Compare commits

...

216 Commits

Author SHA1 Message Date
268de64005 [ROCm][Windows] Enable torchvision build with ROCm on Windows (#147382)
- Updated HIP flags for Windows (removed non Windows flags on Windows case, added runtime library)
- Set hipcc call for Windows case
- Removed CUDA flags (not used in ROCm) on Windows
- Updated Windows compiler (added case when using ROCm on Windows)
- Fixed path issue in hipify_python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147382
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-18 23:37:05 +00:00
61a64c20c4 [MPSInductor] Move threadfence at the right location (#149437)
Not sure how it worked in the past, but fence should be before first read from the shared memory, not after it.
This bug was exposed by https://github.com/pytorch/pytorch/pull/148969 which removed unnecessary barrier before calling `threadgroup_reduce` functions
Test plan:
```
% python3 generate.py --checkpoint_path checkpoints/stories15M/model.pth --prompt "Once upon a time" --device mps --compile
```
Before that it produced gibberish, now it works fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149437
Approved by: https://github.com/manuelcandales, https://github.com/dcci
2025-03-18 23:27:19 +00:00
ea02aac2ca [export] Update remove runtime asserts pass (#149198)
Test Plan: CI -- Removing asserts should be a noop

Differential Revision: D69566851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149198
Approved by: https://github.com/pianpwk
2025-03-18 23:07:25 +00:00
5db3a4ac88 [Build] Guard per-op headers in ACLUtils.cpp (#149417)
To fix internal build failures, where per-op headers are not generated.
We really should have lint for something like that.

Test Plan: CI

Reviewed By: izaitsevfb

Differential Revision: D71406882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149417
Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb
2025-03-18 22:56:29 +00:00
45fec7843d Fix local compilication and hipification (#149384)
Summary:
As title, we need to fix the issue introduced from
https://github.com/pytorch/pytorch/pull/148305

Test Plan: CI and e2e https://docs.google.com/document/d/1Bu-MxJCkN7WaRkKJLVBQvnSp8yV0v3Aeb3Y9R5sjeHw/edit?tab=t.0

Differential Revision: D71373001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149384
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/chenyang78
2025-03-18 22:56:02 +00:00
0d804dec0f [Profiler/Easy] Pass Overload Names To Kineto (#149333)
Summary: Right now we get Overload names and forward them to the Event List frontend for profiler but we do not forward anything to kineto. This diff checks if there is an overload name for each cpu op and appends it to the name if necessary

Test Plan: Added test in CI

Differential Revision: D71326670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149333
Approved by: https://github.com/aaronenyeshi
2025-03-18 22:15:51 +00:00
3b48c72141 [export] Minor refactor to trace.py (#149240)
Minor refactor to trace.py
* Removed `_strict_export_lower_to_aten_ir` in favor of just `_strict_export` and `_non_strict_export`
* Matched the APIs of `_strict_export` and `_non_strict_export`
    * Instead of a `lower_to_aten_callback` which is a callable, or `dispatch_tracing_mode`, both functions take in a `_to_aten_func` which can be either `_export_to_aten_ir_make_fx` or `_export_to_aten_ir`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149240
Approved by: https://github.com/pianpwk
2025-03-18 21:40:30 +00:00
010963032c [ONNX] Create onnx_symbolic (#148905)
In the old exporter we allow users to define a symbolic() method to bypass JIT tracing for a block of logic. We can allow users to do similar things by creating symbolic ops at export.

This PR implements `torch.onnx.ops.symbolic` and `torch.onnx.ops.symbolic_multi_out` to allow users to create onnx nodes symbolically with pt2 & fx. The custom pytorch ops were designed such that the attributes are encoded to be part of a valid fx op. Users provide shape and dtype for the meta function to produce the currect fake tensor during export.

An example is

![image](https://github.com/user-attachments/assets/c62f5f21-e038-456e-a71d-b9a5d0a7cd9d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148905
Approved by: https://github.com/titaiwangms
2025-03-18 21:32:06 +00:00
d80a70b58a Avoid unnecessary clone in torch.cuda.set_rng_state (#149283)
Clone has performance issue according to f49c3eb6e6/megatron/core/tensor_parallel/random.py (L77-L80)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149283
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-03-18 20:47:57 +00:00
cd5c13d8f0 [hop] Rework the check of Metadata in the functionalization key (#148789)
This PR is a more cosmetic rework of the metadata check performed by some HOPs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148789
Approved by: https://github.com/ydwu4
2025-03-18 20:30:59 +00:00
f06e366532 partitioner: treat inputs with static indices as free to save (#148922)
Fixes https://github.com/pytorch/pytorch/issues/141881

internal xref: https://fb.workplace.com/groups/1075192433118967/posts/1538435030128036/?comment_id=1556782068293332

I tried to make a test case out of the code linked in that github issue. The setup + bad outcome today was as follows:

(1) you have a graph where one of its inputs is a model weight

(2) in the backward, you do some downstream compute on `weight`, `tmp = f(weight)`, where (a) `tmp` is of a smaller size than `weight`, and (b) the compute is trivially fusible into other kernels (so the partitioner thinks it is "free" to recompute

(3) since `sizeof(tmp) < sizeof(weight)` and the recompute is free, the partitioner decides that it would be strictly better to save `tmp` for backward instead of weight

(4) this is bad: `weight` is a static tensor that sits in GPU memory for the duration of your entire training loop, so saving it for backward has no negative impact on peak memory.  Since we're saving `tmp` instead, we end up unnecessarily increasing peak memory. In particular - the repro involves an autograd.Function in eager that saves the weight for bw, so we end up hitting higher peak memory in compile

The fix I'm trying out in this PR is to tell the partitioner that graph inputs that we know have static addresses (aka parameters) are "free" to save.

Below is the fw/bw graph before my change, where you can see that instead of `primals_2` being saved for backward, we save `t_8` (which involves some low precision downstream compute on `primals_2`, that is only needed in the backward.

```
 ===== Forward graph 0 =====
 /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1)
        view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]);  abs_1 = None
        amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]);  view = None
        abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2)
        view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]);  abs_2 = None
        amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]);  view_1 = None
        _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32);  amax = None
        clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12);  _to_copy = None
        div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0);  clamp = None
        reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div)
        view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64])
        view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]);  view_2 = None
        slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807);  reciprocal = None
        unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1);  slice_1 = None
        slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807);  unsqueeze = None
        unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3);  slice_2 = None
        mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1);  view_3 = unsqueeze_1 = None
        view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]);  mul = None
        view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]);  view_4 = None
        _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn);  view_5 = None
        _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32)
        clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12);  _to_copy_2 = None
        div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0);  clamp_1 = None
        reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1)
        view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64])
        view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]);  view_6 = None
        slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807);  reciprocal_1 = None
        unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1);  slice_3 = None
        slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807);  unsqueeze_2 = None
        unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3);  slice_4 = None
        mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3);  view_7 = unsqueeze_3 = None
        view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]);  mul_1 = None
        view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]);  view_8 = None
        _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn);  view_9 = None
        t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1);  div_1 = None
        new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False)
        new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False)
        t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3);  _to_copy_3 = None
        t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1);  new_ones_1 = None
        _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16);  _to_copy_1 = t_2 = new_ones = t_3 = None
        view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]);  _scaled_mm = None
        view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]);  view_10 = None
        slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807);  div = None
        unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1);  slice_5 = None
        slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807);  unsqueeze_4 = None
        unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3);  slice_6 = None
        mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5);  view_11 = unsqueeze_5 = None
        view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]);  mul_2 = None
        view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]);  view_12 = None
        view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]);  view_13 = None
        view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]);  view_14 = None
        slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807);  t = None
        unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1);  slice_7 = None
        slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807);  unsqueeze_6 = None
        unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3);  slice_8 = None
        mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7);  view_15 = unsqueeze_7 = None
        view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]);  mul_3 = None
        view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]);  view_16 = None
        _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16);  view_17 = None
        add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3);  _to_copy_4 = primals_3 = None
        t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2);  primals_2 = None
        clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format);  t_4 = None
        t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1);  amax_1 = None
        view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]);  t_5 = None
        amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]);  view_21 = None
        unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1);  amax_3 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1])
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3);  div_3 = None
        view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]);  clone = None
        view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]);  view_27 = None
        slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807);  reciprocal_3 = None
        unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1);  slice_11 = None
        slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807);  unsqueeze_11 = None
        unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3);  slice_12 = None
        mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12);  view_28 = unsqueeze_12 = None
        view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]);  mul_5 = None
        view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]);  view_29 = None
        _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn);  view_30 = None
        t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8);  _to_copy_8 = None

        # No stacktrace found for following nodes
        view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]);  add = None
        return (view_39, primals_1, unsqueeze_8, t_8)

INFO: TRACED GRAPH
 ===== Backward graph 0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", t_8: "f8e4m3fn[64, 64][1, 64]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1])
        view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]);  tangents_1 = None

         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19)
        view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]);  abs_3 = None
        amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]);  view_20 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]);  unsqueeze_8 = None
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32);  amax_2 = None
        clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12);  _to_copy_5 = None
        div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0);  clamp_2 = None
        reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2)
        view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64])
        view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]);  view_23 = None
        slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807);  reciprocal_2 = None
        unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1);  slice_9 = None
        slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807);  unsqueeze_9 = None
        unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3);  slice_10 = None
        mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10);  view_24 = unsqueeze_10 = None
        view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]);  mul_4 = None
        view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]);  view_25 = None
        _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn);  view_26 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3);  div_3 = None
        new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False)
        new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False)
        t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3);  new_ones_3 = None
        _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16);  _to_copy_6 = t_8 = new_ones_2 = t_9 = None
        view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]);  _scaled_mm_1 = None
        view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]);  view_31 = None
        slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807);  div_2 = None
        unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1);  slice_13 = None
        slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807);  unsqueeze_13 = None
        unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3);  slice_14 = None
        mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14);  view_32 = unsqueeze_14 = None
        view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]);  mul_6 = None
        view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]);  view_33 = None
        view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]);  view_34 = None
        view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]);  view_35 = None
        slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807);  t_6 = None
        unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1);  slice_15 = None
        slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807);  unsqueeze_15 = None
        unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3);  slice_16 = None
        mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16);  view_36 = unsqueeze_16 = None
        view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]);  mul_7 = None
        view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]);  view_37 = None
        _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16);  view_38 = None
        t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19)
        mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1);  t_10 = primals_1 = None
        sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]);  view_19 = None
        return (_to_copy_9, mm, sum_1)

```

With the change, we save primals_2 for backward instead

```
 ===== Forward graph 0 =====
 /data/users/hirsheybar/checkout2/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", primals_3: "bf16[64][1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        abs_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_1)
        view: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_1, [64, 1, 64]);  abs_1 = None
        amax: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view, [-1]);  view = None
        abs_2: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(primals_2)
        view_1: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_2, [64, 1, 64]);  abs_2 = None
        amax_1: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_1, [-1]);  view_1 = None
        _to_copy: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax, dtype = torch.float32);  amax = None
        clamp: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy, 1e-12);  _to_copy = None
        div: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp, 448.0);  clamp = None
        reciprocal: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div)
        view_2: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_1, [64, 1, 64])
        view_3: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_2, [64, 1, 1, 64]);  view_2 = None
        slice_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal, 0, 0, 9223372036854775807);  reciprocal = None
        unsqueeze: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_1, 1);  slice_1 = None
        slice_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze, 2, 0, 9223372036854775807);  unsqueeze = None
        unsqueeze_1: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_2, 3);  slice_2 = None
        mul: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_3, unsqueeze_1);  view_3 = unsqueeze_1 = None
        view_4: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul, [64, 1, 64]);  mul = None
        view_5: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_4, [64, 64]);  view_4 = None
        _to_copy_1: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_5, dtype = torch.float8_e4m3fn);  view_5 = None
        _to_copy_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_1, dtype = torch.float32)
        clamp_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_2, 1e-12);  _to_copy_2 = None
        div_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_1, 448.0);  clamp_1 = None
        reciprocal_1: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_1)
        view_6: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(primals_2, [64, 1, 64])
        view_7: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_6, [64, 1, 1, 64]);  view_6 = None
        slice_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_1, 0, 0, 9223372036854775807);  reciprocal_1 = None
        unsqueeze_2: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_3, 1);  slice_3 = None
        slice_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_2, 2, 0, 9223372036854775807);  unsqueeze_2 = None
        unsqueeze_3: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_4, 3);  slice_4 = None
        mul_1: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_7, unsqueeze_3);  view_7 = unsqueeze_3 = None
        view_8: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_1, [64, 1, 64]);  mul_1 = None
        view_9: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_8, [64, 64]);  view_8 = None
        _to_copy_3: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_9, dtype = torch.float8_e4m3fn);  view_9 = None
        t: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_1);  div_1 = None
        new_ones: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div, [1, 1], pin_memory = False)
        new_ones_1: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t, [1, 1], pin_memory = False)
        t_2: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_3);  _to_copy_3 = None
        t_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_1);  new_ones_1 = None
        _scaled_mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_1, t_2, new_ones, t_3, None, None, torch.bfloat16);  _to_copy_1 = t_2 = new_ones = t_3 = None
        view_10: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm, [64, 1, 64]);  _scaled_mm = None
        view_11: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_10, [64, 1, 1, 64]);  view_10 = None
        slice_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div, 0, 0, 9223372036854775807);  div = None
        unsqueeze_4: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_5, 1);  slice_5 = None
        slice_6: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_4, 2, 0, 9223372036854775807);  unsqueeze_4 = None
        unsqueeze_5: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_6, 3);  slice_6 = None
        mul_2: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_11, unsqueeze_5);  view_11 = unsqueeze_5 = None
        view_12: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_2, [64, 1, 64]);  mul_2 = None
        view_13: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_12, [64, 64]);  view_12 = None
        view_14: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_13, [1, 64, 64]);  view_13 = None
        view_15: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_14, [1, 64, 64, 1]);  view_14 = None
        slice_7: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t, 0, 0, 9223372036854775807);  t = None
        unsqueeze_6: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_7, 1);  slice_7 = None
        slice_8: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_6, 2, 0, 9223372036854775807);  unsqueeze_6 = None
        unsqueeze_7: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_8, 3);  slice_8 = None
        mul_3: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_15, unsqueeze_7);  view_15 = unsqueeze_7 = None
        view_16: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_3, [64, 64, 1]);  mul_3 = None
        view_17: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_16, [64, 64]);  view_16 = None
        _to_copy_4: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_17, dtype = torch.bfloat16);  view_17 = None
        add: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.add.Tensor(_to_copy_4, primals_3);  _to_copy_4 = primals_3 = None
        t_5: "bf16[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(amax_1);  amax_1 = None
        view_21: "bf16[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.view.default(t_5, [1, 1, 64]);  t_5 = None
        amax_3: "bf16[1, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_21, [-1]);  view_21 = None
        unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(amax_3, 1);  amax_3 = None

        # No stacktrace found for following nodes
        view_39: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(add, [64, 64]);  add = None
        return (view_39, primals_1, primals_2, unsqueeze_8)

INFO: TRACED GRAPH
 ===== Backward graph 0 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "bf16[64, 64][64, 1]cuda:0", primals_2: "bf16[64, 64][64, 1]cuda:0", unsqueeze_8: "bf16[1, 1, 1][1, 1, 1]cuda:0", tangents_1: "bf16[64, 64][64, 1]cuda:0"):
         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6946 in forward, code: out = out.unflatten(0, input.shape[:-1])
        view_19: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(tangents_1, [64, 64]);  tangents_1 = None

         # File: /data/users/hirsheybar/checkout2/pytorch/test/dynamo/test_repros.py:6943 in forward, code: out = Fp8LinearFn.apply(
        t_4: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(primals_2);  primals_2 = None
        clone: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.clone.default(t_4, memory_format = torch.contiguous_format);  t_4 = None
        abs_3: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.abs.default(view_19)
        view_20: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(abs_3, [64, 1, 64]);  abs_3 = None
        amax_2: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.amax.default(view_20, [-1]);  view_20 = None
        expand: "bf16[1, 64, 1][1, 0, 1]cuda:0" = torch.ops.aten.expand.default(unsqueeze_8, [1, 64, 1]);  unsqueeze_8 = None
        clone_1: "bf16[1, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        view_22: "bf16[64, 1][1, 1]cuda:0" = torch.ops.aten.view.default(clone_1, [64, 1]);  clone_1 = None
        _to_copy_5: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(amax_2, dtype = torch.float32);  amax_2 = None
        clamp_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_5, 1e-12);  _to_copy_5 = None
        div_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_2, 448.0);  clamp_2 = None
        reciprocal_2: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_2)
        view_23: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_19, [64, 1, 64])
        view_24: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_23, [64, 1, 1, 64]);  view_23 = None
        slice_9: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_2, 0, 0, 9223372036854775807);  reciprocal_2 = None
        unsqueeze_9: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_9, 1);  slice_9 = None
        slice_10: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_9, 2, 0, 9223372036854775807);  unsqueeze_9 = None
        unsqueeze_10: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_10, 3);  slice_10 = None
        mul_4: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_24, unsqueeze_10);  view_24 = unsqueeze_10 = None
        view_25: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_4, [64, 1, 64]);  mul_4 = None
        view_26: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_25, [64, 64]);  view_25 = None
        _to_copy_6: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_26, dtype = torch.float8_e4m3fn);  view_26 = None
        _to_copy_7: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten._to_copy.default(view_22, dtype = torch.float32);  view_22 = None
        clamp_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.clamp.default(_to_copy_7, 1e-12);  _to_copy_7 = None
        div_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.div.Tensor(clamp_3, 448.0);  clamp_3 = None
        reciprocal_3: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.reciprocal.default(div_3)
        view_27: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(clone, [64, 1, 64]);  clone = None
        view_28: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_27, [64, 1, 1, 64]);  view_27 = None
        slice_11: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(reciprocal_3, 0, 0, 9223372036854775807);  reciprocal_3 = None
        unsqueeze_11: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_11, 1);  slice_11 = None
        slice_12: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_11, 2, 0, 9223372036854775807);  unsqueeze_11 = None
        unsqueeze_12: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_12, 3);  slice_12 = None
        mul_5: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_28, unsqueeze_12);  view_28 = unsqueeze_12 = None
        view_29: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_5, [64, 1, 64]);  mul_5 = None
        view_30: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_29, [64, 64]);  view_29 = None
        _to_copy_8: "f8e4m3fn[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_30, dtype = torch.float8_e4m3fn);  view_30 = None
        t_6: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.t.default(div_3);  div_3 = None
        new_ones_2: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(div_2, [1, 1], pin_memory = False)
        new_ones_3: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.new_ones.default(t_6, [1, 1], pin_memory = False)
        t_8: "f8e4m3fn[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(_to_copy_8);  _to_copy_8 = None
        t_9: "f32[1, 1][1, 1]cuda:0" = torch.ops.aten.t.default(new_ones_3);  new_ones_3 = None
        _scaled_mm_1: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._scaled_mm.default(_to_copy_6, t_8, new_ones_2, t_9, None, None, torch.bfloat16);  _to_copy_6 = t_8 = new_ones_2 = t_9 = None
        view_31: "bf16[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(_scaled_mm_1, [64, 1, 64]);  _scaled_mm_1 = None
        view_32: "bf16[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.view.default(view_31, [64, 1, 1, 64]);  view_31 = None
        slice_13: "f32[64, 1][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(div_2, 0, 0, 9223372036854775807);  div_2 = None
        unsqueeze_13: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_13, 1);  slice_13 = None
        slice_14: "f32[64, 1, 1][1, 1, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_13, 2, 0, 9223372036854775807);  unsqueeze_13 = None
        unsqueeze_14: "f32[64, 1, 1, 1][1, 1, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_14, 3);  slice_14 = None
        mul_6: "f32[64, 1, 1, 64][64, 64, 64, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_32, unsqueeze_14);  view_32 = unsqueeze_14 = None
        view_33: "f32[64, 1, 64][64, 64, 1]cuda:0" = torch.ops.aten.view.default(mul_6, [64, 1, 64]);  mul_6 = None
        view_34: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_33, [64, 64]);  view_33 = None
        view_35: "f32[1, 64, 64][4096, 64, 1]cuda:0" = torch.ops.aten.view.default(view_34, [1, 64, 64]);  view_34 = None
        view_36: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.view.default(view_35, [1, 64, 64, 1]);  view_35 = None
        slice_15: "f32[1, 64][1, 1]cuda:0" = torch.ops.aten.slice.Tensor(t_6, 0, 0, 9223372036854775807);  t_6 = None
        unsqueeze_15: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_15, 1);  slice_15 = None
        slice_16: "f32[1, 1, 64][1, 64, 1]cuda:0" = torch.ops.aten.slice.Tensor(unsqueeze_15, 2, 0, 9223372036854775807);  unsqueeze_15 = None
        unsqueeze_16: "f32[1, 1, 64, 1][1, 64, 1, 1]cuda:0" = torch.ops.aten.unsqueeze.default(slice_16, 3);  slice_16 = None
        mul_7: "f32[1, 64, 64, 1][4096, 64, 1, 1]cuda:0" = torch.ops.aten.mul.Tensor(view_36, unsqueeze_16);  view_36 = unsqueeze_16 = None
        view_37: "f32[64, 64, 1][64, 1, 1]cuda:0" = torch.ops.aten.view.default(mul_7, [64, 64, 1]);  mul_7 = None
        view_38: "f32[64, 64][64, 1]cuda:0" = torch.ops.aten.view.default(view_37, [64, 64]);  view_37 = None
        _to_copy_9: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten._to_copy.default(view_38, dtype = torch.bfloat16);  view_38 = None
        t_10: "bf16[64, 64][1, 64]cuda:0" = torch.ops.aten.t.default(view_19)
        mm: "bf16[64, 64][64, 1]cuda:0" = torch.ops.aten.mm.default(t_10, primals_1);  t_10 = primals_1 = None
        sum_1: "bf16[64][1]cuda:0" = torch.ops.aten.sum.dim_IntList(view_19, [0]);  view_19 = None
        return (_to_copy_9, mm, sum_1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148922
Approved by: https://github.com/zou3519
2025-03-18 20:08:11 +00:00
b8c0c50bbe Release.md readability improvements (#149402)
Improves a bunch of readability/grammatical issues with release.md.

Note: This was a claude code experiment, with all changes automatically generated.  But turns out minor edits like this is _not_ a good use of claude code since it asked for approval on every single changed line.  Prob way more efficient to toss this entire thing into a simple LLM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149402
Approved by: https://github.com/atalman
2025-03-18 20:04:56 +00:00
dfdf58f8cb [ROCm] enable CK backend for bf16/fp16 on gfx11 (#143971)
this change enables enable CK backend for fp16 on Gfx11
@jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143971
Approved by: https://github.com/jeffdaily
2025-03-18 18:18:22 +00:00
e0e8639a10 [torchbench] fix dynamic_shapes spec for moco (#148772)
Fixes https://github.com/pytorch/pytorch/issues/148333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148772
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-03-18 18:16:54 +00:00
dbea13ed45 [ROCm][TunableOp] Minor fix to BLAS logging for ScaledGEMM with no bias vector. (#149357)
Omit the bias type argument for BLAS logging when there is a ScaledGEMM with no bias vector.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149357
Approved by: https://github.com/jeffdaily
2025-03-18 18:14:52 +00:00
c0566e0dbf [ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245)
Fixes https://github.com/ROCm/hip/issues/3764.

Fixes and improvements to CUDA->HIP flag conversion for CPP extensions

- Log flag conversion for debugging purposes.
- Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance.
- Fix case where nvcc key may not exist
- Fix case where hipify should ignore flag values and only touch the flag itself

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245
Approved by: https://github.com/jeffdaily

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
2025-03-18 18:01:07 +00:00
585fd972b8 Iterate over dense dim first in split reduction reindexing (#147229)
Fix for https://github.com/pytorch/pytorch/issues/144431.

Improves perf from 0.29963893827160504 -> 0.0396331632970453.

In split reductions, we view an input tensor as a single dimension, then reduce over it. When we are reducing over a tensor which has a dimension other than the last dimension as the dense dimension, we should iterate over the dense dimension first in our re-indexing.

This pr also gives evidence for general need of reduction tiling, e.g. for cooperative reduction handling of this..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147229
Approved by: https://github.com/jansel
2025-03-18 17:35:21 +00:00
ee3a2c6ee2 [State_dict] Remove functools.cache and add unit test (#149354)
Fixes https://github.com/pytorch/pytorch/issues/149100

@functools.cache would keep 'self' alive, leading to unexpected memory performance. (e.g. in the issue linked, if the model is deleted, the model's memory is still occupied.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149354
Approved by: https://github.com/fegin
2025-03-18 17:30:41 +00:00
5b8cc4709a [FSDP2] Add set_reshard_after_forward (#149103)
Fixes https://github.com/pytorch/pytorch/issues/149029

Add `set_reshard_after_forward` to set `post_forward_mesh_info` so as to decide `_reshard_after_forward`

Add unit test similar to `test_fully_shard_communication_count`, the FSDPModule would perform as `._reshard_after_forward=True` after `.set_reshard_after_forward=True`, as well as setting to False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149103
Approved by: https://github.com/awgu
2025-03-18 17:21:54 +00:00
a8df5e5af9 [dynamo] Add mem leak test (#149358)
Test for https://github.com/pytorch/pytorch/pull/148480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149358
Approved by: https://github.com/malfet
2025-03-18 16:38:28 +00:00
d5b1d99f78 Enable more nightly tests on s390x (#148452)
Also enable some tests which probably were accidentally disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148452
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-18 16:09:39 +00:00
381d0cb239 [DCP] Avoid in-place update and deepcopy during dudpe (#149320)
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
2025-03-18 16:08:40 +00:00
c41196a4d0 [EZ][Docker] Remove install_db.sh (#149360)
Which is a vestige of caffe2 days and was no-op since https://github.com/pytorch/pytorch/pull/125092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149360
Approved by: https://github.com/atalman, https://github.com/cyyever, https://github.com/seemethere, https://github.com/Skylion007
2025-03-18 16:07:47 +00:00
fdacf3c920 [ONNX] Update types in VerificationInfo (#149377)
torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377
Approved by: https://github.com/titaiwangms
2025-03-18 15:37:39 +00:00
405025778d Revert "[AOTI] Update test runner to use the new APIs (#147105)"
This reverts commit 9a78513c3cb21a5f506135e2a56f967cf1fddc60.

Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413))
2025-03-18 15:25:40 +00:00
5ba437fb45 Revert "[AOTI] Forward fix unit test failures (#149401)"
This reverts commit ec9e11145e1a86300aae0fe09a1d8917d21deba1.

Reverted https://github.com/pytorch/pytorch/pull/149401 on behalf of https://github.com/desertfire due to reverting the original PR instead ([comment](https://github.com/pytorch/pytorch/pull/149401#issuecomment-2733633516))
2025-03-18 15:18:48 +00:00
213eea216a [MTIA] Add _mtia_maybeExchangeDevice to MTIA module (#149340)
Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072063

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340
Approved by: https://github.com/chaos5958
2025-03-18 15:15:12 +00:00
ec9e11145e [AOTI] Forward fix unit test failures (#149401)
Summary: There is a land conflict between https://github.com/pytorch/pytorch/pull/149161 and https://github.com/pytorch/pytorch/pull/147105. We just need to update the APIs used in two new unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149401
Approved by: https://github.com/ZainRizvi
2025-03-18 15:02:01 +00:00
6e2b2660b9 Make numpy check optional (#149356)
We may want to skip numpy smoke tests. Hence making it optional

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149356
Approved by: https://github.com/ZainRizvi
2025-03-18 15:00:01 +00:00
bc88f6faa1 Use TorchVersion for triton version check (#149136)
Followup after https://github.com/pytorch/pytorch/pull/149092#issuecomment-2721990321
To use TorchVersion for triton version parsing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149136
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-18 13:48:46 +00:00
b06b5c3e27 [ROCm] Use alternate mirror for drm repo (#149380)
Fixes issue with building ROCm manywheel and libtorch images eg. https://github.com/pytorch/pytorch/actions/runs/13887711267/job/38854659005#step:4:8328

```
#53 2.832 Cloning into 'drm'...
#53 2.849 fatal: unable to access 'https://gitlab.freedesktop.org/mesa/drm.git/': The requested URL returned error: 503
#53 2.851 ./install_rocm_drm.sh: line 29: pushd: drm: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149380
Approved by: https://github.com/jeffdaily
2025-03-18 13:33:25 +00:00
6055a4f612 refresh benchmarks results. (#149347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149347
Approved by: https://github.com/jamesjwu
2025-03-18 08:53:49 +00:00
9b92828d4b Add batch dim sharding rule to sdpa (#149253)
This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253
Approved by: https://github.com/XilunWu
2025-03-18 07:54:02 +00:00
9cd52da45c [MPS/inductor] Add support for modified_bessel_i1. (#149379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149379
Approved by: https://github.com/malfet
2025-03-18 06:02:33 +00:00
6c2db8fab0 Enable qint8 and quint8 add for AArch64 using ACL directly (#148653)
This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly.
Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.

Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653
Approved by: https://github.com/malfet
ghstack dependencies: #148585
2025-03-18 05:38:39 +00:00
2e0c98ff05 [MPS] Add bicubic2d_aa (#149378)
Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287

Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from eec43cfbc0/src/libImaging/Resample.c as well as
bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)

Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa

At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378
Approved by: https://github.com/dcci
2025-03-18 05:35:41 +00:00
dea7157160 nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351)
Fixes #149153

Yaml generated from:

```
python .github/scripts/generate_ci_workflows.py
```

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

```
rm -rf third_party/nccl
python setup.py develop
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351
Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet
2025-03-18 05:23:18 +00:00
b8f91bcb14 [pt2_provenance_tracking] add support for cpp kernel (#149185)
Summary:
As title.

Add inductor cpp kernel to post grad graph node mapping
& UT.

Context:
Raised as a feature request for AOTI CPU case.

https://fb.workplace.com/groups/1028545332188949/permalink/1169020841474730/

Differential Revision: D71181284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149185
Approved by: https://github.com/jingsh
2025-03-18 04:43:07 +00:00
7869196482 Fix torchbind schema str generation (#149239)
Summary: Fix Torchbind HOP schema generation when there's no input

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
```

Differential Revision: D71231164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149239
Approved by: https://github.com/zou3519
2025-03-18 04:29:56 +00:00
bca75fe97a [MAIA] [Autocast] Enable autocast on MAIA device (#148511)
Fixes #148510.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148511
Approved by: https://github.com/albanD
2025-03-18 03:46:22 +00:00
c43e35d6f7 [MPS] Implement support for modified_bessel_i1 in eager. (#149368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-18 03:29:10 +00:00
bb42e4d137 [AOTInductor] Add function to free buffer (#149161)
Summary:
We add a function that allows users to free the unused buffer.

Test Plan:
Testing correctness:
    python test/inductor/test_aot_inductor.py -k free_inactive

    Testing memory consumption:
    LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
    /home/$USER/local/pytorch/build/bin/test_aoti_inference

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161
Approved by: https://github.com/chenyang78, https://github.com/desertfire
ghstack dependencies: #149249
2025-03-18 02:43:14 +00:00
cccdf860e2 [BE] Add STABLE_LIBRARY test for multiple returns (#149230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149230
Approved by: https://github.com/albanD, https://github.com/zou3519
ghstack dependencies: #149052
2025-03-18 02:40:54 +00:00
988827cdfb Use schema as source of truth + support ones_like/empty_like (#149052)
This change does 2 important things:
(a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat!

(b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052
Approved by: https://github.com/albanD
2025-03-18 02:40:54 +00:00
ebabd0efdd [ONNX] Expose verification utilities (#148603)
Expose verification utilities to public documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603
Approved by: https://github.com/titaiwangms
2025-03-18 02:10:34 +00:00
c36ac16da1 [Inductor] optimize welford reduction (#145061)
Fix https://github.com/pytorch/pytorch/issues/141541.
Fix https://github.com/pytorch/pytorch/issues/142839.
Fix https://github.com/pytorch/pytorch/issues/143182.

**Summary:**
In order to fix the issue that the accuracy of welford reduction is not good enough, we refer to the eager implementation, combine Welford algorithm with cascade sum to improve numerical stability. Specifically:
1. Use Welford algorithm to compute mean and variance.
2. Use cascade summation when computing sum over input for both mean and variance.

I tested Inductor benchmark with this PR on CPU, no performance gains or regressions were seen.

**Example:**
Take https://github.com/pytorch/pytorch/issues/141541 as an example:
```
import torch
import torch.nn as nn
torch.manual_seed(0)

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)

    def forward(self, x):
        return self.gn(x)

model = Model().eval()
c_model = torch.compile(model)
x = torch.randn(1, 32, 128, 128, 128)

with torch.no_grad():
    output = model(x)
    c_output = c_model(x)

print(torch.max(torch.abs(output - c_output)))
print(torch.allclose(output, c_output, 1.3e-6, 1e-5))
```
**logs**

- before
```
tensor(7.0095e-05)
False
```
- After
```
tensor(9.5367e-07)
True
```

- on CUDA
```
tensor(1.4305e-06, device='cuda:0', grad_fn=<MaxBackward1>)
True
```

**Generated code:**
- before
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(131072L));
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                        }
                    }
                }
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean);
                out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2);
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                        auto tmp1 = out_ptr0[static_cast<int64_t>(x0)];
                        auto tmp4 = out_ptr1[static_cast<int64_t>(x0)];
                        auto tmp12 = in_ptr1[static_cast<int64_t>(x0)];
                        auto tmp15 = in_ptr2[static_cast<int64_t>(x0)];
                        auto tmp2 = at::vec::Vectorized<float>(tmp1);
                        auto tmp3 = tmp0 - tmp2;
                        auto tmp5 = static_cast<float>(2097152.0);
                        auto tmp6 = tmp4 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                        auto tmp9 = 1 / std::sqrt(tmp8);
                        auto tmp10 = at::vec::Vectorized<float>(tmp9);
                        auto tmp11 = tmp3 * tmp10;
                        auto tmp13 = at::vec::Vectorized<float>(tmp12);
                        auto tmp14 = tmp11 * tmp13;
                        auto tmp16 = at::vec::Vectorized<float>(tmp15);
                        auto tmp17 = tmp14 + tmp16;
                        tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0));
                    }
                }
            }
        }
    }
}
''')
```
- After
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/ln/clnlak27xpvmq3klpqyj6xzyq2thf4ecrezve5ddy4f4xaz4sb7w.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                WelfordHelper<at::vec::Vectorized<float>> welford_helper0(static_cast<int64_t>(131072L));
                static WelfordHelper<at::vec::Vectorized<float>> masked_welford_helper0(static_cast<int64_t>(0L));
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                            tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &welford_helper0);
                        }
                    }
                }
                tmp_acc0_vec = welford_combine(tmp_acc0_vec, &welford_helper0);
                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, &masked_welford_helper0);
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean);
                out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2);
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
                        auto tmp1 = out_ptr0[static_cast<int64_t>(x0)];
                        auto tmp4 = out_ptr1[static_cast<int64_t>(x0)];
                        auto tmp12 = in_ptr1[static_cast<int64_t>(x0)];
                        auto tmp15 = in_ptr2[static_cast<int64_t>(x0)];
                        auto tmp2 = at::vec::Vectorized<float>(tmp1);
                        auto tmp3 = tmp0 - tmp2;
                        auto tmp5 = static_cast<float>(2097152.0);
                        auto tmp6 = tmp4 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                        auto tmp9 = 1 / std::sqrt(tmp8);
                        auto tmp10 = at::vec::Vectorized<float>(tmp9);
                        auto tmp11 = tmp3 * tmp10;
                        auto tmp13 = at::vec::Vectorized<float>(tmp12);
                        auto tmp14 = tmp11 * tmp13;
                        auto tmp16 = at::vec::Vectorized<float>(tmp15);
                        auto tmp17 = tmp14 + tmp16;
                        tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0));
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145061
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2025-03-18 02:05:35 +00:00
cyy
1096443467 Use torch_compile_options for c10 libraries (#147821)
c10, c10_cuda, c10_hip and c10_xpu are given additional compile options by torch_compile_options, which are more restrictive and can help reveal potential bugs inside the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147821
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-03-18 01:54:23 +00:00
60523540f1 Force build to conform C++ standard on windows by adding /permissive- flag (#149035)
Fixes #147366

1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard.
2. Fix the error when trying to assign a string literal to a non-const ptr.

The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170

From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks),
>  By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions.
> The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option.

Thus, it is reasonable to add this flag to the existing project.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035
Approved by: https://github.com/guangyey, https://github.com/malfet
2025-03-18 01:51:46 +00:00
c1dd75e4dc Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031)
**Summary**
Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907
This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT.

**Test plan**
```
pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031
Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire
2025-03-18 01:33:13 +00:00
cyy
425c6d8eba Replace c10::is_pod with std::is_trivial (#149286)
These remaining c10::is_pod calls can be replaced without compromising the semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149286
Approved by: https://github.com/zou3519
2025-03-18 01:33:01 +00:00
f9a787224c [dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
2025-03-18 01:25:37 +00:00
186cc7327c [MPS/BE] Remove decorator that skipped test on macOS 12. (#149365)
macOS 12 is not really supported anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365
Approved by: https://github.com/malfet
2025-03-18 00:58:08 +00:00
a0ac63cbd9 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-18 00:46:07 +00:00
811f587d86 [MPS/BE] @parametrize generation of pointwise_ops. (#149363)
Make this less error prone/reduces duplication.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149363
Approved by: https://github.com/malfet
2025-03-18 00:37:43 +00:00
9a78513c3c [AOTI] Update test runner to use the new APIs (#147105)
Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring.

Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105
Approved by: https://github.com/jingsh
2025-03-18 00:27:09 +00:00
b52a8bef01 Revert "[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)"
This reverts commit 5905bbe745b0acb4909243c93014c0e6f3512c2d.

Reverted https://github.com/pytorch/pytorch/pull/149228 on behalf of https://github.com/malfet due to I wonder if this will fix the pr-time-benchmark regressions ([comment](https://github.com/pytorch/pytorch/pull/149228#issuecomment-2731237949))
2025-03-18 00:10:50 +00:00
46226a90c8 [EZ][BE] Remove cross-compilation options from mac-build.yml (#149237)
It has long been gone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149237
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-03-17 23:50:31 +00:00
523bffd388 cd: Add no-cache for test binaries (#149218)
This is to make it so that we don't experience issues like https://github.com/pytorch/vision/actions/runs/13861462856/job/38795684317#step:13:212

```
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    unknown package:
        Expected sha256 8e34a6f02ac5a63763251953063a19ba9df855ac2c8a13ef409dfef708e2ba26
             Got        341156cc5067488565c1e103be6e95105b0fc0d87d8ac24ff8891f63fd33216f
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149218
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet
2025-03-17 23:26:20 +00:00
37c914ca0c fix simple-spec crash (#147723)
found an issue while running `python torchgen/fuse/gen_patterns.py`

exact error:
```shell
Traceback (most recent call last):
  File "/Users/mayankmishra/Desktop/non-IBM/pytorch/torchgen/fuse/gen_patterns.py", line 19, in <module>
    joint_graph.lazy_init()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 2096, in lazy_init
    result = fn()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 53, in lazy_init
    _pad_mm_init()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/pad_mm.py", line 905, in _pad_mm_init
    gen_register_replacement(
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1584, in gen_register_replacement
    pat = _serialize_pattern(
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1539, in _serialize_pattern
    file_template = get_file_template()
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1513, in get_file_template
    if isinstance(attr, type) and issubclass(attr, (PatternExpr, _TargetExpr)):
  File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
TypeError: issubclass() arg 1 must be a class
```

This PR fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147723
Approved by: https://github.com/aorenste

Co-authored-by: Aaron Orenstein <aorenste@meta.com>
2025-03-17 23:25:48 +00:00
78715a181f Convert Tensor lr to 0-dim as needed for the optimizer to normally work (#145674)
Fixes #145461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145674
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-17 23:07:05 +00:00
1157367c78 [AOTInductor] [BE] Add macro for loading symbols in aoti runner (#149249)
Summary:
Add macro for loading symbols in aoti runner

Test Plan:
Existing tests

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149249
Approved by: https://github.com/chenyang78
2025-03-17 23:02:01 +00:00
24cfeec2c7 Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)"
This reverts commit bfee141666319c80b6c5284394905beef8682515.

Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see 8bc7bd94a5/1 ([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812))
2025-03-17 22:57:00 +00:00
afa1eda901 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626))
2025-03-17 22:43:15 +00:00
a16ada41b9 Fix outdated docstring of torch.export.export regarding strict flag (#149077)
Summary: Fix outdated docstring of torch.export.export regarding strict flag

Test Plan: None, doc only change

Differential Revision: D71068215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149077
Approved by: https://github.com/zhxchen17
2025-03-17 22:29:20 +00:00
d25617255c Fix AOTI update_constant_buffer issue. (#149243)
Summary:
In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like
```
I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights
*** Aborted at 1741652964 (Unix time, try 'date -d 1741652964') ***
*** Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: ***
    @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453
    @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/fibers/GuardPageAllocator.cpp:237
    @ 000000000004455f (unknown)
                       /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8
                       -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
    @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque*> > > const&, bool, bool)
```

Test Plan:
1) Generate lowered merge net
```
CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par  --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor  --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false
```

2) Load net predictor
```
CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true
```

Reviewed By: hl475

Differential Revision: D71236710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243
Approved by: https://github.com/hl475, https://github.com/jingsh
2025-03-17 22:10:57 +00:00
a3c6e3139a allow extra args for parameterization of tests in inductor (#149154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149154
Approved by: https://github.com/amjames, https://github.com/eellison
2025-03-17 22:05:06 +00:00
e4f6e4ac84 [MPS] Add inductor support for modified_bessel_i0. (#149342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149342
Approved by: https://github.com/malfet
2025-03-17 21:45:51 +00:00
8bc7bd94a5 [ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527)
This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527
Approved by: https://github.com/jeffdaily

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-03-17 20:51:36 +00:00
e8dd58b8cf cpp_wrapper: Precompile device-specific header files (#146928)
This saves us about a second per compilation, which is _massive_ for the OpInfo tests.  Total OpInfo test runtime is down about 2x from this change alone.

Relands #144002, with changes needed by fbcode internals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928
Approved by: https://github.com/desertfire
2025-03-17 20:40:15 +00:00
5e9f792479 [ROCm] Unskip flex attention UTs after triton 3.3 bump (#148327)
Enable `test_flex_attention.py::TestLearnableBiases` unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148327
Approved by: https://github.com/jeffdaily
2025-03-17 20:15:14 +00:00
6c7d8419e3 fix two accuracy regression (#149172)
There are 2 accuracy regression in 3/12 nightly perf run. I can not repro them locally thus there is no effective way to bisect. Raise the tolerance to make them pass the accuracy check.

- error log for HF MegatronBertForQuestionAnswering https://gist.github.com/shunting314/25322b66e15e98feed32e0d9a1e43316
- error log for TIMM gluon_inception_v3 https://gist.github.com/shunting314/df64ce22327df27a7057bbbd19ef5164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149172
Approved by: https://github.com/jansel, https://github.com/eellison
2025-03-17 19:34:00 +00:00
769f19bf95 [MTIA] Add _mtia_exchangeDevice to MTIA module (#149322)
Summary: The FlexAttention path uses `_exchange_device`, so it will be needed eventually for MTIA as well.

Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_exchange_device`

Reviewed By: chaos5958

Differential Revision: D70072059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149322
Approved by: https://github.com/chaos5958
2025-03-17 19:31:10 +00:00
8d7c430e84 Symintify transpose_ (#149057)
Fixes https://github.com/pytorch/pytorch/issues/148702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057
Approved by: https://github.com/yushangdi
2025-03-17 19:11:54 +00:00
08a644a4c4 Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585)
This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation.
However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`.

- **For `qlinear_dynamic` (dynamically quantized matmuls):**

This PR yields an ****average speedup** (averaged over context_lengths of 2^3 up to 2^9) of ~ **50%** for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased`** with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below:
```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
from transformers import AutoModel, AutoConfig
import time
import numpy as np
from argparse import ArgumentParser

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__(description="huggingface model")
        self.add_argument("--context_length",
                            help="context length - number of input tokens",
                            type=int,
                            default=64
        )
        self.add_argument("--model",
                            help="model checkpoint - i.e. 'bert-base-uncased'",
                            type=str,
                            default=None)
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()
    model_name = args.model
    config = AutoConfig.from_pretrained(model_name)
    batch_size = 1
    model = AutoModel.from_pretrained(model_name)
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    model.eval()
    inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu")
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("Model = ", model_name)
    print("Context Length = ", args.context_length)
    print("Min (ms) = ", min(times))
    print("Mean (ms) = ", np.mean(times))
```

- **For `qlinear` (statically quantized matmuls):**

This PR yields an **average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below.
The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`.
The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64.

```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
from torch.quantization import QConfig
from torch.ao.quantization.observer import HistogramObserver, default_weight_observer
import torch
import torch.nn as nn
import numpy as np
import random
from argparse import ArgumentParser
import time

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__()
        self.add_argument("--M",
                            help="M dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--K",
                            help="K dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--N",
                            help="N dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--signed_input",
                            help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8",
                            action="store_true"
        )
        self.add_argument("--seed",
                          help="Random seed",
                          type=int,
                          default=42
        )
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

class LinearModel(nn.Module):
    def __init__(self, K, N):
        super(LinearModel, self).__init__()
        self.quant = torch.quantization.QuantStub()
        self.fc = nn.Linear(K, N)
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

def quantize_model(model, args):
    qconfig = QConfig(
            activation=HistogramObserver.with_args(reduce_range=False,
            dtype=torch.qint8 if args.signed_input else torch.quint8),
            weight=default_weight_observer,
    )
    # Prepare the model for static quantization
    # Specify quantization configurations
    model.qconfig = qconfig
    model_prepared = torch.quantization.prepare(model_fp32)

    # Calibrate the model with sample inputs
    # Example input data for calibration
    with torch.no_grad():
        sample_data = torch.randn(args.M, args.K)
        model_prepared(sample_data)
    # Convert the prepared model to a quantized model
    model_quantized = torch.quantization.convert(model_prepared)
    return model_quantized

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()

    set_seed(args.seed)
    model_fp32 = LinearModel(args.K, args.N)
    model_quantized = quantize_model(model_fp32, args)

    inputs = torch.randn(args.M, args.K)
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model_quantized(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model_quantized(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input)
    print("Min Times (ms) = ", min(times))
    print("Mean Times (ms) = ", np.mean(times))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-17 18:21:10 +00:00
c41c2130be Fix printing INT64_MIN (#149148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149148
Approved by: https://github.com/anijain2305
2025-03-17 17:57:18 +00:00
8cdb9adc05 do not run test_ck_blas_library on cpu (#148316)
Fix on non-rocm:

```
root@e01-tw-ue5g2g3sap6:~/pytorch/test# python test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu
E
======================================================================
ERROR: test_ck_blas_library_cpu (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 480, in instantiated_test
    raise rte
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 460, in instantiated_test
    result = test(self, **param_kwargs)
  File "/root/pytorch/torch/testing/_internal/common_device_type.py", line 1242, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/root/pytorch/torch/testing/_internal/common_utils.py", line 1981, in _fn
    fn(*args, **kwargs)
  File "/root/pytorch/test/test_linalg.py", line 8621, in test_ck_blas_library
    torch.backends.cuda.preferred_blas_library('ck')
  File "/root/pytorch/torch/backends/cuda/__init__.py", line 258, in preferred_blas_library
    torch._C._set_blas_preferred_backend(_BlasBackends[backend])
RuntimeError: Cannot set preferred backend to Ck if PyTorch has not been compiled for ROCm.

To execute this test, run the following from the base repo dir:
    python test/test_linalg.py TestLinalgCPU.test_ck_blas_library_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.346s

FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148316
Approved by: https://github.com/jeffdaily
2025-03-17 17:45:45 +00:00
224cd9f055 [ez] Flush trymerge print statements (#149012)
Logs of trymerge don't match up with timestamps, ex
https://github.com/pytorch/pytorch/actions/runs/13766246347/job/38493307591
Ex:
```
2025-03-10T14:20:41.4899509Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (0.003460856278737386 minutes elapsed)
...
2025-03-10T14:20:41.4907867Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 16 jobs to finish, first few of them are: Check Labels / Check labels, trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build. Retrying in 5 min
2025-03-10T14:20:41.4909772Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (5.280085611343384 minutes elapsed)
...
2025-03-10T14:20:41.4916812Z Merge of https://github.com/pytorch/pytorch/pull/148648 failed due to: Still waiting for 15 jobs to finish, first few of them are: trunk / macos-py3-arm64 / build, trunk / win-vs2022-cpu-py3 / build, trunk / cuda12.4-py3.10-gcc9-sm80 / build, trunk / win-vs2022-cuda12.6-py3 / build, trunk / linux-focal-cuda12.6-py3.10-gcc11-no-ops / build. Retrying in 5 min
2025-03-10T14:20:41.4918183Z Attempting merge of https://github.com/pytorch/pytorch/pull/148648 (10.590279157956441 minutes elapsed)
```

Either buffering prints or github actions logs are being weird?

Print with flush to see if it helps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149012
Approved by: https://github.com/malfet
2025-03-17 17:04:48 +00:00
aaa4c3d60b [mm_logs] make aten mm info readable (#148800)
Summary:
as title. make it into a table like

e.g. also see pic in test plan

| Name     | M   | N   | K   | Count |
| aten.mm | 16  | 6   |  16 |     1     |
...

Test Plan: {F1975907876}
<img width="1090" alt="Screenshot 2025-03-11 at 3 13 00 PM" src="https://github.com/user-attachments/assets/ffae8c56-e32c-49cc-bbfb-5b8d216b8657" />

Differential Revision: D70825664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148800
Approved by: https://github.com/henrylhtsang
2025-03-17 17:00:58 +00:00
2a011ca904 [ROCm] testing: enable MEFF/FA unittests for gfx1100 (#148911)
Include gfx1100, and optionally enable gfx1201/gfx950 according to env var TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148911
Approved by: https://github.com/jeffdaily
2025-03-17 16:41:15 +00:00
9d37b501db Revert "[ROCm] enable HIPMallocAsyncAllocator (#149145)"
This reverts commit 2e02c07a5d1c432547542f90de2885be9ffd13cf.

Reverted https://github.com/pytorch/pytorch/pull/149145 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally.  @albanD, might you be able to help get this PR landed? See D71214814 for more details on the failure. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149145#issuecomment-2730104736))
2025-03-17 16:17:02 +00:00
c7c3e77324 Refine XPU oneDNN context manager API (#147349)
# Motivation
This PR introduces improvements to the XPU oneDNN context manager API:

- `GpuEngineManager::get_engine`: Added a new API that accepts a `DeviceIndex` to simplify code and improve usability - by default, using the current device index.
- `GpuStreamManager::get_stream`: Now explicitly requires a `DeviceIndex` as input to ensure correctness and consistency - by default, using the current device index.

Additionally, it enhances integration with `c10::DeviceGuard`, ensuring correct device management.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147349
Approved by: https://github.com/EikanWang
2025-03-17 14:45:56 +00:00
790f93db3a Update slow tests (#149300)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149300
Approved by: https://github.com/pytorchbot
2025-03-17 11:39:29 +00:00
b2862f1435 optimize the decomposition of aten.native_group_norm (#144733)
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.

The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],

The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],

I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-17 09:27:01 +00:00
1cc5f6b623 Optimize MaxPool1d param ceil_mode description (#148869)
Fixes #148123

Add output shape formula based on `ceil_mode` value, according to

00199acdb8/aten/src/ATen/native/Pool.h (L61-L75)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/0a175178-a104-4348-a14b-516e866d533a)

### After

![image](https://github.com/user-attachments/assets/ce621d4b-1986-41fb-bd71-2b03c0aa996e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148869
Approved by: https://github.com/mikaylagawarecki
2025-03-17 08:50:40 +00:00
916e8979d3 Skip some tests not using gradcheck on slowgradcheck (#149220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149220
Approved by: https://github.com/seemethere
2025-03-17 00:34:52 +00:00
eqy
6048d88afe [ARM64][CUDA] skip string pattern matching in test_workspace_allocation_error (#149236)
`unwind()` on ARM64 seems to elide the strings of interest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149236
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/BoyuanFeng
2025-03-17 00:30:43 +00:00
bfee141666 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-16 23:52:58 +00:00
6b1b95ad2a Support subclass constructor capturing in export (#147014)
Notable TODOs:
1. Need to implement AutogradHOP to get rid of subclasses before serializing
2. Need to implement mechanism to figure out what subclasses will be used in export when they are not expressed in the inputs

Differential Revision: [D69640673](https://our.internmc.facebook.com/intern/diff/D69640673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147014
Approved by: https://github.com/bdhirsh
2025-03-16 18:19:19 +00:00
5905bbe745 [dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None (#149228)
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
2025-03-16 15:56:17 +00:00
9f33c6f0a0 [MPS] Add support for modified_bessel_i0 in eager. (#149264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-16 04:45:49 +00:00
f80bee4934 [MPS][BE] Move common binary ops macros to indexing.h (#149263)
And binary op invocation logic to OperationUtils.mm

This is a no-op change, additional sanity checks/logic improvements will be added as followups
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149263
Approved by: https://github.com/dcci
ghstack dependencies: #149262
2025-03-16 02:06:40 +00:00
21c2edfec8 [MPS/metal] Add missing inline to function definitions. (#149265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149265
Approved by: https://github.com/malfet
2025-03-16 00:33:27 +00:00
3e2c4086ad [EZ][BE] Reuse result_of from c10/metal/utils.h (#149262)
No need for one more implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149262
Approved by: https://github.com/dcci
2025-03-16 00:21:28 +00:00
acf42b0048 Fix memory leak in subproc_pool future (#149259)
Summary: The future holds a reference to the callback, and the callback captures the outer future. Seems to create a cycle that the garbage collector doesn't clean up. Verified by compiling 15k synthetic Triton kernels and observing that subprocess memory overhead improves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149259
Approved by: https://github.com/Skylion007
2025-03-15 20:26:30 +00:00
a9c55277d7 [Reland] First version of statically compiled launcher for triton compiled CUDA kernels (#149238)
This is a new version of https://github.com/pytorch/pytorch/pull/148561 fixing the ROCM test failure

Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.

This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.

Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66

Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.

The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.

This diff does not add the launcher to torch, but introduces a basic test suite.

A list of TODOs that are not yet complete:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Probably lots of features of the triton C++ generated code that I haven't handled yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149238
Approved by: https://github.com/oulgen
2025-03-15 15:06:46 +00:00
c83c711da8 Remove some memory overhead in parallel compile workers (#149168)
Summary: The parallel compile workers are holding on to more memory than they need to because they're loading the compiled modules into memory. Update the post-fork initializer to record when in a subprocess and skip some of the unnecessary overhead.

Test Plan: Ran a test script to compile 15k Triton kernels and used tracemalloc in the subprocs to investigate the overhead. On my devgpu:
* After importing torch in a subproc: 371M
* Without this PR, after compiling 15k kernels: 825M
* With this PR, after compiling 15k kernels: 531M

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149168
Approved by: https://github.com/jansel
2025-03-15 14:20:40 +00:00
e7e477c1f9 Not generate custom obj json when it's empty (#149246)
Summary: as title.

See internal Diff summary for more context.

Test Plan: buck run @fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r config_not_generated

Differential Revision: D71241676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149246
Approved by: https://github.com/houseroad

Co-authored-by: Huamin Li <huaminli@meta.com>
2025-03-15 13:00:48 +00:00
4482a65fef Add side_effect to avoid dce custom op in CA graph (#149181)
We found that in compiled_autograd, when defining custom op, the custom op will be dce in the backward graph. We added a side effect condition in the dce function to prevent eliminating custom op with side effect in CA graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149181
Approved by: https://github.com/xmfan
2025-03-15 04:15:49 +00:00
115fc98cc0 Migrate aten.split.Tensor from using Sharding Rule to Sharding Strategy (#149106)
Summary:
Use Sharding Strategy for aten.split.Tensor instead of sharding rule

Test Plan:
pytest test/distributed/tensor/test_dtensor_ops.py -s -k split

Reviewers:
xilunwu

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149106
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
2025-03-15 04:03:40 +00:00
740ce0fa5f op should NOT be static in aoti_torch_call_dispatcher (#149208)
aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet
2025-03-15 01:47:11 +00:00
578160c875 [ca] don't inline accumulate grad op (#149014)
we use dummy tensors in our initial trace, so we should never inline. the subclass dispatch might not support the dummy tensor, e.g. DTensor accumulate grad will check that both param and grad are DTensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149014
Approved by: https://github.com/jansel
ghstack dependencies: #149064
2025-03-15 01:10:54 +00:00
f4368d8872 [ca] clean up aot node deduping (#149064)
rename the AOT nodes as we copy paste them into the CA graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149064
Approved by: https://github.com/jansel
2025-03-15 01:10:54 +00:00
96795e9533 [BE] Parametrize TestMPS.test_binops_dtype_precedence (#149234)
No op change, just splits a longer tests into a series of a smaller ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234
Approved by: https://github.com/atalman, https://github.com/dcci
ghstack dependencies: #149216, #149233
2025-03-15 00:37:11 +00:00
1c7196f04b Add new GHA workflow to cache ROCm CI docker images on MI300 CI runners periodically (#148394)
Refiling https://github.com/pytorch/pytorch/pull/148387 from pytorch repo branch to get AWS login via OIDC working

Successful docker caching run: https://github.com/pytorch/pytorch/actions/runs/13843689908/job/38737095535
Run without cached docker image: https://github.com/pytorch/pytorch/actions/runs/13843692637/job/38746033460
![image](https://github.com/user-attachments/assets/c410ff35-a150-4885-b904-3a5e1888c032)
Run with cached docker image:
![image](https://github.com/user-attachments/assets/41e417b5-a795-4ed2-a9cd-00151db8f813)
~6 min vs 3 s :)

Thanks @saienduri for the help on the MI300 infra side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148394
Approved by: https://github.com/jeffdaily
2025-03-15 00:34:04 +00:00
9ad6265d04 [AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175)
The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175
Approved by: https://github.com/jansel
2025-03-15 00:30:04 +00:00
7537b19c73 [FSDP2] Update ignored_params docstring and add unit test (#149074)
Fixes https://github.com/pytorch/pytorch/issues/148242

ignored_params won't be moved to devices in full_shard(), update docstring.
Add unit test `test_move_states_to_device_ignored_param_device` to show that ignored_params won't be moved during full_shard(), but would be after `model.cuda()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149074
Approved by: https://github.com/awgu
2025-03-15 00:23:09 +00:00
09f7f62cfe Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070)
**Issue:**
* The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards.
* Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve
* This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated.

**Fix:**
* Updated the build flags to explicitly use **-march=armv8-a+sve**, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before.
* This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware.

Test plan:
 - Allocate `a1.4xlarge` on AWS
 - Run following script using wheel produced by this PR
 ```python
import torch
def f(x):
    return x.sin() + x.cos()

print(torch.__version__)
f_c = torch.jit.script(f)
```
- Observe no crash
```
$ python3 foo.py
2.7.0.dev20250313+cpu
```
- Observe crash with 2.6.0
```
$ python3 foo.py
2.6.0+cpu
Illegal instruction (core dumped)
```

Fixes #146792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070
Approved by: https://github.com/malfet
2025-03-15 00:02:38 +00:00
08af311fc2 [MPS] Fix type promotion for torch.floor_divide (#149233)
And delete some duplicating glue code by relying on the stub
After this change `torch.arange(10, device = 'mps') // torch.arange(10., device='mps')` will return tensor of floats, which is a common dtype for float + integral operation, rather than tensor of ints
Checked by `test_div2` inductor testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149233
Approved by: https://github.com/atalman
ghstack dependencies: #149216
2025-03-15 00:00:42 +00:00
eb7bf4202d Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-14 23:38:19 +00:00
ff58ccec6c [ATen-CPU] Add math.h for Gelu (#149164)
Summary:
## Context

This PR is mostly to enable ExecuTorch build for Windows: https://github.com/pytorch/executorch/pull/9198

In ExecuTorch, the optimized GeLU kernel calls the ATen implementation. However, on Windows `math.h` needs to be included with `#define _USE_MATH_DEFINES` in order for math constants to be defined.

Test Plan:
Rely on CI to make sure existing tests do not break. Tested separately with ExecuTorch to make sure Windows build is successful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149164
Approved by: https://github.com/swolchok
2025-03-14 23:37:25 +00:00
f9b4856989 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit c95a6b416b4d1b830535f82e2719c055d077cbad.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @zou3519 can you please help land this internally? See the sigmoid tests in D71198793 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2725982539))
2025-03-14 23:13:34 +00:00
643aaea133 Revert "[RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561)"
This reverts commit 5a843f8973d7fc6a601f089fc969d2a5ac7e5338.

Reverted https://github.com/pytorch/pytorch/pull/148561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/148561#issuecomment-2725969268))
2025-03-14 23:01:26 +00:00
05f2cbfe19 Add meta function for out variants of ones,zeros,empty (#149098)
Open another PR to fix merge conflicts. Fixes https://github.com/pytorch/pytorch/issues/135832

For aten.ones, aten.zeros, followed this [link](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.64r4npvq0w0) to register meta functions.

For aten.empty.out, followed this [part](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.iy9lxhxhtl5v) to register a decomp for empty that handles the FakeTensor input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149098
Approved by: https://github.com/williamwen42
2025-03-14 22:17:30 +00:00
d7d9a71e19 [MPSInductor] Add support for atan2 (#149216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149216
Approved by: https://github.com/dcci
2025-03-14 21:53:03 +00:00
dd6e9df3d0 [MPS] fix attention enable_gqa crash on mps (#149147)
Fixes #149132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147
Approved by: https://github.com/malfet
2025-03-14 21:25:54 +00:00
0bd863a62f [MPS] Add inductor support for i1e. (#149221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149221
Approved by: https://github.com/malfet
2025-03-14 21:18:38 +00:00
a0893475ba Enable oneDNN dispatch for gemm bf16bf16->bf16 (#148197)
Currently, `linear` layers using BF16 are dispatched to OpenBLAS, provided that sbgemm_ is available.
However, profiling on AArch64 shows that dispatching to oneDNN results in a significant speedup. This PR updates the dispatch logic to leverage oneDNN for improved performance.

Attaching some benchmark results. Instance: NeoverseV1., on 16 threads.

<img width="482" alt="Screenshot 2025-02-28 at 17 18 38" src="https://github.com/user-attachments/assets/b84e7455-af6e-417f-920d-bdd2bec2e8f9" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148197
Approved by: https://github.com/malfet
2025-03-14 20:58:24 +00:00
1bdbf12672 Update as strided doc (#149146)
Make it clearer why it is not recommended to use it and when the resulting Tensor will have undefined behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149146
Approved by: https://github.com/gchanan, https://github.com/jbschlosser
2025-03-14 19:49:57 +00:00
69aeb87eca update error message in get_backend() more detail_ (#141796)
Fixes #ISSUE_NUMBER
When attempting to reconfigure the environment without properly handling the PyTorch-related settings, you may encounter the following message.
```
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/distributed/distribut │
                             │ ed_c10d.py:1215 in get_backend                                                                                            │
                             │                                                                                                                           │
                             │   1212 │   if _rank_not_in_group(pg):                                                                                     │
                             │   1213 │   │   raise ValueError("Invalid process group specified")                                                        │
                             │   1214 │   pg_store = _world.pg_map[pg] if pg in _world.pg_map else None                                                  │
                             │ ❱ 1215 │   return Backend(not_none(pg_store)[0])                                                                          │
                             │   1216                                                                                                                    │
                             │   1217                                                                                                                    │
                             │   1218 def _get_process_group_uid(pg: ProcessGroup) -> int:                                                               │
                             │                                                                                                                           │
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/torch/utils/_typing_utils.p │
                             │ y:13 in not_none                                                                                                          │
                             │                                                                                                                           │
                             │   10                                                                                                                      │
                             │   11 def not_none(obj: Optional[T]) -> T:                                                                                 │
                             │   12 │   if obj is None:                                                                                                  │
                             │ ❱ 13 │   │   raise TypeError("Invariant encountered: value was None when it should not be")                               │
                             │   14 │   return obj                                                                                                       │
                             │   15                                                                                                                      │
                             ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
                             TypeError: Invariant encountered: value was None when it should not be
Exception ignored in: <function Vllm.__del__ at 0x7f35f96b6dd0>
```
Since this message can cause confusion for multiple developers, the purpose of this PR is to suggest additional details to help clarify the situation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141796
Approved by: https://github.com/kwen2501
2025-03-14 19:42:42 +00:00
5e79b61e8a add PrivateUse1 backend in fsdp collecitves (#147260)
add PrivateUse1 backend in fsdp collecitves

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147260
Approved by: https://github.com/weifengpy
2025-03-14 19:41:41 +00:00
fe01af2242 [AOTI][debug logger] small fix for intermediate value debugger for jit when arg is not tensor (#149007)
repro:
```
import torch
import torch._inductor.config as config

config.aot_inductor.debug_intermediate_value_printer = "2"
config.aot_inductor.filtered_kernel_names = "triton_poi_fused__to_copy_add_0"

class Model(torch.nn.Module):
    def forward(self, x):
        x = x.to(torch.float)
        return x + 1

model = Model().cuda()
x = torch.randn(10).cuda().to(torch.float8_e4m3fn)
_ = torch.compile(model, fullgraph=True)(x)

print("done")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149007
Approved by: https://github.com/jingsh
2025-03-14 19:40:41 +00:00
c96ed7e6f5 [BE]: No include left behind - recursive glob setuptools support (#148258)
Fixes #148256
TestPlan check the printout from the setup.py build and verify the files are still included.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148258
Approved by: https://github.com/malfet, https://github.com/benjaminglass1
2025-03-14 19:39:21 +00:00
9d7945e382 [EZ] Fix typo in UnaryOps.mm (#149217)
s/imput/input/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149217
Approved by: https://github.com/ZainRizvi, https://github.com/dcci
2025-03-14 19:31:20 +00:00
a7f8de2198 Add nn.Bilinear param validation (#149018)
Fixes #103425

## Changes

- Add doc description size value `must be > 0`
- Add validation for `in1_features` param

Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking.

```python
import torch
from torch import nn

class lenet(nn.Module):
    def __init__(self):
        super(lenet, self).__init__()
        self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1)

        # Error, `in1_features=1, in2_features=0, out_features=0` no error
        self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0)

    def forward(self, x):
        # 1st block
        x = self.conv(x)
        x = self.linear(x)

        return x

if __name__ == '__main__':
    net = lenet()

```

## Test Result

```bash
pytest test/test_nn.py -k test_bilinear -vv
```

![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43)

![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018
Approved by: https://github.com/mikaylagawarecki
2025-03-14 19:26:12 +00:00
5a843f8973 [RFC] First version of statically compiled launcher for triton compiled CUDA kernels (#148561)
Putting this up for a first pass review, though I will likely make a bunch of changes before landing to add more features, etc.

This diff implements a first version of a static CUDA kernel launcher in `torch._C`. The goal here is to take a cubin file and some metadata from a CompiledKernel from `triton`, and launch the cubin file directly.

Background doc: https://docs.google.com/document/d/1rjRcHl6MfauHG30nCoQX-9UKvKyIs4WWMy_GsGyqb9g/edit?tab=t.0#heading=h.ut5lf39lzq66

Normally, using triton's CompiledKernel.make_launcher(), we would pay the cost of codegenning C++ and running it at compile time. With this new approach, we can use one statically compiled library to launch the kernel.

The tradeoff here is that this new kernel launcher will not be able to use codegen to deal with different lengths/types of arguments. So we use templating to handle up to 10 arguments for now. We also allocate 8 bytes on the stack per argument no matter the argument type, which can take more memory than codegenning. On the other hand, we improve compile time on cold and warm start by not having to call the C++ compiler at all.

This diff does not add the launcher to torch, but introduces a basic test suite.

A list of TODOs that are not yet complete, will do in separate diff:
- Handle `nvTmaDesc` and `cuTensorMap`, which triton handles
- Embed the grid logic instead of passing in gridX,Y,Z. With https://github.com/pytorch/pytorch/pull/147583, we should be able to handle all of the grid logic directly in _StaticCudaLauncher.launch_kernel, and get rid of the python evaluation.
- Handle launch_enter and exit hooks? (Not sure if inductor has these)
- Benchmarking to see if there's runtime performance loss
- Hooking it up with a config to inductor
- Testing harness to test against torch generated triton kernels

Differential Revision: [D69926783](https://our.internmc.facebook.com/intern/diff/D69926783/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148561
Approved by: https://github.com/aorenste, https://github.com/syed-ahmed
2025-03-14 19:12:13 +00:00
97272e4b49 Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-14 18:53:10 +00:00
2e02c07a5d [ROCm] enable HIPMallocAsyncAllocator (#149145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149145
Approved by: https://github.com/jeffdaily
2025-03-14 18:21:27 +00:00
f2221b2fce [MPS] Add support for i1e (#149203)
Followup after https://github.com/pytorch/pytorch/pull/149174
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149203
Approved by: https://github.com/dcci
2025-03-14 17:33:52 +00:00
f067eafabb [MPS] Modify a test to test the correct function. (#149204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149204
Approved by: https://github.com/malfet
2025-03-14 17:27:47 +00:00
42e468d9b0 [MPSInductor] Adjust check_bounds (#147205)
To make upper bound inclusive, which fixes `test_vectorized_ops_masked` and results in the following code
```python
mps_lib_0 = compile_mps_shader("""
    #include <c10/metal/random.h>
    #include <c10/metal/special_math.h>
    #include <c10/metal/utils.h>
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = (xindex) % (64);
        int x1 = (xindex) / (64);
        auto tmp5 = in_ptr0[x0 + 63*x1];
        int x2 = xindex;
        auto tmp0 = x0;
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 63;
        auto tmp3 = tmp1 < tmp2;
        if (x0 > 63) return;
        auto tmp6 = tmp3 ? tmp5 : 7;
        out_ptr0[x2] = static_cast<float>(tmp6);
    }
""")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147205
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #147211
2025-03-14 17:26:00 +00:00
cyy
a9aae05a6b Remove test decorations on MacOS 12 (#148942)
MacOS 12 may reach EOL, as from https://endoflife.date/macos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148942
Approved by: https://github.com/malfet
2025-03-14 17:22:37 +00:00
f2ea77c099 [MPS] Add inductor support for i0e. (#149180)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149180
Approved by: https://github.com/malfet
2025-03-14 16:15:52 +00:00
71795f159e Revert "[AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167)"
This reverts commit bea181ff7eeead9fcdd806e286846296c4ab2d67.

Reverted https://github.com/pytorch/pytorch/pull/149167 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D71177501 for the failure. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/149167#issuecomment-2725001232))
2025-03-14 15:16:21 +00:00
706c22549c [MPS] Add support for i0e in eager. (#149174)
Add `special.i0e` to XFAIL_GRADLIST for now, as its backward op is not yet implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-14 14:43:46 +00:00
68bbe20db7 Add test coverage (#149182)
Summary: Follow up from D71160718

Differential Revision: D71177037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149182
Approved by: https://github.com/houseroad
2025-03-14 09:38:29 +00:00
c95a6b416b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-14 08:50:30 +00:00
05ac99042f Clean up grid in execution trace (#149159)
Summary: This DIFF https://www.internalfb.com/diff/D70471332 removed input "grid" when calling triton kernel. PyTorch execution trace need to make the appropriate change. It includes capturing ET and replay ET.

Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda  -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_with_pt2_cuda

buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay

Differential Revision: D71152464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149159
Approved by: https://github.com/sraikund16, https://github.com/jansel
2025-03-14 07:12:16 +00:00
be4e6c1c8e Revert "[MPS] Add support for i0e in eager. (#149174)"
This reverts commit b4745db90482ff139ea62d06ec0a18468e1131b7.

Reverted https://github.com/pytorch/pytorch/pull/149174 on behalf of https://github.com/malfet due to MPS are red on trunk ([comment](https://github.com/pytorch/pytorch/pull/149174#issuecomment-2723774600))
2025-03-14 06:35:01 +00:00
e162758051 [MPSInductor] Add bessel_[jy][01] ops (#149179)
By simply calling corresponding special functions

Followup TODO: tweak bessel_y0 to match CPU implementation for `torch.half` dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149179
Approved by: https://github.com/dcci
ghstack dependencies: #149123
2025-03-14 06:33:30 +00:00
d4496346b9 Update logic when producing key name for keep_original_weights (#149171)
Differential Revision: D71160718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149171
Approved by: https://github.com/houseroad
2025-03-14 05:29:54 +00:00
db6d72213b [MPS] Add torch.special.bessel_[jy][01] implementations (#149123)
By copy-n-pasting functions from
f59064f2b7/aten/src/ATen/native/cuda/Math.cuh (L1463)

With an  ugly workaround for `bessel_y[01]` to avoid internal compiler exception on M1/M2 machines (see FB16863363 /  https://gist.github.com/malfet/e7785e4b572e7740887a83a2386ef769 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149123
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-03-14 05:13:55 +00:00
e6839819c8 Revert "[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527)"
This reverts commit 4f8391db55c8c3a574d61d99d6d6a4a0b6723acb.

Reverted https://github.com/pytorch/pytorch/pull/147527 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally.  @albanD, would you be able to help them land the fixes internally? The error looks really simple. See D71152448 for details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/147527#issuecomment-2723531085))
2025-03-14 05:11:01 +00:00
9e6b2ca58d Fix sympy float priting (#147552)
Fixes https://github.com/pytorch/pytorch/pull/147261
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147552
Approved by: https://github.com/bobrenjc93, https://github.com/cyyever
2025-03-14 05:07:06 +00:00
bea181ff7e [AOTInductor] [BE] Add swap_constant_buffer into pybind for tests. (#149167)
Summary:
We add swap_constant_buffer in pybind to add tests.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_update_inactive_constant_buffer

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149167
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-14 04:12:48 +00:00
e567900998 [AOTInductor] Activate CPU test for update_constant_buffer (#149162)
Summary:
Fixed by #145459

Test Plan:
Re-activating tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149162
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-14 04:09:57 +00:00
aed0b7a742 [c10d] Add param recording for uniqueID broadcasting and allgather (#149166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149166
Approved by: https://github.com/kwen2501
2025-03-14 03:51:30 +00:00
b4745db904 [MPS] Add support for i0e in eager. (#149174)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149174
Approved by: https://github.com/malfet
2025-03-14 02:51:28 +00:00
c179971bfc xpu: update filter out of dg2 AOT target (#148677)
torch-xpu-ops has updated list of AOT targets to use and used `dg2` instead of `dg2-g10`. This requires an update in cpp_extension.py which currently filters out `dg2-` prefixed AOT targets.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148677
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD
2025-03-14 02:24:06 +00:00
56b2e4b8f0 ci: Update linux.20_04 --> linux.24_04 (#149142)
Ubuntu 20.04 is getting deprecated soon so we might as well proactively
move to the latest LTS which is 24.04

> [!NOTE]
> The oldest supported version of python on 24.04 is Python 3.8. Since we test for Python 3.6 compat in our collect_env test we need to have this particular job stick with 20.04 for now until we decide to upgrade it to a newer python version.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149142
Approved by: https://github.com/atalman, https://github.com/wdvr
2025-03-14 02:20:10 +00:00
cyy
e66ad221e9 Use std::string_view in get_fully_qualified_type_name (#145197)
The same as #139164 but open a new PR due to messy history there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145197
Approved by: https://github.com/r-barnes
2025-03-14 01:58:35 +00:00
e8d36019d4 [c10d] Make getDefaultBackend more fault tolerant without relying on exceptions (#149152)
Summary: no-except builds are terminating when this exception is thrown. We should proactively check if a backend is available before calling has_hooks, instead of trying and failing.

Test Plan: CI

Differential Revision: D71144456

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149152
Approved by: https://github.com/kwen2501
2025-03-14 01:27:52 +00:00
15cd6921a5 [export] Fix tensor_constant and buffer naming conflicts in TS converter (#148803)
Summary: In TS converter, tensor constants are traced as BUFFER and later we will convert them back to CONSTANT_TENSOR. So we need to prevent naming conflicts during lift constant pass.

Test Plan: CI

Differential Revision: D70826426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148803
Approved by: https://github.com/angelayi
2025-03-14 00:38:12 +00:00
49570cb402 Revert "Split up cub-RadixSortPairs.cu to parallelize compilation (#148936)"
This reverts commit 9a3d26cfcdb1c1be84a04baa3ee554dbe67cb049.

Reverted https://github.com/pytorch/pytorch/pull/148936 on behalf of https://github.com/ZainRizvi due to Breaks lint in trunk [GH job link](https://github.com/pytorch/pytorch/actions/runs/13845459825/job/38742803351) [HUD commit link](9a3d26cfcd) ([comment](https://github.com/pytorch/pytorch/pull/148936#issuecomment-2722853628))
2025-03-13 22:54:33 +00:00
4cae8f48cc [ROCm] Improve softmax performance (#149076)
This patch improves the performance of softmax for 2D tensors by:

using a softmax calculation which eliminates the increase of shared memory usage with the size of the tensor and relies on global memory accesses for the tensor data accesses while still using shared memory for the actual reduction step (the shared memory used for the reduction is constant and does not increase with tensor size).
for the final computation replacing the division by the sum with the multiplication of 1/sum. The 1/sum is computed as the last step of the warp reduction.
replace the use of the exp function with the __expf function.
The impact on numerical accuracy is within a 1e-5 for half precision and 1e-7 for full precision.

The impact on performance for MI300X is between 22% and 50% percentage improvement over current runtimes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149076
Approved by: https://github.com/jeffdaily
2025-03-13 22:07:28 +00:00
9a3d26cfcd Split up cub-RadixSortPairs.cu to parallelize compilation (#148936)
Summary: `cub-RadixSortPairs.cu` has slow compilation times, especially on Windows. These changes split up the file into smaller components to allow each component to compile in parallel. On Windows, I observed a compile time drop from about 20 minutes to 6 minutes.

Differential Revision: D70539649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148936
Approved by: https://github.com/suo, https://github.com/eqy
2025-03-13 22:02:05 +00:00
4098a229a0 Add back fake class registration to test_torchbind (#149137)
Fixes #149121

Summary: as title, to fix https://github.com/pytorch/pytorch/issues/149121

Test Plan:
```
 python test/export/test_torchbind.py
```

Differential Revision: D71129321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149137
Approved by: https://github.com/yiming0416
2025-03-13 21:26:37 +00:00
e5fccb2bab [pytorch] Fix duplicated Malloc/Free insertation when using IRBuilderBase::CreateMalloc/CreateFree in LLVM 18+ (#149058)
Summary:
Pytorch unitest hangs when jitting the Tensor kernel. The problem exists for LLVM version >= 18 due to this upstream change: 45bb45f2ae

`IRBuilderBase::CreateCall` will insert the instruction into the BasicBlock by default. And we don't need to explicitly insert the instruction when compiling the tensor kernel.

Test Plan:
## Test with the release toolchain
```
buck test 'mode/dev' //caffe2/test:jit -- --exact 'caffe2/test:jit - test_concat_invariant (test_jit_fuser_te.TestTEFuserDynamic)'
```
## Test with the Buckified toolchain
Apply this D71046097 to select the LLVM libraries.
```
# Build tests
buck build 'mode/dev-asan' //caffe2/test:jit --show-output
```
```
# Run test (Change HASH and paths accordingly)
HASH="b755f1c435832a1e"

ENABLE_FLATBUFFER=0 FB_OVERRIDE_PYBIND11_GIL_INCREF_DECREF_CHECK=1 MKL_NUM_THREADS=1 NO_MULTIPROCESSING_SPAWN=0 OMP_NUM_THREADS=1 PYTORCH_TEST=1 PYTORCH_TEST_FBCODE=1 PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_DEV_DBG_ASAN=1 PYTORCH_TEST_WITH_TSAN=0 PYTORCH_TEST_WITH_UBSAN=1 SKIP_TEST_BOTTLENECK=1 TENSORPIPE_TLS_DATACENTER=test_dc TEST_PILOT=True TPX_IS_TEST_EXECUTION=true TPX_TIMEOUT_SEC=6000 \
buck-out/v2/gen/$HASH/caffe2/test/__jit__/jit.par --test-filter test_jit_fuser_te.TestTEFuserDynamic.test_concat_invariant
```

Differential Revision: D71046799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149058
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-03-13 20:37:47 +00:00
38e81a5332 [ROCm] Use generated CK config.h rather than system (#147993)
prevents pytorch from potentially using system version of config.h and instead prioritize the CK submodule's version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147993
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-13 20:04:12 +00:00
4f8391db55 [ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types (#147527)
This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527
Approved by: https://github.com/jeffdaily

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-03-13 19:56:26 +00:00
0dcd482e54 [SDPA] Respect sdpa_kernel's priority_order setting in torch.compile (#147768)
[https://github.com/pytorch/pytorch/pull/140467](https://github.com/pytorch/pytorch/pull/140467) added the option to specify a priority order for SDPA but the `torch.compile` path silently ignored this setting as I wasn't aware of the separate context manager handling on `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147768
Approved by: https://github.com/drisspg
2025-03-13 18:52:34 +00:00
5e1b715dda BC fix for AOTIModelPackageLoader() constructor defaults (#149082)
The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082
Approved by: https://github.com/desertfire
2025-03-13 18:40:53 +00:00
cyy
970fefcc53 Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940)
Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940
Approved by: https://github.com/mikaylagawarecki
2025-03-13 18:02:50 +00:00
c73c72b1e1 ci: Update linux_job references to v2 (#149102)
This is probably a bit overdue but trying to update these so we can
finally get rid of all the remnants that rely on non-manylinux2_28 stuff
and conda stuff

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149102
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #149104
2025-03-13 17:31:55 +00:00
77ea66695a ci: Fix check_binary gcc abi check (#149104)
All of our binaries should be built with the cxx11-abi now so lets fix
this check to reflect reality.

I also noticed that this particular script is not used widely since this
issue should've been caught in nightlies a long time ago.

Maybe worth an investigation to just remove this script if it's not
actually being used.

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149104
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/malfet
2025-03-13 17:31:55 +00:00
7c87ec1b50 [ca] always do initial trace with dynamic shapes (#148801)
HUD: https://fburl.com/wzvx6tax no regressions (ignore the pass rate improvements, those come from #149030)
<img width="864" alt="image" src="https://github.com/user-attachments/assets/d7598f98-b378-4abb-a0c7-e4311162f681" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148801
Approved by: https://github.com/jansel
ghstack dependencies: #148799, #149030
2025-03-13 17:30:29 +00:00
b263b272fa [ca] fix lazily compiled aot bwd (#149030)
FIXES https://github.com/pytorch/pytorch/issues/137372

sometimes, the aot bwd is lowered lazily. so the bw_module we saved in CompiledFunction._lazy_backward_info hasn't gone through post grad passes, specifically the view_to_reshape pass. Running that directly will then sometimes error, because the AOT forward has already changed its views to reshapes, and it is reflected in the gradients we see in CA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149030
Approved by: https://github.com/bdhirsh
ghstack dependencies: #148799
2025-03-13 17:30:29 +00:00
e6f560a262 [ca] support for dynamic shapes CopySlices (#148799)
i'm changing CA initial trace to always trace as dynamic, fixes these errors:
```python
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.2139s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_autograd_python_custom_function_inplace - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_autograd_python_custom_function_inplace
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0057s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_copy_slices_graph_task_updates - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_copy_slices_graph_task_updates
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.9662s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_inplace_on_view_weak_grad_fn - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_inplace_on_view_weak_grad_fn
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0077s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_leaf_assignment - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_leaf_assignment
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [5.0485s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_setitem_mask - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_setitem_mask
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [0.0102s] test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_tensor_hooks_inplace_over_view - RuntimeError: !has_symbolic_sizes_strides_ INTERNAL ASSERT FAILED at "/home/xmfan/core/a/pytorch/aten/src/ATen/TensorGeometry.h":63, please report a bug to PyTorch.
To execute this test, run the following from the base repo dir:
    python test/test_autograd.py TestAutogradWithCompiledAutograd.test_tensor_hooks_inplace_over_view
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148799
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-03-13 17:30:20 +00:00
e84cc4c052 Update Kineto Submodule (#149089)
Summary: We have made a lot of changes in Kineto this month. It is a good idea to update the submodule in now especially since the roctracer-sdk change will be very large

Test Plan: CI

Differential Revision: D71082829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149089
Approved by: https://github.com/Skylion007
2025-03-13 17:18:16 +00:00
6856d81c60 [BE]: Update CU128 cudnn to 9.8.0.87 (#148963)
Also cu12.6 is an on old CUDNN version, we may want to upgrade it for all the performance reasons as I don't see a manywheel linux reason to stay back on the old 9.5 release. I might split that into it's own PR. This one just updates CU126 to the latest and greatest.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148963
Approved by: https://github.com/jansel, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/atalman
2025-03-13 16:59:12 +00:00
b9803a5c81 [AOTI] Re-enable AOTI cpp unit test (#149085)
Summary: test_inductor_aoti was removed by accident previously. Add it back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149085
Approved by: https://github.com/jbschlosser
2025-03-13 16:00:38 +00:00
3e605fe46d [CUDAGraph] Graph Partition (#147648)
This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR.

## Example
```python
import torch

torch._inductor.config.graph_partition = True

def f(x, y):
    x1 = x + 1
    y1 = y + 1
    y_cpu = y1.cpu() + 1
    z = x @ y
    return x1 + y1 + z + y_cpu.cuda()

x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)]
x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]]
eager_out = f(x, y)

f_compiled = torch.compile(f, mode="reduce-overhead")

for _ in range(5):
    compiled_out = f_compiled(x_cloned, y_cloned)
    assert torch.allclose(eager_out, compiled_out)
```

w/o graph partition, we will skip cudagraph:
```
skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from :
   File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f
    y_cpu = y1.cpu() + 1 # 3
```

w/ graph partition, we can see two cudagraphify under the same torch-compiled region:
![image](https://github.com/user-attachments/assets/4e22d428-2687-433d-b92a-0814a2201b25)

## Design

PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions.

The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is:
- compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`.
- during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`.
- finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`.

#### Q: How does it work with codecache?

While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later.

## Edge Case 1
PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`.

## Edge Case 2
Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420)

![image](https://github.com/user-attachments/assets/03d08ce0-f1d1-4d1d-8432-805a07e1dd40)

To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648
Approved by: https://github.com/eellison
2025-03-13 16:00:21 +00:00
65d19a5699 Remove runtime dependency on packaging (#149092)
Looks like after https://github.com/pytorch/pytorch/pull/148924
We are seeing this error in nightly test:
https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623

```
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module>
    from .lowering import fallback_node_due_to_unsupported_type
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module>
    from . import kernel
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module>
    from . import mm, mm_common, mm_plus_mm
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'
```

Hence removing runtime dependency on packaging since it may not be installed by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092
Approved by: https://github.com/drisspg, https://github.com/davidberard98
2025-03-13 14:53:13 +00:00
f59064f2b7 [FIX] remove the duplicate key in DEFAULT_STATIC_QUANT_MODULE_MAPPINGS (#149043)
nn.Dropout appeared at line 81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149043
Approved by: https://github.com/jingsh
2025-03-13 12:42:33 +00:00
bdf57fb8f7 [AOTI][refactor] Split MiniArrayRef into a separate header (#149073)
Summary: MiniArrayRef is a common utility and will be used by the libtorch-free AOTI.

Differential Revision: [D71064657](https://our.internmc.facebook.com/intern/diff/D71064657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149073
Approved by: https://github.com/yushangdi
2025-03-13 11:57:32 +00:00
a8b1767ae5 [DTensor] Fix local_map with multi-threading (#149070)
Using `nonlocal device_mesh` is not safe with multi-threading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149070
Approved by: https://github.com/wanchaol
2025-03-13 10:58:59 +00:00
df60500ab8 Fix too big to optimize in test, actually use O0 when aot_inductor.compile_wrapper_with_O0 is set (#148714)
Summary:
1. Check against the "0" char instead

2. We got the following error when using anything other than O0 flag: `error: Function ZN5torch12aot_inductorL22__check_inputs_outputsEPP16AtenTensorOpaqueS3 is too big to optimize [-Werror,-Wignored-optimization-argument]` So we use O0 flag in wrapper code when `aot_inductor.compile_wrapper_opt_level` is set to `O0`.

Test Plan:
```
 buck run  'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:ads_second_stage_dsnn_models_aoti_lowering_test -- -r AdsSecondStageDSNNModelsAOTILoweringTest
```

Differential Revision: D70670957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148714
Approved by: https://github.com/desertfire
2025-03-13 10:22:06 +00:00
96a6a71ac7 skip test_torch_dynamo_codegen_pow if CPU backend is not cpp (#146595)
The test asserts that `aten.pow` is not present in the generated kernel code. When using a CPU backend other than cpp, the kernel contains comments referencing the aten ops that produced the kernel in this case `aten.pow`.

This PR skips that test case if the CPU backend is not cpp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146595
Approved by: https://github.com/williamwen42
2025-03-13 10:03:29 +00:00
d90f9e9a34 [inductor] Fix issue with set_linter, improve linter framework (#144620)
### `set_linter` only

* Fix gnarly [bug](dbed747aae/tools/test/set_linter_testdata/python_code.py.txt.python (L42)) which would have garbled Python files involving sets contained in sets.
* Better handling of new Python3.12 token types

### Both linters.

* Recover from and report on unparseable Python files
* Remove `ParseError.check()` (it made it harder to read the code)
* FileLinter is now generic on `PythonFile`

### Notes

As I started working on new docstring features, I found a nasty bug and an edge case bug in set linter, and realized both the linters crash when there is a badly-formed Python file in the repo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144620
Approved by: https://github.com/amjames, https://github.com/jansel
2025-03-13 09:49:40 +00:00
f4bffb7461 [docs] fix autograd description on convex function case (#148658)
The sub-gradient of minimum norm is the least steep descent direction.

```python
import torch

x = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True)
torch.relu(x).sum().backward()
print(x.grad) # tensor([0., 0., 0., 1., 1.])

y = torch.tensor([-2, -1, 0, 1, 2.], requires_grad=True)
torch.abs(y).sum().backward()
print(y.grad) # tensor([-1., -1.,  0.,  1.,  1.])
```

(How can I request a reviewer? I don't have the button on the right)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148658
Approved by: https://github.com/lezcano
2025-03-13 09:06:15 +00:00
75c8b7d972 [Profiler][HPU] Fix incorrect availabilities for HPU (#148663)
Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/albanD
2025-03-13 08:03:52 +00:00
eqy
ec93aa7f84 fix cuDNN SDPA meta registration (#148921)
Update `cuDNN SDPA` meta registration to matching memory layout behavior in: https://github.com/pytorch/pytorch/pull/138354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148921
Approved by: https://github.com/drisspg, https://github.com/jbschlosser
2025-03-13 07:33:16 +00:00
2a7d583452 Consolidate torchbind fake class registration (#149063)
Summary: Remove duplicated fake class registration

Test Plan: CI

Differential Revision: D71052419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149063
Approved by: https://github.com/angelayi
2025-03-13 06:57:13 +00:00
c208f21791 [Dynamo] Replace unimplemented withunimplemented_v2 in torch/_dynamo/variables/base.py (#148177)
Part of #147913

Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/base.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148177
Approved by: https://github.com/williamwen42
2025-03-13 06:35:51 +00:00
037d7af778 [Inductor UT] Enable PYTORCH_TESTING_DEVICE_ONLY_FOR test case filter for test_torchinductor.py (#149023)
The environ var PYTORCH_TESTING_DEVICE_ONLY_FOR controls the devices
in get_desired_device_type_test_bases, so we add RUN_CPU and RUN_GPU to
make sure cases are only enabled for devices specified for PYTORCH_TESTING_DEVICE_ONLY_FOR.
eg. Only enable GPU cases, not CPU cases even HAS_CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149023
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-13 05:15:28 +00:00
7cdbb913e7 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-13 03:50:58 +00:00
3646d4dbc8 [partitioner] always ban compiler-driven recompute of collectives by default (#147561)
This should fix the hang in https://fb.workplace.com/groups/1075192433118967/permalink/1603268720311333/

The argument here is that:

(1) in general, it is not safe for the partitioner to sometimes choose to recompute collectives in the backward. Why? If we are running a distributed job, where many ranks are compiling at the same time, we need every rank to make a consistent decision about which collectives are recomputed for backward. If we let each compiler instance make its own choice without any cross-rank communication, they can make different choices and cause NCCL hangs (see the link above)

(2) later on, we'll want an `spmd_mode` flag that causes the compiler to issue collectives and communicate info across ranks. Once we have such a config, then turning it on should make it safe for the partitioner to potentially choose to recompute collectives (and agree on the binary "recompute-or-save" choice across all ranks)

(3) even without an `spmd_mode`, users can override this choice by using `torch.utils.checkpoint()` in their user code. User checkpointing generally always overrides the partitioner, and this should be safe because we expect the user to apply checkpointing consistently across ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147561
Approved by: https://github.com/zou3519
2025-03-13 03:36:13 +00:00
420a9be743 [regression] Fix pin_memory() when it is called before device lazy initialization. (#149033)
PR #145752 has added a check in the isPinnedPtr to check if a device is initialized before checking if the tensor is pinned. Also that PR has added a lazy initialization trigger when an at::empty is called with a pinned param set to true. However, when the tensor is firstly created and it is pinned in a separate call by calling pin_memory() function, lazy device init is not called so is_pinned returns always false.

With this PR, the lazy initialization is moved to getPinnedMemoryAllocator function, thus it is assured that device is initialized before we pin a tensor.

Fixes #149032

@ngimel @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149033
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-03-13 02:56:24 +00:00
f2d43d866c [cutlass backend] switch layout for cutlass backend benchmark (#149009)
```
python benchmarks/inductor_backends/cutlass.py
```

logs:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 13.059554621577263 |  1.580178506206721   |         NA          |
|        triton         | 10.245470330119133 | 0.04118620231747627  | -21.54808776410064  |
| triton_persistent_tma | 10.388538241386414 | 0.04225084185600281  | -20.45258400908819  |
|  cutlass_lvl_default  | 12.882896699011326 |  231.14990583620965  | -1.3527101626732294 |
|   cutlass_lvl_1111    | 11.362981051206589 |  126.41650272067636  | -12.99105229490415  |
|   cutlass_lvl_2222    | 11.107578873634338 |  555.8380545829423   | -14.946725248331441 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 14.037585817277431 | 0.21587548777461052  |         NA          |
|        triton         | 10.571777820587158 |  78.15654796129093   | -24.68948750735019  |
| triton_persistent_tma | 10.761583223938942 |  1.3195342738181353  | -23.337364672110443 |
|  cutlass_lvl_default  | 12.872588820755482 |  237.0100042372942   | -8.299126443010406  |
|   cutlass_lvl_1111    | 11.08622644096613  |  137.55013868492097  | -21.02469338195443  |
|   cutlass_lvl_2222    | 11.044904589653015 |   551.265836935956   | -21.319059178545007 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.483894050121307 | 0.27990864124149084  |         NA          |
|        triton         | 29.567627236247063 |  99.87172158574685   | -3.005740711366232  |
| triton_persistent_tma | 29.66325916349888  |  1.3695051120594144  | -2.692027748401006  |
|  cutlass_lvl_default  | 29.82821688055992  |  72.61214569816366   | -2.150897022812533  |
|   cutlass_lvl_1111    | 29.476772993803024 |   67.7428645719774   | -3.303780857728953  |
|   cutlass_lvl_2222    | 30.113255605101585 |  233.84051702311262  | -1.2158500630212203 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.58255836367607  | 0.058386584743857384 |         NA          |
|        triton         | 29.799651354551315 |  100.18178300186992  | -2.559978795150901  |
| triton_persistent_tma | 29.362043365836143 |  1.534341821912676   | -3.990885861562106  |
|  cutlass_lvl_default  |  29.4346883893013  |  73.68858492700383   | -3.7533484305817093 |
|   cutlass_lvl_1111    | 29.164200648665428 |  75.44329373072833   | -4.637799421958348  |
|   cutlass_lvl_2222    | 29.13798950612545  |  227.33327346481383  |  -4.7235056020244   |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1656.6237211227417 |  0.0549461180344224  |         NA         |
|        triton         | 1892.8285837173462 |  2.3174119112081826  | 14.258208401997386 |
| triton_persistent_tma | 1665.332317352295  |  2.7922237082384527  | 0.525683419747917  |
|  cutlass_lvl_default  | 1705.5492401123047 |  108.31571159465238  | 2.9533272019312116 |
|   cutlass_lvl_1111    | 1714.9059772491455 |  17.64627545280382   | 3.518134829489478  |
|   cutlass_lvl_2222    | 1680.4152727127075 |  306.9972395859659   | 1.4361469829637354 |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1621.416687965393  | 0.06300561130046844  |         NA         |
|        triton         | 1782.3902368545532 |  2.318530729971826   | 9.927956834535548  |
| triton_persistent_tma | 1586.0934257507324 |  2.7931175641715527  | -2.178543151605614 |
|  cutlass_lvl_default  | 1657.4617624282837 |  43.31810224894434   | 2.2230605328307784 |
|   cutlass_lvl_1111    | 1641.5367126464844 |  17.648567833006382  | 1.2408916739557292 |
|   cutlass_lvl_2222    | 1645.8417177200317 |  249.33647010894492  | 1.5064005407078918 |
+-----------------------+--------------------+----------------------+--------------------+
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-13 01:57:47 +00:00
4a12777ffe [Partitioner] Remove unnecessary upstream nodes in dependency viewer (#146580)
We iterate upstream nodes to update partition map. But actually did nothing due to we iterate nodes with reversed topological order https://github.com/pytorch/pytorch/pull/136608/files#diff-f2f9dd3903fd99955732eb694941fea0cb7301a58d59554787f3311d417e5615L193 so that there exists no upstream nodes in assignment. Remove it to reduce for-loop overhead which up to O(N * N) complexity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146580
Approved by: https://github.com/Skylion007, https://github.com/jerome-habana
2025-03-13 01:42:10 +00:00
1e37e5b836 Update nightly PyTorch version to 2.8.0 (#149038)
Branch for 2.7: https://github.com/pytorch/pytorch/tree/release/2.7
Same as https://github.com/pytorch/pytorch/pull/135916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149038
Approved by: https://github.com/ZainRizvi
2025-03-12 23:51:04 +00:00
e51615cb73 Revert "[Profiler][HPU] Fix incorrect availabilities for HPU (#148663)"
This reverts commit 28b78800b92a4d847a2360ab0e0b87d3e00a6138.

Reverted https://github.com/pytorch/pytorch/pull/148663 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @albanD, could you please help get this relanded? See D71052806 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148663#issuecomment-2719297055))
2025-03-12 22:52:11 +00:00
b1980b2405 Revert "Make dynamism code robust to NotImplementedException (#148823)"
This reverts commit 60576419a2a5cc09e4a92be870fda8f3fc305ddc.

Reverted https://github.com/pytorch/pytorch/pull/148823 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D71042206 for details. To validate your fixes internally before relanding, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148823#issuecomment-2719287467))
2025-03-12 22:45:39 +00:00
38c5cf99b3 [CI] Don't clean workspace when fetching repo (#147994)
Tested on https://github.com/pytorch/pytorch/pull/148995
Do two checkouts: first one attempts to use an existing checkout if possible.  The second one removes the workspace and re pulls everything if the first one fails

This is probably not going to be useful if we switch entirely to ephemeral runners but w/e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-12 22:29:52 +00:00
3f1769f785 Add ninja to requirements-ci for all arch (#148778)
So I can get ninja_logs for the builds
No negative consequences afaik
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148778
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-12 22:07:46 +00:00
0c8ec26d3b [ROCm][TunableOp] hipblaslt tf32 support (#145946)
TF32 is supported by hipblaslt. Support added by #143549.  This PR expands integration to the TunableOp feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145946
Approved by: https://github.com/pruthvistony, https://github.com/echen4096, https://github.com/yoyoyocmu

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2025-03-12 21:17:11 +00:00
ab45aaca97 Set non-strict export as default mode (#148790)
Summary:
- Flip the default value of strict argument in torch.export.export from True to False
- Update test infra to cope with the change, some of them made the assumption of strict mode as default
- Disabled some tests that fail in non-strict mode

Test Plan: Sandcastle

Differential Revision: D70228628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148790
Approved by: https://github.com/angelayi
2025-03-12 21:10:58 +00:00
e3ebf61589 Create and send full_tensor on ProcessGroup-supported device in _broadcast_tensors (#148865)
Fixes #138842

`device` is always the device of the `local_state_dict`, which may or may not be CPU, which is not supported by NCCL backend.

Instead, create broadcasted tensors on one of `pg._device_types` and then move the tensors back if `local_state_dict`'s `device` was not supported by the `ProcessGroup`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148865
Approved by: https://github.com/mori360
2025-03-12 20:56:31 +00:00
b5191b9312 [codemod][lowrisk] Fix deprecated use of 0/NULL in caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/fc-unpack.cc + 1 (#148996)
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D70939306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148996
Approved by: https://github.com/Skylion007
2025-03-12 20:06:19 +00:00
eqy
b90698f5ba [CUDA] try to abate some flakiness in test_stream_event_nogil (#148796)
threshold twiddling as one in a few dozen runs tend to fail the current threshold

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148796
Approved by: https://github.com/Skylion007
2025-03-12 19:12:50 +00:00
215f856142 Add XPU device to nested_layer_norm (#148593)
Work with https://github.com/intel/torch-xpu-ops/pull/1416 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148593
Approved by: https://github.com/guangyey, https://github.com/jbschlosser
2025-03-12 19:07:08 +00:00
66300d3d55 [cutlass backend] try make cutlass backend benchmark more robust (#149015)
Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/)

I want to make sure the benchmark even if failed on some experiment can still print most of the results.

```
Experiment group: mm (3x3, 3x3) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
|         name          | forward_time (us) | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+-------------------+----------------------+---------------------+
|         aten          | 6.175220478326082 |  0.5982149520423263  |         NA          |
|        triton         | 5.326753947883844 |  3.2067150759976357  | -13.739858089605114 |
| triton_persistent_tma | 5.340870004147291 |  3.279932268196717   | -13.51126615004617  |
|  cutlass_lvl_default  |        inf        |         inf          |         inf         |
|   cutlass_lvl_1111    |        inf        |         inf          |         inf         |
|   cutlass_lvl_2222    |        inf        |         inf          |         inf         |
|   cutlass_lvl_3333    |        inf        |         inf          |         inf         |
+-----------------------+-------------------+----------------------+---------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-12 18:59:49 +00:00
86bc154d61 [scan] Flattened output of HOP scan (#148955)
This is required because downstream operations expect HOPs to return a flattened list of output elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148955
Approved by: https://github.com/ydwu4
2025-03-12 18:27:27 +00:00
fb0e9cb0a0 Remove warnings on non-buffer tensor constants (#148483)
Export already registers tensor constants directly in the graph and this is also true for Torchbind objects. This removes warning that pollutes the output.

Differential Revision: [D70577856](https://our.internmc.facebook.com/intern/diff/D70577856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148483
Approved by: https://github.com/zhxchen17, https://github.com/zou3519
ghstack dependencies: #148364
2025-03-12 18:20:04 +00:00
29fd875bc1 Automate stable CUDA update and linter using min Python verison (#148912)
1. Fixes: https://github.com/pytorch/pytorch/issues/145571 . Cuda Stable is the same cuda version that is published to pypi, also used to set Metadata section in the rest of whl scripts and tag the docker releases with latest tag.
2. Updates min python version used in linter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148912
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-03-12 18:12:34 +00:00
01e9036bd2 skip torchbind in cosntant folding (#148993)
Summary:
Do not fold torchbind objects in constant folding

Any operation on these torchbind objects can have arbitrary side effects, so we can't effectively constant fold anything torchbind-obj-related anyway.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile_constant_folding
```

Reviewed By: angelayi

Differential Revision: D69946541

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148993
Approved by: https://github.com/angelayi
2025-03-12 18:08:08 +00:00
923ce10f6c [while_loop] require stride to be the same as input for body_fn (#148002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148002
Approved by: https://github.com/zou3519
2025-03-12 17:15:10 +00:00
28b78800b9 [Profiler][HPU] Fix incorrect availabilities for HPU (#148663)
Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/albanD
2025-03-12 17:06:57 +00:00
b040dc3a53 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential [disconnected] Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-12 15:52:16 +00:00
626a5e22eb Revert "[CI] Don't clean workspace when fetching repo (#147994)"
This reverts commit e5fef8a08ebb8548e8413ae54ef0ad9a11f1f4c0.

Reverted https://github.com/pytorch/pytorch/pull/147994 on behalf of https://github.com/clee2000 due to broke checkout on xpu, probably lack of sudo? ([comment](https://github.com/pytorch/pytorch/pull/147994#issuecomment-2718335186))
2025-03-12 15:50:38 +00:00
9a0f65d3d3 [TD] test_cpp_extensions_aot_ninja corresponds to things in test/cpp_extensions (#148992)
Manually map test_cpp_extensions_aot_ninja to files in test/cpp_extensions since test_cpp_extensions_aot_ninja isn't an actual file you can edit, but a wrapper for files in test/cpp_extensions.

Idk if this is a good idea, feels very manual.  Maybe it would be better to classify this the same as any other TD failure where TD simply can't figure out the tests it needs to run
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148992
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/janeyx99
2025-03-12 15:40:06 +00:00
488c4480f9 [inductor] Fix profiler tests with latest Triton (#149025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149025
Approved by: https://github.com/yanboliang
2025-03-12 15:34:26 +00:00
5ada4e6a53 Revert "Reland: [inductor] Simplify grid handling (#148305)"
This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178.

Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))
2025-03-12 14:58:43 +00:00
cyy
8fa81a6066 Enable misc-use-internal-linkage check and apply fixes (#148948)
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.

The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.

The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
2025-03-12 14:22:56 +00:00
f349304c08 [Inductor][CPP] Fix expr issue in loop split (#148882)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/148058. In this case, there is an `indexing_expr` as an integer which doesn't have the method of `find`.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_148058
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148882
Approved by: https://github.com/jgong5
2025-03-12 11:08:07 +00:00
81aee3c9c4 [Partitioner] Reduce time consuming of partitions merger (#146582)
This patch optimize maybe_merge_partition func through 3-ways:

Remove unnecessary copy https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L99. The number of copied nodes is large if we can merge all of the nodes of graph into one partition.
Record users of each partition to avoid duplicate iteration over nodes https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L133. The trip count of this loop maybe very large.
The nodes number of each partitions maybe not balance https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/infra/partitioner.py#L145. We always encounter one issue: one partition has n nodes, but the other has one node. Merge the smaller partition into the larger can help to reduce time consuming.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146582
Approved by: https://github.com/jerome-habana, https://github.com/Skylion007
2025-03-12 09:24:38 +00:00
d547a56668 [AMD] Various fixes for mem efficient attention on CK backend (#148986)
Summary: Decouple aotriton vs. ck for mem efficient attention. Also fixed HW check.

Reviewed By: henryhu6

Differential Revision: D70872677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148986
Approved by: https://github.com/jianyuh, https://github.com/houseroad
2025-03-12 07:36:46 +00:00
480 changed files with 13722 additions and 4189 deletions

View File

@ -105,7 +105,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10 ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11 GCC_VERSION=11
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -119,7 +118,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10 ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -134,7 +132,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.12 ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -149,7 +146,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.13 ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -164,7 +160,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10 ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -178,7 +173,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10 ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -193,7 +187,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.12 ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -208,7 +201,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.13 ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -223,7 +215,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10 ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
UCX_COMMIT=${_UCX_COMMIT} UCX_COMMIT=${_UCX_COMMIT}
@ -235,7 +226,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9 ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10 CLANG_VERSION=10
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
CONDA_CMAKE=yes CONDA_CMAKE=yes
ONNX=yes ONNX=yes
@ -244,7 +234,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9 ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10 CLANG_VERSION=10
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
VULKAN_SDK_VERSION=1.2.162.1 VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes SWIFTSHADER=yes
@ -255,7 +244,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.11 ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=10 CLANG_VERSION=10
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
VULKAN_SDK_VERSION=1.2.162.1 VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes SWIFTSHADER=yes
@ -266,7 +254,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9 ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9 GCC_VERSION=9
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
CONDA_CMAKE=yes CONDA_CMAKE=yes
TRITON=yes TRITON=yes
@ -275,7 +262,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10 ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11 GCC_VERSION=11
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
ROCM_VERSION=6.2.4 ROCM_VERSION=6.2.4
NINJA_VERSION=1.9.0 NINJA_VERSION=1.9.0
@ -290,7 +276,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.10 ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11 GCC_VERSION=11
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
ROCM_VERSION=6.3 ROCM_VERSION=6.3
NINJA_VERSION=1.9.0 NINJA_VERSION=1.9.0
@ -305,7 +290,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9 ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11 GCC_VERSION=11
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
XPU_VERSION=0.5 XPU_VERSION=0.5
NINJA_VERSION=1.9.0 NINJA_VERSION=1.9.0
@ -316,7 +300,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9 ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11 GCC_VERSION=11
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
XPU_VERSION=2025.0 XPU_VERSION=2025.0
NINJA_VERSION=1.9.0 NINJA_VERSION=1.9.0
@ -327,7 +310,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9 ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11 GCC_VERSION=11
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
CONDA_CMAKE=yes CONDA_CMAKE=yes
@ -341,7 +323,6 @@ case "$image" in
CUDNN_VERSION=9 CUDNN_VERSION=9
CLANG_VERSION=12 CLANG_VERSION=12
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
TRITON=yes TRITON=yes
;; ;;
@ -349,7 +330,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9 ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=12 CLANG_VERSION=12
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
CONDA_CMAKE=yes CONDA_CMAKE=yes
TRITON=yes TRITON=yes
@ -370,7 +350,6 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9 ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11 GCC_VERSION=11
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
KATEX=yes KATEX=yes
CONDA_CMAKE=yes CONDA_CMAKE=yes
@ -416,7 +395,6 @@ case "$image" in
GCC_VERSION=11 GCC_VERSION=11
ACL=yes ACL=yes
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
CONDA_CMAKE=yes CONDA_CMAKE=yes
# snadampal: skipping llvm src build install because the current version # snadampal: skipping llvm src build install because the current version
@ -428,7 +406,6 @@ case "$image" in
GCC_VERSION=11 GCC_VERSION=11
ACL=yes ACL=yes
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
CONDA_CMAKE=yes CONDA_CMAKE=yes
# snadampal: skipping llvm src build install because the current version # snadampal: skipping llvm src build install because the current version
@ -439,7 +416,6 @@ case "$image" in
*) *)
# Catch-all for builds that are not hardcoded. # Catch-all for builds that are not hardcoded.
PROTOBUF=yes PROTOBUF=yes
DB=yes
VISION=yes VISION=yes
echo "image '$image' did not match an existing build configuration" echo "image '$image' did not match an existing build configuration"
if [[ "$image" == *py* ]]; then if [[ "$image" == *py* ]]; then
@ -495,7 +471,6 @@ docker build \
--build-arg "BUILD_ENVIRONMENT=${image}" \ --build-arg "BUILD_ENVIRONMENT=${image}" \
--build-arg "PROTOBUF=${PROTOBUF:-}" \ --build-arg "PROTOBUF=${PROTOBUF:-}" \
--build-arg "LLVMDEV=${LLVMDEV:-}" \ --build-arg "LLVMDEV=${LLVMDEV:-}" \
--build-arg "DB=${DB:-}" \
--build-arg "VISION=${VISION:-}" \ --build-arg "VISION=${VISION:-}" \
--build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \ --build-arg "UBUNTU_VERSION=${UBUNTU_VERSION}" \
--build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \ --build-arg "CENTOS_VERSION=${CENTOS_VERSION}" \

View File

@ -55,13 +55,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF} ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV # (optional) Install vision packages like OpenCV
ARG VISION ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./ COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

View File

@ -1 +1 @@
v2.25.1-1 v2.26.2-1

View File

@ -240,7 +240,7 @@ function prune_126 {
} }
function install_128 { function install_128 {
CUDNN_VERSION=9.7.1.26 CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3" echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container # install CUDA 12.8.0 in the same container

View File

@ -161,7 +161,7 @@ function prune_126 {
} }
function install_128 { function install_128 {
CUDNN_VERSION=9.7.1.26 CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3" echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container # install CUDA 12.8.0 in the same container

View File

@ -5,7 +5,7 @@ if [[ -n "${CUDNN_VERSION}" ]]; then
mkdir tmp_cudnn mkdir tmp_cudnn
pushd tmp_cudnn pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.7.1.26_cuda12-archive" CUDNN_NAME="cudnn-linux-x86_64-9.8.0.87_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive" CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then

View File

@ -1,38 +0,0 @@
#!/bin/bash
set -ex
install_ubuntu() {
apt-get update
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
install_centos() {
# Need EPEL for many packages we depend on.
# See http://fedoraproject.org/wiki/EPEL
yum --enablerepo=extras install -y epel-release
# Cleanup
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
# Install base packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac

View File

@ -25,7 +25,9 @@ python3 -m pip install meson ninja
########################### ###########################
### clone repo ### clone repo
########################### ###########################
GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git # TEMPORARY FIX: https://gitlab.freedesktop.org/mesa/drm.git is down until 2025/03/22
# GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git
GIT_SSL_NO_VERIFY=true git clone git://anongit.freedesktop.org/mesa/drm
pushd drm pushd drm
########################### ###########################

View File

@ -41,11 +41,14 @@ fbscribelogger==0.1.7
#Pinned versions: 0.1.6 #Pinned versions: 0.1.6
#test that import: #test that import:
flatbuffers==2.0 flatbuffers==2.0 ; platform_machine != "s390x"
#Description: cross platform serialization library #Description: cross platform serialization library
#Pinned versions: 2.0 #Pinned versions: 2.0
#test that import: #test that import:
flatbuffers ; platform_machine == "s390x"
#Description: cross platform serialization library; Newer version is required on s390x for new python version
hypothesis==5.35.1 hypothesis==5.35.1
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests #Description: advanced library for generating parametrized tests
@ -102,10 +105,10 @@ networkx==2.8.8
#Pinned versions: 2.8.8 #Pinned versions: 2.8.8
#test that import: functorch #test that import: functorch
#ninja ninja==1.11.1.3
#Description: build system. Note that it install from #Description: build system. Used in some tests. Used in build to generate build
#here breaks things so it is commented out #time tracing information
#Pinned versions: 1.10.0.post1 #Pinned versions: 1.11.1.3
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.49.0 ; python_version < "3.9" numba==0.49.0 ; python_version < "3.9"
@ -365,7 +368,6 @@ PyYAML
pyzstd pyzstd
setuptools setuptools
ninja==1.11.1 ; platform_machine == "aarch64"
scons==4.5.2 ; platform_machine == "aarch64" scons==4.5.2 ; platform_machine == "aarch64"
pulp==2.9.0 ; python_version >= "3.8" pulp==2.9.0 ; python_version >= "3.8"

View File

@ -50,13 +50,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF} ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV # (optional) Install vision packages like OpenCV
ARG VISION ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./ COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

View File

@ -50,13 +50,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF} ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV # (optional) Install vision packages like OpenCV
ARG VISION ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./ COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

View File

@ -77,13 +77,6 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt RUN rm install_triton.sh common_utils.sh triton-xpu.txt triton_version.txt
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV # (optional) Install vision packages like OpenCV
ARG VISION ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./ COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

View File

@ -74,13 +74,6 @@ RUN if [ -n "${PROTOBUF}" ]; then bash ./install_protobuf.sh; fi
RUN rm install_protobuf.sh RUN rm install_protobuf.sh
ENV INSTALLED_PROTOBUF ${PROTOBUF} ENV INSTALLED_PROTOBUF ${PROTOBUF}
# (optional) Install database packages like LMDB and LevelDB
ARG DB
COPY ./common/install_db.sh install_db.sh
RUN if [ -n "${DB}" ]; then bash ./install_db.sh; fi
RUN rm install_db.sh
ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV # (optional) Install vision packages like OpenCV
ARG VISION ARG VISION
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./ COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./

View File

@ -73,26 +73,14 @@ fi
# Check GCC ABI # Check GCC ABI
############################################################################### ###############################################################################
# NOTE [ Building libtorch with old vs. new gcc ABI ] # NOTE: As of https://github.com/pytorch/pytorch/issues/126551 we only produce
# # wheels with cxx11-abi
# Packages built with one version of ABI could not be linked against by client
# C++ libraries that were compiled using the other version of ABI. Since both
# gcc ABIs are still common in the wild, we need to support both ABIs. Currently:
#
# - All the nightlies built on CentOS 7 + devtoolset7 use the old gcc ABI.
# - All the nightlies built on Ubuntu 16.04 + gcc 5.4 use the new gcc ABI.
echo "Checking that the gcc ABI is what we expect" echo "Checking that the gcc ABI is what we expect"
if [[ "$(uname)" != 'Darwin' ]]; then if [[ "$(uname)" != 'Darwin' ]]; then
function is_expected() { function is_expected() {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$DESIRED_CUDA" == *"rocm"* ]]; then if [[ "$1" -gt 0 || "$1" == "ON " ]]; then
if [[ "$1" -gt 0 || "$1" == "ON " ]]; then echo 1
echo 1
fi
else
if [[ -z "$1" || "$1" == 0 || "$1" == "OFF" ]]; then
echo 1
fi
fi fi
} }

View File

@ -121,9 +121,9 @@ def main() -> None:
else: else:
install_root = Path(distutils.sysconfig.get_python_lib()) / "torch" install_root = Path(distutils.sysconfig.get_python_lib()) / "torch"
libtorch_cpu_path = install_root / "lib" / "libtorch_cpu.so" libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")
pre_cxx11_abi = "cxx11-abi" not in os.getenv("DESIRED_DEVTOOLSET", "") # NOTE: All binaries are built with cxx11abi now
check_lib_symbols_for_abi_correctness(libtorch_cpu_path, pre_cxx11_abi) check_lib_symbols_for_abi_correctness(libtorch_cpu_path, False)
if __name__ == "__main__": if __name__ == "__main__":

View File

@ -76,10 +76,13 @@ def read_release_matrix():
def test_numpy(): def test_numpy():
import numpy as np try:
import numpy as np
x = np.arange(5) x = np.arange(5)
torch.tensor(x) torch.tensor(x)
except ImportError:
print("Numpy check skipped. Numpy is not installed.")
def check_version(package: str) -> None: def check_version(package: str) -> None:
@ -410,6 +413,7 @@ def main() -> None:
smoke_test_conv2d() smoke_test_conv2d()
test_linalg() test_linalg()
test_numpy() test_numpy()
if is_cuda_system: if is_cuda_system:
test_linalg("cuda") test_linalg("cuda")
test_cuda_gds_errors_captured() test_cuda_gds_errors_captured()

View File

@ -1619,6 +1619,7 @@ elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then
install_torchvision install_torchvision
checkout_install_torchbench hf_T5 llama moco checkout_install_torchbench hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER" PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
test_inductor_aoti
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision install_torchvision
test_inductor_shard "${SHARD_NUMBER}" test_inductor_shard "${SHARD_NUMBER}"

View File

@ -55,12 +55,16 @@ s3_upload() {
s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/" s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/"
fi fi
( (
cache_control_flag=""
if [[ "${UPLOAD_CHANNEL}" = "test" ]]; then
cache_control_flag="--cache-control='no-cache,no-store,must-revalidate'"
fi
for pkg in ${PKG_DIR}/*.${extension}; do for pkg in ${PKG_DIR}/*.${extension}; do
( (
set -x set -x
shm_id=$(sha256sum "${pkg}" | awk '{print $1}') shm_id=$(sha256sum "${pkg}" | awk '{print $1}')
${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}" \ ${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}" \
--metadata "checksum-sha256=${shm_id}" --metadata "checksum-sha256=${shm_id}" ${cache_control_flag}
) )
done done
) )

View File

@ -48,7 +48,6 @@ misc-*,
-misc-no-recursion, -misc-no-recursion,
-misc-non-private-member-variables-in-classes, -misc-non-private-member-variables-in-classes,
-misc-unused-using-decls, -misc-unused-using-decls,
-misc-use-internal-linkage,
modernize-*, modernize-*,
-modernize-macro-to-enum, -modernize-macro-to-enum,
-modernize-return-braced-init-list, -modernize-return-braced-init-list,

View File

@ -3,8 +3,11 @@ self-hosted-runner:
# GitHub hosted runner that actionlint doesn't recognize because actionlint version (1.6.21) is too old # GitHub hosted runner that actionlint doesn't recognize because actionlint version (1.6.21) is too old
- ubuntu-24.04 - ubuntu-24.04
# GitHub hosted x86 Linux runners # GitHub hosted x86 Linux runners
# TODO: Cleanup mentions of linux.20_04 when upgrade to linux.24_04 is complete
- linux.20_04.4x - linux.20_04.4x
- linux.20_04.16x - linux.20_04.16x
- linux.24_04.4x
- linux.24_04.16x
# Organization-wide AWS Linux Runners # Organization-wide AWS Linux Runners
- linux.large - linux.large
- linux.2xlarge - linux.2xlarge
@ -49,6 +52,7 @@ self-hosted-runner:
- linux.rocm.gpu - linux.rocm.gpu
- linux.rocm.gpu.2 - linux.rocm.gpu.2
- linux.rocm.gpu.4 - linux.rocm.gpu.4
- rocm-docker
# Repo-specific Apple hosted runners # Repo-specific Apple hosted runners
- macos-m1-ultra - macos-m1-ultra
- macos-m2-14 - macos-m2-14

View File

@ -24,8 +24,12 @@ runs:
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Set up parallel fetch and clean workspace - name: Set up parallel fetch and clean workspace
id: first-clean
continue-on-error: true
shell: bash shell: bash
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }} if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
env:
NO_SUDO: ${{ inputs.no-sudo }}
run: | run: |
# Use all available CPUs for fetching # Use all available CPUs for fetching
cd "${GITHUB_WORKSPACE}" cd "${GITHUB_WORKSPACE}"
@ -35,10 +39,16 @@ runs:
# Clean workspace. The default checkout action should also do this, but # Clean workspace. The default checkout action should also do this, but
# do it here as well just in case # do it here as well just in case
if [[ -d .git ]]; then if [[ -d .git ]]; then
git clean -ffdx if [ -z "${NO_SUDO}" ]; then
sudo git clean -ffdx
else
git clean -ffdx
fi
fi fi
- name: Checkout PyTorch - name: Checkout PyTorch
id: first-checkout-attempt
continue-on-error: true
uses: actions/checkout@v4 uses: actions/checkout@v4
with: with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }} ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
@ -46,3 +56,30 @@ runs:
fetch-depth: ${{ inputs.fetch-depth }} fetch-depth: ${{ inputs.fetch-depth }}
submodules: ${{ inputs.submodules }} submodules: ${{ inputs.submodules }}
show-progress: false show-progress: false
- name: Clean workspace (try again)
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' &&
(steps.first-clean.outcome != 'success' || steps.first-checkout-attempt.outcome != 'success') }}
shell: bash
env:
NO_SUDO: ${{ inputs.no-sudo }}
run: |
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
echo "${GITHUB_WORKSPACE}"
if [ -z "${NO_SUDO}" ]; then
retry sudo rm -rf "${GITHUB_WORKSPACE}"
else
retry rm -rf "${GITHUB_WORKSPACE}"
fi
mkdir "${GITHUB_WORKSPACE}"
- name: Checkout PyTorch (try again)
uses: actions/checkout@v4
if: ${{ steps.first-clean.outcome != 'success' || steps.first-checkout-attempt.outcome != 'success' }}
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
fetch-depth: ${{ inputs.fetch-depth }}
submodules: ${{ inputs.submodules }}
show-progress: false

View File

@ -17,6 +17,7 @@ from typing import Optional
# NOTE: Also update the CUDA sources in tools/nightly.py when changing this list # NOTE: Also update the CUDA sources in tools/nightly.py when changing this list
CUDA_ARCHES = ["11.8", "12.6", "12.8"] CUDA_ARCHES = ["11.8", "12.6", "12.8"]
CUDA_STABLE = "12.6"
CUDA_ARCHES_FULL_VERSION = { CUDA_ARCHES_FULL_VERSION = {
"11.8": "11.8.0", "11.8": "11.8.0",
"12.6": "12.6.3", "12.6": "12.6.3",
@ -67,7 +68,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'" "nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -76,14 +77,14 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | " "nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'" "nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -373,7 +374,7 @@ def generate_wheels_matrix(
} }
) )
# Special build building to use on Colab. Python 3.11 for 12.6 CUDA # Special build building to use on Colab. Python 3.11 for 12.6 CUDA
if python_version == "3.11" and arch_version == "12.6": if python_version == "3.11" and arch_version == CUDA_STABLE:
ret.append( ret.append(
{ {
"python_version": python_version, "python_version": python_version,
@ -416,7 +417,7 @@ def generate_wheels_matrix(
"pytorch_extra_install_requirements": ( "pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["xpu"] PYTORCH_EXTRA_INSTALL_REQUIREMENTS["xpu"]
if gpu_arch_type == "xpu" if gpu_arch_type == "xpu"
else PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.6"] else PYTORCH_EXTRA_INSTALL_REQUIREMENTS[CUDA_STABLE]
if os != "linux" if os != "linux"
else "" else ""
), ),

30
.github/scripts/get_ci_variable.py vendored Executable file
View File

@ -0,0 +1,30 @@
#!/usr/bin/env python3
"""Helper script - Return CI variables such as stable cuda, min python version, etc."""
import argparse
import sys
def main(args: list[str]) -> None:
import generate_binary_build_matrix
parser = argparse.ArgumentParser()
parser.add_argument(
"--cuda-stable-version",
action="store_true",
help="get cuda stable version",
)
parser.add_argument(
"--min-python-version",
action="store_true",
help="get min supported python version",
)
options = parser.parse_args(args)
if options.cuda_stable_version:
return print(generate_binary_build_matrix.CUDA_STABLE)
if options.min_python_version:
return print(generate_binary_build_matrix.FULL_PYTHON_VERSIONS[0])
if __name__ == "__main__":
main(sys.argv[1:])

97
.github/scripts/s390x-ci/tests_list.py vendored Executable file
View File

@ -0,0 +1,97 @@
#!/usr/bin/env python3
import os
import re
import sys
sys.path.insert(1, os.path.join(sys.path[0], "..", "..", ".."))
from tools.testing.discover_tests import TESTS
skip_list = [
# these tests fail due to various reasons
"dynamo/test_misc",
"inductor/test_aot_inductor",
"inductor/test_cpu_repro",
"inductor/test_cpu_select_algorithm",
"inductor/test_aot_inductor_arrayref",
"inductor/test_torchinductor_codegen_dynamic_shapes",
"lazy/test_meta_kernel",
"onnx/test_utility_funs",
"profiler/test_profiler",
"test_ao_sparsity",
"test_cpp_extensions_open_device_registration",
"test_jit",
"test_metal",
"test_mps",
"dynamo/test_torchrec",
"inductor/test_aot_inductor_utils",
"inductor/test_coordinate_descent_tuner",
"test_jiterator",
# these tests run long and fail in addition to that
"dynamo/test_dynamic_shapes",
"test_quantization",
"inductor/test_torchinductor",
"inductor/test_torchinductor_dynamic_shapes",
"inductor/test_torchinductor_opinfo",
"test_binary_ufuncs",
"test_unary_ufuncs",
# these tests fail when cuda is not available
"inductor/test_cudacodecache",
"inductor/test_inductor_utils",
"inductor/test_inplacing_pass",
"inductor/test_kernel_benchmark",
"inductor/test_max_autotune",
"inductor/test_move_constructors_to_cuda",
"inductor/test_multi_kernel",
"inductor/test_pattern_matcher",
"inductor/test_perf",
"inductor/test_select_algorithm",
"inductor/test_snode_runtime",
"inductor/test_triton_wrapper",
# these tests fail when mkldnn is not available
"inductor/test_custom_post_grad_passes",
"inductor/test_mkldnn_pattern_matcher",
# lacks quantization support
"onnx/test_models_quantized_onnxruntime",
"onnx/test_pytorch_onnx_onnxruntime",
# https://github.com/pytorch/pytorch/issues/102078
"test_decomp",
# https://github.com/pytorch/pytorch/issues/146698
"test_model_exports_to_core_aten",
# runs very long, skip for now
"inductor/test_layout_optim",
"test_fx",
# some false errors
"doctests",
]
skip_list_regex = [
# distributed tests fail randomly
"distributed/.*",
]
all_testfiles = sorted(TESTS)
filtered_testfiles = []
for filename in all_testfiles:
if filename in skip_list:
continue
regex_filtered = False
for regex_string in skip_list_regex:
if re.fullmatch(regex_string, filename):
regex_filtered = True
break
if regex_filtered:
continue
filtered_testfiles.append(filename)
for filename in filtered_testfiles:
print(' "' + filename + '",')

View File

@ -819,10 +819,9 @@ class GitHubPR:
cursor=info["reviews"]["pageInfo"]["startCursor"], cursor=info["reviews"]["pageInfo"]["startCursor"],
) )
info = rc["data"]["repository"]["pullRequest"] info = rc["data"]["repository"]["pullRequest"]
reviews = {} reviews = {
for author, state in self._reviews: author: state for author, state in self._reviews if state != "COMMENTED"
if state != "COMMENTED": }
reviews[author] = state
return list(reviews.items()) return list(reviews.items())
def get_approved_by(self) -> list[str]: def get_approved_by(self) -> list[str]:
@ -2282,7 +2281,8 @@ def merge(
except MandatoryChecksMissingError as ex: except MandatoryChecksMissingError as ex:
last_exception = str(ex) last_exception = str(ex)
print( print(
f"Merge of https://github.com/{pr.org}/{pr.project}/pull/{pr.pr_num} failed due to: {ex}. Retrying in 5 min" f"Merge of https://github.com/{pr.org}/{pr.project}/pull/{pr.pr_num} failed due to: {ex}. Retrying in 5 min",
flush=True,
) )
time.sleep(5 * 60) time.sleep(5 * 60)
# Finally report timeout back # Finally report timeout back

View File

@ -33,10 +33,6 @@ on:
default: "3.9" default: "3.9"
description: | description: |
The python version to be used. Will be 3.9 by default The python version to be used. Will be 3.9 by default
environment-file:
required: false
type: string
description: Set the conda environment file used to setup macOS build.
test-matrix: test-matrix:
required: false required: false
type: string type: string
@ -86,23 +82,12 @@ jobs:
fi fi
- name: Setup miniconda - name: Setup miniconda
if: inputs.environment-file == ''
uses: pytorch/test-infra/.github/actions/setup-miniconda@main uses: pytorch/test-infra/.github/actions/setup-miniconda@main
with: with:
python-version: ${{ inputs.python-version }} python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }} environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
# This option is used when cross-compiling arm64 from x86-64. Specifically, we need arm64 conda
# environment even though the arch is x86-64
- name: Setup miniconda using the provided environment file
if: inputs.environment-file != ''
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
with:
python-version: ${{ inputs.python-version }}
environment-file: ${{ inputs.environment-file }}
pip-requirements-file: .github/requirements/pip-requirements-${{ runner.os }}.txt
- name: Install sccache (only for non-forked PRs, and pushes to trunk) - name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@v3.0.0 uses: nick-fields/retry@v3.0.0
if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }} if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}

View File

@ -35,7 +35,7 @@ jobs:
pull-requests: write pull-requests: write
name: Check labels name: Check labels
if: github.repository_owner == 'pytorch' if: github.repository_owner == 'pytorch'
runs-on: linux.20_04.4x runs-on: linux.24_04.4x
steps: steps:
- name: Checkout PyTorch - name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main uses: pytorch/pytorch/.github/actions/checkout-pytorch@main

View File

@ -0,0 +1,55 @@
name: docker-cache-mi300
on:
# run every 6 hours
schedule:
- cron: 0 0,6,12,18 * * *
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
docker-cache:
if: github.repository_owner == 'pytorch'
runs-on: rocm-docker
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: true
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Login to Amazon ECR
id: login-ecr
continue-on-error: false
uses: aws-actions/amazon-ecr-login@v2
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: pytorch-linux-focal-rocm-n-py3
push: false
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Tar and upload to S3 bucket
run: |
sudo docker save -o ~/docker-data/pytorch/pytorch_docker_image.tar ${{ steps.calculate-docker-image.outputs.docker-image }}
sudo rclone copy -P --s3-upload-concurrency 64 --s3-chunk-size 200M --s3-upload-cutoff 300M ~/docker-data/pytorch/pytorch_docker_image.tar oci:pytorchbucket0002/pytorch_docker_image --progress

View File

@ -117,7 +117,10 @@ jobs:
# To get QEMU binaries in our PATH # To get QEMU binaries in our PATH
echo "${RUNNER_TEMP}/bin" >> "${GITHUB_PATH}" echo "${RUNNER_TEMP}/bin" >> "${GITHUB_PATH}"
# Generate PyTorch version to use # Generate PyTorch version to use
echo "PYTORCH_VERSION=$(python3 .github/scripts/generate_pytorch_version.py --no-build-suffix)" >> "${GITHUB_ENV}" {
echo "PYTORCH_VERSION=$(python3 .github/scripts/generate_pytorch_version.py --no-build-suffix)";
echo "STABLE_CUDA_VERSION=$(python3 .github/scripts/get_ci_variable.py --stable-cuda-version)"
} >> "${GITHUB_ENV}"
- name: Setup test specific variables - name: Setup test specific variables
if: ${{ startsWith(github.event.ref, 'refs/tags/v') }} if: ${{ startsWith(github.event.ref, 'refs/tags/v') }}
run: | run: |
@ -154,7 +157,7 @@ jobs:
docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}" docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"
# Please note, here we ned to pin specific verison of CUDA as with latest label # Please note, here we ned to pin specific verison of CUDA as with latest label
if [[ ${CUDA_VERSION_SHORT} == "12.4" ]]; then if [[ ${CUDA_VERSION_SHORT} == "${STABLE_CUDA_VERSION}" ]]; then
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}" \ docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}" \
ghcr.io/pytorch/pytorch-nightly:latest ghcr.io/pytorch/pytorch-nightly:latest
docker push ghcr.io/pytorch/pytorch-nightly:latest docker push ghcr.io/pytorch/pytorch-nightly:latest

View File

@ -64,7 +64,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64 build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-aarch64-test: # Testing manywheel-py3_9-cpu-aarch64-test: # Testing
@ -134,7 +134,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64-12_8 build_name: manywheel-py3_9-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420 timeout-minutes: 420
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
@ -181,7 +181,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64 build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-aarch64-test: # Testing manywheel-py3_10-cpu-aarch64-test: # Testing
@ -251,7 +251,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8 build_name: manywheel-py3_10-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420 timeout-minutes: 420
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
@ -298,7 +298,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64 build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-aarch64-test: # Testing manywheel-py3_11-cpu-aarch64-test: # Testing
@ -368,7 +368,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8 build_name: manywheel-py3_11-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420 timeout-minutes: 420
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
@ -415,7 +415,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64 build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-aarch64-test: # Testing manywheel-py3_12-cpu-aarch64-test: # Testing
@ -485,7 +485,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8 build_name: manywheel-py3_12-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420 timeout-minutes: 420
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
@ -532,7 +532,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cpu-aarch64 build_name: manywheel-py3_13-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cpu-aarch64-test: # Testing manywheel-py3_13-cpu-aarch64-test: # Testing
@ -602,7 +602,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8 build_name: manywheel-py3_13-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420 timeout-minutes: 420
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
@ -649,7 +649,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cpu-aarch64 build_name: manywheel-py3_13t-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-aarch64-test: # Testing manywheel-py3_13t-cpu-aarch64-test: # Testing
@ -719,7 +719,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine" ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8 build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420 timeout-minutes: 420
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -105,7 +105,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_6 build_name: manywheel-py3_9-cuda12_6
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-test: # Testing manywheel-py3_9-cuda12_6-test: # Testing
@ -152,7 +152,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8 build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing manywheel-py3_9-cuda12_8-test: # Testing

View File

@ -262,7 +262,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_6 build_name: manywheel-py3_9-cuda12_6
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-test: # Testing manywheel-py3_9-cuda12_6-test: # Testing
@ -331,7 +331,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8 build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing manywheel-py3_9-cuda12_8-test: # Testing
@ -891,7 +891,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_6 build_name: manywheel-py3_10-cuda12_6
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_6-test: # Testing manywheel-py3_10-cuda12_6-test: # Testing
@ -960,7 +960,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_8 build_name: manywheel-py3_10-cuda12_8
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_8-test: # Testing manywheel-py3_10-cuda12_8-test: # Testing
@ -1520,7 +1520,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_6 build_name: manywheel-py3_11-cuda12_6
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_6-test: # Testing manywheel-py3_11-cuda12_6-test: # Testing
@ -1654,7 +1654,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_8 build_name: manywheel-py3_11-cuda12_8
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-test: # Testing manywheel-py3_11-cuda12_8-test: # Testing
@ -2214,7 +2214,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_6 build_name: manywheel-py3_12-cuda12_6
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_6-test: # Testing manywheel-py3_12-cuda12_6-test: # Testing
@ -2283,7 +2283,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8 build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing manywheel-py3_12-cuda12_8-test: # Testing
@ -2843,7 +2843,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_6 build_name: manywheel-py3_13-cuda12_6
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_6-test: # Testing manywheel-py3_13-cuda12_6-test: # Testing
@ -2912,7 +2912,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_8 build_name: manywheel-py3_13-cuda12_8
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_8-test: # Testing manywheel-py3_13-cuda12_8-test: # Testing
@ -3472,7 +3472,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_6 build_name: manywheel-py3_13t-cuda12_6
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_6-test: # Testing manywheel-py3_13t-cuda12_6-test: # Testing
@ -3541,7 +3541,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_8 build_name: manywheel-py3_13t-cuda12_8
build_environment: linux-binary-manywheel build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.8.0.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_8-test: # Testing manywheel-py3_13t-cuda12_8-test: # Testing

View File

@ -63,7 +63,7 @@ jobs:
timeout-minutes: 420 timeout-minutes: 420
build_name: manywheel-py3_9-cpu-s390x build_name: manywheel-py3_9-cpu-s390x
build_environment: linux-s390x-binary-manywheel build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-s390x-test: # Testing manywheel-py3_9-cpu-s390x-test: # Testing
@ -128,7 +128,7 @@ jobs:
timeout-minutes: 420 timeout-minutes: 420
build_name: manywheel-py3_10-cpu-s390x build_name: manywheel-py3_10-cpu-s390x
build_environment: linux-s390x-binary-manywheel build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-s390x-test: # Testing manywheel-py3_10-cpu-s390x-test: # Testing
@ -193,7 +193,7 @@ jobs:
timeout-minutes: 420 timeout-minutes: 420
build_name: manywheel-py3_11-cpu-s390x build_name: manywheel-py3_11-cpu-s390x
build_environment: linux-s390x-binary-manywheel build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-s390x-test: # Testing manywheel-py3_11-cpu-s390x-test: # Testing
@ -258,7 +258,7 @@ jobs:
timeout-minutes: 420 timeout-minutes: 420
build_name: manywheel-py3_12-cpu-s390x build_name: manywheel-py3_12-cpu-s390x
build_environment: linux-s390x-binary-manywheel build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-s390x-test: # Testing manywheel-py3_12-cpu-s390x-test: # Testing
@ -323,7 +323,7 @@ jobs:
timeout-minutes: 420 timeout-minutes: 420
build_name: manywheel-py3_13-cpu-s390x build_name: manywheel-py3_13-cpu-s390x
build_environment: linux-s390x-binary-manywheel build_environment: linux-s390x-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets: secrets:
github-token: ${{ secrets.GITHUB_TOKEN }} github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cpu-s390x-test: # Testing manywheel-py3_13-cpu-s390x-test: # Testing

View File

@ -43,7 +43,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9" DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally # NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the # They are also here because setting them at a workflow level doesn't give us access to the
@ -167,7 +167,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10" DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally # NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the # They are also here because setting them at a workflow level doesn't give us access to the
@ -291,7 +291,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11" DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally # NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the # They are also here because setting them at a workflow level doesn't give us access to the
@ -415,7 +415,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12" DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally # NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the # They are also here because setting them at a workflow level doesn't give us access to the
@ -539,7 +539,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13" DESIRED_PYTHON: "3.13"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally # NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the # They are also here because setting them at a workflow level doesn't give us access to the
@ -663,7 +663,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13t" DESIRED_PYTHON: "3.13t"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally # NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the # They are also here because setting them at a workflow level doesn't give us access to the

View File

@ -54,7 +54,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12" DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally # NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the # They are also here because setting them at a workflow level doesn't give us access to the

View File

@ -54,7 +54,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9" DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -290,7 +290,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9" DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -528,7 +528,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9" DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -766,7 +766,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.9" DESIRED_PYTHON: "3.9"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -1238,7 +1238,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10" DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -1474,7 +1474,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10" DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -1712,7 +1712,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10" DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -1950,7 +1950,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.10" DESIRED_PYTHON: "3.10"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -2422,7 +2422,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11" DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -2658,7 +2658,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11" DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -2896,7 +2896,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11" DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -3134,7 +3134,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.11" DESIRED_PYTHON: "3.11"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -3606,7 +3606,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12" DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -3842,7 +3842,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12" DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -4080,7 +4080,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12" DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -4318,7 +4318,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.12" DESIRED_PYTHON: "3.12"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -4790,7 +4790,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13" DESIRED_PYTHON: "3.13"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -5026,7 +5026,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13" DESIRED_PYTHON: "3.13"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -5264,7 +5264,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13" DESIRED_PYTHON: "3.13"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -5502,7 +5502,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13" DESIRED_PYTHON: "3.13"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -5974,7 +5974,7 @@ jobs:
GPU_ARCH_TYPE: cpu GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13t" DESIRED_PYTHON: "3.13t"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -6210,7 +6210,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13t" DESIRED_PYTHON: "3.13t"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -6448,7 +6448,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13t" DESIRED_PYTHON: "3.13t"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash
@ -6686,7 +6686,7 @@ jobs:
GPU_ARCH_TYPE: cuda GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1 SKIP_ALL_TESTS: 1
DESIRED_PYTHON: "3.13t" DESIRED_PYTHON: "3.13t"
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64' PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
steps: steps:
- name: Display EC2 information - name: Display EC2 information
shell: bash shell: bash

View File

@ -26,7 +26,7 @@ jobs:
curr_branch: ${{ github.head_ref || github.ref_name }} curr_branch: ${{ github.head_ref || github.ref_name }}
lintrunner-clang: lintrunner-clang:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: get-label-type needs: get-label-type
with: with:
timeout: 120 timeout: 120
@ -43,7 +43,7 @@ jobs:
.github/scripts/lintrunner.sh .github/scripts/lintrunner.sh
lintrunner-noclang: lintrunner-noclang:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: get-label-type needs: get-label-type
with: with:
timeout: 120 timeout: 120
@ -59,7 +59,7 @@ jobs:
.github/scripts/lintrunner.sh .github/scripts/lintrunner.sh
quick-checks: quick-checks:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: get-label-type needs: get-label-type
with: with:
timeout: 120 timeout: 120
@ -116,7 +116,7 @@ jobs:
bash .github/scripts/pr-sanity-check.sh bash .github/scripts/pr-sanity-check.sh
workflow-checks: workflow-checks:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: get-label-type needs: get-label-type
with: with:
timeout: 120 timeout: 120
@ -154,7 +154,7 @@ jobs:
exit $RC exit $RC
toc: toc:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: get-label-type needs: get-label-type
with: with:
timeout: 120 timeout: 120
@ -194,7 +194,7 @@ jobs:
test-tools: test-tools:
name: Test tools name: Test tools
if: ${{ github.repository == 'pytorch/pytorch' }} if: ${{ github.repository == 'pytorch/pytorch' }}
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
needs: get-label-type needs: get-label-type
with: with:
timeout: 120 timeout: 120
@ -215,7 +215,7 @@ jobs:
test_run_test: test_run_test:
name: Test `run_test.py` is usable without boto3 name: Test `run_test.py` is usable without boto3
if: ${{ github.repository == 'pytorch/pytorch' }} if: ${{ github.repository == 'pytorch/pytorch' }}
runs-on: linux.20_04.4x runs-on: linux.24_04.4x
steps: steps:
- name: Checkout PyTorch - name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
@ -241,10 +241,18 @@ jobs:
test_collect_env: test_collect_env:
if: ${{ github.repository == 'pytorch/pytorch' }} if: ${{ github.repository == 'pytorch/pytorch' }}
name: Test collect_env name: Test collect_env
runs-on: linux.20_04.4x runs-on: ${{ matrix.runner }}
strategy: strategy:
matrix: matrix:
test_type: [with_torch, without_torch, older_python_version] include:
- test_type: with_torch
runner: linux.24_04.4x
- test_type: without_torch
runner: linux.24_04.4x
# NOTE: The oldest supported version of python for 24.04 is 3.8
# so this cannot be updated if we want to keep this test at 3.6
- test_type: older_python_version
runner: linux.20_04.4x
steps: steps:
# [see note: pytorch repo ref] # [see note: pytorch repo ref]
# deep clone (fetch-depth 0) required, to allow us to use git log # deep clone (fetch-depth 0) required, to allow us to use git log
@ -253,21 +261,28 @@ jobs:
with: with:
submodules: false submodules: false
fetch-depth: 1 fetch-depth: 1
- name: Setup Python 3.6 - name: Get min python version
id: get-min-python-version
if: matrix.test_type == 'older_python_version'
run: |
set -eou pipefail
# Generate PyTorch version to use
echo "MIN_PYTHON_VERSION=$(python3 .github/scripts/get_ci_variable.py --min-python-version)" >> "${GITHUB_OUTPUT}"
- name: Setup Old Python version
if: matrix.test_type == 'older_python_version' if: matrix.test_type == 'older_python_version'
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: '3.6' python-version: 3.6
architecture: x64 architecture: x64
check-latest: false check-latest: false
cache: pip cache: pip
cache-dependency-path: | cache-dependency-path: |
**/requirements.txt **/requirements.txt
- name: Setup Python 3.9 - name: Setup Min Python version
if: matrix.test_type != 'older_python_version' if: matrix.test_type != 'older_python_version'
uses: actions/setup-python@v4 uses: actions/setup-python@v4
with: with:
python-version: '3.9' python-version: ${{ steps.get-min-python-version.outputs.MIN_PYTHON_VERSION }}
architecture: x64 architecture: x64
check-latest: false check-latest: false
cache: pip cache: pip

View File

@ -7,7 +7,7 @@ on:
jobs: jobs:
do_revert: do_revert:
name: try_revert_pr_${{ github.event.client_payload.pr_num }} name: try_revert_pr_${{ github.event.client_payload.pr_num }}
runs-on: linux.20_04.4x runs-on: linux.24_04.4x
environment: mergebot environment: mergebot
env: env:
GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }} GH_RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}

View File

@ -15,7 +15,7 @@ jobs:
check_binary_linux_cpu: check_binary_linux_cpu:
if: github.repository_owner == 'pytorch' if: github.repository_owner == 'pytorch'
name: Test check_binary.sh for Linux CPU name: Test check_binary.sh for Linux CPU
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
with: with:
docker-image: python:3.11 docker-image: python:3.11
docker-build-dir: "skip-docker-build" docker-build-dir: "skip-docker-build"
@ -28,7 +28,7 @@ jobs:
check_binary_linux_cuda: check_binary_linux_cuda:
if: github.repository_owner == 'pytorch' if: github.repository_owner == 'pytorch'
name: Test check_binary.sh for Linux CUDA name: Test check_binary.sh for Linux CUDA
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
with: with:
runner: linux.4xlarge.nvidia.gpu runner: linux.4xlarge.nvidia.gpu
docker-image: python:3.11 docker-image: python:3.11

View File

@ -7,7 +7,7 @@ on:
jobs: jobs:
do_merge: do_merge:
name: try_merge_pr_${{ github.event.client_payload.pr_num }} name: try_merge_pr_${{ github.event.client_payload.pr_num }}
runs-on: linux.20_04.4x runs-on: linux.24_04.4x
environment: mergebot environment: mergebot
permissions: permissions:
id-token: write id-token: write

View File

@ -19,7 +19,7 @@
- [Cherry Picking Fixes](#cherry-picking-fixes) - [Cherry Picking Fixes](#cherry-picking-fixes)
- [How to do Cherry Picking](#how-to-do-cherry-picking) - [How to do Cherry Picking](#how-to-do-cherry-picking)
- [Cherry Picking Reverts](#cherry-picking-reverts) - [Cherry Picking Reverts](#cherry-picking-reverts)
- [Preparing and Creating Final Release candidate](#preparing-and-creating-final-release-candidate) - [Preparing and Creating Final Release Candidate](#preparing-and-creating-final-release-candidate)
- [Promoting RCs to Stable](#promoting-rcs-to-stable) - [Promoting RCs to Stable](#promoting-rcs-to-stable)
- [Additional Steps to prepare for release day](#additional-steps-to-prepare-for-release-day) - [Additional Steps to prepare for release day](#additional-steps-to-prepare-for-release-day)
- [Modify release matrix](#modify-release-matrix) - [Modify release matrix](#modify-release-matrix)
@ -63,7 +63,7 @@ Following is the Release Compatibility Matrix for PyTorch releases:
## Release Cadence ## Release Cadence
Following is the release cadence. All future dates below are tentative, for latest updates on the release scheduled please follow [dev discuss](https://dev-discuss.pytorch.org/c/release-announcements/27). Please note: Patch Releases are optional. Following is the release cadence. All future dates below are tentative. For latest updates on the release schedule, please follow [dev discuss](https://dev-discuss.pytorch.org/c/release-announcements/27). Please note: Patch Releases are optional.
| Minor Version | Release branch cut | Release date | First patch release date | Second patch release date| | Minor Version | Release branch cut | Release date | First patch release date | Second patch release date|
| --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- |
@ -91,20 +91,20 @@ Releasing a new version of PyTorch generally entails 3 major steps:
### Frequently Asked Questions ### Frequently Asked Questions
* Q: What is release branch cut ? * Q: What is a release branch cut ?
* A: When bulk of the tracked features merged into the main branch, the primary release engineer starts the release process of cutting the release branch by creating a new git branch based off of the current `main` development branch of PyTorch. This allows PyTorch development flow on `main` to continue uninterrupted, while the release engineering team focuses on stabilizing the release branch in order to release a series of release candidates (RC). The activities in the release branch include both regression and performance testing as well as polishing new features and fixing release-specific bugs. In general, new features *are not* added to the release branch after it was created. * A: When bulk of the tracked features merged into the main branch, the primary release engineer starts the release process of cutting the release branch by creating a new git branch based off of the current `main` development branch of PyTorch. This allows PyTorch development flow on `main` to continue uninterrupted, while the release engineering team focuses on stabilizing the release branch in order to release a series of release candidates (RC). The activities in the release branch include both regression and performance testing as well as polishing new features and fixing release-specific bugs. In general, new features *are not* added to the release branch after it was created.
* Q: What is cherry-pick ? * Q: What is a cherry-pick ?
* A: A cherry pick is a process of propagating commits from the main into the release branch, utilizing git's built in [cherry-pick feature](https://git-scm.com/docs/git-cherry-pick). These commits are typically limited to small fixes or documentation updates to ensure that the release engineering team has sufficient time to complete a thorough round of testing on the release branch. To nominate a fix for cherry-picking, a separate pull request must be created against the respective release branch and then mentioned in the Release Tracker issue (example: https://github.com/pytorch/pytorch/issues/94937) following the template from the issue description. The comment nominating a particular cherry-pick for inclusion in the release should include the committed PR against main branch, the newly created cherry-pick PR, as well as the acceptance criteria for why the cherry-pick is needed in the first place. * A: A cherry pick is a process of propagating commits from the main into the release branch, utilizing git's built in [cherry-pick feature](https://git-scm.com/docs/git-cherry-pick). These commits are typically limited to small fixes or documentation updates to ensure that the release engineering team has sufficient time to complete a thorough round of testing on the release branch. To nominate a fix for cherry-picking, a separate pull request must be created against the respective release branch and then mentioned in the Release Tracker issue (example: https://github.com/pytorch/pytorch/issues/94937) following the template from the issue description. The comment nominating a particular cherry-pick for inclusion in the release should include the committed PR against main branch, the newly created cherry-pick PR, as well as the acceptance criteria for why the cherry-pick is needed in the first place.
## Cutting a release branch preparations ## Cutting a release branch preparations
Following Requirements needs to be met prior to cutting a release branch: Following requirements need to be met prior to cutting a release branch:
* Resolve all outstanding issues in the milestones(for example [1.11.0](https://github.com/pytorch/pytorch/milestone/28))before first RC cut is completed. After RC cut is completed following script should be executed from test-infra repo in order to validate the presence of the fixes in the release branch : * Resolve all outstanding issues in the milestones (for example [1.11.0](https://github.com/pytorch/pytorch/milestone/28)) before first RC cut is completed. After RC cut is completed, the following script should be executed from test-infra repo in order to validate the presence of the fixes in the release branch:
``` python github_analyze.py --repo-path ~/local/pytorch --remote upstream --branch release/1.11 --milestone-id 26 --missing-in-branch ``` ``` python github_analyze.py --repo-path ~/local/pytorch --remote upstream --branch release/1.11 --milestone-id 26 --missing-in-branch ```
* Validate that all new workflows have been created in the PyTorch and domain libraries included in the release. Validate it against all dimensions of release matrix, including operating systems(Linux, MacOS, Windows), Python versions as well as CPU architectures(x86 and arm) and accelerator versions(CUDA, ROCm, XPU). * Validate that all new workflows have been created in the PyTorch and domain libraries included in the release. Validate it against all dimensions of release matrix, including operating systems (Linux, MacOS, Windows), Python versions as well as CPU architectures (x86 and arm) and accelerator versions (CUDA, ROCm, XPU).
* All the nightly jobs for pytorch and domain libraries should be green. Validate this using following HUD links: * All the nightly jobs for pytorch and domain libraries should be green. Validate this using the following HUD links:
* [Pytorch](https://hud.pytorch.org/hud/pytorch/pytorch/nightly) * [Pytorch](https://hud.pytorch.org/hud/pytorch/pytorch/nightly)
* [TorchVision](https://hud.pytorch.org/hud/pytorch/vision/nightly) * [TorchVision](https://hud.pytorch.org/hud/pytorch/vision/nightly)
* [TorchAudio](https://hud.pytorch.org/hud/pytorch/audio/nightly) * [TorchAudio](https://hud.pytorch.org/hud/pytorch/audio/nightly)
@ -224,12 +224,12 @@ Backups are stored in a non-public S3 bucket at [`s3://pytorch-backup`](https://
### Release Candidate health validation ### Release Candidate health validation
Validate the release jobs for pytorch and domain libraries should be green. Validate this using following HUD links: Validate that the release jobs for pytorch and domain libraries are green. Validate this using the following HUD links:
* [Pytorch](https://hud.pytorch.org/hud/pytorch/pytorch/release%2F1.12) * [Pytorch](https://hud.pytorch.org/hud/pytorch/pytorch/release%2F1.12)
* [TorchVision](https://hud.pytorch.org/hud/pytorch/vision/release%2F1.12) * [TorchVision](https://hud.pytorch.org/hud/pytorch/vision/release%2F1.12)
* [TorchAudio](https://hud.pytorch.org/hud/pytorch/audio/release%2F1.12) * [TorchAudio](https://hud.pytorch.org/hud/pytorch/audio/release%2F1.12)
Validate that the documentation build has completed and generated entry corresponding to the release in [docs repository](https://github.com/pytorch/docs/tree/main/). Validate that the documentation build has completed and generated an entry corresponding to the release in the [docs repository](https://github.com/pytorch/docs/tree/main/).
### Cherry Picking Fixes ### Cherry Picking Fixes
@ -274,15 +274,15 @@ requires `pytorchbot`, so it's only available in PyTorch atm.
### Cherry Picking Reverts ### Cherry Picking Reverts
If PR that has been cherry-picked into release branch has been reverted, its cherry-pick must be reverted as well. If a PR that has been cherry-picked into the release branch has been reverted, its cherry-pick must be reverted as well.
Reverts for changes that was committed into the main branch prior to the branch cut, must be propagated into release branch as well. Reverts for changes that were committed into the main branch prior to the branch cut must be propagated into the release branch as well.
## Preparing and Creating Final Release candidate ## Preparing and Creating Final Release Candidate
The following requirements need to be met prior to creating final Release Candidate : The following requirements need to be met prior to creating the final Release Candidate:
* Resolve all outstanding open issues in the milestone. There should be no open issues/PRs (for example [2.1.2](https://github.com/pytorch/pytorch/milestone/39)). The issue should either be closed or de-milestoned. * Resolve all outstanding open issues in the milestone. There should be no open issues/PRs (for example [2.1.2](https://github.com/pytorch/pytorch/milestone/39)). Each issue should either be closed or de-milestoned.
* Validate that all closed milestone PRs are present in the release branch. Confirm this by running: * Validate that all closed milestone PRs are present in the release branch. Confirm this by running:
``` python github_analyze.py --repo-path ~/local/pytorch --remote upstream --branch release/2.2 --milestone-id 40 --missing-in-branch ``` ``` python github_analyze.py --repo-path ~/local/pytorch --remote upstream --branch release/2.2 --milestone-id 40 --missing-in-branch ```
@ -291,7 +291,7 @@ The following requirements need to be met prior to creating final Release Candid
* Perform [Release Candidate health validation](#release-candidate-health-validation). CI should have the green signal. * Perform [Release Candidate health validation](#release-candidate-health-validation). CI should have the green signal.
After the final RC is created. The following tasks should be performed : After the final RC is created, the following tasks should be performed:
* Perform [Release Candidate health validation](#release-candidate-health-validation). CI should have the green signal. * Perform [Release Candidate health validation](#release-candidate-health-validation). CI should have the green signal.
@ -323,25 +323,25 @@ Promotion should occur in two steps:
## Additional Steps to prepare for release day ## Additional Steps to prepare for release day
The following should be prepared for the release day The following should be prepared for the release day:
### Modify release matrix ### Modify release matrix
Need to modify release matrix for get started page. See following [PR](https://github.com/pytorch/test-infra/pull/4611) as reference. Modify the release matrix for the get started page. See the following [PR](https://github.com/pytorch/test-infra/pull/4611) as reference.
The PR to update published_versions.json and quick-start-module.js is auto generated. See following [PR](https://github.com/pytorch/pytorch.github.io/pull/1467) as reference. The PR to update published_versions.json and quick-start-module.js is auto generated. See the following [PR](https://github.com/pytorch/pytorch.github.io/pull/1467) as reference.
Please note: This PR needs to be merged on the release day and hence it should be absolutely free of any failures. To test this PR, open another test PR but pointing to the Release candidate location as above [Release Candidate Storage](RELEASE.md#release-candidate-storage) Please note: This PR needs to be merged on the release day and hence it should be absolutely free of any failures. To test this PR, open another test PR pointing to the Release Candidate location as described in the [Release Candidate Storage](#release-candidate-storage) section.
### Open Google Colab issue ### Open Google Colab issue
This is normally done right after the release is completed. We would need to create Google Colab Issue see following [PR](https://github.com/googlecolab/colabtools/issues/2372) This is normally done right after the release is completed. We need to create a Google Colab issue. See the following example [issue](https://github.com/googlecolab/colabtools/issues/2372)
# Patch Releases # Patch Releases
A patch release is a maintenance release of PyTorch that includes fixes for regressions found in a previous minor release. Patch releases typically will bump the `patch` version from semver (i.e. `[major].[minor].[patch]`). A patch release is a maintenance release of PyTorch that includes fixes for regressions found in a previous minor release. Patch releases typically will bump the `patch` version from semver (i.e. `[major].[minor].[patch]`).
Please note: Starting from 2.1 one can expect up to 2 patch releases after every minor ones. Patch releases would only be published for latest minor release. Please note: Starting from 2.1, one can expect up to 2 patch releases after every minor release. Patch releases are only published for the latest minor release.
## Patch Release Criteria ## Patch Release Criteria
@ -363,29 +363,29 @@ Patch releases should be considered if a regression meets the following criteria
> Main POC: Patch Release Managers, Triage Reviewers > Main POC: Patch Release Managers, Triage Reviewers
Patch releases should follow these high-level phases. This process starts immediately after the previous release has completed. Patch releases should follow these high-level phases. This process starts immediately after the previous release has completed.
Patch release process takes around 4-5 weeks to complete. The patch release process takes around 4-5 weeks to complete.
1. Triage, is a process where issues are identified, graded, compared to Patch Release Criteria and added to Patch Release milestone. This process normally takes 2 weeks after the release completion. 1. Triage is a process where issues are identified, graded, compared to Patch Release Criteria and added to Patch Release milestone. This process normally takes 2 weeks after the release completion.
2. Go/No Go meeting between PyTorch Releng, PyTorch Core and Project Managers where potential issues triggering a release in milestones are reviewed, and following decisions are made: 2. Go/No Go meeting between PyTorch Releng, PyTorch Core and Project Managers where potential issues triggering a release in milestones are reviewed, and following decisions are made:
* Should the new patch Release be created ? * Should the new patch release be created?
* Timeline execution for the patch release * Timeline execution for the patch release
3. Cherry picking phase starts after the decision is made to create patch release. At this point a new release tracker for the patch release is created, and an announcement will be made on official channels [example announcement](https://dev-discuss.pytorch.org/t/pytorch-release-2-0-1-important-information/1176). The authors of the fixes to regressions will be asked to create their own cherry picks. This process normally takes 2 weeks. 3. Cherry picking phase starts after the decision is made to create a patch release. At this point, a new release tracker for the patch release is created, and an announcement will be made on official channels [example announcement](https://dev-discuss.pytorch.org/t/pytorch-release-2-0-1-important-information/1176). The authors of the fixes to regressions will be asked to create their own cherry picks. This process normally takes 2 weeks.
4. Building Binaries, Promotion to Stable and testing. After all cherry picks have been merged, Release Managers trigger new build and produce new release candidate. Announcement is made on the official channel about the RC availability at this point. This process normally takes 2 weeks. 4. Building Binaries, Promotion to Stable and testing. After all cherry picks have been merged, Release Managers trigger a new build and produce a new release candidate. An announcement is made on the official channel about the RC availability at this point. This process normally takes 2 weeks.
5. General Availability 5. General Availability
### Triage ### Triage
> Main POC: Triage Reviewers > Main POC: Triage Reviewers
1. Tag issues / pull requests that are candidates for a potential patch release with `triage review` 1. Tag issues/pull requests that are candidates for a potential patch release with `triage review`
* ![adding triage review label](https://user-images.githubusercontent.com/1700823/132589089-a9210a14-6159-409d-95e5-f79067f6fa38.png) * ![adding triage review label](https://user-images.githubusercontent.com/1700823/132589089-a9210a14-6159-409d-95e5-f79067f6fa38.png)
2. Triage reviewers will then check if the regression / fix identified fits within above mentioned [Patch Release Criteria](#patch-release-criteria) 2. Triage reviewers will then check if the regression/fix identified fits within the above mentioned [Patch Release Criteria](#patch-release-criteria)
3. Triage reviewers will then add the issue / pull request to the related milestone (i.e. `1.9.1`) if the regressions is found to be within the [Patch Release Criteria](#patch-release-criteria) 3. Triage reviewers will then add the issue/pull request to the related milestone (i.e. `1.9.1`) if the regression is found to be within the [Patch Release Criteria](#patch-release-criteria)
* ![adding to milestone](https://user-images.githubusercontent.com/1700823/131175980-148ff38d-44c3-4611-8a1f-cd2fd1f4c49d.png) * ![adding to milestone](https://user-images.githubusercontent.com/1700823/131175980-148ff38d-44c3-4611-8a1f-cd2fd1f4c49d.png)
### Issue Tracker for Patch releases ### Issue Tracker for Patch releases
For patch releases issue tracker needs to be created. For patch release, we require all cherry-pick changes to have links to either a high-priority GitHub issue or a CI failure from previous RC. An example of this would look like: For patch releases, an issue tracker needs to be created. For a patch release, we require all cherry-pick changes to have links to either a high-priority GitHub issue or a CI failure from previous RC. An example of this would look like:
* https://github.com/pytorch/pytorch/issues/128436 * https://github.com/pytorch/pytorch/issues/128436
Only following issues are accepted: Only following issues are accepted:

View File

@ -343,9 +343,32 @@ if(USE_CUDA)
endif() endif()
if(USE_ROCM) if(USE_ROCM)
# NOTE: The PyTorch build does not actually add_subdirectory
# third_party/composable_kernel or use it as a CMake library. What is used
# is header only, so this should be ok, except that the CMake build generates
# a ck/config.h. We just do that part here. Without this, the ck.h from the
# ROCM SDK may get accidentally used instead.
function(_pytorch_rocm_generate_ck_conf)
set(CK_ENABLE_INT8 "ON")
set(CK_ENABLE_FP16 "ON")
set(CK_ENABLE_FP32 "ON")
set(CK_ENABLE_FP64 "ON")
set(CK_ENABLE_BF16 "ON")
set(CK_ENABLE_FP8 "ON")
set(CK_ENABLE_BF8 "ON")
set(CK_USE_XDL "ON")
set(CK_USE_WMMA "ON")
configure_file(
"${Torch_SOURCE_DIR}/third_party/composable_kernel/include/ck/config.h.in"
"${CMAKE_CURRENT_BINARY_DIR}/composable_kernel/ck/config.h"
)
endfunction()
list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/hip) list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/hip)
list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/include) list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/include)
list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/library/include) list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/library/include)
list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_BINARY_DIR}/composable_kernel)
_pytorch_rocm_generate_ck_conf()
# Next two lines are needed because TunableOp uses third-party/fmt # Next two lines are needed because TunableOp uses third-party/fmt
list(APPEND ATen_HIP_INCLUDE $<TARGET_PROPERTY:fmt::fmt-header-only,INTERFACE_INCLUDE_DIRECTORIES>) list(APPEND ATen_HIP_INCLUDE $<TARGET_PROPERTY:fmt::fmt-header-only,INTERFACE_INCLUDE_DIRECTORIES>)
list(APPEND ATen_HIP_DEPENDENCY_LIBS fmt::fmt-header-only) list(APPEND ATen_HIP_DEPENDENCY_LIBS fmt::fmt-header-only)

View File

@ -69,7 +69,7 @@ Generator createCPUGenerator(uint64_t seed_val) {
* Helper function to concatenate two 32 bit unsigned int * Helper function to concatenate two 32 bit unsigned int
* and return them as a 64 bit unsigned int * and return them as a 64 bit unsigned int
*/ */
inline uint64_t make64BitsFrom32Bits(uint32_t hi, uint32_t lo) { inline static uint64_t make64BitsFrom32Bits(uint32_t hi, uint32_t lo) {
return (static_cast<uint64_t>(hi) << 32) | lo; return (static_cast<uint64_t>(hi) << 32) | lo;
} }

View File

@ -588,7 +588,7 @@ Allocator* getCPUAllocator() {
// means the allow_tf32 flags are overridden and tf32 is force disabled // means the allow_tf32 flags are overridden and tf32 is force disabled
// override_allow_tf32_flag = false // override_allow_tf32_flag = false
// means the original allow_tf32 flags are followed // means the original allow_tf32 flags are followed
thread_local bool override_allow_tf32_flag = false; thread_local static bool override_allow_tf32_flag = false;
NoTF32Guard::NoTF32Guard() { NoTF32Guard::NoTF32Guard() {
if (!override_allow_tf32_flag) { if (!override_allow_tf32_flag) {
@ -611,7 +611,7 @@ bool NoTF32Guard::should_disable_tf32() {
// This information can be used, for example, to select implementations // This information can be used, for example, to select implementations
// with different numerical or performance characteristics. // with different numerical or performance characteristics.
// See https://pytorch.org/docs/stable/notes/numerical_accuracy.html for details. // See https://pytorch.org/docs/stable/notes/numerical_accuracy.html for details.
thread_local bool rocm_is_backward_pass; thread_local static bool rocm_is_backward_pass;
ROCmBackwardPassGuard::ROCmBackwardPassGuard() { ROCmBackwardPassGuard::ROCmBackwardPassGuard() {
rocm_is_backward_pass = true; rocm_is_backward_pass = true;

View File

@ -110,6 +110,11 @@ class TORCH_API Context {
Allocator* getPinnedMemoryAllocator( Allocator* getPinnedMemoryAllocator(
std::optional<c10::DeviceType> device_type = std::nullopt) { std::optional<c10::DeviceType> device_type = std::nullopt) {
auto opt_device_type =
device_type.has_value() ? device_type : at::getAccelerator();
if (opt_device_type) {
lazyInitDevice(opt_device_type.value());
}
return getAcceleratorHooksInterface(device_type).getPinnedMemoryAllocator(); return getAcceleratorHooksInterface(device_type).getPinnedMemoryAllocator();
} }

View File

@ -28,10 +28,8 @@ c10::Allocator* GetCPUAllocatorMaybePinned(bool pin_memory) {
opt_device_type = at::getAccelerator(false); opt_device_type = at::getAccelerator(false);
} }
if (opt_device_type.has_value()) { if (opt_device_type.has_value()) {
at::globalContext().lazyInitDevice(opt_device_type.value()); return at::globalContext().getPinnedMemoryAllocator(
return at::globalContext() opt_device_type.value());
.getAcceleratorHooksInterface(opt_device_type)
.getPinnedMemoryAllocator();
} else { } else {
TORCH_CHECK( TORCH_CHECK(
false, "Need to provide pin_memory allocator to use pin memory.") false, "Need to provide pin_memory allocator to use pin memory.")
@ -172,7 +170,7 @@ SymInt computeStorageNbytes(
} }
template <typename T> template <typename T>
TensorBase _empty_generic( static TensorBase _empty_generic(
ArrayRef<T> size, ArrayRef<T> size,
c10::Allocator* allocator, c10::Allocator* allocator,
c10::DispatchKeySet ks, c10::DispatchKeySet ks,
@ -225,7 +223,7 @@ TensorBase empty_generic_symint(
} }
template <typename T> template <typename T>
TensorBase _empty_strided_generic( static TensorBase _empty_strided_generic(
T size, T size,
T stride, T stride,
c10::Allocator* allocator, c10::Allocator* allocator,

View File

@ -59,7 +59,7 @@ SymDimVector infer_size_symdimvector(SymIntArrayRef a, SymIntArrayRef b) {
} }
template<typename Container> template<typename Container>
C10_ALWAYS_INLINE InferExpandGeometryResult<Container> inferExpandGeometryImpl( C10_ALWAYS_INLINE static InferExpandGeometryResult<Container> inferExpandGeometryImpl(
IntArrayRef tensor_sizes, IntArrayRef tensor_sizes,
IntArrayRef tensor_strides, IntArrayRef tensor_strides,
IntArrayRef sizes) { IntArrayRef sizes) {

View File

@ -737,7 +737,7 @@ bool isFunctionalTensor(const c10::List<::std::optional<Tensor>>& t_list) {
} }
template <typename T> template <typename T>
bool isFunctionalTensorIListRef(c10::IListRef<T> list) { static bool isFunctionalTensorIListRef(c10::IListRef<T> list) {
if (list.size() == 0) return false; if (list.size() == 0) return false;
auto functional_count = 0; auto functional_count = 0;
for (const auto& tensor : list) { for (const auto& tensor : list) {
@ -803,7 +803,7 @@ void set_sizes_strides_offset(const std::vector<Tensor>& outs, const std::vector
} }
} }
thread_local bool _functionalizationReapplyViews; thread_local static bool _functionalizationReapplyViews;
bool getFunctionalizationReapplyViewsTLS() { bool getFunctionalizationReapplyViewsTLS() {
return _functionalizationReapplyViews; return _functionalizationReapplyViews;

View File

@ -2,7 +2,7 @@
namespace at::impl { namespace at::impl {
thread_local int64_t VmapMode_current_vmap_level = 0; thread_local static int64_t VmapMode_current_vmap_level = 0;
int64_t VmapMode::current_vmap_level() { int64_t VmapMode::current_vmap_level() {
return VmapMode_current_vmap_level; return VmapMode_current_vmap_level;

View File

@ -71,7 +71,7 @@ c10::DispatchKeySet get_view_key_set(const at::Tensor& base) {
namespace at::native { namespace at::native {
inline std::vector<int64_t> construct_opt_sizes(const at::Tensor& sizes) { inline static std::vector<int64_t> construct_opt_sizes(const at::Tensor& sizes) {
// torch.tensor([]) is considered to have `dim() = 1` and `size(0) = 0` // torch.tensor([]) is considered to have `dim() = 1` and `size(0) = 0`
// torch.nested_tensor([]) should also has `dim() = 1` and `size(0) = 0` // torch.nested_tensor([]) should also has `dim() = 1` and `size(0) = 0`
if (sizes.dim() == 0) { if (sizes.dim() == 0) {

View File

@ -5,7 +5,7 @@ namespace at {
// See TensorGeometry.h on why this is useful now that we cache is_contiguous. // See TensorGeometry.h on why this is useful now that we cache is_contiguous.
template <typename T> template <typename T>
bool _geometry_is_contiguous(ArrayRef<T> sizes, ArrayRef<T> strides) { static bool _geometry_is_contiguous(ArrayRef<T> sizes, ArrayRef<T> strides) {
assert(!overflows<std::int64_t>(sizes.size())); assert(!overflows<std::int64_t>(sizes.size()));
auto dim = static_cast<std::int64_t>(sizes.size()); auto dim = static_cast<std::int64_t>(sizes.size());
T expected_stride = 1; T expected_stride = 1;

View File

@ -327,7 +327,7 @@ std::vector<int64_t> defaultStrides(IntArrayRef sizes) {
// see overloads of computeStride() below. // see overloads of computeStride() below.
// //
template <typename ResultVec, typename NewShapeVec, typename Numel> template <typename ResultVec, typename NewShapeVec, typename Numel>
inline std::optional<ResultVec> computeStride_impl( inline static std::optional<ResultVec> computeStride_impl(
const NewShapeVec& oldshape, const NewShapeVec& oldshape,
const NewShapeVec& oldstride, const NewShapeVec& oldstride,
const NewShapeVec& newshape, const NewShapeVec& newshape,

View File

@ -20,12 +20,12 @@ namespace at {
// We haven't made a decision on that yet so we are temporarily banning random // We haven't made a decision on that yet so we are temporarily banning random
// operations inside of vmap while we gather user feedback. // operations inside of vmap while we gather user feedback.
template <typename... Args> Tensor unsupportedRandomOp(Args... args) { template <typename... Args> static Tensor unsupportedRandomOp(Args... args) {
TORCH_CHECK(false, "vmap: We do not yet support calling random operations inside of vmap. ", TORCH_CHECK(false, "vmap: We do not yet support calling random operations inside of vmap. ",
"Please perform random operations outside of vmap as a workaround"); "Please perform random operations outside of vmap as a workaround");
} }
template <typename... Args> Tensor& unsupportedRandomOp_(Args... args) { template <typename... Args> static Tensor& unsupportedRandomOp_(Args... args) {
TORCH_CHECK(false, "vmap: We do not yet support calling random operations inside of vmap. ", TORCH_CHECK(false, "vmap: We do not yet support calling random operations inside of vmap. ",
"Please perform random operations outside of vmap as a workaround"); "Please perform random operations outside of vmap as a workaround");
} }

View File

@ -64,7 +64,7 @@ thread_local std::array<at::ScalarType, at::COMPILE_TIME_MAX_DEVICE_TYPES>
at::ScalarType::Undefined, // IDEEP. at::ScalarType::Undefined, // IDEEP.
at::kHalf, // AMD HIP at::kHalf, // AMD HIP
at::ScalarType::Undefined, // FPGA at::ScalarType::Undefined, // FPGA
at::ScalarType::Undefined, // ONNX Runtime / Microsoft at::kBFloat16, // ONNX Runtime / Microsoft
at::kBFloat16, // XLA / TPU at::kBFloat16, // XLA / TPU
at::ScalarType::Undefined, // Vulkan at::ScalarType::Undefined, // Vulkan
at::ScalarType::Undefined, // Metal at::ScalarType::Undefined, // Metal
@ -500,6 +500,44 @@ TORCH_LIBRARY_IMPL(aten, AutocastMTIA, m) {
TORCH_FN((&at::autocast::binary_cross_entropy_banned))); TORCH_FN((&at::autocast::binary_cross_entropy_banned)));
} }
// MAIA
TORCH_LIBRARY_IMPL(_, AutocastMAIA, m) {
m.fallback(torch::CppFunction::makeFallthrough());
}
TORCH_LIBRARY_IMPL(aten, AutocastMAIA, m) {
// lower_precision_fp
#define _KERNEL_MAIA_LOW_PRECISION_FP(...) \
KERNEL_MAIA(__VA_ARGS__, lower_precision_fp)
AT_FORALL_LOWER_PRECISION_FP(_KERNEL_MAIA_LOW_PRECISION_FP)
// fp32
#define _KERNEL_MAIA_FP32(...) KERNEL_MAIA(__VA_ARGS__, fp32)
AT_FORALL_FP32(_KERNEL_MAIA_FP32)
// fp32_set_opt_dtype
#define _KERNEL_MAIA_FP32_SET_OPT_DTYPE(...) \
KERNEL_MAIA(__VA_ARGS__, fp32_set_opt_dtype)
AT_FORALL_FP32_SET_OPT_DTYPE(_KERNEL_MAIA_FP32_SET_OPT_DTYPE)
// fp32_append_dtype
// The fp32_append_dtype wrapper overrides implicit promotion behavior.
// norm does not implicitly promote, but be aware when adding new ops to this policy.
AT_FORALL_DIFFERENT_REDISPATCH_SIGNATURE(
KERNEL_DIFFERENT_REDISPATCH_SIGNATURE_MAIA)
// promote
#define _KERNEL_MAIA_PROMOTE(...) KERNEL_MAIA(__VA_ARGS__, promote)
AT_FORALL_PROMOTE(_KERNEL_MAIA_PROMOTE)
m.impl(TORCH_SELECTIVE_NAME("aten::binary_cross_entropy"),
TORCH_FN((&at::autocast::binary_cross_entropy_banned)));
}
// XPU // XPU
TORCH_LIBRARY_IMPL(_, AutocastXPU, m) { TORCH_LIBRARY_IMPL(_, AutocastXPU, m) {
m.fallback(torch::CppFunction::makeFallthrough()); m.fallback(torch::CppFunction::makeFallthrough());

View File

@ -123,12 +123,14 @@ TORCH_API inline void set_autocast_gpu_dtype(at::ScalarType dtype) {
_(privateuseone, at::kPrivateUse1) _(privateuseone, at::kPrivateUse1)
// deprecated other backend specific autocast APIs // deprecated other backend specific autocast APIs
// NOLINTNEXTLINE(misc-use-internal-linkage)
AT_FORALL_DEPRECATED_AUTOCAST_BACKENDS(DECLARE_DEPRECATED_AUTOCAST_APIS) AT_FORALL_DEPRECATED_AUTOCAST_BACKENDS(DECLARE_DEPRECATED_AUTOCAST_APIS)
const std::array<at::DeviceType, 9> _AUTOCAST_SUPPORTED_DEVICES{ const std::array<at::DeviceType, 10> _AUTOCAST_SUPPORTED_DEVICES{
at::kCPU, at::kCPU,
at::kCUDA, at::kCUDA,
at::kMTIA, at::kMTIA,
at::kMAIA,
at::kXPU, at::kXPU,
at::kIPU, at::kIPU,
at::kHPU, at::kHPU,
@ -149,6 +151,8 @@ inline bool is_autocast_eligible(
tensor.is_floating_point(); tensor.is_floating_point();
case c10::DeviceType::MTIA: case c10::DeviceType::MTIA:
return tensor.is_mtia() && tensor.is_floating_point(); return tensor.is_mtia() && tensor.is_floating_point();
case c10::DeviceType::MAIA:
return tensor.is_maia() && tensor.is_floating_point();
case c10::DeviceType::XPU: case c10::DeviceType::XPU:
return tensor.is_xpu() && tensor.is_floating_point(); return tensor.is_xpu() && tensor.is_floating_point();
case c10::DeviceType::IPU: case c10::DeviceType::IPU:
@ -176,6 +180,8 @@ inline DispatchKey get_autocast_dispatch_key_from_device_type(
return DispatchKey::AutocastCPU; return DispatchKey::AutocastCPU;
case c10::DeviceType::MTIA: case c10::DeviceType::MTIA:
return DispatchKey::AutocastMTIA; return DispatchKey::AutocastMTIA;
case c10::DeviceType::MAIA:
return DispatchKey::AutocastMAIA;
case c10::DeviceType::XPU: case c10::DeviceType::XPU:
return DispatchKey::AutocastXPU; return DispatchKey::AutocastXPU;
case c10::DeviceType::IPU: case c10::DeviceType::IPU:
@ -747,6 +753,24 @@ copy pasted in from VariableTypeEverything.cpp with appropriate substitutions.
REDISPATCH_SIGNATURE, \ REDISPATCH_SIGNATURE, \
POLICY) POLICY)
// KERNEL_MAIA/KERNEL_DIFFERENT_REDISPATCH_SIGNATURE_MAIA
// registration (OP, POLICY) or (OP, OVERLOAD, POLICY) for AutocastMAIA
#define KERNEL_MAIA(...) KERNEL(c10::DeviceType::MAIA, __VA_ARGS__)
#define KERNEL_DIFFERENT_REDISPATCH_SIGNATURE_MAIA( \
REDISPATCH_FUNC, \
REGISTER_NAME, \
REGISTER_SIGNATURE, \
REDISPATCH_SIGNATURE, \
POLICY) \
KERNEL_DIFFERENT_REDISPATCH_SIGNATURE( \
c10::DeviceType::MAIA, \
REDISPATCH_FUNC, \
REGISTER_NAME, \
REGISTER_SIGNATURE, \
REDISPATCH_SIGNATURE, \
POLICY)
// KERNEL_XPU/KERNEL_DIFFERENT_REDISPATCH_SIGNATURE_XPU // KERNEL_XPU/KERNEL_DIFFERENT_REDISPATCH_SIGNATURE_XPU
// registration (OP, POLICY) or (OP, OVERLOAD, POLICY) for AutocastXPU // registration (OP, POLICY) or (OP, OVERLOAD, POLICY) for AutocastXPU
#define KERNEL_XPU(...) KERNEL(c10::DeviceType::XPU, __VA_ARGS__) #define KERNEL_XPU(...) KERNEL(c10::DeviceType::XPU, __VA_ARGS__)

View File

@ -43,7 +43,7 @@ std::string toString(const Scalar& s) {
namespace at { namespace at {
//not all C++ compilers have default float so we define our own here //not all C++ compilers have default float so we define our own here
inline std::ios_base& defaultfloat(std::ios_base& __base) { inline static std::ios_base& defaultfloat(std::ios_base& __base) {
__base.unsetf(std::ios_base::floatfield); __base.unsetf(std::ios_base::floatfield);
return __base; return __base;
} }

View File

@ -42,7 +42,7 @@ static std::vector<at::OptionalTensorRef> get_unboxed_opt_tensor_vector() {
} }
template <typename T> template <typename T>
void check_elements_same(at::ITensorListRef list, const T& thing, int use_count) { static void check_elements_same(at::ITensorListRef list, const T& thing, int use_count) {
EXPECT_EQ(thing.size(), list.size()); EXPECT_EQ(thing.size(), list.size());
size_t i = 0; size_t i = 0;
for (const auto& t : list) { for (const auto& t : list) {

View File

@ -5,7 +5,7 @@
namespace at { namespace at {
thread_local bool NamesMode_enabled = true; thread_local static bool NamesMode_enabled = true;
bool NamesMode::is_enabled() { bool NamesMode::is_enabled() {
return NamesMode_enabled; return NamesMode_enabled;

View File

@ -80,6 +80,10 @@ TORCH_LIBRARY_IMPL(_, AutogradMTIA, m) {
m.fallback(AUTOGRAD_FALLBACK); m.fallback(AUTOGRAD_FALLBACK);
} }
TORCH_LIBRARY_IMPL(_, AutogradMAIA, m) {
m.fallback(AUTOGRAD_FALLBACK);
}
TORCH_LIBRARY_IMPL(_, AutogradXLA, m) { TORCH_LIBRARY_IMPL(_, AutogradXLA, m) {
m.fallback(AUTOGRAD_FALLBACK); m.fallback(AUTOGRAD_FALLBACK);
} }

View File

@ -329,7 +329,7 @@ class CuBlasLtMatmulPreference : public CuBlasLtDescriptor<
template <typename Dtype> template <typename Dtype>
inline void bgemm_internal_cublaslt(CUDABLAS_BGEMM_ARGTYPES(Dtype)) { static inline void bgemm_internal_cublaslt(CUDABLAS_BGEMM_ARGTYPES(Dtype)) {
cudaDataType_t abcType = CUDA_R_32F; cudaDataType_t abcType = CUDA_R_32F;
cublasComputeType_t computeType = CUBLAS_COMPUTE_32F; cublasComputeType_t computeType = CUBLAS_COMPUTE_32F;
cudaDataType_t scaleType = CUDA_R_32F; cudaDataType_t scaleType = CUDA_R_32F;
@ -1079,7 +1079,13 @@ void gemm_internal<float>(CUDABLAS_GEMM_ARGTYPES(float))
} }
#ifdef USE_ROCM #ifdef USE_ROCM
else if (at::globalContext().blasPreferredBackend() == BlasBackend::Ck) { else if (at::globalContext().blasPreferredBackend() == BlasBackend::Ck) {
at::native::gemm_internal_ck<float>(CUDABLAS_GEMM_ARGS(float)); auto dprops = at::cuda::getCurrentDeviceProperties();
c10::string_view arch(dprops->gcnArchName);
if (arch == "gfx1100") { //no CK GEMM version for gfx1100
gemm_internal_cublaslt<float>(CUDABLAS_GEMM_ARGS(float));
} else{
at::native::gemm_internal_ck<float>(CUDABLAS_GEMM_ARGS(float));
}
} }
#endif #endif
else { else {

View File

@ -156,6 +156,7 @@ NVRTC_STUB2(nvrtcGetProgramLogSize,nvrtcProgram, size_t*)
NVRTC_STUB2(nvrtcGetProgramLog, nvrtcProgram, char *) NVRTC_STUB2(nvrtcGetProgramLog, nvrtcProgram, char *)
NVRTC_STUB3(nvrtcGetLoweredName, nvrtcProgram, const char *, const char **) NVRTC_STUB3(nvrtcGetLoweredName, nvrtcProgram, const char *, const char **)
CUDA_STUB2(cuModuleLoad, CUmodule*, const char*)
CUDA_STUB2(cuModuleLoadData, CUmodule *, const void *) CUDA_STUB2(cuModuleLoadData, CUmodule *, const void *)
CUDA_STUB3(cuModuleGetFunction, CUfunction *, CUmodule, const char *) CUDA_STUB3(cuModuleGetFunction, CUfunction *, CUmodule, const char *)
CUDA_STUB4(cuOccupancyMaxActiveBlocksPerMultiprocessor, int *, CUfunction, int, size_t) CUDA_STUB4(cuOccupancyMaxActiveBlocksPerMultiprocessor, int *, CUfunction, int, size_t)
@ -169,6 +170,8 @@ CUDA_STUB4(cuLinkCreate, unsigned int, CUjit_option *, void **, CUlinkState *)
CUDA_STUB3(cuLinkComplete, CUlinkState, void **, size_t *) CUDA_STUB3(cuLinkComplete, CUlinkState, void **, size_t *)
CUDA_STUB3(cuFuncSetAttribute, CUfunction, CUfunction_attribute, int) CUDA_STUB3(cuFuncSetAttribute, CUfunction, CUfunction_attribute, int)
CUDA_STUB3(cuFuncGetAttribute, int*, CUfunction_attribute, CUfunction) CUDA_STUB3(cuFuncGetAttribute, int*, CUfunction_attribute, CUfunction)
CUDA_STUB3(cuPointerGetAttribute, void*, CUpointer_attribute, CUdeviceptr)
#if defined(CUDA_VERSION) && CUDA_VERSION >= 12000 #if defined(CUDA_VERSION) && CUDA_VERSION >= 12000
CUresult CUDAAPI CUresult CUDAAPI

View File

@ -43,6 +43,7 @@ namespace at::cuda {
_(nvrtcGetProgramLogSize) \ _(nvrtcGetProgramLogSize) \
_(nvrtcGetProgramLog) \ _(nvrtcGetProgramLog) \
_(nvrtcGetLoweredName) \ _(nvrtcGetLoweredName) \
_(cuModuleLoad) \
_(cuModuleLoadData) \ _(cuModuleLoadData) \
_(cuModuleLoadDataEx) \ _(cuModuleLoadDataEx) \
_(cuModuleGetFunction) \ _(cuModuleGetFunction) \
@ -60,6 +61,7 @@ namespace at::cuda {
_(cuLinkComplete) \ _(cuLinkComplete) \
_(cuFuncSetAttribute) \ _(cuFuncSetAttribute) \
_(cuFuncGetAttribute) \ _(cuFuncGetAttribute) \
_(cuPointerGetAttribute) \
#if defined(CUDA_VERSION) && CUDA_VERSION >= 12000 #if defined(CUDA_VERSION) && CUDA_VERSION >= 12000
#define AT_FORALL_NVRTC_EXTENDED(_) \ #define AT_FORALL_NVRTC_EXTENDED(_) \

View File

@ -575,11 +575,20 @@ struct ScaledGemmParams : OpParams {
std::string BLASSignature() const override { std::string BLASSignature() const override {
// Excluding use_fast_accum and use_rowise booleans for now // Excluding use_fast_accum and use_rowise booleans for now
return fmt::sprintf("- { function: matmul, M: %ld, N: %ld, K: %ld, lda: %ld, ldb: %ld, ldc: %ld, ldd: %ld, stride_a: 0, stride_b: 0, stride_c: 0, stride_d: 0, " if (bias_ptr == nullptr) {
"transA: %c, transB: %c, batch_count: 1, scaleA: f32_r, scaleB: f32_r, a_type: %s, b_type: %s, c_type: %s, d_type: %s, bias_type: %s, scale_type: %s, compute_type: %s }", return fmt::sprintf("- { function: matmul, M: %ld, N: %ld, K: %ld, lda: %ld, ldb: %ld, ldc: %ld, ldd: %ld, stride_a: 0, stride_b: 0, stride_c: 0, stride_d: 0, "
m, n, k, lda, ldb, ldc, ldc, transa, transb, "transA: %c, transB: %c, batch_count: 1, scaleA: f32_r, scaleB: f32_r, a_type: %s, b_type: %s, c_type: %s, d_type: %s, scale_type: %s, compute_type: %s }",
ScalarTypeToBLASType(a_dtype), ScalarTypeToBLASType(b_dtype), ScalarTypeToBLASType(c_dtype), ScalarTypeToBLASType(c_dtype), ScalarTypeToBLASType(bias_dtype), m, n, k, lda, ldb, ldc, ldc, transa, transb,
ComputeTypeFor<T>(), ComputeTypeFor<T>()); ScalarTypeToBLASType(a_dtype), ScalarTypeToBLASType(b_dtype), ScalarTypeToBLASType(c_dtype), ScalarTypeToBLASType(c_dtype),
ComputeTypeFor<T>(), ComputeTypeFor<T>());
}
else {
return fmt::sprintf("- { function: matmul, M: %ld, N: %ld, K: %ld, lda: %ld, ldb: %ld, ldc: %ld, ldd: %ld, stride_a: 0, stride_b: 0, stride_c: 0, stride_d: 0, "
"transA: %c, transB: %c, batch_count: 1, scaleA: f32_r, scaleB: f32_r, a_type: %s, b_type: %s, c_type: %s, d_type: %s, bias_type: %s, scale_type: %s, compute_type: %s }",
m, n, k, lda, ldb, ldc, ldc, transa, transb,
ScalarTypeToBLASType(a_dtype), ScalarTypeToBLASType(b_dtype), ScalarTypeToBLASType(c_dtype), ScalarTypeToBLASType(c_dtype), ScalarTypeToBLASType(bias_dtype),
ComputeTypeFor<T>(), ComputeTypeFor<T>());
}
} }
std::string Signature() const override { std::string Signature() const override {

View File

@ -498,7 +498,11 @@ class HipblasltGemmOp : public Callable<ParamsT> {
mat_c, HIPBLASLT_MATRIX_LAYOUT_STRIDED_BATCH_OFFSET, &stride_c, sizeof(stride_c))); mat_c, HIPBLASLT_MATRIX_LAYOUT_STRIDED_BATCH_OFFSET, &stride_c, sizeof(stride_c)));
} }
HipBlasLtMatmulDescriptor matmul(HIPBLAS_COMPUTE_32F, HIP_R_32F); hipblasComputeType_t computeType = HIPBLAS_COMPUTE_32F;
if (at::globalContext().allowTF32CuBLAS()) {
computeType = HIPBLAS_COMPUTE_32F_FAST_TF32;
}
HipBlasLtMatmulDescriptor matmul(computeType, HIP_R_32F);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_TRANSA, opa); matmul.setAttribute(HIPBLASLT_MATMUL_DESC_TRANSA, opa);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_TRANSB, opb); matmul.setAttribute(HIPBLASLT_MATMUL_DESC_TRANSB, opb);
@ -611,6 +615,11 @@ auto GetHipBlasLtTypeStringAndOps() {
auto in_out_datatype = HipDataTypeFor<CT>(); auto in_out_datatype = HipDataTypeFor<CT>();
std::vector<hipblasLtMatmulHeuristicResult_t> heuristic_result; std::vector<hipblasLtMatmulHeuristicResult_t> heuristic_result;
hipblasComputeType_t computeType = HIPBLAS_COMPUTE_32F;
if (at::globalContext().allowTF32CuBLAS()) {
computeType = HIPBLAS_COMPUTE_32F_FAST_TF32;
}
hipblasLtHandle_t handle; hipblasLtHandle_t handle;
TORCH_HIPBLASLT_CHECK(hipblasLtCreate(&handle)); TORCH_HIPBLASLT_CHECK(hipblasLtCreate(&handle));
TORCH_HIPBLASLT_CHECK(hipblaslt_ext::getAllAlgos(handle, TORCH_HIPBLASLT_CHECK(hipblaslt_ext::getAllAlgos(handle,
@ -621,7 +630,7 @@ auto GetHipBlasLtTypeStringAndOps() {
b_datatype, b_datatype,
in_out_datatype, in_out_datatype,
in_out_datatype, in_out_datatype,
HIPBLAS_COMPUTE_32F, computeType,
heuristic_result)); heuristic_result));
TORCH_HIPBLASLT_CHECK(hipblasLtDestroy(handle)); TORCH_HIPBLASLT_CHECK(hipblasLtDestroy(handle));

View File

@ -141,6 +141,8 @@ class RocblasGemmOp : public Callable<GemmParams<T>> {
TuningStatus Call(const GemmParams<T>* params) override { TuningStatus Call(const GemmParams<T>* params) override {
auto input_output_type = RocBlasDataTypeFor<T>(); auto input_output_type = RocBlasDataTypeFor<T>();
if (at::globalContext().allowTF32CuBLAS() && input_output_type == rocblas_datatype_f32_r)
return FAIL; // no support for TF32 in rocBLAS
auto compute_type = RocBlasComputeTypeFor<T>(); auto compute_type = RocBlasComputeTypeFor<T>();
auto h_a = DoCastForHalfOrBfloat16(params->alpha); auto h_a = DoCastForHalfOrBfloat16(params->alpha);
auto h_b = DoCastForHalfOrBfloat16(params->beta); auto h_b = DoCastForHalfOrBfloat16(params->beta);
@ -207,6 +209,8 @@ class RocblasGemmStridedBatchedOp : public Callable<GemmStridedBatchedParams<T>>
TuningStatus Call(const GemmStridedBatchedParams<T>* params) override { TuningStatus Call(const GemmStridedBatchedParams<T>* params) override {
auto input_output_type = RocBlasDataTypeFor<T>(); auto input_output_type = RocBlasDataTypeFor<T>();
if (at::globalContext().allowTF32CuBLAS() && input_output_type == rocblas_datatype_f32_r)
return FAIL; // no support for TF32 in rocBLAS
auto compute_type = RocBlasComputeTypeFor<T>(); auto compute_type = RocBlasComputeTypeFor<T>();
auto h_a = DoCastForHalfOrBfloat16(params->alpha); auto h_a = DoCastForHalfOrBfloat16(params->alpha);
auto h_b = DoCastForHalfOrBfloat16(params->beta); auto h_b = DoCastForHalfOrBfloat16(params->beta);

View File

@ -12,7 +12,7 @@
namespace at::functorch { namespace at::functorch {
template <typename Func> template <typename Func>
std::tuple<Tensor, std::optional<int64_t>,Tensor, std::optional<int64_t>> static std::tuple<Tensor, std::optional<int64_t>,Tensor, std::optional<int64_t>>
max_pool_with_indices_batch_rule_helper( max_pool_with_indices_batch_rule_helper(
const Tensor& self, std::optional<int64_t> self_bdim, const Tensor& self, std::optional<int64_t> self_bdim,
IntArrayRef kernel_size, IntArrayRef stride, IntArrayRef kernel_size, IntArrayRef stride,

View File

@ -20,7 +20,7 @@
namespace at::functorch { namespace at::functorch {
template <typename F, F Func, typename... ExtraArgs> template <typename F, F Func, typename... ExtraArgs>
Tensor random_batching_rule(SymIntArrayRef shape, ExtraArgs... extra_args) { static Tensor random_batching_rule(SymIntArrayRef shape, ExtraArgs... extra_args) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode); c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer(); auto maybe_layer = maybeCurrentDynamicLayer();
TORCH_INTERNAL_ASSERT(maybe_layer.has_value()); TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
@ -37,7 +37,7 @@ Tensor random_batching_rule(SymIntArrayRef shape, ExtraArgs... extra_args) {
} }
template <typename F, F Func, typename... ExtraArgs> template <typename F, F Func, typename... ExtraArgs>
Tensor& random_inplace_batching_rule(Tensor& self, ExtraArgs... extra_args) { static Tensor& random_inplace_batching_rule(Tensor& self, ExtraArgs... extra_args) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode); c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer(); auto maybe_layer = maybeCurrentDynamicLayer();
TORCH_INTERNAL_ASSERT(maybe_layer.has_value()); TORCH_INTERNAL_ASSERT(maybe_layer.has_value());
@ -108,7 +108,7 @@ static Tensor& bernoulli_inplace_Tensor_batching_rule(Tensor& self, const Tensor
} }
template <typename F, F Func, typename... ExtraArgs> template <typename F, F Func, typename... ExtraArgs>
Tensor randperm_batching_rule(int64_t n, ExtraArgs... extra_args) { static Tensor randperm_batching_rule(int64_t n, ExtraArgs... extra_args) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode); c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer(); auto maybe_layer = maybeCurrentDynamicLayer();
auto const batch_size = maybe_layer->batchSize(); auto const batch_size = maybe_layer->batchSize();
@ -127,7 +127,7 @@ Tensor randperm_batching_rule(int64_t n, ExtraArgs... extra_args) {
} }
template <typename F, F Func, typename... ExtraArgs> template <typename F, F Func, typename... ExtraArgs>
Tensor unary_pointwise_random_batch_rule(const Tensor& tensor, ExtraArgs... extra_args) { static Tensor unary_pointwise_random_batch_rule(const Tensor& tensor, ExtraArgs... extra_args) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode); c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer(); auto maybe_layer = maybeCurrentDynamicLayer();
const auto cur_level = maybe_layer->layerId(); const auto cur_level = maybe_layer->layerId();
@ -153,7 +153,7 @@ Tensor unary_pointwise_random_batch_rule(const Tensor& tensor, ExtraArgs... extr
} }
template<typename F, F Func, typename... ExtraArgs> template<typename F, F Func, typename... ExtraArgs>
Tensor tensor_like_random_batch_rule(const Tensor& self, ExtraArgs... extra_args) { static Tensor tensor_like_random_batch_rule(const Tensor& self, ExtraArgs... extra_args) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode); c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer(); auto maybe_layer = maybeCurrentDynamicLayer();
const auto cur_level = maybe_layer->layerId(); const auto cur_level = maybe_layer->layerId();
@ -272,7 +272,7 @@ struct RandomBatchRuleHelper<F, Func, typelist<T1, T...>> {
}; };
template <typename F, F Func, typename... T> template <typename F, F Func, typename... T>
Tensor rand_int_wrapper(SymIntArrayRef shape, c10::SymInt high, T... extra_args) { static Tensor rand_int_wrapper(SymIntArrayRef shape, c10::SymInt high, T... extra_args) {
return Func(high, shape, std::forward<T>(extra_args)...); return Func(high, shape, std::forward<T>(extra_args)...);
} }
@ -299,7 +299,7 @@ struct RandIntBatchRuleHelper<F, Func, typelist<T1, T2, T...>> {
}; };
template <typename F, F Func, typename T0, typename T1, typename... T> template <typename F, F Func, typename T0, typename T1, typename... T>
Tensor rand_int_low_wrapper(SymIntArrayRef shape, T0 scalar0, T1 scalar1, T... extra_args) { static Tensor rand_int_low_wrapper(SymIntArrayRef shape, T0 scalar0, T1 scalar1, T... extra_args) {
return Func(scalar0, scalar1, shape, std::forward<T>(extra_args)...); return Func(scalar0, scalar1, shape, std::forward<T>(extra_args)...);
} }
@ -346,7 +346,7 @@ struct NormalPointwiseBatchRule<F, Func, typelist<A0, T...>> {
}; };
template<typename F, F Func, typename... T> template<typename F, F Func, typename... T>
Tensor normal_wrapper(const Tensor& tensor, double scalar, T... extra_args) { static Tensor normal_wrapper(const Tensor& tensor, double scalar, T... extra_args) {
return Func(scalar, tensor, extra_args...); return Func(scalar, tensor, extra_args...);
} }

View File

@ -19,7 +19,7 @@
namespace at::functorch { namespace at::functorch {
bool kVmapFallbackWarningEnabled = true; static bool kVmapFallbackWarningEnabled = true;
bool isVmapFallbackWarningEnabled() { bool isVmapFallbackWarningEnabled() {
return kVmapFallbackWarningEnabled; return kVmapFallbackWarningEnabled;
@ -29,7 +29,7 @@ void setVmapFallbackWarningEnabled(bool enabled) {
kVmapFallbackWarningEnabled = enabled; kVmapFallbackWarningEnabled = enabled;
} }
bool kVmapFallbackEnabled = true; static bool kVmapFallbackEnabled = true;
bool isVmapFallbackEnabled() { bool isVmapFallbackEnabled() {
return kVmapFallbackEnabled; return kVmapFallbackEnabled;

View File

@ -322,6 +322,24 @@ void gemm(
const float beta, const float beta,
at::BFloat16 *c, int64_t ldc) { at::BFloat16 *c, int64_t ldc) {
internal::normalize_last_dims(transa, transb, m, n, k, &lda, &ldb, &ldc); internal::normalize_last_dims(transa, transb, m, n, k, &lda, &ldb, &ldc);
#if AT_MKLDNN_ENABLED()
#ifdef __aarch64__
// MKLDNN also supports ARM for bf16, and the bypass is only
// currently intended for x86/x86_64.
const bool use_bf16_gemv_trans = false;
#elif defined(__powerpc__)
const bool use_bf16_gemv_trans = false;
#else
const bool bf16_gemv_trans_would_be_faster = cpuinfo_initialize() &&
!cpuinfo_has_x86_avx512bf16();
const bool use_bf16_gemv_trans = bf16_gemv_trans_would_be_faster &&
transa == TransposeType::Transpose &&
transb == TransposeType::NoTranspose && n == 1 && alpha == 1.0;
#endif
if (!use_bf16_gemv_trans && mkldnn_bf16_gemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)) {
return;
}
#endif
#if AT_BUILD_WITH_BLAS() && defined(BLAS_HAS_SBGEMM) #if AT_BUILD_WITH_BLAS() && defined(BLAS_HAS_SBGEMM)
if (use_blas_gemm(transa, transb, m, n, k, lda, ldb, ldc)) { if (use_blas_gemm(transa, transb, m, n, k, lda, ldb, ldc)) {
int m_ = m, n_ = n, k_ = k, lda_ = lda, ldb_ = ldb, ldc_ = ldc; int m_ = m, n_ = n, k_ = k, lda_ = lda, ldb_ = ldb, ldc_ = ldc;
@ -342,24 +360,6 @@ void gemm(
} }
return; return;
} }
#endif
#if AT_MKLDNN_ENABLED()
#ifdef __aarch64__
// MKLDNN also supports ARM for bf16, and the bypass is only
// currently intended for x86/x86_64.
const bool use_bf16_gemv_trans = false;
#elif defined(__powerpc__)
const bool use_bf16_gemv_trans = false;
#else
const bool bf16_gemv_trans_would_be_faster = cpuinfo_initialize() &&
!cpuinfo_has_x86_avx512bf16();
const bool use_bf16_gemv_trans = bf16_gemv_trans_would_be_faster &&
transa == TransposeType::Transpose &&
transb == TransposeType::NoTranspose && n == 1 && alpha == 1.0;
#endif
if (!use_bf16_gemv_trans && mkldnn_bf16_gemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)) {
return;
}
#endif #endif
gemm_stub( gemm_stub(
at::kCPU, at::kBFloat16, at::kCPU, at::kBFloat16,

View File

@ -3610,11 +3610,11 @@ Tensor& transpose_(Tensor& self, int64_t dim0, int64_t dim1) {
return at::_mkldnn_transpose_(self, dim0, dim1); return at::_mkldnn_transpose_(self, dim0, dim1);
} }
DimVector sizes(self.sizes().begin(), self.sizes().end()); SymDimVector sizes(self.sym_sizes().begin(), self.sym_sizes().end());
DimVector strides(self.strides().begin(), self.strides().end());
std::swap(strides[dim0], strides[dim1]);
std::swap(sizes[dim0], sizes[dim1]); std::swap(sizes[dim0], sizes[dim1]);
self.as_strided_(sizes, strides); SymDimVector strides(self.sym_strides().begin(), self.sym_strides().end());
std::swap(strides[dim0], strides[dim1]);
auto result = self.as_strided__symint(std::move(sizes), std::move(strides));
return self; return self;
} }

View File

@ -832,9 +832,9 @@ void hardswish_backward_kernel(TensorIterator& iter) {
cpu_kernel_vec( cpu_kernel_vec(
iter, iter,
[&](scalar_t grad_val, scalar_t self_val) -> scalar_t { [&](scalar_t grad_val, scalar_t self_val) -> scalar_t {
if (float(self_val) < neg_three) { if (float(self_val) <= neg_three) {
return zero; return zero;
} else if (float(self_val) <= three) { } else if (float(self_val) < three) {
return float(grad_val) * ((float(self_val) / three) + one_half); return float(grad_val) * ((float(self_val) / three) + one_half);
} else { } else {
return grad_val; return grad_val;
@ -847,19 +847,19 @@ void hardswish_backward_kernel(TensorIterator& iter) {
Vec::blendv( Vec::blendv(
grad_val0 * ((self_val0 / kThreeVec) + kOneHalfVec), grad_val0 * ((self_val0 / kThreeVec) + kOneHalfVec),
grad_val0, grad_val0,
self_val0 > kThreeVec self_val0 >= kThreeVec
), ),
kZeroVec, kZeroVec,
self_val0 < kNegThreeVec self_val0 <= kNegThreeVec
); );
self_val1 = Vec::blendv( self_val1 = Vec::blendv(
Vec::blendv( Vec::blendv(
grad_val1 * ((self_val1 / kThreeVec) + kOneHalfVec), grad_val1 * ((self_val1 / kThreeVec) + kOneHalfVec),
grad_val1, grad_val1,
self_val1 > kThreeVec self_val1 >= kThreeVec
), ),
kZeroVec, kZeroVec,
self_val1 < kNegThreeVec self_val1 <= kNegThreeVec
); );
return convert_from_float<scalar_t>(self_val0, self_val1); return convert_from_float<scalar_t>(self_val0, self_val1);
}); });
@ -878,9 +878,9 @@ void hardswish_backward_kernel(TensorIterator& iter) {
cpu_kernel_vec( cpu_kernel_vec(
iter, iter,
[&](scalar_t grad_val, scalar_t self_val) { [&](scalar_t grad_val, scalar_t self_val) {
if (self_val < neg_three) { if (self_val <= neg_three) {
return zero; return zero;
} else if (self_val <= three) { } else if (self_val < three) {
return grad_val * ((self_val / three) + one_half); return grad_val * ((self_val / three) + one_half);
} else { } else {
return grad_val; return grad_val;
@ -891,10 +891,10 @@ void hardswish_backward_kernel(TensorIterator& iter) {
Vec::blendv( Vec::blendv(
grad_val * ((self_val / kThreeVec) + kOneHalfVec), grad_val * ((self_val / kThreeVec) + kOneHalfVec),
grad_val, grad_val,
self_val > kThreeVec self_val >= kThreeVec
), ),
kZeroVec, kZeroVec,
self_val < kNegThreeVec self_val <= kNegThreeVec
); );
} }
); );

View File

@ -1,5 +1,12 @@
#pragma once #pragma once
// On Windows, math.h needs to be included with _USE_MATH_DEFINES defined to
// access constants such as M_SQRT2 and M_2_SQRTPI.
#ifdef _WIN32
#define _USE_MATH_DEFINES
#include <cmath>
#endif // _WIN32
#include <ATen/cpu/vec/vec.h> #include <ATen/cpu/vec/vec.h>
#include <c10/util/BFloat16.h> // For c10::is_reduced_floating_point_v. #include <c10/util/BFloat16.h> // For c10::is_reduced_floating_point_v.

View File

@ -45,9 +45,9 @@ void hardswish_backward_kernel(TensorIterator& iter) {
[zero, three, neg_three, one_half]GPU_LAMBDA(scalar_t grad_val_, scalar_t self_val_) -> scalar_t { [zero, three, neg_three, one_half]GPU_LAMBDA(scalar_t grad_val_, scalar_t self_val_) -> scalar_t {
opmath_t grad_val = static_cast<opmath_t>(grad_val_); opmath_t grad_val = static_cast<opmath_t>(grad_val_);
opmath_t self_val = static_cast<opmath_t>(self_val_); opmath_t self_val = static_cast<opmath_t>(self_val_);
if (self_val < neg_three) { if (self_val <= neg_three) {
return zero; return zero;
} else if (self_val <= three) { } else if (self_val < three) {
return grad_val * ((self_val / three) + one_half); return grad_val * ((self_val / three) + one_half);
} else { } else {
return grad_val; return grad_val;

View File

@ -51,6 +51,23 @@
namespace at::native { namespace at::native {
#ifdef USE_ROCM
// Custom configuration for vectorized elementwise kernel
// with template instantiation.
namespace vectorized_templated_config {
constexpr int num_threads() {
return 512;
}
constexpr int elems_per_thread() {
return 32;
}
constexpr int block_work_size() {
return elems_per_thread() * num_threads();
}
} // namespace vectorized_templated_config
#endif
template <typename args_t, size_t... Is> template <typename args_t, size_t... Is>
constexpr auto sum_of_sizes(args_t args, std::index_sequence<Is...>) { constexpr auto sum_of_sizes(args_t args, std::index_sequence<Is...>) {
@ -255,6 +272,139 @@ static inline void launch_vectorized_kernel(
} }
} }
#ifdef USE_ROCM
template <
int vec_size,
typename func_t,
typename array_t,
typename inp_calc_t,
typename out_calc_t,
typename loader_t,
typename storer_t,
typename OutputType,
typename... InputTypes>
C10_LAUNCH_BOUNDS_1(vectorized_templated_config::num_threads())
__global__ void vectorized_templated_elementwise_kernel(
int N,
func_t f,
array_t data,
inp_calc_t inp_calc,
out_calc_t out_calc,
loader_t loader,
storer_t storer) {
int remaining =
N - vectorized_templated_config::block_work_size() * blockIdx.x;
if (remaining <
vectorized_templated_config::block_work_size()) { // if this block handles
// the reminder,
// just do a naive unrolled loop
auto policy = memory::policies::unroll_base<
vectorized_templated_config::num_threads(),
array_t,
inp_calc_t,
out_calc_t,
loader_t,
storer_t,
vectorized_templated_config::elems_per_thread()>(
data, remaining, inp_calc, out_calc, loader, storer);
elementwise_kernel_helper(f, policy);
} else { // if this block has a full `block_work_size` data to handle, use
// vectorized memory access
elementwise_kernel_helper(
f,
memory::policies::vectorized_templated<
vec_size,
array_t,
vectorized_templated_config::elems_per_thread(),
vectorized_templated_config::num_threads(),
OutputType,
InputTypes...>(data));
}
}
// This function assume trivial 1d and supports template specialization
// to avoid dynamic casting.
// Input vectorization size is based on runtime information, i.e.
// the actual data types of the input and output tensor and cannot
// be determined using the functor type, as in regular non-templated
// vectorized kernels. The caller is in charge of selecting the correct input
// vectorization length.
template <
typename func_t,
typename array_t,
typename inp_calc_t,
typename out_calc_t,
typename loader_t,
typename storer_t,
typename OutputType,
typename... InputTypes>
static inline void launch_vectorized_templated_kernel(
int64_t N,
const func_t& f,
array_t data,
inp_calc_t ic,
out_calc_t oc,
loader_t l,
storer_t s) {
TORCH_INTERNAL_ASSERT(N > 0 && N <= std::numeric_limits<int32_t>::max());
using traits = function_traits<func_t>;
int64_t grid = (N + vectorized_templated_config::block_work_size() - 1) /
vectorized_templated_config::block_work_size();
auto stream = at::cuda::getCurrentCUDAStream();
int vec_size = memory::can_vectorize_up_to<func_t>(data);
switch (vec_size) {
case 8:
vectorized_templated_elementwise_kernel<
8,
func_t,
array_t,
inp_calc_t,
out_calc_t,
loader_t,
storer_t,
OutputType,
InputTypes...>
<<<grid, vectorized_templated_config::num_threads(), 0, stream>>>(
N, f, data, ic, oc, l, s);
C10_CUDA_KERNEL_LAUNCH_CHECK();
break;
case 4:
vectorized_templated_elementwise_kernel<
4,
func_t,
array_t,
inp_calc_t,
out_calc_t,
loader_t,
storer_t,
OutputType,
InputTypes...>
<<<grid, vectorized_templated_config::num_threads(), 0, stream>>>(
N, f, data, ic, oc, l, s);
C10_CUDA_KERNEL_LAUNCH_CHECK();
break;
case 2:
vectorized_templated_elementwise_kernel<
2,
func_t,
array_t,
inp_calc_t,
out_calc_t,
loader_t,
storer_t,
OutputType,
InputTypes...>
<<<grid, vectorized_templated_config::num_threads(), 0, stream>>>(
N, f, data, ic, oc, l, s);
C10_CUDA_KERNEL_LAUNCH_CHECK();
break;
default:
// vector size 1 is not handled as part of vectorize_templated kernel
TORCH_INTERNAL_ASSERT(false, "Unexpected vectorization size");
}
}
#endif
template < template <
typename func_t, typename func_t,
typename array_t, typename array_t,
@ -392,6 +542,46 @@ void gpu_kernel_impl_nocast(TensorIteratorBase& iter, const func_t& f) {
}); });
} }
#ifdef USE_ROCM
namespace {
template <typename TupleLike, size_t arity, size_t arg_num = 0>
struct check_types {
constexpr static inline bool check() {
if constexpr (arity != 2)
return false;
if constexpr (arg_num == 0) {
using SelectedType = std::tuple_element_t<arg_num, TupleLike>;
if constexpr (std::is_same_v<float, SelectedType>)
return check_types<TupleLike, arity, arg_num + 1>::check();
} else if constexpr (arg_num == 1) {
using SelectedType2 = std::tuple_element_t<arg_num, TupleLike>;
if constexpr (std::is_same_v<float, SelectedType2>)
return check_types<TupleLike, arity, arg_num + 1>::check();
}
return false;
}
};
// Bottom case: if we got this far, assume correct type matching except
// when there are no arguments (arity == 0).
template <typename TupleLike, size_t arity>
struct check_types<TupleLike, arity, arity> {
constexpr static inline bool check() {
if constexpr (arity != 0)
return true;
return false;
}
};
template <typename TupleLike>
struct check_types<TupleLike, 0, 0> {
constexpr static inline bool check() {
return false;
}
};
} // namespace
#endif
template <typename func_t> template <typename func_t>
void gpu_kernel_impl(TensorIteratorBase& iter, const func_t& f) { void gpu_kernel_impl(TensorIteratorBase& iter, const func_t& f) {
if (!needs_dynamic_casting<func_t>::check(iter)) { if (!needs_dynamic_casting<func_t>::check(iter)) {
@ -416,6 +606,45 @@ void gpu_kernel_impl(TensorIteratorBase& iter, const func_t& f) {
if (contiguous) { if (contiguous) {
#ifdef USE_ROCM #ifdef USE_ROCM
// Attempt to call specialized vectorized elementwise kernel
// that enables interleaving.
using float_map = c10::CppTypeToScalarType<float>;
using bfloat16_map = c10::CppTypeToScalarType<BFloat16>;
if (iter.ninputs() == 2 && iter.input_dtype(0) == float_map::value &&
iter.input_dtype(1) == bfloat16_map::value &&
memory::can_vectorize_up_to<func_t>(data) > 1) {
// constexpr to reduce the amount of kernels (empty) generated for
// vectorized templated elementwise and limit which functors are actually
// applied to the load and store at compile time.
using func_tuple = typename traits::ArgsTuple;
if constexpr (
std::is_same_v<float, arg0_t> && traits::arity == 2 &&
check_types<func_tuple, traits::arity, 0>::check()) {
auto input_offset_calculator = TrivialOffsetCalculator<traits::arity>();
auto output_offset_calculator = TrivialOffsetCalculator<1>();
auto loader = memory::LoadWithCast<traits::arity>(iter);
auto storer = memory::StoreWithCast<1>(iter);
launch_vectorized_templated_kernel<
func_t,
std::array<char*, ntensors>,
decltype(input_offset_calculator),
decltype(output_offset_calculator),
decltype(loader),
decltype(storer),
float,
float,
BFloat16>(
numel,
f,
data,
input_offset_calculator,
output_offset_calculator,
loader,
storer);
return;
}
}
std::array<ScalarType, ntensors> dtypes; std::array<ScalarType, ntensors> dtypes;
auto inner_strides = iter.get_inner_strides(); auto inner_strides = iter.get_inner_strides();
std::array<int, ntensors> strides; std::array<int, ntensors> strides;

View File

@ -67,6 +67,28 @@ struct vectorized_load_helper {
} }
}; };
#ifdef USE_ROCM
// Templated version of vectorized load helper.
// It can be used on heterogeneous input tensor element types.
template <int arg_index>
struct vectorized_templated_load_helper {
template <typename args_t, typename policy_t>
static __device__ void apply(policy_t& self, args_t* args, int idx) {
using arg_t = std::tuple_element_t<arg_index, args_t>;
// `data` hold the data_ptr for tensors [output, input0, input1, ...], so we
// need a +1 offset to get the input
// Delay pointer arithmetic to the policy loader where we know the actual
// type of the current argument.
char* ptr = (self.data[arg_index + 1]);
auto args_accessor = [&args] __device__(int thread_unroll_idx) -> arg_t& {
return std::get<arg_index>(args[thread_unroll_idx]);
};
self.template load_single_arg<arg_index>(args_accessor, ptr, idx);
}
};
#endif
template<int arg_index> template<int arg_index>
struct unroll_load_helper { struct unroll_load_helper {
template <typename args_t, typename policy_t, typename offset_t, typename loader_t> template <typename args_t, typename policy_t, typename offset_t, typename loader_t>
@ -181,9 +203,16 @@ __device__ aligned_vector<bool, vec_size> load_vector(const bool *base_ptr, uint
namespace policies { namespace policies {
template<typename data_t, typename inp_calc_t, typename out_calc_t, typename loader_t, typename storer_t, int elems_per_thread, int num_outputs=1> template <
struct unroll { int num_threads,
typename data_t,
typename inp_calc_t,
typename out_calc_t,
typename loader_t,
typename storer_t,
int elems_per_thread,
int num_outputs = 1>
struct unroll_base {
data_t data; data_t data;
int remaining; int remaining;
inp_calc_t input_offset_calculator; inp_calc_t input_offset_calculator;
@ -191,12 +220,24 @@ struct unroll {
loader_t loader; loader_t loader;
storer_t storer; storer_t storer;
static constexpr int tws = elems_per_thread; static constexpr int tws = elems_per_thread;
static constexpr int block_work_size = elems_per_thread * num_threads;
__device__ unroll(data_t data, int remaining, inp_calc_t ic, out_calc_t oc, loader_t l, storer_t s): __device__ unroll_base(
data(data), remaining(remaining), input_offset_calculator(ic), output_offset_calculator(oc), loader(l), storer(s) {} data_t data,
int remaining,
inp_calc_t ic,
out_calc_t oc,
loader_t l,
storer_t s)
: data(data),
remaining(remaining),
input_offset_calculator(ic),
output_offset_calculator(oc),
loader(l),
storer(s) {}
__device__ inline bool check_inbounds(int thread_work_elem) { __device__ inline bool check_inbounds(int thread_work_elem) {
return ((int)(threadIdx.x + thread_work_elem*num_threads()) < remaining); return ((int)(threadIdx.x + thread_work_elem * num_threads) < remaining);
} }
template<typename args_t> template<typename args_t>
@ -205,13 +246,13 @@ struct unroll {
int thread_idx = threadIdx.x; int thread_idx = threadIdx.x;
#pragma unroll #pragma unroll
for (int i = 0; i < elems_per_thread; i++) { for (int i = 0; i < elems_per_thread; i++) {
if (thread_idx >= remaining) { if (thread_idx < remaining) {
return; int linear_idx = thread_idx + block_work_size * idx;
auto offset = input_offset_calculator.get(linear_idx);
detail::static_unroll<detail::unroll_load_helper, arity>::with_args(
*this, args, offset, loader, i, num_outputs);
thread_idx += num_threads;
} }
int linear_idx = thread_idx + elems_per_thread * num_threads() * idx;
auto offset = input_offset_calculator.get(linear_idx);
detail::static_unroll<detail::unroll_load_helper, arity>::with_args(*this, args, offset, loader, i, num_outputs);
thread_idx += num_threads();
} }
} }
@ -220,22 +261,36 @@ struct unroll {
int thread_idx = threadIdx.x; int thread_idx = threadIdx.x;
#pragma unroll #pragma unroll
for (int i = 0; i < elems_per_thread; i++) { for (int i = 0; i < elems_per_thread; i++) {
if (thread_idx >= remaining) { if (thread_idx < remaining) {
return; int linear_idx = thread_idx + block_work_size * idx;
int offset = output_offset_calculator.get(linear_idx)[0];
storer.store(from[i], data[0], offset);
thread_idx += num_threads;
} }
int linear_idx = thread_idx + elems_per_thread * num_threads() * idx;
int offset = output_offset_calculator.get(linear_idx)[0];
storer.store(from[i], data[0], offset);
thread_idx += num_threads();
} }
} }
}; };
// Assumption: // Utility type for all users of unroll that extract the num_threads value from
// all tensors are contiguous, that is: stride == sizeof(type) for all tensors // the caller scope.
// Note: template <
// Functions in vectorized policy does not do boundary check. It assumes the whole block typename data_t,
// has its job to do. So the reminders should be handled by the caller manually. typename inp_calc_t,
typename out_calc_t,
typename loader_t,
typename storer_t,
int elems_per_thread,
int num_outputs = 1>
using unroll = unroll_base<
num_threads(),
data_t,
inp_calc_t,
out_calc_t,
loader_t,
storer_t,
elems_per_thread,
num_outputs>;
template <int vec_size, typename data_t, int elems_per_thread> // vec_size: number of scalars, can be 1, 2, or 4. template <int vec_size, typename data_t, int elems_per_thread> // vec_size: number of scalars, can be 1, 2, or 4.
struct vectorized { struct vectorized {
@ -289,6 +344,86 @@ struct vectorized {
} }
}; };
#ifdef USE_ROCM
// This is similar to vectorized policy above, but this one supports
// heterogenous input tensor types as templated parameters.
// Its use should be limited to frequently used heterogeneous data types
// as each instantiation will generate a separate kernel, leading to code
// bloating if applied to all combinations supported in PyTorch. Assumption: all
// tensors are contiguous, that is: stride == sizeof(type) for all tensors.
template <
int vec_size,
typename data_t,
int elems_per_thread,
int num_threads,
typename CastToT,
typename... CastFromTs> // vec_size: number of scalars, can be 1, 2, or 4.
struct vectorized_templated {
static_assert(
elems_per_thread % vec_size == 0,
"The workload per thread must be a multiple of vec_size");
static constexpr int loop_size = elems_per_thread / vec_size;
static constexpr int tws = elems_per_thread;
static constexpr int block_work_size = elems_per_thread * num_threads;
data_t data;
__device__ vectorized_templated(data_t data) : data(data) {}
__device__ inline constexpr bool check_inbounds(int thread_work_elem) {
return true;
}
template <int arg_index, typename accessor_t>
__device__ inline void load_single_arg(accessor_t to, char* ptr, int idx) {
// extract the arg_index-th input tensor element type from the
// variadic template argument.
using CastFromT =
std::tuple_element_t<arg_index, std::tuple<CastFromTs...>>;
// Delayed pointer arithmetic from the caller: this is the place
// where we know the type of the argument.
CastFromT* block_ptr =
reinterpret_cast<CastFromT*>(ptr) + block_work_size * idx;
int thread_idx = threadIdx.x;
#pragma unroll
for (int i = 0; i < loop_size; i++) {
int index = thread_idx + i * num_threads;
auto v = load_vector<vec_size>(block_ptr, index);
#pragma unroll
for (int j = 0; j < vec_size; j++) {
to(vec_size * i + j) = c10::convert<CastToT>(v.val[j]);
}
}
}
template <typename args_t>
__device__ inline void load(args_t* args, int idx) {
constexpr int arity = std::tuple_size<args_t>::value;
detail::static_unroll<detail::vectorized_templated_load_helper, arity>::
with_args(*this, args, idx);
}
// Assume for now that from (temporary array per thread) is of the same
// type as to (destination tensor), which is the case for
// float(float,bfloat16) and functor add on float(float,float).
template <typename scalar_t>
__device__ inline void store(scalar_t* from, int idx) {
using vec_t = aligned_vector<scalar_t, vec_size>;
scalar_t* to = reinterpret_cast<scalar_t*>(data[0]) + block_work_size * idx;
vec_t* to_ = reinterpret_cast<vec_t*>(to);
int thread_idx = threadIdx.x;
#pragma unroll
for (int i = 0; i < loop_size; i++) {
int index = thread_idx + i * num_threads;
vec_t v;
for (int j = 0; j < vec_size; j++) {
v.val[j] = from[vec_size * i + j];
}
to_[index] = v;
}
}
};
#endif
template <typename data_t, typename inp_calc_t, typename out_calc_t, int num_outputs> template <typename data_t, typename inp_calc_t, typename out_calc_t, int num_outputs>
struct multi_outputs_unroll { struct multi_outputs_unroll {
//multi_outputs_unroll struct members and check_inbounds and load methods are copypasted from unroll struct //multi_outputs_unroll struct members and check_inbounds and load methods are copypasted from unroll struct

View File

@ -89,6 +89,20 @@ struct SoftMaxBackwardEpilogue {
const AccumT sum; const AccumT sum;
}; };
template<typename T, typename AccumT, typename OutT>
struct SoftMaxForwardWithMulEpilogue {
__device__ __forceinline__ SoftMaxForwardWithMulEpilogue(AccumT max_input, AccumT sum)
: max_input(max_input)
, sum(sum) {}
__device__ __forceinline__ OutT operator()(T input) const {
return static_cast<OutT>(__expf(input - max_input) * sum);
}
const AccumT max_input;
const AccumT sum;
};
@ -387,6 +401,19 @@ struct SumExpFloat
const AccumT max_k; const AccumT max_k;
}; };
template<typename T, typename AccumT>
struct SumExpfFloat
{
__device__ __forceinline__ SumExpfFloat(AccumT v)
: max_k(v) {}
__device__ __forceinline__ AccumT operator()(AccumT sum, T v) const {
return sum + __expf(v - max_k);
}
const AccumT max_k;
};
template <template<typename> class Reduction, typename AccumT> template <template<typename> class Reduction, typename AccumT>
__device__ __forceinline__ AccumT __device__ __forceinline__ AccumT
blockReduce(AccumT* smem, AccumT val, blockReduce(AccumT* smem, AccumT val,
@ -449,6 +476,19 @@ T blockReduceWarp(T* smem_cache, T value, const Reduction<T>& op, T defaultVal)
return smem_cache[0]; return smem_cache[0];
} }
template <template<typename> class Reduction, typename T>
__device__ __forceinline__
T blockReduceWarpInverse(T* smem_cache, T value, const Reduction<T>& op, T defaultVal)
{
T result = cuda_utils::BlockReduce<T, Reduction<T>>(value, op, defaultVal, smem_cache);
if (threadIdx.x == 0) {
smem_cache[0] = 1 / result;
}
__syncthreads();
return smem_cache[0];
}
template <template<typename, typename> class Reduction, int ILP, typename T, typename AccumT, typename index_t=int> template <template<typename, typename> class Reduction, int ILP, typename T, typename AccumT, typename index_t=int>
__device__ __forceinline__ AccumT __device__ __forceinline__ AccumT
ilpReduce(index_t shift, ilpReduce(index_t shift,
@ -664,6 +704,38 @@ WriteBpropResults(
} }
} }
template <int ILP, typename scalar_t, typename accscalar_t, typename outscalar_t, template <typename, typename, typename> class EpilogueWithMul>
__global__ void
cunn_SoftMaxForwardFast(outscalar_t *output, const scalar_t *input, int classes)
{
extern __shared__ unsigned char smem[];
auto sdata = reinterpret_cast<accscalar_t*>(smem);
// each block handles a sample in the mini-batch
input += static_cast<int64_t>(blockIdx.x) * classes;
output += static_cast<int64_t>(blockIdx.x) * classes;
const int shift = ((uint64_t)input) % ALIGN_BYTES / sizeof(scalar_t);
// find the max
accscalar_t threadMax = ilpReduce<MaxFloat, ILP, scalar_t, accscalar_t>(
shift, input, classes, MaxFloat<scalar_t, accscalar_t>(), -at::numeric_limits<accscalar_t>::max());
accscalar_t max_k = blockReduceWarp<Max, accscalar_t>(sdata, threadMax,
Max<accscalar_t>(), -at::numeric_limits<accscalar_t>::max());
// reduce all values
accscalar_t threadExp = ilpReduce<SumExpfFloat, ILP, scalar_t, accscalar_t>(
shift, input, classes, SumExpfFloat<scalar_t, accscalar_t>(max_k), static_cast<accscalar_t>(0));
accscalar_t sumAll = blockReduceWarpInverse<Add, accscalar_t>(sdata, threadExp,
Add<accscalar_t>(), static_cast<accscalar_t>(0));
EpilogueWithMul<scalar_t, accscalar_t, outscalar_t> epilogue(max_k, sumAll);
for (int offset = threadIdx.x; offset < classes; offset += blockDim.x) {
output[offset] = epilogue(input[offset]);
}
}
template <int ILP, typename scalar_t, typename accscalar_t, typename outscalar_t, template <typename, typename, typename> class Epilogue> template <int ILP, typename scalar_t, typename accscalar_t, typename outscalar_t, template <typename, typename, typename> class Epilogue>
__global__ void __global__ void
cunn_SoftMaxForward(outscalar_t *output, const scalar_t *input, int classes) cunn_SoftMaxForward(outscalar_t *output, const scalar_t *input, int classes)
@ -755,6 +827,68 @@ cunn_SoftMaxForwardReg(outscalar_t *output, const scalar_t *input, index_t class
} }
} }
template <int ILP, typename scalar_t, typename accscalar_t, typename outscalar_t,
template <typename, typename, typename> class EpilogueWithMul, typename index_t = int32_t>
__global__ void
cunn_SoftMaxForwardGmem(outscalar_t *output, const scalar_t *input, index_t classes)
{
// Each thread block processes a sample in the batch
input += static_cast<int64_t>(blockIdx.x) * classes;
output += static_cast<int64_t>(blockIdx.x) * classes;
accscalar_t threadMax = -at::numeric_limits<accscalar_t>::max();
accscalar_t threadExp = static_cast<accscalar_t>(0);
// The first smem segment is used to cache input values and the last
// segment is used for thread block reductions
extern __shared__ unsigned char smem[];
auto smem_reduction_cache = reinterpret_cast<accscalar_t*>(smem);
using LoadT = at::native::memory::aligned_vector<scalar_t, ILP>;
const LoadT* const input_vec_ptr = reinterpret_cast<const LoadT*>(input);
// Do the first step in max calculation:
MaxFloat<scalar_t, accscalar_t> maxFunc;
for (index_t offset = threadIdx.x; offset * ILP < classes; offset += blockDim.x) {
LoadT crnt_vec = input_vec_ptr[offset];
#pragma unroll
for (int i = 0; i < ILP; ++i) {
threadMax = maxFunc(threadMax, crnt_vec.val[i]);
}
}
accscalar_t max_k = blockReduceWarp<Max, accscalar_t>(smem_reduction_cache, threadMax,
Max<accscalar_t>(), -at::numeric_limits<accscalar_t>::max());
// Do the second step in sum exp calculation:
SumExpfFloat<scalar_t, accscalar_t> sumExpFunc(max_k);
for (index_t offset = threadIdx.x; offset * ILP < classes; offset += blockDim.x) {
LoadT crnt_vec = input_vec_ptr[offset];
#pragma unroll
for (int i = 0; i < ILP; ++i) {
threadExp = sumExpFunc(threadExp, crnt_vec.val[i]);
}
}
accscalar_t sumAll = blockReduceWarpInverse<Add, accscalar_t>(smem_reduction_cache, threadExp,
Add<accscalar_t>(), static_cast<accscalar_t>(0));
EpilogueWithMul<scalar_t, accscalar_t, outscalar_t> epilogue(max_k, sumAll);
using StoreT = at::native::memory::aligned_vector<outscalar_t, ILP>;
StoreT* output_vec_ptr = reinterpret_cast<StoreT*>(output);
for (index_t offset = threadIdx.x; offset * ILP < classes; offset += blockDim.x) {
LoadT crnt_vec = input_vec_ptr[offset];
StoreT out_vec;
#pragma unroll
for (int i = 0; i < ILP; ++i) {
out_vec.val[i] = epilogue(crnt_vec.val[i]);
}
output_vec_ptr[offset] = out_vec;
}
}
template <int ILP, typename scalar_t, typename accscalar_t, typename outscalar_t, template <int ILP, typename scalar_t, typename accscalar_t, typename outscalar_t,
template <typename, typename, typename> class Epilogue, typename index_t = int32_t> template <typename, typename, typename> class Epilogue, typename index_t = int32_t>
__global__ void __global__ void
@ -935,7 +1069,9 @@ cunn_SoftMaxBackwardSmem(scalar_t *gradInput, const outscalar_t *output, const o
} }
} }
template<template<typename, typename, typename> class Epilogue, bool is_log_softmax>
template<template<typename, typename, typename> class Epilogue,
template<typename, typename, typename> class EpilogueWithMul, bool is_log_softmax, bool use_fast_softmax>
Tensor host_softmax(const Tensor & input_, const int64_t dim_, const bool half_to_float, const Tensor& output){ Tensor host_softmax(const Tensor & input_, const int64_t dim_, const bool half_to_float, const Tensor& output){
if (half_to_float) { if (half_to_float) {
TORCH_CHECK(input_.scalar_type() == ScalarType::Half, "conversion is supported for Half type only"); TORCH_CHECK(input_.scalar_type() == ScalarType::Half, "conversion is supported for Half type only");
@ -977,66 +1113,78 @@ Tensor host_softmax(const Tensor & input_, const int64_t dim_, const bool half_t
} }
} else { } else {
constexpr int ILP = sizeof(float4) / sizeof(scalar_t); constexpr int ILP = sizeof(float4) / sizeof(scalar_t);
dim3 block = SoftMaxForward_getBlockSize(dim_size); if constexpr (use_fast_softmax) {
size_t smem_reduction_sz = block.x / C10_WARP_SIZE * sizeof(accscalar_t); dim3 block(512);
auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - size_t smem_reduction_sz = block.x / C10_WARP_SIZE * sizeof(accscalar_t);
smem_reduction_sz) / sizeof(scalar_t); if (dim_size % ILP == 0) {
cunn_SoftMaxForwardGmem<ILP, scalar_t, accscalar_t, scalar_t, EpilogueWithMul>
bool can_use_smem = static_cast<size_t>(dim_size) < max_elements_per_smem;
can_use_smem &= !(reinterpret_cast<uintptr_t>(input_ptr) % ALIGN_BYTES);
can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES));
can_use_smem &= !(dim_size % ILP);
int32_t potential_reg_cnt = potential_register_count(dim_size, block.x);
if(potential_reg_cnt < 10){
TORCH_INTERNAL_ASSERT(potential_reg_cnt > 0, "potential_reg_cnt for softmax with register should be greater than 0.");
switch (potential_reg_cnt) {
// TODO(Wenqin): try to investigate why we couldn't use macro for below code,
// because it seems on MSVS, it seems the macro way didn't expand correct.
case 1:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 1>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break; } else {
case 2: cunn_SoftMaxForwardFast<ILP, scalar_t, accscalar_t, scalar_t, EpilogueWithMul>
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 2>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 3:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 3>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 4:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 4>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 5:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 5>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 6:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 6>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 7:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 7>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 8:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 8>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 9:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 9>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
} }
} else if (can_use_smem) {
size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz;
cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue>
<<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size);
} else { } else {
cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> dim3 block = SoftMaxForward_getBlockSize(dim_size);
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); size_t smem_reduction_sz = block.x / C10_WARP_SIZE * sizeof(accscalar_t);
auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock -
smem_reduction_sz) / sizeof(scalar_t);
bool can_use_smem = static_cast<size_t>(dim_size) < max_elements_per_smem;
can_use_smem &= !(reinterpret_cast<uintptr_t>(input_ptr) % ALIGN_BYTES);
can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES));
can_use_smem &= !(dim_size % ILP);
int32_t potential_reg_cnt = potential_register_count(dim_size, block.x);
if(potential_reg_cnt < 10){
TORCH_INTERNAL_ASSERT(potential_reg_cnt > 0, "potential_reg_cnt for softmax with register should be greater than 0.");
switch (potential_reg_cnt) {
// TODO(Wenqin): try to investigate why we couldn't use macro for below code,
// because it seems on MSVS, it seems the macro way didn't expand correct.
case 1:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 1>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 2:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 2>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 3:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 3>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 4:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 4>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 5:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 5>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 6:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 6>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 7:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 7>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 8:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 8>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
case 9:
cunn_SoftMaxForwardReg<scalar_t, accscalar_t, scalar_t, Epilogue, int64_t, 9>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
break;
}
} else if (can_use_smem) {
size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz;
cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue>
<<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size);
} else {
cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
}
} }
C10_CUDA_KERNEL_LAUNCH_CHECK(); C10_CUDA_KERNEL_LAUNCH_CHECK();
@ -1056,23 +1204,35 @@ Tensor host_softmax(const Tensor & input_, const int64_t dim_, const bool half_t
} }
} else { } else {
constexpr int ILP = sizeof(float4) / sizeof(scalar_t); constexpr int ILP = sizeof(float4) / sizeof(scalar_t);
dim3 block = SoftMaxForward_getBlockSize(dim_size); if constexpr (use_fast_softmax) {
size_t smem_reduction_sz = block.x / C10_WARP_SIZE * sizeof(accscalar_t); dim3 block(512);
auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - size_t smem_reduction_sz = block.x / C10_WARP_SIZE * sizeof(accscalar_t);
smem_reduction_sz) / sizeof(scalar_t); if (dim_size % ILP == 0) {
cunn_SoftMaxForwardGmem<ILP, scalar_t, accscalar_t, accscalar_t, EpilogueWithMul>
bool can_use_smem = static_cast<size_t>(dim_size) < max_elements_per_smem; <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
can_use_smem &= !(reinterpret_cast<uintptr_t>(input_ptr) % ALIGN_BYTES); } else {
can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); cunn_SoftMaxForwardFast<ILP, scalar_t, accscalar_t, accscalar_t, EpilogueWithMul>
can_use_smem &= !(dim_size % ILP); <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
}
if (can_use_smem) {
size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz;
cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue>
<<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size);
} else { } else {
cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> dim3 block = SoftMaxForward_getBlockSize(dim_size);
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); size_t smem_reduction_sz = block.x / C10_WARP_SIZE * sizeof(accscalar_t);
auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock -
smem_reduction_sz) / sizeof(scalar_t);
bool can_use_smem = static_cast<size_t>(dim_size) < max_elements_per_smem;
can_use_smem &= !(reinterpret_cast<uintptr_t>(input_ptr) % ALIGN_BYTES);
can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES));
can_use_smem &= !(dim_size % ILP);
if (can_use_smem) {
size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz;
cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue>
<<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size);
} else {
cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue>
<<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size);
}
} }
C10_CUDA_KERNEL_LAUNCH_CHECK(); C10_CUDA_KERNEL_LAUNCH_CHECK();
@ -1252,7 +1412,7 @@ TORCH_IMPL_FUNC(log_softmax_cuda_out) (
const int64_t dim, const int64_t dim,
const bool half_to_float, const bool half_to_float,
const Tensor &output) { const Tensor &output) {
host_softmax<LogSoftMaxForwardEpilogue,true>(input, dim, half_to_float, output); host_softmax<LogSoftMaxForwardEpilogue, LogSoftMaxForwardEpilogue, true, false>(input, dim, half_to_float, output);
} }
TORCH_IMPL_FUNC(log_softmax_backward_cuda_out) ( TORCH_IMPL_FUNC(log_softmax_backward_cuda_out) (
@ -1276,7 +1436,11 @@ TORCH_IMPL_FUNC(softmax_cuda_out) (
const int64_t dim, const int64_t dim,
const bool half_to_float, const bool half_to_float,
const Tensor &output) { const Tensor &output) {
host_softmax<SoftMaxForwardEpilogue,false>(input, dim, half_to_float, output); #if defined(USE_ROCM)
host_softmax<SoftMaxForwardEpilogue, SoftMaxForwardWithMulEpilogue, false, true>(input, dim, half_to_float, output);
#else
host_softmax<SoftMaxForwardEpilogue, SoftMaxForwardWithMulEpilogue, false, false>(input, dim, half_to_float, output);
#endif
} }
TORCH_IMPL_FUNC(softmax_backward_cuda_out) TORCH_IMPL_FUNC(softmax_backward_cuda_out)

View File

@ -469,11 +469,315 @@ void dispatch_bfloat16_gemm(CUDABLAS_GEMM_ARGTYPES(at::BFloat16)) {
} }
} }
void dispatch_bfloat16_gemm_wmma(CUDABLAS_GEMM_ARGTYPES(at::BFloat16)) {
// If any of the shapes cant be tiled, we must use padding.
bool use_padding = ((m % 256 != 0) || (n % 128 != 0) || (k % 64 != 0));
// Dispatch to best implementation.
// TODO add more configurations. Optimize.
bool transa_ = std::tolower(transa) != 'n';
bool transb_ = std::tolower(transb) != 'n';
if (use_padding) {
if(transa_ && transb_) { // col , col
gemm_impl_wmma<
at::BFloat16,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
true,
true,
true>
(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
else if(transa_ && !transb_) { // row, col
gemm_impl_wmma<
at::BFloat16,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
true,
true,
false>
(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
else if(!transa_ && transb_) { //col, row
gemm_impl_wmma<
at::BFloat16,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
true,
false,
true>
(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
else if(!transa_ && !transb_) { //row, row
gemm_impl_wmma<
at::BFloat16,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
true,
false,
false>
(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
else {
TORCH_CHECK(false, "unreachable");
}
} else {
if(transa_ && transb_) { // col , col
gemm_impl_wmma<
at::BFloat16,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
false,
true,
true>
(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
else if(transa_ && !transb_) { // row, col
gemm_impl_wmma<
at::BFloat16,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
false,
true,
false>
(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
else if(!transa_ && transb_) { //col, row
gemm_impl_wmma<
at::BFloat16,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
false,
false,
true>
(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
else if(!transa_ && !transb_) { //row, row
gemm_impl_wmma<
at::BFloat16,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
1,
1,
S<1, 32, 1, 8>, 8,
false,
false,
false>
(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
else {
TORCH_CHECK(false, "unreachable");
}
}
}
template <> template <>
void gemm_internal_ck<at::BFloat16>(CUDABLAS_GEMM_ARGTYPES(at::BFloat16)) { void gemm_internal_ck<at::BFloat16>(CUDABLAS_GEMM_ARGTYPES(at::BFloat16)) {
dispatch_bfloat16_gemm(CUDABLAS_GEMM_ARGS(at::BFloat16)); auto dprops = at::cuda::getCurrentDeviceProperties();
c10::string_view arch(dprops->gcnArchName);
if (arch == "gfx1100") {
dispatch_bfloat16_gemm_wmma(CUDABLAS_GEMM_ARGS(at::BFloat16));
} else{
dispatch_bfloat16_gemm(CUDABLAS_GEMM_ARGS(at::BFloat16));
}
} }
} // namespace at::native } // namespace at::native

View File

@ -297,10 +297,314 @@ void dispatch_half_gemm(CUDABLAS_GEMM_ARGTYPES(at::Half)) {
} }
#endif #endif
} }
void dispatch_half_gemm_wmma(CUDABLAS_GEMM_ARGTYPES(at::Half)) {
// If any of the shapes cant be tiled, we must use padding.
bool use_padding = ((m % 256 != 0) || (n % 128 != 0) || (k % 64 != 0));
// Dispatch to best implementation.
// TODO add more configurations. Optimize.
bool transa_ = std::tolower(transa) != 'n';
bool transb_ = std::tolower(transb) != 'n';
if (use_padding) {
if(transa_ && transb_) { // col , col
gemm_impl_wmma<
at::Half,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
true,
true,
true>
(CUDABLAS_GEMM_ARGS(at::Half));
}
else if(transa_ && !transb_) { // row, col
gemm_impl_wmma<
at::Half,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
true,
true,
false>
(CUDABLAS_GEMM_ARGS(at::Half));
}
else if(!transa_ && transb_) { //col, row
gemm_impl_wmma<
at::Half,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
true,
false,
true>
(CUDABLAS_GEMM_ARGS(at::Half));
}
else if(!transa_ && !transb_) { //row, row
gemm_impl_wmma<
at::Half,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
true,
false,
false>
(CUDABLAS_GEMM_ARGS(at::Half));
}
else {
TORCH_CHECK(false, "unreachable");
}
} else {
if(transa_ && transb_) { // col , col
gemm_impl_wmma<
at::Half,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
false,
true,
true>
(CUDABLAS_GEMM_ARGS(at::Half));
}
else if(transa_ && !transb_) { // row, col
gemm_impl_wmma<
at::Half,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
false,
true,
false>
(CUDABLAS_GEMM_ARGS(at::Half));
}
else if(!transa_ && transb_) { //col, row
gemm_impl_wmma<
at::Half,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
1,
1,
S<1, 32, 1, 8>,
8,
false,
false,
true>
(CUDABLAS_GEMM_ARGS(at::Half));
}
else if(!transa_ && !transb_) { //row, row
gemm_impl_wmma<
at::Half,
256,
128,
256,
64,
8,
16,
16,
4,
4,
S<4, 64, 1>,
S<1, 0, 2>,
S<1, 0, 2>,
2,
8,
8,
true,
S<4, 64, 1>,
S<0, 2, 1>,
S<0, 2, 1>,
1,
1,
8,
true,
1,
1,
S<1, 32, 1, 8>, 8,
false,
false,
false>
(CUDABLAS_GEMM_ARGS(at::Half));
}
else {
TORCH_CHECK(false, "unreachable");
}
}
}
template <> template <>
void gemm_internal_ck<at::Half>(CUDABLAS_GEMM_ARGTYPES(at::Half)) { void gemm_internal_ck<at::Half>(CUDABLAS_GEMM_ARGTYPES(at::Half)) {
dispatch_half_gemm(CUDABLAS_GEMM_ARGS(at::Half)); auto dprops = at::cuda::getCurrentDeviceProperties();
c10::string_view arch(dprops->gcnArchName);
if (arch == "gfx1100") {
dispatch_half_gemm_wmma(CUDABLAS_GEMM_ARGS(at::Half));
} else{
dispatch_half_gemm(CUDABLAS_GEMM_ARGS(at::Half));
}
} }
} // namespace at::native } // namespace at::native

View File

@ -30,6 +30,7 @@
#include <ck/library/utility/literals.hpp> #include <ck/library/utility/literals.hpp>
#include <ck/tensor_operation/gpu/device/impl/device_gemm_multiple_d_xdl_cshuffle_v3.hpp> #include <ck/tensor_operation/gpu/device/impl/device_gemm_multiple_d_xdl_cshuffle_v3.hpp>
#include <ck/tensor_operation/gpu/device/impl/device_gemm_wmma.hpp>
// Define commonly used types. // Define commonly used types.
template <ck::index_t... Is> template <ck::index_t... Is>
@ -236,4 +237,180 @@ void gemm_impl(CUDABLAS_GEMM_ARGTYPES(Dtype)) {
invoker.Run(argument, StreamConfig{stream, false}); invoker.Run(argument, StreamConfig{stream, false});
} }
template <
typename Dtype,
int BLOCK_SIZE,
int MBLOCK,
int NBLOCK,
int KBLOCK,
int K1,
int MPER_WMMA,
int NPER_WMMA,
int MPER_WAVE,
int NPER_WAVE,
typename ABLOCK_CLUSTER_LENS,
typename ABLOCK_CLUSTER_ORDER,
typename ABLOCK_SRC_ORDER,
int ABLOCK_VECTOR_DIM,
int ABLOCK_SCALAR_VEC,
int ABLOCK_SCALAR_VEC_K1,
bool ABLOCK_LDS_EXTRAM,
typename BBLOCK_CLUSTER_LENS,
typename BBLOCK_CLUSTER_ORDER,
typename BBLOCK_SRC_ORDER,
int BBLOCK_VECTOR_DIM,
int BBLOCK_SCALAR_VEC,
int BBLOCK_SCALAR_VEC_AK1,
bool BBLOCK_LDS_EXTRAN,
int CMPER_WAVE,
int CNPER_WAVE,
typename CBLOCK_CLUSTER_LENS,
int CNPER_BLOCK,
bool PADDING = false,
bool TRANSA = false,
bool TRANSB = false>
void gemm_impl_wmma(CUDABLAS_GEMM_ARGTYPES(Dtype)) {
// Get input information.
int M = m;
int N = n;
int K = k;
int StrideA = lda;
int StrideB = ldb;
int StrideC = ldc;
int KBatch = 1;
float falpha = alpha;
float fbeta = beta;
using ADataType = typename CkMathType<Dtype>::dtype;
using BDataType = typename CkMathType<Dtype>::dtype;
using CDataType = typename CkMathType<Dtype>::dtype;
using DDataType = typename CkMathType<Dtype>::dtype;
using AccDataType = float;
using CShuffleDataType = typename CkMathType<Dtype>::dtype;
using ALayout = typename CkTensorLayout<TRANSA, TRANSB>::a_layout;
using BLayout = typename CkTensorLayout<TRANSA, TRANSB>::b_layout;
using DLayout = Row;
using CLayout = Row;
using AElementOp = PassThrough;
using BElementOp = PassThrough;
using CElementOp = PassThrough;
static constexpr auto GemmDefault =
ck::tensor_operation::device::GemmSpecialization::Default;
static constexpr auto GemmMNKPadding =
ck::tensor_operation::device::GemmSpecialization::MNKPadding;
static constexpr auto GemmSpec = PADDING ? GemmMNKPadding : GemmDefault;
using DeviceGemmInstance =
ck::tensor_operation::device::DeviceGemmWmma_CShuffle<ALayout,
BLayout,
CLayout,
ADataType,
BDataType,
CDataType,
AccDataType,
CShuffleDataType,
AElementOp,
BElementOp,
CElementOp,
GemmSpec,
1, // NumPrefetch
BLOCK_SIZE,
MBLOCK,
NBLOCK,
KBLOCK,
K1,
MPER_WMMA,
NPER_WMMA,
MPER_WAVE,
NPER_WAVE,
ABLOCK_CLUSTER_LENS,
ABLOCK_CLUSTER_ORDER,
ABLOCK_SRC_ORDER,
ABLOCK_VECTOR_DIM,
ABLOCK_SCALAR_VEC,
ABLOCK_SCALAR_VEC_K1,
ABLOCK_LDS_EXTRAM,
BBLOCK_CLUSTER_LENS,
BBLOCK_CLUSTER_ORDER,
BBLOCK_SRC_ORDER,
BBLOCK_VECTOR_DIM,
BBLOCK_SCALAR_VEC,
BBLOCK_SCALAR_VEC_AK1,
BBLOCK_LDS_EXTRAN,
CMPER_WAVE,
CNPER_WAVE,
CBLOCK_CLUSTER_LENS,
CNPER_BLOCK>;
auto gemm = DeviceGemmInstance{};
auto invoker = gemm.MakeInvoker();
auto a_element_op = AElementOp{};
auto b_element_op = BElementOp{};
auto c_element_op = CElementOp{};
using DDataArrayType = std::array<const void*, 0>;
DDataArrayType DDataArray;
// We swap A and B inputs here as a temporary workaround
auto argument = gemm.MakeArgument(
reinterpret_cast<const ADataType*>(b),
reinterpret_cast<const BDataType*>(a),
reinterpret_cast<CDataType*>(c),
N,
M,
K,
StrideB,
StrideA,
StrideC,
b_element_op,
a_element_op,
c_element_op);
if(!gemm.IsSupportedArgument(argument))
{
printf("error shape = %d %d %d TRANSA=%d TRANSB=%d \n",
n, m, k,TRANSA, TRANSB);
throw std::runtime_error(
"wrong! device_gemm with the specified compilation parameters does "
"not support this GEMM problem");
}
auto stream = at::cuda::getCurrentHIPStream().stream();
#if 1
invoker.Run(argument, StreamConfig{stream, false});
#else
float ave_time = invoker.Run(argument, StreamConfig{stream, true});
std::size_t flop = std::size_t(2) * M * N * K;
std::size_t num_btype =
sizeof(ADataType) * M * K + sizeof(BDataType) * K * N + sizeof(CDataType) * M * N;
float tflops = static_cast<float>(flop) / 1.E9 / ave_time;
float gb_per_sec = num_btype / 1.E6 / ave_time;
std::cout << "Perf: " << std::setw(10) << ave_time << " ms, " << tflops << " TFlops, "
<< gb_per_sec << " GB/s, " << N <<" " <<M<<" " << k <<" "
<< "stride: "<<StrideA <<" "<<StrideB <<" "<<StrideC <<" "
<< gemm.GetTypeString()
<< std::endl;
#endif
}
} // namespace at::native } // namespace at::native

View File

@ -311,9 +311,8 @@ void gpu_float_sdpa(
bool is_causal, bool is_causal,
float softmax_scale, float softmax_scale,
const Tensor& output) { const Tensor& output) {
auto eng = GpuEngineManager::Instance().get_engine( auto& eng = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()}); auto& strm = GpuStreamManager::Instance().get_stream();
auto strm = GpuStreamManager::Instance().get_stream();
const auto get_tril_mask = [&]() { const auto get_tril_mask = [&]() {
auto opts = query.options(); auto opts = query.options();

View File

@ -338,8 +338,7 @@ class Attr {
// [1, C, 1, 1], channel broadcast // [1, C, 1, 1], channel broadcast
// [dst.shape], no broadcast and eltwise-wise binary operations on dst // [dst.shape], no broadcast and eltwise-wise binary operations on dst
auto engine = GpuEngineManager::Instance().get_engine( auto& engine = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()});
for (size_t i = 0; i < ops_params_.size(); ++i) { for (size_t i = 0; i < ops_params_.size(); ++i) {
kind_t kind = ops_params_[i].kind_; kind_t kind = ops_params_[i].kind_;
if (kind == kind_t::binary) { if (kind == kind_t::binary) {

View File

@ -83,9 +83,8 @@ sycl::event convolution(
int64_t groups, int64_t groups,
Attr& attr, Attr& attr,
const std::vector<sycl::event>& deps) { const std::vector<sycl::event>& deps) {
auto engine = GpuEngineManager::Instance().get_engine( auto& engine = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()}); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
bool is_channels_last = use_channels_last_for_conv(src, weight); bool is_channels_last = use_channels_last_for_conv(src, weight);
@ -184,9 +183,8 @@ sycl::event convolution_backward_weights(
IntArrayRef dilation, IntArrayRef dilation,
int64_t groups, int64_t groups,
const std::vector<sycl::event>& deps) { const std::vector<sycl::event>& deps) {
auto engine = GpuEngineManager::Instance().get_engine( auto& engine = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()}); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
bool is_channels_last = use_channels_last_for_conv(src, diff_dst); bool is_channels_last = use_channels_last_for_conv(src, diff_dst);
@ -292,9 +290,8 @@ sycl::event convolution_backward_data(
int64_t groups, int64_t groups,
bool bias_defined, bool bias_defined,
const std::vector<sycl::event>& deps) { const std::vector<sycl::event>& deps) {
auto engine = GpuEngineManager::Instance().get_engine( auto& engine = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()}); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
bool is_channels_last = use_channels_last_for_conv(diff_dst, weight); bool is_channels_last = use_channels_last_for_conv(diff_dst, weight);

View File

@ -158,9 +158,8 @@ sycl::event deconvolution(
int64_t groups, int64_t groups,
Attr& attr, Attr& attr,
const std::vector<sycl::event>& deps) { const std::vector<sycl::event>& deps) {
auto engine = GpuEngineManager::Instance().get_engine( auto& engine = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()}); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
bool is_channels_last_suggested = use_channels_last_for_conv(src, weight); bool is_channels_last_suggested = use_channels_last_for_conv(src, weight);
@ -249,9 +248,8 @@ sycl::event deconvolution_backward_data(
int64_t groups, int64_t groups,
bool bias_defined, bool bias_defined,
const std::vector<sycl::event>& deps) { const std::vector<sycl::event>& deps) {
auto engine = GpuEngineManager::Instance().get_engine( auto& engine = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()}); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
bool is_channels_last_suggested = bool is_channels_last_suggested =
use_channels_last_for_conv(diff_dst, weight); use_channels_last_for_conv(diff_dst, weight);
@ -347,9 +345,8 @@ sycl::event deconvolution_backward_weights(
IntArrayRef dilation, IntArrayRef dilation,
int64_t groups, int64_t groups,
const std::vector<sycl::event>& deps) { const std::vector<sycl::event>& deps) {
auto engine = GpuEngineManager::Instance().get_engine( auto& engine = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()}); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
bool is_channels_last_suggested = use_channels_last_for_conv(src, diff_dst); bool is_channels_last_suggested = use_channels_last_for_conv(src, diff_dst);

View File

@ -30,9 +30,8 @@ sycl::event matmul(
"oneDNN input matrixes must have the same ranks"); "oneDNN input matrixes must have the same ranks");
TORCH_CHECK(result.defined(), "oneDNN matmul result should be defined"); TORCH_CHECK(result.defined(), "oneDNN matmul result should be defined");
at::Device cur_device = at::Device(at::kXPU, c10::xpu::current_device()); auto& engine = GpuEngineManager::Instance().get_engine();
auto engine = GpuEngineManager::Instance().get_engine(cur_device); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
at::Tensor m1 = mat1; at::Tensor m1 = mat1;
at::Tensor m2 = mat2; at::Tensor m2 = mat2;

View File

@ -5,6 +5,7 @@
#include <ATen/native/mkldnn/xpu/detail/Attr.h> #include <ATen/native/mkldnn/xpu/detail/Attr.h>
#include <ATen/native/mkldnn/xpu/detail/Utils.h> #include <ATen/native/mkldnn/xpu/detail/Utils.h>
#include <ATen/native/mkldnn/xpu/detail/oneDNN.h>
#include <ATen/native/mkldnn/xpu/detail/oneDNNContext.h> #include <ATen/native/mkldnn/xpu/detail/oneDNNContext.h>
#include <oneapi/dnnl/dnnl.hpp> #include <oneapi/dnnl/dnnl.hpp>
@ -106,9 +107,8 @@ at::Tensor quantized_convolution(
output.defined(), output.defined(),
"A valid output is required for quantized convolution."); "A valid output is required for quantized convolution.");
auto engine = GpuEngineManager::Instance().get_engine( auto& engine = GpuEngineManager::Instance().get_engine();
{c10::kXPU, c10::xpu::current_device()}); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
// input tensors config // input tensors config
dnnl::memory::dims src_dims = act.sizes().vec(); dnnl::memory::dims src_dims = act.sizes().vec();

View File

@ -125,9 +125,8 @@ void quantized_matmul(
attr); attr);
size_t dims = result.dim(); size_t dims = result.dim();
at::Device cur_device = at::Device(at::kXPU, c10::xpu::current_device()); auto& engine = GpuEngineManager::Instance().get_engine();
auto engine = GpuEngineManager::Instance().get_engine(cur_device); auto& stream = GpuStreamManager::Instance().get_stream();
auto stream = GpuStreamManager::Instance().get_stream();
at::Tensor m1 = is_onednn_matmul_strides(mat1) ? mat1 : mat1.contiguous(); at::Tensor m1 = is_onednn_matmul_strides(mat1) ? mat1 : mat1.contiguous();
at::Tensor m2 = is_onednn_matmul_strides(mat2) ? mat2 : mat2.contiguous(); at::Tensor m2 = is_onednn_matmul_strides(mat2) ? mat2 : mat2.contiguous();

View File

@ -29,8 +29,7 @@ static inline void dnnl_delete(
} }
GpuEngineManager::GpuEngineManager() { GpuEngineManager::GpuEngineManager() {
c10::DeviceIndex device_count = c10::xpu::device_count(); c10::DeviceIndex device_count = c10::xpu::device_count_ensure_non_zero();
TORCH_INTERNAL_ASSERT(device_count > 0);
for (const auto i : c10::irange(device_count)) { for (const auto i : c10::irange(device_count)) {
static dnnl::graph::allocator alloc = static dnnl::graph::allocator alloc =
dnnl::graph::sycl_interop::make_allocator(dnnl_alloc, dnnl_delete); dnnl::graph::sycl_interop::make_allocator(dnnl_alloc, dnnl_delete);

View File

@ -25,10 +25,15 @@ bool set_onednn_verbose(int level);
struct TORCH_XPU_API GpuEngineManager { struct TORCH_XPU_API GpuEngineManager {
static GpuEngineManager& Instance(); // Singleton static GpuEngineManager& Instance(); // Singleton
dnnl::engine& get_engine(
DeviceIndex device_index = c10::xpu::current_device()) {
c10::xpu::check_device_index(device_index);
return *engine_pool[device_index];
}
dnnl::engine& get_engine(const Device& device) { dnnl::engine& get_engine(const Device& device) {
TORCH_INTERNAL_ASSERT(device.type() == kXPU); TORCH_INTERNAL_ASSERT(device.type() == kXPU);
TORCH_INTERNAL_ASSERT(device.index() < c10::xpu::device_count()); return get_engine(device.index());
return *engine_pool[device.index()];
} }
GpuEngineManager(GpuEngineManager const&) = delete; GpuEngineManager(GpuEngineManager const&) = delete;
@ -48,16 +53,15 @@ struct TORCH_XPU_API GpuEngineManager {
struct TORCH_XPU_API GpuStreamManager { struct TORCH_XPU_API GpuStreamManager {
static GpuStreamManager& Instance(); // Singleton static GpuStreamManager& Instance(); // Singleton
dnnl::stream get_stream() { dnnl::stream& get_stream(
auto stream = c10::xpu::getCurrentXPUStream(); DeviceIndex device_index = c10::xpu::current_device()) {
auto stream = c10::xpu::getCurrentXPUStream(device_index);
auto priority = stream.priority(); auto priority = stream.priority();
auto device_index = stream.device_index();
if (stream_pool[device_index][priority].find(stream) == if (stream_pool[device_index][priority].find(stream) ==
stream_pool[device_index][priority].end()) { stream_pool[device_index][priority].end()) {
stream_pool[device_index][priority][stream] = stream_pool[device_index][priority][stream] =
std::make_shared<dnnl::stream>(dnnl::sycl_interop::make_stream( std::make_shared<dnnl::stream>(dnnl::sycl_interop::make_stream(
GpuEngineManager::Instance().get_engine( GpuEngineManager::Instance().get_engine(device_index),
{c10::kXPU, device_index}),
stream.queue())); stream.queue()));
} }
return *stream_pool[device_index][priority][stream]; return *stream_pool[device_index][priority][stream];
@ -70,8 +74,7 @@ struct TORCH_XPU_API GpuStreamManager {
protected: protected:
GpuStreamManager() { GpuStreamManager() {
c10::DeviceIndex device_count = c10::xpu::device_count(); c10::DeviceIndex device_count = c10::xpu::device_count_ensure_non_zero();
TORCH_INTERNAL_ASSERT(device_count > 0);
stream_pool.resize(device_count); stream_pool.resize(device_count);
} }
~GpuStreamManager() = default; ~GpuStreamManager() = default;

View File

@ -19,7 +19,7 @@ static inline c10::ScalarType qconv_decide_out_dtype(
return dst_dtype; return dst_dtype;
} }
at::Tensor qconv_prepack_xpu( static at::Tensor qconv_prepack_xpu(
at::Tensor weight, at::Tensor weight,
at::Tensor weight_scales, at::Tensor weight_scales,
double input_scale, double input_scale,

View File

@ -19,7 +19,7 @@ static inline c10::ScalarType qlinear_decide_out_dtype(
return dst_dtype; return dst_dtype;
} }
Tensor q_linear_pointwise( static Tensor q_linear_pointwise(
Tensor act, Tensor act,
double act_scale, double act_scale,
int64_t act_zero_point, int64_t act_zero_point,
@ -78,7 +78,7 @@ Tensor q_linear_pointwise(
return qout; return qout;
} }
Tensor q_linear_pointwise_tensor( static Tensor q_linear_pointwise_tensor(
Tensor act, Tensor act,
Tensor act_scale, Tensor act_scale,
Tensor act_zero_point, Tensor act_zero_point,
@ -137,7 +137,7 @@ Tensor q_linear_pointwise_tensor(
return qout; return qout;
} }
Tensor q_linear_pointwise_binary( static Tensor q_linear_pointwise_binary(
Tensor act, Tensor act,
double act_scale, double act_scale,
int64_t act_zero_point, int64_t act_zero_point,
@ -208,7 +208,7 @@ Tensor q_linear_pointwise_binary(
return dim == 3 ? qout.reshape({act.size(0), -1, N}) : qout; return dim == 3 ? qout.reshape({act.size(0), -1, N}) : qout;
} }
Tensor q_linear_pointwise_binary_tensor( static Tensor q_linear_pointwise_binary_tensor(
Tensor act, Tensor act,
Tensor act_scale, Tensor act_scale,
Tensor act_zero_point, Tensor act_zero_point,
@ -248,7 +248,7 @@ Tensor q_linear_pointwise_binary_tensor(
unary_post_op_algorithm); unary_post_op_algorithm);
} }
at::Tensor q_linear_prepack_onednn( static at::Tensor q_linear_prepack_onednn(
at::Tensor weight, at::Tensor weight,
std::optional<torch::List<int64_t>> input_shape) { std::optional<torch::List<int64_t>> input_shape) {
at::Tensor weight_transposed = weight.transpose(0, 1); at::Tensor weight_transposed = weight.transpose(0, 1);

View File

@ -133,6 +133,10 @@ class MetalShaderLibrary {
TensorIteratorBase& iter, TensorIteratorBase& iter,
const std::string& name, const std::string& name,
std::optional<int64_t> extra = std::nullopt); std::optional<int64_t> extra = std::nullopt);
void exec_binary_kernel(
TensorIteratorBase& iter,
const std::string& name,
const bool supports_dense = true);
protected: protected:
virtual MTLLibrary_t getLibrary(); virtual MTLLibrary_t getLibrary();

View File

@ -1010,6 +1010,49 @@ void MetalShaderLibrary::exec_unary_kernel(TensorIteratorBase& iter,
} }
} }
void MetalShaderLibrary::exec_binary_kernel(TensorIteratorBase& iter,
const std::string& name,
const bool supports_dense) {
TORCH_CHECK(iter.common_dtype() != at::kDouble, "float64 is not supported on MPS");
Tensor input = iter.input(0);
Tensor other = iter.input(1);
Tensor out = iter.output();
id<MTLDevice> device = MPSDevice::getInstance()->device();
MPSStream* mpsStream = getCurrentMPSStream();
const uint32_t nDim = iter.ndim();
constexpr uint32_t nOffsets = 3;
const uint32_t numThreads = iter.numel();
dispatch_sync_with_rethrow(mpsStream->queue(), ^() {
@autoreleasepool {
auto computeEncoder = mpsStream->commandEncoder();
if (supports_dense && iter.is_contiguous()) {
const auto kernel_name = fmt::format("{}_dense_{}", name, scalarToMetalTypeString(input));
auto binaryPSO = getPipelineStateForFunc(kernel_name);
[computeEncoder setComputePipelineState:binaryPSO];
mtl_setArgs(computeEncoder, input, other, out);
mtl_dispatch1DJob(computeEncoder, binaryPSO, numThreads);
return;
}
const auto kernel = fmt::format("{}_{}", name, scalarToMetalTypeString(input));
auto kernelDataOffsets = generateKernelDataOffsets(computeEncoder, iter);
auto binaryPSO = getPipelineStateForFunc(kernel);
// this function call is a no-op if MPS Profiler is not enabled
getMPSProfiler().beginProfileKernel(binaryPSO, kernel, {input, other});
[computeEncoder setComputePipelineState:binaryPSO];
mtl_setArgs(computeEncoder, input, other, out);
[computeEncoder setBuffer:kernelDataOffsets offset:0 atIndex:3];
mtl_dispatch1DJob(computeEncoder, binaryPSO, numThreads);
getMPSProfiler().endProfileKernel(binaryPSO);
}
});
}
MetalShaderLibrary& MetalShaderLibrary::getBundledLibrary() { MetalShaderLibrary& MetalShaderLibrary::getBundledLibrary() {
static BundledShaderLibary l; static BundledShaderLibary l;
return l; return l;

View File

@ -1,3 +1,4 @@
#include <c10/metal/indexing.h>
#include <c10/metal/special_math.h> #include <c10/metal/special_math.h>
#include <c10/metal/utils.h> #include <c10/metal/utils.h>
#include <metal_stdlib> #include <metal_stdlib>
@ -91,59 +92,6 @@ struct polar_functor {
} }
}; };
// Future BinaryTensorIterator
template <typename T, typename F>
using result_of = decltype(::metal::declval<F>()(
::metal::declval<T>(),
::metal::declval<T>()));
template <typename T, typename F>
kernel void binary_indexing(
constant void* input_ [[buffer(0)]],
constant void* other_ [[buffer(1)]],
device void* out_ [[buffer(2)]],
constant uint3* offsets [[buffer(3)]],
uint tid [[thread_position_in_grid]]) {
auto out = (device result_of<T, F>*)((device uint8_t*)out_ + offsets[tid].x);
auto input = (constant T*)((constant uint8_t*)input_ + offsets[tid].y);
auto other = (constant T*)((constant uint8_t*)other_ + offsets[tid].z);
F f;
*out = f(*input, *other);
}
template <typename T, typename F>
kernel void binary_dense(
constant T* input [[buffer(0)]],
constant T* other [[buffer(1)]],
device result_of<T, F>* out [[buffer(2)]],
uint tid [[thread_position_in_grid]]) {
F f;
out[tid] = f(input[tid], other[tid]);
}
#define REGISTER_BINARY_INDEXING_OP(NAME, DTYPE) \
template [[host_name(#NAME "_" #DTYPE)]] kernel void \
binary_indexing<DTYPE, NAME##_functor>( \
constant void* input_, \
constant void* other_, \
device void* out_, \
constant uint3* offsets, \
uint tid); \
template [[host_name(#NAME "_dense_" #DTYPE)]] kernel void \
binary_dense<DTYPE, NAME##_functor>( \
constant DTYPE * input_, \
constant DTYPE * other_, \
device result_of<DTYPE, NAME##_functor> * out_, \
uint tid)
#define REGISTER_BINARY_OP(NAME, DTYPE) \
template [[host_name(#NAME "_" #DTYPE)]] kernel void NAME<DTYPE>( \
constant void* input_, \
constant void* other_, \
device void* out_, \
constant uint3* offsets, \
uint tid)
REGISTER_BINARY_INDEXING_OP(copysign, long); REGISTER_BINARY_INDEXING_OP(copysign, long);
REGISTER_BINARY_INDEXING_OP(copysign, int); REGISTER_BINARY_INDEXING_OP(copysign, int);
REGISTER_BINARY_INDEXING_OP(copysign, float); REGISTER_BINARY_INDEXING_OP(copysign, float);
@ -190,9 +138,7 @@ kernel void complex_mul(
out[1] = input[0] * other[1] + input[1] * other[0]; out[1] = input[0] * other[1] + input[1] * other[0];
} }
REGISTER_BINARY_OP(complex_mul, float); // Constructs complex tensor from real and imaginary planes
REGISTER_BINARY_OP(complex_mul, half);
template <typename T> template <typename T>
kernel void complex_kernel( kernel void complex_kernel(
constant void* real_ [[buffer(0)]], constant void* real_ [[buffer(0)]],
@ -207,5 +153,15 @@ kernel void complex_kernel(
out[1] = imag[0]; out[1] = imag[0];
} }
#define REGISTER_BINARY_OP(NAME, DTYPE) \
template [[host_name(#NAME "_" #DTYPE)]] kernel void NAME<DTYPE>( \
constant void* input_, \
constant void* other_, \
device void* out_, \
constant uint3* offsets, \
uint tid)
REGISTER_BINARY_OP(complex_mul, float);
REGISTER_BINARY_OP(complex_mul, half);
REGISTER_BINARY_OP(complex_kernel, float); REGISTER_BINARY_OP(complex_kernel, float);
REGISTER_BINARY_OP(complex_kernel, half); REGISTER_BINARY_OP(complex_kernel, half);

View File

@ -1,16 +1,63 @@
#include <c10/metal/indexing.h> #include <c10/metal/indexing.h>
#include <c10/metal/special_math.h> #include <c10/metal/special_math.h>
using namespace c10::metal; using namespace c10::metal;
using namespace metal;
DEFINE_UNARY_FLOATING_FUNCTOR(bessel_j0_forward);
DEFINE_UNARY_FLOATING_FUNCTOR(bessel_j1_forward);
DEFINE_UNARY_FLOATING_FUNCTOR(modified_bessel_i0_forward);
DEFINE_UNARY_FLOATING_FUNCTOR(modified_bessel_i1_forward);
DEFINE_UNARY_FLOATING_FUNCTOR(i0); DEFINE_UNARY_FLOATING_FUNCTOR(i0);
DEFINE_UNARY_FLOATING_FUNCTOR(i0e);
DEFINE_UNARY_FLOATING_FUNCTOR(i1); DEFINE_UNARY_FLOATING_FUNCTOR(i1);
DEFINE_UNARY_FLOATING_FUNCTOR(i1e);
DEFINE_UNARY_FLOATING_FUNCTOR(spherical_bessel_j0); DEFINE_UNARY_FLOATING_FUNCTOR(spherical_bessel_j0);
DEFINE_UNARY_FLOATING_FUNCTOR(entr); DEFINE_UNARY_FLOATING_FUNCTOR(entr);
#define REGISTER_SPECIAL(DTI, DTO) \ // TODO: Replaceme with DEFINE_UNARY_FLOATING_FUNCTOR
REGISTER_UNARY_OP(i0, DTI, DTO); \ // But for some reason instantinating bessel_y[01] on M1/M2 results in
REGISTER_UNARY_OP(i1, DTI, DTO); \ // Failed to created pipeline state object, error: Error Domain=AGXMetalG14X
REGISTER_UNARY_OP(spherical_bessel_j0, DTI, DTO); \ // Code=3 "Compiler encountered an internal error"
struct bessel_y0_forward_functor {
template <typename T>
inline enable_if_t<is_floating_point_v<T>, T> operator()(const T x) {
return static_cast<T>(bessel_y0_forward(x));
}
template <typename T>
inline enable_if_t<is_integral_v<T>, float> operator()(const T x) {
return bessel_y0_forward(static_cast<float>(x));
}
inline float operator()(const bool x) {
return x ? 0.08825694769620895 : -INFINITY;
}
};
struct bessel_y1_forward_functor {
template <typename T>
inline enable_if_t<is_floating_point_v<T>, T> operator()(const T x) {
return static_cast<T>(bessel_y1_forward(x));
}
template <typename T>
inline enable_if_t<is_integral_v<T>, float> operator()(const T x) {
return bessel_y1_forward(static_cast<float>(x));
}
inline float operator()(const bool x) {
return x ? -0.7812128067016602 : -INFINITY;
}
};
#define REGISTER_SPECIAL(DTI, DTO) \
REGISTER_UNARY_OP(bessel_j0_forward, DTI, DTO); \
REGISTER_UNARY_OP(bessel_j1_forward, DTI, DTO); \
REGISTER_UNARY_OP(modified_bessel_i0_forward, DTI, DTO); \
REGISTER_UNARY_OP(modified_bessel_i1_forward, DTI, DTO); \
REGISTER_UNARY_OP(bessel_y0_forward, DTI, DTO); \
REGISTER_UNARY_OP(bessel_y1_forward, DTI, DTO); \
REGISTER_UNARY_OP(i0, DTI, DTO); \
REGISTER_UNARY_OP(i0e, DTI, DTO); \
REGISTER_UNARY_OP(i1, DTI, DTO); \
REGISTER_UNARY_OP(i1e, DTI, DTO); \
REGISTER_UNARY_OP(spherical_bessel_j0, DTI, DTO); \
REGISTER_UNARY_OP(entr, DTI, DTO) REGISTER_UNARY_OP(entr, DTI, DTO)
REGISTER_SPECIAL(float, float); REGISTER_SPECIAL(float, float);

View File

@ -268,12 +268,31 @@ kernel void upsample_bilinear2d(
} }
} }
inline float bilinear_functor(float x) { struct BilinearFunctor {
return abs(x) < 1.0 ? 1.0 - abs(x) : abs(x); inline float operator()(float x) {
} x = abs(x);
return x < 1.0 ? 1.0 - x : x;
}
static constant constexpr float area_factor = 1.0;
};
template <typename T> struct BicubicFunctor {
kernel void upsample_bilinear2d_aa( inline float operator()(float x) {
// https://en.wikipedia.org/wiki/Bicubic_interpolation#Bicubic_convolution_algorithm
x = abs(x);
if (x < 1.0) {
return 1.0 + (1.5 * x - 2.5) * x * x;
}
if (x < 2.0) {
return 2.0 - 0.5 * ((x - 5.0) * x + 8.0) * x;
}
return 0;
}
static constant constexpr float area_factor = 2.0;
};
template <typename T, typename F>
kernel void upsample_2d_aa(
constant T* inputData [[buffer(0)]], constant T* inputData [[buffer(0)]],
device T* outputData [[buffer(1)]], device T* outputData [[buffer(1)]],
constant ulong4& input_strides [[buffer(2)]], constant ulong4& input_strides [[buffer(2)]],
@ -286,15 +305,26 @@ kernel void upsample_bilinear2d_aa(
auto output_x = thread_index % static_cast<uint>(output_sizes.w); auto output_x = thread_index % static_cast<uint>(output_sizes.w);
auto output_y = thread_index / static_cast<uint>(output_sizes.w); auto output_y = thread_index / static_cast<uint>(output_sizes.w);
(void)align_corners; // Align corners is unused for AA algorithm (void)align_corners; // Align corners is unused for AA algorithm
F f;
auto x_center = area_pixel_compute_source_index( auto x_center = area_pixel_compute_source_index(
scales.x, output_x, /*align_corners=*/false, /*cubic=*/false); scales.x,
output_x,
/*align_corners=*/false,
/*cubic=*/F::area_factor == 2.0);
auto y_center = area_pixel_compute_source_index( auto y_center = area_pixel_compute_source_index(
scales.y, output_y, /*align_corners=*/false, /*cubic=*/false); scales.y,
output_y,
/*align_corners=*/false,
/*cubic=*/F::area_factor == 2.0);
auto clamped_scales = max(1.0, scales); auto clamped_scales = max(1.0, scales);
auto x_min = max(0L, long(floor(x_center - clamped_scales.x + 1))); auto x_min =
auto x_max = min(input_sizes.w, long(ceil(x_center + clamped_scales.x))); max(0L, long(floor(x_center - f.area_factor * clamped_scales.x + 1)));
auto y_min = max(0L, long(floor(y_center - clamped_scales.y + 1))); auto x_max = min(
auto y_max = min(input_sizes.z, long(ceil(y_center + clamped_scales.y))); input_sizes.w, long(ceil(x_center + f.area_factor * clamped_scales.x)));
auto y_min =
max(0L, long(floor(y_center - f.area_factor * clamped_scales.y + 1)));
auto y_max = min(
input_sizes.z, long(ceil(y_center + f.area_factor * clamped_scales.y)));
for (int n = 0; n < output_sizes.x; n++) { for (int n = 0; n < output_sizes.x; n++) {
for (int c = 0; c < output_sizes.y; c++) { for (int c = 0; c < output_sizes.y; c++) {
float res = 0.0; float res = 0.0;
@ -302,9 +332,9 @@ kernel void upsample_bilinear2d_aa(
constant auto* input = constant auto* input =
inputData + n * input_strides.x + c * input_strides.y; inputData + n * input_strides.x + c * input_strides.y;
for (auto y = y_min; y < y_max; ++y) { for (auto y = y_min; y < y_max; ++y) {
auto dy = bilinear_functor((y - y_center) / clamped_scales.y); auto dy = f((y - y_center) / clamped_scales.y);
for (auto x = x_min; x < x_max; ++x) { for (auto x = x_min; x < x_max; ++x) {
auto dx = bilinear_functor((x - x_center) / clamped_scales.x); auto dx = f((x - x_center) / clamped_scales.x);
auto val = input[x * input_strides.w + y * input_strides.z]; auto val = input[x * input_strides.w + y * input_strides.z];
res += val * dx * dy; res += val * dx * dy;
ws += dx * dy; ws += dx * dy;
@ -456,6 +486,19 @@ kernel void upsample_bicubic2d_backward(
constant bool& align_corners [[buffer(7)]], \ constant bool& align_corners [[buffer(7)]], \
uint thread_index [[thread_position_in_grid]]) uint thread_index [[thread_position_in_grid]])
#define INSTANTIATE_UPSAMPLE_2D_AA(NAME, FUNCTOR, DTYPE) \
template [[host_name("upsample_" #NAME "_" #DTYPE)]] kernel void \
upsample_2d_aa<DTYPE, FUNCTOR>( \
constant DTYPE * inputData [[buffer(0)]], \
device DTYPE * outputData [[buffer(1)]], \
constant ulong4 & input_strides [[buffer(2)]], \
constant ulong4 & output_strides [[buffer(3)]], \
constant long4 & input_sizes [[buffer(4)]], \
constant long4 & output_sizes [[buffer(5)]], \
constant float2 & scales [[buffer(6)]], \
constant bool& align_corners [[buffer(7)]], \
uint thread_index [[thread_position_in_grid]])
#define INSTANTIATE_UPSAMPLE_2D_BACKWARD(NAME, DTYPE) \ #define INSTANTIATE_UPSAMPLE_2D_BACKWARD(NAME, DTYPE) \
template [[host_name("upsample_" #NAME "_backward_" #DTYPE)]] kernel void \ template [[host_name("upsample_" #NAME "_backward_" #DTYPE)]] kernel void \
upsample_##NAME##_backward<DTYPE>( \ upsample_##NAME##_backward<DTYPE>( \
@ -482,11 +525,12 @@ kernel void upsample_bicubic2d_backward(
constant bool& align_corners [[buffer(7)]], \ constant bool& align_corners [[buffer(7)]], \
uint thread_index [[thread_position_in_grid]]) uint thread_index [[thread_position_in_grid]])
#define INSTANTIATE_UPSAMPLE_ALL(DTYPE) \ #define INSTANTIATE_UPSAMPLE_ALL(DTYPE) \
INSTANTIATE_UPSAMPLE_2D(bicubic2d, DTYPE); \ INSTANTIATE_UPSAMPLE_2D(bicubic2d, DTYPE); \
INSTANTIATE_UPSAMPLE_2D_BACKWARD(bicubic2d, DTYPE); \ INSTANTIATE_UPSAMPLE_2D_AA(bicubic2d_aa, BicubicFunctor, DTYPE); \
INSTANTIATE_UPSAMPLE_2D(bilinear2d, DTYPE); \ INSTANTIATE_UPSAMPLE_2D_BACKWARD(bicubic2d, DTYPE); \
INSTANTIATE_UPSAMPLE_2D(bilinear2d_aa, DTYPE); \ INSTANTIATE_UPSAMPLE_2D(bilinear2d, DTYPE); \
INSTANTIATE_UPSAMPLE_2D_AA(bilinear2d_aa, BilinearFunctor, DTYPE); \
INSTANTIATE_UPSAMPLE_LINEAR(DTYPE); INSTANTIATE_UPSAMPLE_LINEAR(DTYPE);
INSTANTIATE_UPSAMPLE_2D(bilinear2d, uchar); INSTANTIATE_UPSAMPLE_2D(bilinear2d, uchar);

View File

@ -44,7 +44,8 @@ std::tuple<Tensor, Tensor> _scaled_dot_product_attention_math_mps(const Tensor&
TORCH_CHECK(!attn_mask.has_value(), TORCH_CHECK(!attn_mask.has_value(),
"_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True"); "_scaled_dot_product_attention: Explicit attn_mask should not be set when is_causal=True");
} }
TORCH_CHECK(query.size(-3) == key.size(-3) && key.size(-3) == value.size(-3),
"number of heads in query/key/value should match");
TORCH_CHECK(dropout_p == 0.0, "_scaled_dot_product_attention_math_for_mps: dropout_p != 0.0 is not supported"); TORCH_CHECK(dropout_p == 0.0, "_scaled_dot_product_attention_math_for_mps: dropout_p != 0.0 is not supported");
TORCH_CHECK(macOS15_0_plus || (query.is_contiguous() && key.is_contiguous() && value.is_contiguous()), TORCH_CHECK(macOS15_0_plus || (query.is_contiguous() && key.is_contiguous() && value.is_contiguous()),
"_scaled_dot_product_attention_math_for_mps: query, key, and value must be contiguous"); "_scaled_dot_product_attention_math_for_mps: query, key, and value must be contiguous");
@ -55,6 +56,7 @@ std::tuple<Tensor, Tensor> _scaled_dot_product_attention_math_mps(const Tensor&
auto [q_, sq] = ensure_4d(query); auto [q_, sq] = ensure_4d(query);
auto [k_, sk] = ensure_4d(key); auto [k_, sk] = ensure_4d(key);
auto [v_, sv] = ensure_4d(value); auto [v_, sv] = ensure_4d(value);
std::optional<Tensor> mask_; std::optional<Tensor> mask_;
if (attn_mask) { if (attn_mask) {
auto maskExpandedDims = query.sizes().vec(); auto maskExpandedDims = query.sizes().vec();

View File

@ -23,54 +23,13 @@
#endif #endif
namespace at::native { namespace at::native {
namespace mps {
#ifndef PYTORCH_JIT_COMPILE_SHADERS #ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary(); static auto& lib = mps::MetalShaderLibrary::getBundledLibrary();
#else #else
#include <ATen/native/mps/BinaryKernel_metallib.h> #include <ATen/native/mps/BinaryKernel_metallib.h>
#endif #endif
static void binary_mps_impl(TensorIteratorBase& iter, const std::string func_name, bool supports_dense = true) { namespace mps {
TORCH_CHECK(iter.common_dtype() != at::kDouble, "float64 is not supported on MPS");
Tensor input = iter.input(0);
Tensor other = iter.input(1);
Tensor out = iter.output();
id<MTLDevice> device = MPSDevice::getInstance()->device();
MPSStream* mpsStream = getCurrentMPSStream();
const uint32_t nDim = iter.ndim();
constexpr uint32_t nOffsets = 3;
const uint32_t numThreads = iter.numel();
dispatch_sync_with_rethrow(mpsStream->queue(), ^() {
@autoreleasepool {
auto computeEncoder = mpsStream->commandEncoder();
if (supports_dense && iter.is_contiguous()) {
const auto kernel_name = fmt::format("{}_dense_{}", func_name, scalarToMetalTypeString(input));
auto binaryPSO = lib.getPipelineStateForFunc(kernel_name);
[computeEncoder setComputePipelineState:binaryPSO];
mtl_setArgs(computeEncoder, input, other, out);
mtl_dispatch1DJob(computeEncoder, binaryPSO, numThreads);
return;
}
const std::string kernel = func_name + "_" + scalarToMetalTypeString(input);
auto kernelDataOffsets = generateKernelDataOffsets(computeEncoder, iter);
id<MTLComputePipelineState> binaryPSO = lib.getPipelineStateForFunc(kernel);
// this function call is a no-op if MPS Profiler is not enabled
getMPSProfiler().beginProfileKernel(binaryPSO, kernel, {input, other});
[computeEncoder setComputePipelineState:binaryPSO];
mtl_setArgs(computeEncoder, input, other, out);
[computeEncoder setBuffer:kernelDataOffsets offset:0 atIndex:3];
mtl_dispatch1DJob(computeEncoder, binaryPSO, numThreads);
getMPSProfiler().endProfileKernel(binaryPSO);
}
});
}
void complex_mul_out(const Tensor& input, const Tensor& other, const Tensor& output) { void complex_mul_out(const Tensor& input, const Tensor& other, const Tensor& output) {
TORCH_INTERNAL_ASSERT(c10::isComplexType(input.scalar_type()) || c10::isComplexType(other.scalar_type())); TORCH_INTERNAL_ASSERT(c10::isComplexType(input.scalar_type()) || c10::isComplexType(other.scalar_type()));
@ -89,43 +48,43 @@ void complex_mul_out(const Tensor& input, const Tensor& other, const Tensor& out
auto iter = auto iter =
TensorIteratorConfig().add_output(output_as_real).add_input(input_as_real).add_input(other_as_real).build(); TensorIteratorConfig().add_output(output_as_real).add_input(input_as_real).add_input(other_as_real).build();
mps::binary_mps_impl(iter, "complex_mul", false); lib.exec_binary_kernel(iter, "complex_mul", /*supports_dense=*/false);
} }
} // namespace mps } // namespace mps
static void fmax_mps_kernel(TensorIteratorBase& iter) { static void fmax_mps_kernel(TensorIteratorBase& iter) {
if (isFloatingType(iter.common_dtype())) { if (isFloatingType(iter.common_dtype())) {
mps::binary_mps_impl(iter, "fmax"); lib.exec_binary_kernel(iter, "fmax");
} else { } else {
at::maximum_out(const_cast<Tensor&>(iter.output()), iter.input(0), iter.input(1)); at::maximum_out(const_cast<Tensor&>(iter.output()), iter.input(0), iter.input(1));
} }
} }
static void fmin_mps_kernel(TensorIteratorBase& iter) { static void fmin_mps_kernel(TensorIteratorBase& iter) {
if (isFloatingType(iter.common_dtype())) { if (isFloatingType(iter.common_dtype())) {
mps::binary_mps_impl(iter, "fmin"); lib.exec_binary_kernel(iter, "fmin");
} else { } else {
at::minimum_out(const_cast<Tensor&>(iter.output()), iter.input(0), iter.input(1)); at::minimum_out(const_cast<Tensor&>(iter.output()), iter.input(0), iter.input(1));
} }
} }
static void copysign_mps_kernel(TensorIteratorBase& iter) { static void copysign_mps_kernel(TensorIteratorBase& iter) {
mps::binary_mps_impl(iter, "copysign"); lib.exec_binary_kernel(iter, "copysign");
} }
static void nextafter_mps_kernel(TensorIteratorBase& iter) { static void nextafter_mps_kernel(TensorIteratorBase& iter) {
TORCH_CHECK_TYPE(isFloatingType(iter.common_dtype()), "nextafter_mps not implemented for non-floating types"); TORCH_CHECK_TYPE(isFloatingType(iter.common_dtype()), "nextafter_mps not implemented for non-floating types");
mps::binary_mps_impl(iter, "nextafter"); lib.exec_binary_kernel(iter, "nextafter");
} }
static void zeta_mps_kernel(TensorIteratorBase& iter) { static void zeta_mps_kernel(TensorIteratorBase& iter) {
TORCH_CHECK_TYPE(isFloatingType(iter.common_dtype()), "zeta_mps not implemented for non-floating types"); TORCH_CHECK_TYPE(isFloatingType(iter.common_dtype()), "zeta_mps not implemented for non-floating types");
mps::binary_mps_impl(iter, "zeta"); lib.exec_binary_kernel(iter, "zeta");
} }
static void xlog1py_mps_kernel(TensorIteratorBase& iter) { static void xlog1py_mps_kernel(TensorIteratorBase& iter) {
TORCH_CHECK_TYPE(isFloatingType(iter.common_dtype()), "xlog1py_mps not implemented for non-floating types"); TORCH_CHECK_TYPE(isFloatingType(iter.common_dtype()), "xlog1py_mps not implemented for non-floating types");
mps::binary_mps_impl(iter, "xlog1py"); lib.exec_binary_kernel(iter, "xlog1py");
} }
REGISTER_DISPATCH(fmax_stub, &fmax_mps_kernel) REGISTER_DISPATCH(fmax_stub, &fmax_mps_kernel)
@ -147,7 +106,7 @@ Tensor& polar_out_mps(const Tensor& abs, const Tensor& angle, Tensor& output) {
auto output_as_real = at::view_as_real(output).select(output.dim(), 0); auto output_as_real = at::view_as_real(output).select(output.dim(), 0);
auto iter = TensorIteratorConfig().add_output(output_as_real).add_input(abs).add_input(angle).build(); auto iter = TensorIteratorConfig().add_output(output_as_real).add_input(abs).add_input(angle).build();
mps::binary_mps_impl(iter, "polar"); lib.exec_binary_kernel(iter, "polar");
return output; return output;
} }
@ -163,7 +122,7 @@ Tensor& complex_out_mps(const Tensor& real, const Tensor& imag, Tensor& output)
auto output_as_real = at::view_as_real(output).select(output.dim(), 0); auto output_as_real = at::view_as_real(output).select(output.dim(), 0);
auto iter = TensorIteratorConfig().add_output(output_as_real).add_input(real).add_input(imag).build(); auto iter = TensorIteratorConfig().add_output(output_as_real).add_input(real).add_input(imag).build();
mps::binary_mps_impl(iter, "complex_kernel", false); lib.exec_binary_kernel(iter, "complex_kernel", /*supports_dense=*/false);
return output; return output;
} }
} // namespace at::native } // namespace at::native

View File

@ -14,7 +14,6 @@
#include <ATen/ops/atan2_native.h> #include <ATen/ops/atan2_native.h>
#include <ATen/ops/div_native.h> #include <ATen/ops/div_native.h>
#include <ATen/ops/eq_native.h> #include <ATen/ops/eq_native.h>
#include <ATen/ops/floor_divide_native.h>
#include <ATen/ops/fmod_native.h> #include <ATen/ops/fmod_native.h>
#include <ATen/ops/ge_native.h> #include <ATen/ops/ge_native.h>
#include <ATen/ops/gt_native.h> #include <ATen/ops/gt_native.h>
@ -447,19 +446,8 @@ TORCH_IMPL_FUNC(pow_Scalar_out_mps)(const Scalar& base, const Tensor& exp, const
} }
} }
Tensor& floor_divide_out_mps(const Tensor& self, const Tensor& other, Tensor& result) { static void div_floor_kernel_mps(TensorIteratorBase& iter) {
mps::div_mode_template(self, other, "floor", result, "floor_divide_out"); mps::div_mode_template(iter.input(0), iter.input(1), "floor", iter.output(0), "floor_divide_out");
return result;
}
Tensor floor_divide_mps(const Tensor& self, const Tensor& other) {
Tensor output = at::empty_like(self);
mps::div_mode_template(self, other, "floor", output, "floor_divide");
return output;
}
Tensor& floor_divide_mps_(Tensor& self, const Tensor& other) {
return floor_divide_out_mps(self, other, self);
} }
TORCH_IMPL_FUNC(remainder_out_mps)(const Tensor& self, const Tensor& other, const Tensor& output) { TORCH_IMPL_FUNC(remainder_out_mps)(const Tensor& self, const Tensor& other, const Tensor& output) {
@ -538,4 +526,6 @@ TORCH_IMPL_FUNC(xlogy_out_mps)(const Tensor& self, const Tensor& other, const Te
TORCH_IMPL_FUNC(lerp_Scalar_mps)(const Tensor& self, const Tensor& end, const Scalar& weight, const Tensor& out) { TORCH_IMPL_FUNC(lerp_Scalar_mps)(const Tensor& self, const Tensor& end, const Scalar& weight, const Tensor& out) {
mps::add_sub_lerp_template(self, end, weight, out, "lerp"); mps::add_sub_lerp_template(self, end, weight, out, "lerp");
} }
REGISTER_DISPATCH(div_floor_stub, &div_floor_kernel_mps);
} // namespace at::native } // namespace at::native

View File

@ -60,9 +60,25 @@ static void _fused_sgd_with_momentum_kernel_mps_(TensorList params,
const bool is_first_step, const bool is_first_step,
const std::optional<Tensor>& grad_scale, const std::optional<Tensor>& grad_scale,
const std::optional<Tensor>& found_inf) { const std::optional<Tensor>& found_inf) {
if (lr_tensor.is_cpu()) {
return _fused_sgd_with_momentum_kernel_mps_(params,
grads,
momentum_buffer_list,
weight_decay,
momentum,
lr_tensor.item<double>(),
dampening,
nesterov,
maximize,
is_first_step,
grad_scale,
found_inf);
}
TORCH_CHECK_GT(momentum, 0); TORCH_CHECK_GT(momentum, 0);
TORCH_CHECK(native::check_fast_path_restrictions({params, grads, momentum_buffer_list})); TORCH_CHECK(native::check_fast_path_restrictions({params, grads, momentum_buffer_list}));
TORCH_CHECK(lr_tensor.device() == params[0].device(), "lr must be on the same GPU device as the params");
std::vector<std::vector<Tensor>> tensor_lists{params.vec(), grads.vec(), momentum_buffer_list.vec()}; std::vector<std::vector<Tensor>> tensor_lists{params.vec(), grads.vec(), momentum_buffer_list.vec()};
const auto kernel_name = "fused_sgd_momentum_" + scalarToMetalTypeString(params[0].scalar_type()); const auto kernel_name = "fused_sgd_momentum_" + scalarToMetalTypeString(params[0].scalar_type());

View File

@ -16,10 +16,18 @@ static void i0_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "i0"); lib.exec_unary_kernel(iter, "i0");
} }
static void i0e_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "i0e");
}
static void i1_kernel_mps(TensorIteratorBase& iter) { static void i1_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "i1"); lib.exec_unary_kernel(iter, "i1");
} }
static void i1e_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "i1e");
}
static void spherical_bessel_j0_kernel_mps(TensorIteratorBase& iter) { static void spherical_bessel_j0_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "spherical_bessel_j0"); lib.exec_unary_kernel(iter, "spherical_bessel_j0");
} }
@ -28,8 +36,40 @@ static void entr_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "entr"); lib.exec_unary_kernel(iter, "entr");
} }
static void bessel_j0_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "bessel_j0_forward");
}
static void bessel_j1_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "bessel_j1_forward");
}
static void modified_bessel_i0_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "modified_bessel_i0_forward");
}
static void modified_bessel_i1_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "modified_bessel_i1_forward");
}
static void bessel_y0_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "bessel_y0_forward");
}
static void bessel_y1_kernel_mps(TensorIteratorBase& iter) {
lib.exec_unary_kernel(iter, "bessel_y1_forward");
}
REGISTER_DISPATCH(i0_stub, &i0_kernel_mps) REGISTER_DISPATCH(i0_stub, &i0_kernel_mps)
REGISTER_DISPATCH(special_i0e_stub, &i0e_kernel_mps)
REGISTER_DISPATCH(special_i1_stub, &i1_kernel_mps) REGISTER_DISPATCH(special_i1_stub, &i1_kernel_mps)
REGISTER_DISPATCH(special_i1e_stub, &i1e_kernel_mps)
REGISTER_DISPATCH(special_bessel_j0_stub, &bessel_j0_kernel_mps)
REGISTER_DISPATCH(special_bessel_j1_stub, &bessel_j1_kernel_mps)
REGISTER_DISPATCH(special_modified_bessel_i0_stub, &modified_bessel_i0_kernel_mps)
REGISTER_DISPATCH(special_modified_bessel_i1_stub, &modified_bessel_i1_kernel_mps)
REGISTER_DISPATCH(special_bessel_y0_stub, &bessel_y0_kernel_mps)
REGISTER_DISPATCH(special_bessel_y1_stub, &bessel_y1_kernel_mps)
REGISTER_DISPATCH(special_spherical_bessel_j0_stub, &spherical_bessel_j0_kernel_mps) REGISTER_DISPATCH(special_spherical_bessel_j0_stub, &spherical_bessel_j0_kernel_mps)
REGISTER_DISPATCH(special_entr_stub, &entr_kernel_mps) REGISTER_DISPATCH(special_entr_stub, &entr_kernel_mps)
} // namespace at::native } // namespace at::native

Some files were not shown because too many files have changed in this diff Show More