Compare commits

...

926 Commits

Author SHA1 Message Date
06c6a81a98 Revert "[ROCm] change preferred blas lib defaults (#150249)" (#150658)
This reverts commit 8b6bc59e9552689e115445649b76917b9487a181.
2025-04-03 20:39:27 -04:00
3b61d5d4e3 Update expected results for pr_time_benchmarks (#150620) 2025-04-03 10:14:13 -04:00
8b6bc59e95 [ROCm] change preferred blas lib defaults (#150249)
* [ROCm] change preferred blas lib defaults (#150212)

Fixes #148883
Fixes #150155

Also adds at::BlasBackend:Default. Instinct cards prefer hipBLASLt, everything else prefers rocBLAS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150212
Approved by: https://github.com/jeffdaily

(cherry picked from commit 7a470c932060190b314fe18bc1cec75335e4831f)

* add unit test for preferred_blas_library settings

---------

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-03 09:26:58 -04:00
c2ccaa3c21 [inductor] Fix inductor windows linker error (#150447)
[inductor] Fix inductor windows linker error (#150256)

Fixes #149889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150256
Approved by: https://github.com/anijain2305, https://github.com/eellison

(cherry picked from commit 37ebb0b56a3af1a5e8083337b4d670fc70fe23a3)

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-04-02 17:36:04 -07:00
6569576c4e Dont exclude constant_pad_nd in prologue fusion (#150145)
Dont exclude constant_pad_nd in prologue fusion (#149947)

Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues.

For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd.

```
import torch
import torch.nn.functional as F
torch._inductor.config.force_disable_caches = True

padded_N = 2048
n_pad_rows = 100

K, N = 2048, 4096

tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16)
tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16)

@torch.compile(mode='max-autotune-no-cudagraphs')
def masked_linear(input, weight, n_pad_input_rows):
    """
    Linear layer with input padded by `n_pad_input_rows` rows
    """
    # Use constant_pad_nd to pad with zeros for the invalid rows
    padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0)
    return F.linear(padded_input, weight)

# Invoke the function
masked_linear(tensor1, tensor2, n_pad_rows)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947
Approved by: https://github.com/drisspg

(cherry picked from commit 4c57aec5b9a37e23caedfe305fb4577e26254123)

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-04-02 20:24:52 -04:00
5416dff2b2 [Release/2.7][MPS] Warn that torch.compile is a prototype (#150550)
And reference https://github.com/pytorch/pytorch/issues/150121
2025-04-02 14:55:19 -07:00
791265114e Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)" (#150572)
This reverts commit 5d4e7d58b42623a9024a84f0050967ff0318dcdb.
2025-04-02 14:47:28 -07:00
7ad8bc7e8b [Windows][inductor] fix blank space break windows file path (#150448)
[Windows][inductor] fix blank space break windows file path (#149388)

Fixes #149310

From origin error message:
```cmd
Command:
cl /I C:/Program Files/Python310/Include /I c:/code/.env/lib/site-packages/torch/include /I c:/code/.env/lib/site-packages/torch/include/torch/csrc/api/include /I c:/code/.env/lib/site-packages/torch/include/TH /I c:/code/.env/lib/site-packages/torch/include/THC /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp /LD /FeC:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.pyd /link /LIBPATH:c:/code/.env/Scripts/libs /LIBPATH:c:/code/.env/lib/site-packages/torch/lib torch.lib torch_cpu.lib torch_python.lib sleef.lib

Output:
Microsoft (R) C/C++ Optimizing Compiler Version 19.43.34809 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental'
cl : Command line warning D9024 : unrecognized source file type 'Files/Python310/Include', object file assumed
coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp
C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp(21): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
```
Python installed in `C:/Program Files/Python310` path, and the blank space break the file path.

Solution:
Add quotes to declare Windows file paths, after that:
```cmd
cl /I "C:/Users/Xuhan/.conda/envs/new_build/Include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include/torch/csrc/api/include"  /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D  C10_USING_CUSTOM_GENERATED_MACROS /D CPU_CAPABILITY_AVX512  /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental  C:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.cpp  /arch:AVX512  /FeC:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.pyd /LD /link /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/libs" /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/lib"  "torch.lib" "torch_cpu.lib" "torch_python.lib" "sleef.lib"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149388
Approved by: https://github.com/jansel

(cherry picked from commit bc1b8730a45e659dca83ec83995c17d4eec9c869)

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-04-02 13:21:44 -07:00
f2ee3f4847 [BE] Fix triton windows build (#150547)
[BE] Fix triton windows build (#150512)

Fixes #150480
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150512
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
(cherry picked from commit 8102272d8c5b5a3063446ec67877eea495e6d323)

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
2025-04-02 09:54:45 -07:00
dfd39fe14f [cherry-pick] [CI] Disable some tests that are failing in periodic #150059 (#150327)
* [CI] Disable some tests that are failing in periodic (#150059)

Disabling some tests to restore periodic

nogpu avx512 timeout:
59f14d19ae (38492953496-box)

profiler failure: 7ae0ce6360 (38461255009-box)

test_accelerator failure:
87bfd66c3c (39476723746-box)
origin: 146098

test_overrides failure:
bf752c36da (39484562957-box)
origin: 146098

inductor cpu repro:
bb9c426024 (38447525659-box)

functorch eager transforms:
8f858e226b (39488068620-box)
f2cea01f71 (39555064878)
b5281a4a18 (39599355600)
either 148288 or 148261?

2ec9aceaeb/1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150059
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet

* disable_CompiledOptimizerParityTests

* Update test/inductor/test_compiled_optimizers.py

---------

Co-authored-by: Catherine Lee <csl@fb.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-01 23:05:14 -07:00
b766c0200a [Cherry-pick] Make PyTorch buildable with cmake-4 (#150460)
* [Cmake] Make PyTorch buildable by CMake-4.x (#150203)

By turning on compatibility mode for protobuf, nnpack, PSimd and FP16, ittapi, TensorPipe and Gloo
Update CMake requirements

 Revert 0ece461ccafe5649d2d0f058ff5477765fd56499 and b0901d62ae2c2e909f91401eacebf3731df20cbe to test that it actually works

TODO:
  - Update/get rid of those libraries

Fixes https://github.com/pytorch/pytorch/issues/150149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150203
Approved by: https://github.com/clee2000

(cherry picked from commit 493c7fa66f82cf781ee0f9d0cc9e305688f0a286)

* Make PyTorch buildable by CMake-4.x on s390x (#150294)

This is a continuation of
https://github.com/pytorch/pytorch/pull/150203
that fixes nightly build on s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150294
Approved by: https://github.com/malfet

(cherry picked from commit ab342d3793472c65aaa0b007ca13a98fc9206dc5)

---------

Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
2025-04-01 19:37:52 -07:00
a3cd7b0cc4 [MPS] tril op not handling infs correctly (#150479)
[MPS] tril op not handling infs correctly (#149866)

Fixes #149813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866
Approved by: https://github.com/malfet

(cherry picked from commit ba46643df181f37efe594f9dd77b45436e08e6ec)

Co-authored-by: Isalia20 <irakli.salia854@gmail.com>
2025-04-01 16:30:38 -07:00
8522972133 torch.backends.mkldnn.flags() CM should not warn (#150416)
`torch.backends.mkldnn.flags()` CM should not warn (#150358)

By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined

Added regression test

Fixes https://github.com/pytorch/pytorch/issues/149829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358
Approved by: https://github.com/atalman

(cherry picked from commit 6470b373c16017f5cb8f1aa4060bb60632b18160)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-04-01 08:53:27 -07:00
c4b98c8364 [Build] Fix XPU builds inside venv (#150301)
Update the torch-xpu-ops commit to [3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5](3ee2bd2f13)

 - Fix the build error if users build torch xpu through python virtual environment. It was due to that torch-xpu-ops uses `${PYTHON_EXECUTABLE}` to get python path. However, `${PYTHON_EXECUTABLE}` is the sytem python path, while the pytorch root cmake is using the Python_EXECUTABLE ([Here](420a9be743/tools/setup_helpers/cmake.py (L310))) https://github.com/intel/torch-xpu-ops/issues/1461
 - code diff (026b2c8c7c..3ee2bd2f13)
   - base commit: 026b2c8c7c92a7b2cec5d26334006e3423251cc6
   - new commit: 3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5

(cherry picked from commit f74d5d576aedf053b7574f3eb06d12417d80625a)

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
2025-04-01 08:22:00 -07:00
d10ffd76db [Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#150395)
[Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#148863)

We found that the `pip install cmake` and `conda install cmake` has different behavior.
The reason is that the pip installed one doesn't find the corresponding libs under conda env. So we need to set the `CMAKE_PREFIX_PATH` for alignment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148863
Approved by: https://github.com/CuiYifeng, https://github.com/malfet

Co-authored-by: Cui, Yifeng <yifeng.cui@intel.com>
(cherry picked from commit ce52674b7651921630019de62323ee0bfd69516d)

Co-authored-by: Stonepia <tong.su@intel.com>
2025-04-01 10:56:57 -04:00
53a13e553d Enabling xpu in OffsetBasedRNGTracker . (#150389)
Enabling xpu in OffsetBasedRNGTracker . (#148360)

Else torch.distributed breaks on xpu devices.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360
Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
(cherry picked from commit f0e1a0838c1245a8763d1c67318b23940a3e9246)

Co-authored-by: _githubsgi <zozoxoxo897@gmail.com>
2025-04-01 10:54:51 -04:00
5745d6a770 [ROCm] cmake 4 workaround for hiprtc (#150361)
[ROCm] cmake 4 workaround for hiprtc (#150324)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150324
Approved by: https://github.com/jeffdaily, https://github.com/atalman, https://github.com/malfet

(cherry picked from commit 423e4a4568958845da52808e50d1cdd2ba7fa48d)

Co-authored-by: Faa Diallo <Faa.Diallo@amd.com>
2025-04-01 10:44:03 -04:00
60ddcd803e Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590) (#150352)
Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590)"

This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626.
2025-03-31 15:25:34 -07:00
f2b3b5c453 [MPS] Fix dot/mm for conj_tensors (#150237)
[MPS] Fix dot/mm for conj_tensors (#150157)

- Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key
- For matmul or dot, add `conjugateWithTensor:name:` calls before running the op
- Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo
- Filter  `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR)
- Preserve conj property when gathering the views, that fixes `cov` operator

Fixes https://github.com/pytorch/pytorch/issues/148156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157
Approved by: https://github.com/dcci

(cherry picked from commit 7c65911b11fc1cc7d93045f4cf923058e8a27782)

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-03-31 16:13:11 -04:00
71fa7def26 Fix #149806 : Fix path lookup in _preload_cuda_deps (#150068)
Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808)

@pytorchbot label "bug"

Fixes #149806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149808
Approved by: https://github.com/jansel

(cherry picked from commit 68b327341c748c869fdd7cb51cd05ab8ad6caaac)

Co-authored-by: Divain <fegnouche@hotmail.fr>
2025-03-31 13:07:56 -07:00
1a6c192dc4 Use schema as source of truth + support ones_like/empty_like (#149775)
Use schema as source of truth + support ones_like/empty_like (#149052)

This change does 2 important things:
(a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat!

(b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052
Approved by: https://github.com/albanD

(cherry picked from commit 988827cdfb6d5946049cac7141a5ca04f2177c0a)

Co-authored-by: Jane Xu <janeyx@meta.com>
2025-03-31 11:03:26 -07:00
e691e92297 Update Doc for Intel XPU Profiling (#150272)
Update Doc for Intel XPU Profiling (#134515)

Updated below two pages for Intel XPU
https://pytorch.org/docs/stable/torch.compiler_profiling_torch_compile.html
https://pytorch.org/docs/stable/profiler.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134515
Approved by: https://github.com/dvrogozh, https://github.com/malfet

(cherry picked from commit 7aacbab0b32596a3c334dca5d488e4620b79bb5e)

Co-authored-by: Louie Tsai <louie.tsai@intel.com>
2025-03-31 09:42:08 -07:00
2b73f403c7 Pin cmake to 3.31.2 for windows conda install (#150223)
Pin cmake to 3.31.2 for windows conda install (#150185)

Trying to fix nightly failures
Cmake 4.0 update https://pypi.org/project/cmake/4.0.0/ broke nightly builds
You can see it here: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=cuda11_8-build
and here: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=
This fix for Windows Builds. Linux and MacOS where already fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150185
Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi

(cherry picked from commit b0901d62ae2c2e909f91401eacebf3731df20cbe)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-28 14:28:41 -07:00
697cd9bbb1 [inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149993)
[inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149973)

Fixes #148938

Context:

In triton 3.3, triton kernels expect a global scratch space arg to be passed in. This is fixed in #148051, which fixed most of the AOTI/cpp_wrapper failures; the fix is to inject a (null) global scratch space arg passed as an argument to all kernels.

But in the case of TMA, we need to call a non-triton-generated function - init1DTMADescriptor. The same `generate_args_decl` function used for calling triton kernels (and modified in #148051 to insert a global scratch space) is used to prepare the arguments to init1DTMADescriptor, and so it had an extra global scratch space arg. Then we'd get a null pointer passed into init1DTMADescriptor, resulting in an IMA later on when the TMA use kernel

This PR: adds an option to `generate_args_decl` to specify whether this is a triton kernel (in which case we should add the global scratch space arg) or not (when we shouldn't add the extra arg).

Note: this doesn't appear in CI because we don't run these tests with Hopper machines in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149973
Approved by: https://github.com/drisspg

(cherry picked from commit a8d0c5c92818186119d4a94d98999acc3f549a7e)

Co-authored-by: David Berard <dberard@fb.com>
2025-03-28 16:40:47 -04:00
64ca70f83c Pin cmake==3.31.6 (#150193)
Pin cmake==3.31.6 (#150158)

I'm not sure if this is the right think to do, but cmake 4.0.0 got released on pypi and our builds are failing with it

Example:
aa70d62041 (39555975425-box)

I guess we have to go change all the cmake_minimum_required to >=3.5?

backwards compat still failing because its building with the base commit which this pr can't really change until it gets merged, but at least manywheel binary builds got past where they were originally failing

Also pin the conda installation, but the most recent version on conda is 3.31.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150158
Approved by: https://github.com/cyyever, https://github.com/malfet

(cherry picked from commit 0ece461ccafe5649d2d0f058ff5477765fd56499)

Co-authored-by: Catherine Lee <csl@fb.com>
2025-03-28 09:09:29 -07:00
1b84fd1503 Enable fast path for qlinear (static/dynamic) and qadd for AArch64 though ACL directly. (#149435)
* Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585)

This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation.
However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`.

- **For `qlinear_dynamic` (dynamically quantized matmuls):**

This PR yields an ****average speedup** (averaged over context_lengths of 2^3 up to 2^9) of ~ **50%** for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased`** with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below:
```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
from transformers import AutoModel, AutoConfig
import time
import numpy as np
from argparse import ArgumentParser

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__(description="huggingface model")
        self.add_argument("--context_length",
                            help="context length - number of input tokens",
                            type=int,
                            default=64
        )
        self.add_argument("--model",
                            help="model checkpoint - i.e. 'bert-base-uncased'",
                            type=str,
                            default=None)
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()
    model_name = args.model
    config = AutoConfig.from_pretrained(model_name)
    batch_size = 1
    model = AutoModel.from_pretrained(model_name)
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    model.eval()
    inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu")
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("Model = ", model_name)
    print("Context Length = ", args.context_length)
    print("Min (ms) = ", min(times))
    print("Mean (ms) = ", np.mean(times))
```

- **For `qlinear` (statically quantized matmuls):**

This PR yields an **average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below.
The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`.
The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64.

```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
from torch.quantization import QConfig
from torch.ao.quantization.observer import HistogramObserver, default_weight_observer
import torch
import torch.nn as nn
import numpy as np
import random
from argparse import ArgumentParser
import time

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__()
        self.add_argument("--M",
                            help="M dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--K",
                            help="K dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--N",
                            help="N dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--signed_input",
                            help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8",
                            action="store_true"
        )
        self.add_argument("--seed",
                          help="Random seed",
                          type=int,
                          default=42
        )
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

class LinearModel(nn.Module):
    def __init__(self, K, N):
        super(LinearModel, self).__init__()
        self.quant = torch.quantization.QuantStub()
        self.fc = nn.Linear(K, N)
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

def quantize_model(model, args):
    qconfig = QConfig(
            activation=HistogramObserver.with_args(reduce_range=False,
            dtype=torch.qint8 if args.signed_input else torch.quint8),
            weight=default_weight_observer,
    )
    # Prepare the model for static quantization
    # Specify quantization configurations
    model.qconfig = qconfig
    model_prepared = torch.quantization.prepare(model_fp32)

    # Calibrate the model with sample inputs
    # Example input data for calibration
    with torch.no_grad():
        sample_data = torch.randn(args.M, args.K)
        model_prepared(sample_data)
    # Convert the prepared model to a quantized model
    model_quantized = torch.quantization.convert(model_prepared)
    return model_quantized

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()

    set_seed(args.seed)
    model_fp32 = LinearModel(args.K, args.N)
    model_quantized = quantize_model(model_fp32, args)

    inputs = torch.randn(args.M, args.K)
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model_quantized(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model_quantized(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input)
    print("Min Times (ms) = ", min(times))
    print("Mean Times (ms) = ", np.mean(times))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 08a644a4c4a0f74cf3277e85e265a44a192079c5)

* Enable qint8 and quint8 add for AArch64 using ACL directly (#148653)

This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly.
Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.

Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653
Approved by: https://github.com/malfet
ghstack dependencies: #148585

(cherry picked from commit 6c2db8fab047b8a1d671c3c8dfbdd4c478c6d2e3)

* [Build] Guard per-op headers in ACLUtils.cpp (#149417)

To fix internal build failures, where per-op headers are not generated.
We really should have lint for something like that.

Test Plan: CI

Reviewed By: izaitsevfb

Differential Revision: D71406882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149417
Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb

(cherry picked from commit 5db3a4ac88ad9a3062a9f64dc64741b820208a91)

---------

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-03-28 07:51:16 -07:00
6b27e11a5b [Release-only] Pin intel-oneapi-dnnl to 2025.0.1-6 (#150132)
[CI] Fix the XPU CI build environment
2025-03-27 16:07:33 -07:00
18a926f547 update release 2.7 xla pin (#150126)
* update release 2.7 xla pin

* fix

* fix
2025-03-27 18:29:03 -04:00
ecd434bea9 Revert "Parallelize sort" (#150128)
Revert "Parallelize sort (#149765)"

This reverts commit 8d2186cd7952336d4f8b3f73648a5c0714a832b9 as it causes inductor test regression, see 5bed3fafc7/1
2025-03-27 12:04:15 -07:00
5bed3fafc7 [ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149432)
[ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245)

Fixes https://github.com/ROCm/hip/issues/3764.

Fixes and improvements to CUDA->HIP flag conversion for CPP extensions

- Log flag conversion for debugging purposes.
- Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance.
- Fix case where nvcc key may not exist
- Fix case where hipify should ignore flag values and only touch the flag itself

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245
Approved by: https://github.com/jeffdaily

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
(cherry picked from commit c0566e0dbf42f633624adb02015742509edcb444)

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2025-03-26 17:18:20 -07:00
9b4f085526 [MPS] fix attention enable_gqa crash on mps (#150067)
[MPS] fix attention enable_gqa crash on mps (#149147)

Fixes #149132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147
Approved by: https://github.com/malfet

(cherry picked from commit dd6e9df3d00851c44fb76341f3113fe9223dcfca)

Co-authored-by: Isalia20 <irakli.salia854@gmail.com>
2025-03-26 17:17:11 -07:00
d29e4c81d9 update aotinductor doc for XPU support (#149935)
update aotinductor doc for XPU support (#149299)

as title. Since the AOTInductor feature starting from 2.7 works on Intel GPU, add the related contents into its doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149299
Approved by: https://github.com/guangyey, https://github.com/desertfire

(cherry picked from commit 4ea580568a27e281b96d26d9380c786c2e2116e6)

Co-authored-by: Jing Xu <jing.xu@intel.com>
2025-03-26 18:42:56 -05:00
8d2186cd79 Parallelize sort (#149765)
Parallelize sort (#149505)

PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149505
Approved by: https://github.com/fadara01, https://github.com/Skylion007

(cherry picked from commit 842d51500be144d53f4d046d31169e8f46c063f6)

Co-authored-by: Annop Wongwathanarat <annop.wongwathanarat@arm.com>
2025-03-26 16:47:46 -05:00
b04d8358d9 ci/docker: use NCCL 2.26.2-1 (#149874)
ci/docker: use NCCL 2.26.2-1 (#149778)

Related to #149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149778
Approved by: https://github.com/Skylion007, https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
(cherry picked from commit ddc0fe903f3043246103d71b60a4fff0aeeef9e8)

Co-authored-by: Tristan Rice <rice@fn.lc>
2025-03-26 16:36:35 -04:00
d80afc07f0 [cherry-pick] Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540 (#149624)
Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: https://github.com/pytorch/pytorch/pull/149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: https://github.com/pytorch/pytorch/pull/148895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
2025-03-26 13:33:54 -07:00
84210a82ef [cherry-pick] nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351) (#149546)
* nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351)

Fixes #149153

Yaml generated from:

```
python .github/scripts/generate_ci_workflows.py
```

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

```
rm -rf third_party/nccl
python setup.py develop
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351
Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet

* fixed_regenerations

---------

Co-authored-by: Tristan Rice <rice@fn.lc>
2025-03-26 13:32:51 -07:00
4268b2f40a [MPSInductor] Move threadfence at the right location (#150037)
[MPSInductor] Move threadfence at the right location (#149437)

Not sure how it worked in the past, but fence should be before first read from the shared memory, not after it.
This bug was exposed by https://github.com/pytorch/pytorch/pull/148969 which removed unnecessary barrier before calling `threadgroup_reduce` functions
Test plan:
```
% python3 generate.py --checkpoint_path checkpoints/stories15M/model.pth --prompt "Once upon a time" --device mps --compile
```
Before that it produced gibberish, now it works fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149437
Approved by: https://github.com/manuelcandales, https://github.com/dcci

(cherry picked from commit 61a64c20c402e61027dad4a9e7a192ec0971d1d6)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-26 11:17:09 -07:00
12a6d2a0b8 Add triton as dependency to CUDA aarch64 build (#149945)
Add triton as dependency to CUDA aarch64 build (#149584)

Aarch64 Triton build was added by: https://github.com/pytorch/pytorch/pull/148705
Hence add proper contrain to CUDA 12.8 Aarch64 build

Please note we want to still use:
```platform_system == 'Linux' and platform_machine == 'x86_64'```
For all other builds.

Since these are prototype binaries only used by cuda 12.8 linux aarch64 build. Which we would like to serve from download.pytorch.org

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149584
Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 9b1127437e6ccf0c55a87607d9f551cc6424ca67)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-26 07:28:27 -07:00
464432ec47 Automate stable CUDA update and linter using min Python verison (#149981)
Automate stable CUDA update and linter using min Python verison (#148912)

1. Fixes: https://github.com/pytorch/pytorch/issues/145571 . Cuda Stable is the same cuda version that is published to pypi, also used to set Metadata section in the rest of whl scripts and tag the docker releases with latest tag.
2. Updates min python version used in linter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148912
Approved by: https://github.com/Skylion007, https://github.com/malfet

(cherry picked from commit 29fd875bc125582f29abbdf5559d3941899680be)

Co-authored-by: atalman <atalman@fb.com>
2025-03-26 07:03:35 -07:00
1f612dafb5 Removing doc references to PRE_CXX11_ABI. (#149952)
Removing doc references to PRE_CXX11_ABI. (#149756)

Fixes #149550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149756
Approved by: https://github.com/svekars, https://github.com/atalman

(cherry picked from commit 43ee67e8dc6827fbb7d12a5950ddf6a5c80771dc)

Co-authored-by: Alanna Burke <burkealanna@meta.com>
2025-03-25 14:04:12 -07:00
f63def6ac7 [XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149714)
[XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149511)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149511
Approved by: https://github.com/EikanWang

(cherry picked from commit ee6a0291653cd507d59bda6bcf5d848099a804d1)

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-03-25 16:04:51 -04:00
3a8e623a9b op should NOT be static in aoti_torch_call_dispatcher (#149644)
op should NOT be static in aoti_torch_call_dispatcher (#149208)

aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet

(cherry picked from commit 740ce0fa5f8c7e9e51422b614f8187ab93a60b8b)

Co-authored-by: Jane Xu <janeyx@meta.com>
2025-03-25 16:03:19 -04:00
bf727425a0 Symintify transpose_ (#149632)
Symintify transpose_ (#149057)

Fixes https://github.com/pytorch/pytorch/issues/148702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057
Approved by: https://github.com/yushangdi

(cherry picked from commit 8d7c430e84f4ad439ebdc81f9ab496a3665033a4)

Co-authored-by: angelayi <yiangela7@gmail.com>
2025-03-25 15:59:42 -04:00
8c7dbc939f [cherry-pick] [CI] Don't clean workspace when fetching repo (#147994) (#149129)
Revert "[CI] Don't clean workspace when fetching repo (#147994)"

This reverts commit e5fef8a08ebb8548e8413ae54ef0ad9a11f1f4c0.
2025-03-25 15:25:19 -04:00
644fdbad95 [Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149631)
[Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149473)

# Motivation
The PR fix a bug that wrongly decides the zero-point mask setting. Specifically, it deems zero-point is always not zeros due to scale is used for judgement. Fortunately, the bug only affects the performance. The accuracy is not affected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149473
Approved by: https://github.com/EikanWang, https://github.com/guangyey

(cherry picked from commit d67c1a027e61bd68908bc4c8e5275a983521366c)

Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com>
2025-03-25 10:07:25 -05:00
fb027c5692 [cherry-pick] Update ExecuTorch pin update (#149539) (#149630)
Update ExecuTorch pin update (#149539)

Latest commit in https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50

Follow-up to https://github.com/pytorch/pytorch/issues/144480#issuecomment-2731150636

Also, need to incorporate change from https://github.com/pytorch/executorch/pull/8817

Test Plan:

Monitor  linux-jammy-py3-clang12-executorch test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149539
Approved by: https://github.com/larryliu0820

(cherry picked from commit bc86b6c55a4f7e07548a92fe7c9b52ad2c88af35)
2025-03-25 10:03:16 -05:00
3b87bd8b82 Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#149878)
Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070)

**Issue:**
* The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards.
* Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve
* This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated.

**Fix:**
* Updated the build flags to explicitly use **-march=armv8-a+sve**, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before.
* This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware.

Test plan:
 - Allocate `a1.4xlarge` on AWS
 - Run following script using wheel produced by this PR
 ```python
import torch
def f(x):
    return x.sin() + x.cos()

print(torch.__version__)
f_c = torch.jit.script(f)
```
- Observe no crash
```
$ python3 foo.py
2.7.0.dev20250313+cpu
```
- Observe crash with 2.6.0
```
$ python3 foo.py
2.6.0+cpu
Illegal instruction (core dumped)
```

Fixes #146792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070
Approved by: https://github.com/malfet

(cherry picked from commit 09f7f62cfebb0067b93d227c13fe9a94b51af762)

Co-authored-by: maajidkhann <maajidkhan.n@fujitsu.com>
2025-03-24 13:57:16 -07:00
89b098a677 Add release branch push triggers to inductor-rocm-mi300.yml (#149871)
Add release branch push triggers to inductor-rocm-mi300.yml (#149672)

In similar vein as https://github.com/pytorch/pytorch/pull/149517

When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149672
Approved by: https://github.com/jeffdaily

(cherry picked from commit 1eab841185cb2d68a11e2e0604fd96d110778960)

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-03-24 13:24:10 -07:00
4cc4302b32 Do not depend on numpy during the import (#149731)
Do not depend on numpy during the import (#149683)

But a good followup would be to use torch primitives instead of numpy here
Fixes https://github.com/pytorch/pytorch/issues/149681

Test plan: Monkey-patch 2.7.0-rc and run `python -c "import torch;print(torch.compile(lambda x:x.sin() + x.cos())(torch.rand(32)))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149683
Approved by: https://github.com/seemethere

(cherry picked from commit 68dfd44e50f59c53698a24985039a27351862963)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-21 11:13:52 -07:00
c632e4fdb8 [ONNX] Expose verification utilities (#149375)
* [ONNX] Expose verification utilities (#148603)

Expose verification utilities to public documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603
Approved by: https://github.com/titaiwangms

(cherry picked from commit ebabd0efdddd91e11364e42227b746c419a39be4)

* [ONNX] Update types in VerificationInfo (#149377)

torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377
Approved by: https://github.com/titaiwangms

---------

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-03-20 08:14:38 -07:00
b23bfae9f7 Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031) (#149386)
**Summary**
Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907
This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT.

**Test plan**
```
pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031
Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire
2025-03-20 08:08:09 -07:00
1b8f496f87 Pin auditwheel to 6.2.0 (#149525) 2025-03-19 16:13:43 -07:00
c236b602ff Add release branch push triggers to rocm-mi300.yml (#149526) 2025-03-19 16:12:59 -07:00
6926f30654 BC fix for AOTIModelPackageLoader() constructor defaults (#149214)
BC fix for AOTIModelPackageLoader() constructor defaults (#149082)

The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082
Approved by: https://github.com/desertfire

(cherry picked from commit 5e1b715dda813d8c545378291261b565649df8e5)

Co-authored-by: Joel Schlosser <jbschlosser@meta.com>
2025-03-17 17:16:54 -05:00
483980d7f3 [AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149242)
[AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175)

The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175
Approved by: https://github.com/jansel

(cherry picked from commit 9ad6265d044075d1ceb27cf0f2af7495e586003c)

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-03-17 17:15:39 -05:00
7173a73cf4 [regression] Fix pin_memory() when it is called before device lazy initialization. (#149183)
[regression] Fix pin_memory() when it is called before device lazy initialization. (#149033)

PR #145752 has added a check in the isPinnedPtr to check if a device is initialized before checking if the tensor is pinned. Also that PR has added a lazy initialization trigger when an at::empty is called with a pinned param set to true. However, when the tensor is firstly created and it is pinned in a separate call by calling pin_memory() function, lazy device init is not called so is_pinned returns always false.

With this PR, the lazy initialization is moved to getPinnedMemoryAllocator function, thus it is assured that device is initialized before we pin a tensor.

Fixes #149032

@ngimel @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149033
Approved by: https://github.com/ngimel, https://github.com/albanD

(cherry picked from commit 420a9be743f8dd5d6296a32a1351c1baced12f1f)

Co-authored-by: Bartlomiej Stemborowski <bstemborowskix@habana.ai>
2025-03-17 15:37:24 -04:00
7bab7354df [cherry-pick] Revert #148823 - Make dynamism code robust to NotImplementedException (#149160)
Revert "Make dynamism code robust to NotImplementedException (#148823)"

This reverts commit 60576419a2a5cc09e4a92be870fda8f3fc305ddc.

Reverting from RC since it was reverted from the main branch
2025-03-14 10:50:24 -05:00
b1940b5867 Remove runtime dependency on packaging (#149125)
Remove runtime dependency on packaging (#149092)

Looks like after https://github.com/pytorch/pytorch/pull/148924
We are seeing this error in nightly test:
https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623

```
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module>
    from .lowering import fallback_node_due_to_unsupported_type
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module>
    from . import kernel
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module>
    from . import mm, mm_common, mm_plus_mm
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'
```

Hence removing runtime dependency on packaging since it may not be installed by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092
Approved by: https://github.com/drisspg, https://github.com/davidberard98

(cherry picked from commit 65d19a5699afbb0b123b6b264188f5610b925c5e)

Co-authored-by: atalman <atalman@fb.com>
2025-03-13 12:28:50 -04:00
abebbd5113 [Profiler][HPU] Fix incorrect availabilities for HPU (#149115)
[Profiler][HPU] Fix incorrect availabilities for HPU (#148663)

Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/albanD

(cherry picked from commit 75c8b7d9725af59d8348379a4165d1252c4ac208)

Co-authored-by: wdziurdz <witold.dziurdz@intel.com>
2025-03-13 08:09:45 -07:00
cdd7a2c72b [RLEASE ONLY CHANGES] Apply release only chnages to release 2.7 (#149056)
* [RLEASE ONLY CHANGES] Apply release only chnages to release 2.7

* fix_lint_workflow

* docker_release

* fix_check_binary
2025-03-12 15:44:15 -04:00
d94ea2647c [inductor] Fix profiler tests with latest Triton (#149059)
[inductor] Fix profiler tests with latest Triton (#149025)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149025
Approved by: https://github.com/yanboliang

(cherry picked from commit 488c4480f97e9d537905b0fbcb7236f88dca47f7)

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-03-12 15:21:51 -04:00
924a247fbb [MPS] Enable angle and atan2 for torch.long (#149017)
This check was added by https://github.com/pytorch/pytorch/pull/85817, that introduced no unit-tests and its content seems to be totally unrelated to title/subject of that PR. Anyway, right now it seems to be working fine on MacOS-13+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149017
Approved by: https://github.com/dcci
2025-03-12 04:48:52 +00:00
7b78a2c415 [MPSInductor] Fix argmin/argmax long reductions (#149021)
By adding an additional indexes array for aggregates and populating it when performing partial reductions.

And with that I can finally `torch.compile` TinyStories and get 600+ tokens/sec vs <200 on eager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149021
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004, #149020
2025-03-12 04:39:29 +00:00
758522d56a [MPSInductor][EZ] Fix argmin/max signatures (#149020)
threadgroup_argmin used to return input type, which is wrong, it should have returned `int` or `long`

Change signatures of both thredgroup_argmin and threadgroup_argmax to return int, as group size is small, no need to carry over large integeres
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149020
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004
2025-03-12 04:39:29 +00:00
fe22db9cc3 [MPSInductor] Fix min/max reductions over large dims (#149004)
Simple followup after sum/prod

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149004
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975
2025-03-12 04:39:19 +00:00
clr
2a7e997b3f test/dynamo/test_utils: Fix one broken test on different python versions (#148987)
We correctly handed different python version in the explicit ir_nodes test, but
didn't handle it in the dynamo_timed test. Just explicitly deleting the fields
there so the dynamo_timed test passes on all python versions.

(I noticed it breaking on 3.13).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148987
Approved by: https://github.com/jansel
2025-03-12 02:11:08 +00:00
e40a9e602b Add the max_autotune tests in the periodic jobs. (#143560)
To promptly detect issues with max_autotune, such as [#143102](https://github.com/pytorch/pytorch/issues/143102), add the max_autotune tests to the periodic CI to track the accuracy regularly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2025-03-12 01:47:46 +00:00
60576419a2 Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-12 01:01:57 +00:00
46f096bba6 Explicitly set use-ephemeral runners for windows nightly cpu test jobs (#149001)
This PR migrated windows builds to use ephemeral runners: https://github.com/pytorch/pytorch/pull/134463 however missed test jobs.

Explicitly set use-ephemeral runners for windows nightly cpu tests.
Please note we should be using already ephemeral runners for these after: https://github.com/pytorch/test-infra/pull/6377 (recently migrated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149001
Approved by: https://github.com/malfet
2025-03-11 23:51:39 +00:00
5b60749e9e [cudagraph] add log for skip reasons (#148797)
Summary: Add skip reasons to dynamo_compile so we can know popular skip reasons for cudagraph

Test Plan: {F1975906635}

Differential Revision: D70820791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148797
Approved by: https://github.com/masnesral
2025-03-11 23:31:48 +00:00
98a2d905bf [MPSInductor] Fix large prod and sum reductions (#148975)
After this change, if reduction dimension is larger than `max_threadgroup_size`,  emit a `for` loop from `codegen_iteration_ranges_entry` and wrap it up in `codegen_body()`
I.e. after this changes following command
```
% TORCH_LOGS=output_code python -c "import torch;print(torch.compile(lambda x:(x[0::2].sin()+(x[1::2] + .4).cos()).sum(dim=0) - 3.14)(torch.rand(4096, device='mps')))" 2>&1|cut -c 86-
```
will emit following shader
```metal
#include <c10/metal/random.h>
#include <c10/metal/special_math.h>
#include <c10/metal/utils.h>
#include <c10/metal/reduction_utils.h>
kernel void generated_kernel(
    device float* out_ptr1,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[1024];
    tmp_acc_0[r0_index] = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        if (r0_0 >= 2047) break;
        auto tmp0 = in_ptr0[2*r0_0];
        auto tmp2 = in_ptr0[1 + 2*r0_0];
        auto tmp1 = metal::precise::sin(tmp0);
        auto tmp3 = 0.4;
        auto tmp4 = tmp2 + tmp3;
        auto tmp5 = metal::precise::cos(tmp4);
        auto tmp6 = tmp1 + tmp5;
        tmp_acc_0[r0_index] += tmp6;
    }
    auto tmp7 = c10:🤘:threadgroup_sum(tmp_acc_0, 1024);
    auto tmp8 = 3.14;
    auto tmp9 = tmp7 - tmp8;
    out_ptr1[0] = static_cast<float>(tmp9);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148975
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #148969
2025-03-11 22:46:41 +00:00
2dcdb4ba78 [ez] include config as part of __all__ in torch.compiler (#148978)
Right now we are susceptive to a race condition where if the torch.compiler.config is not implicitly import via dynamo/builder.py, we will throw an error when trying to set compiler configs. This fixes it by including config in `__all__`.

Previous
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
>>> torch.compiler.config.dynamic_sources =
"L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
```

Now
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148978
Approved by: https://github.com/bdhirsh, https://github.com/laithsakka
2025-03-11 21:58:38 +00:00
a6459afb0e [dynamic shapes] add backed_size_oblivious option (#148696)
Adds option `torch.fx.experimental._config.backed_size_oblivious = True` to allocate `[0, inf]` instead of `[2, inf]` ranges for size backed symbols, and opting into size-oblivious semantics for them.

Helps in a number of cases like
- Keeps `[0, inf]` bounds for unbacked symbols, when we make a unbacked -> backed replacement
- More sound handling for 0/1 inputs at runtime when we lower from export
- Avoids ends-of-bounds, sys.maxsize constraint violations for exporting with named Dims (https://github.com/pytorch/pytorch/issues/146315, https://github.com/pytorch/pytorch/issues/146046)

May look towards turning this on globally for export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148696
Approved by: https://github.com/bobrenjc93
2025-03-11 21:52:34 +00:00
53a1a022a9 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 21:49:46 +00:00
b98af95401 Fix DCP link (#148974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148974
Approved by: https://github.com/svekars
2025-03-11 21:26:37 +00:00
6119ffc711 [ROCm][TunableOp] Fix TunableOp BLAS logging for online tuning case. (#148979)
In a previous PR https://github.com/pytorch/pytorch/pull/147034, there was a bad merge at the last minute.
BLAS logging works for offline tuning, but does not currently work for online tuning.

This PR fixes BLAS logging for online tuning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148979
Approved by: https://github.com/jeffdaily
2025-03-11 21:20:04 +00:00
e5fef8a08e [CI] Don't clean workspace when fetching repo (#147994)
Tested on 874c5dc4c98cc63a06bfc900d03683b02f110d7c'
Also tested on https://github.com/pytorch/pytorch/actions/runs/13798178199/job/38594767529?pr=148995#step:4:12

Don't remove the workspace when fetching.  The checkout action performs git clean -ffdx to remove untracked files and files in gitignore

This is probably not going to be useful if we switch entirely to ephemeral runners but w/e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-11 21:10:56 +00:00
72d9f88ef2 [release] Move triton pin to latest triton release/3.3.x (#148971)
This branch contains latest AMD cherry-picks:
https://github.com/triton-lang/triton/pull/6171
https://github.com/triton-lang/triton/pull/6165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148971
Approved by: https://github.com/danzimm
2025-03-11 21:10:42 +00:00
e6ef0620cc Add shim.h C API to call dispatcher on our own aten ops (#148832)
This PR still needs testing through some cpp extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148832
Approved by: https://github.com/albanD, https://github.com/atalman
ghstack dependencies: #148124
2025-03-11 21:02:04 +00:00
cf19efd3d9 Support basic TorchBind in aot_compile and aoti_compile_and_package (#148506)
Summary:
**Codegen**

- Skip some codegen parts for torchbind (such as arg decleration) because they are loaded in proxy executor, so we do not need to declare torchbind args in cpp code
- Added a helper method to get the schema of CallTorchBind HOP. The returned schema is only the schema of `obj.method()`.

**Serialization**
Add support for torchbind object in serialization

- For CallTorchBind HOP, we need to handle it specially because of it's schema. The output serialized args is in the format of `(obj, method, *args, **kwargs)`.
- it.TorchBindObject inputs are serialized to `as_custom_obj` Argument.

**Packaging**

Add torchbind objects file and `custom_objs_config.json` file to generated files output of `aot_compile`.

The json file is stored in the `data/aotinductor/<model_name>` folder in pt2 archive.

The torchbind objects are stored in data/constants/ folder in pt2 archive.
The format of torchbind objects are `f"{CUSTOM_OBJ_FILENAME_PREFIX}{custom_obj_idx}"`. e.g. `custom_obj_0`.
CustomClassHolder objects implement their own pickle methods.

Note that this `custom_objs_config.json` file is different from the `model_constants_config.json` file produced in package_sigmoid(). The keys in `custom_objs_config` directly correspond to the arg name in extern nodes json.
The key in `model_constants_config.json` produced by `package_sigmoid` is the attribute name in the user mode code.

This is required for both internal and OSS torchbind support.
For OSS torchbind support, we also need to package torchbind_constants into the .pt2 output.

**Work Left**
We still need to add torchbind support in ProxyExecutor for inductor.aoti_load_package to work. See other diffs in the stack.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D69490718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148506
Approved by: https://github.com/angelayi
2025-03-11 20:55:18 +00:00
f69e58e8e8 [CI] Update crossvit_9_240 as pass (#148989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989
Approved by: https://github.com/ZainRizvi
2025-03-11 20:54:39 +00:00
b54cf1a281 Revert "[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)"
This reverts commit 73c8068cf889829fb811fc75baac03163c9a42ee.

Reverted https://github.com/pytorch/pytorch/pull/148693 on behalf of https://github.com/ZainRizvi due to This is breaking lint on trunk. Please rebase these changes before merging them back in. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13796723235/job/38590020554) [HUD commit link](73c8068cf8) ([comment](https://github.com/pytorch/pytorch/pull/148693#issuecomment-2715671875))
2025-03-11 20:50:23 +00:00
c18858d633 [MPS] Make torch.mps.compile_shader public (#148972)
It was a private method in 2.6, but nothin changes in its API for 2.7
and it will likely remain the same in 2.8, so time to remove underscore
from its name

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148972
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/seemethere, https://github.com/albanD, https://github.com/dcci
2025-03-11 20:20:58 +00:00
abcec55532 gracefully handle tokenize.TokenError in funcname parser. Adds support for non-Python source (#148737)
This change allows defining python functions in non-python source and having them be able to compiled by torch.compile. The existing implementation already returns None for the case where the file couldn't be read, so returning None (by making an empty funcname cache) makes sense for the case of non-python source code too.

Example [basilisp](https://github.com/basilisp-lang/basilisp):
```clojure
(import torch)
(import [torch.nn.functional :as F])
(torch/rand 10)

(defn f {:decorators [torch/compile]} [x]
  (* (F/relu x) x))

(f (-> (torch/randn 100)
       (.cuda)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148737
Approved by: https://github.com/williamwen42
2025-03-11 19:49:28 +00:00
73c8068cf8 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-11 19:38:40 +00:00
5b8da17681 [cutlass backend] Add addmm and bmm tests for AOTI (#148929)
Needs to do:
1. Expand addmm tests to cover all 4 shapes
2. Add dynamic shape support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148929
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-03-11 19:38:24 +00:00
7b2ecb80eb [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148928)
Differential Revision: D70908557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148928
Approved by: https://github.com/angelayi
2025-03-11 19:36:30 +00:00
61f9b50e09 [ROCm] Fix TORCH_CHECK for hdim 512 support added in AOTriton 0.9b (#148967)
Fixes #148850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148967
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-03-11 19:21:10 +00:00
971606befa Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman
2025-03-11 19:12:46 +00:00
4d10da731b [ROCm] CK Memory-Efficient Attention (attention bias support) (#147778)
Implements CK as the backend for memory efficient attention with a couple caveats:

- Still enabled via `torch.backends.cuda.preferred_rocm_fa_library("ck")
- Does NOT support Nested Tensors

Using the mem_eff path allows us to use attention bias with a CK sdpa backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147778
Approved by: https://github.com/houseroad
2025-03-11 19:02:59 +00:00
a1cb67b69e [ROCm] Improve backwards indexing when stride is not one (#147630)
Improve backwards indexing when stride is not one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147630
Approved by: https://github.com/jeffdaily
2025-03-11 19:02:48 +00:00
daff65d671 Correctly propagate exception to parent tx (#146502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502
Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #146504, #146499
2025-03-11 18:55:45 +00:00
fb53e9e514 Add __context/cause/suppress_context/traceback__ to Exception (#146499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499
Approved by: https://github.com/zou3519, https://github.com/anijain2305
ghstack dependencies: #146504
2025-03-11 18:55:45 +00:00
4e7d264cf8 Introduce UserDefinedExceptionClassVariable (#146504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504
Approved by: https://github.com/anijain2305
2025-03-11 18:55:45 +00:00
8d08b49015 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-11 18:51:06 +00:00
c916a8efc5 Revert "Use the device interface for detecting Triton availability (#139171)"
This reverts commit 940b60db974f08a31c746eec2f9c399fc8a861ee.

Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))
2025-03-11 18:49:21 +00:00
57ee821a41 fix dynamo ide (#148849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148849
Approved by: https://github.com/bobrenjc93
2025-03-11 18:43:30 +00:00
883fb78c7e Update jinja2 version in requirements-gha-cache.txt
As previous version is vulnerable  to CVE-2025-27516
This closes Dependabot report
2025-03-11 11:42:38 -07:00
5ee9dbc0a1 Bump jinja2 from 3.1.5 to 3.1.6 in /.ci/docker (#148812)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-11 11:39:55 -07:00
cyy
a5f6b24d87 Remove outdated skipIfRocmVersionLessThan decorations (#148941)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148941
Approved by: https://github.com/jeffdaily
2025-03-11 18:37:40 +00:00
ef6296e7f2 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
b366f33606 [MPSInductor] Prep for mutlistage reductions (#148969)
----

- Move reduction variable initialization from `loads` to  `indexing_code`
- Move barriers from `codegen_kernel` to `reduction` and only use them for `any` reductions (as other reduction ops do  barriers explicitly inside the respective reduction functions)
- Use `self.compute` instead of `self.body` for all compute operations

Checked that number of before/after failures stays at `164 failed, 616 passed, 53 skipped`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148969
Approved by: https://github.com/dcci
2025-03-11 18:35:23 +00:00
dcc502f376 [ROCm][TunableOp] Add bias data type to params signature. (#146227)
Add bias vector data type in TunableOp params signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227
Approved by: https://github.com/jeffdaily
2025-03-11 18:31:22 +00:00
52acc1f955 [DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/140898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918
Approved by: https://github.com/fduwjj, https://github.com/mori360
ghstack dependencies: #148825
2025-03-11 18:24:12 +00:00
e0d4c43ad1 Add env for disabling meta reference on functionalization. (#148822)
Fix: https://github.com/pytorch/xla/issues/8755

This PR introduces `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE`
environment variable. Setting this variable makes it so the
functionalization kernels won't run the meta reference, which is used to
propagate expected sizes and strides.

Currently, PyTorch/XLA doesn't actually propagates the correct strides
to its tensors. It was also shown that calling these meta functions may
incur in significant overhead.

Running the provided minimal reproducer (see issue), we see a speedup
close to 4.3x:

- Baseline: 0.0747s
- `XLA_DISABLE_FUNCTIONALIZATION=1`: 0.0159s
- `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1`: 0.0175s

In summary, this PR:

- Creates the `disable_meta_reference()` function, which checks whether
  the environment variable is set
- Modifies codegen for functionalization kernels, adding the call to
  `disable_meta_reference()` function to the appropriate conditions
- Creates a new bash function for running `lazy/test_ts_opinfo.py` with
  the environment variable set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148822
Approved by: https://github.com/bdhirsh
2025-03-11 16:13:35 +00:00
09029010e5 [inductor] Fix create_specialize_impl error in latest Triton (#148933)
```py
$ python test/inductor/test_triton_kernels.py KernelTests.test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1
WARNING:torch._dynamo:Encountered an exception in identify_mutated_tensors, assuming every input is mutated
Traceback (most recent call last):
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 715, in identify_mutated_tensors
    ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 289, in generate_ttir
    specialization = _get_specialization(ordered_args.values())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 262, in _get_specialization
    specialize_impl = triton.runtime.jit.create_specialize_impl()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: create_specialize_impl() missing 1 required positional argument: 'specialize_extra'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148933
Approved by: https://github.com/yanboliang, https://github.com/davidberard98
2025-03-11 15:54:47 +00:00
16560d4e8f Revert "Refactor test/test_torch.py by moving testcase to test_indexing.py (#148875)"
This reverts commit 0fa0a740958ffc474843ceb1d19ee43c4bff4c09.

Reverted https://github.com/pytorch/pytorch/pull/148875 on behalf of https://github.com/ZainRizvi due to That torch.version failure you got in CI was a legitimate failure and is now breaking trunk. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13778023702/job/38534207536) [HUD commit link](0fa0a74095) ([comment](https://github.com/pytorch/pytorch/pull/148875#issuecomment-2714757288))
2025-03-11 15:27:25 +00:00
3945954741 Bump triton pin. Add aarch64 triton build (#148705)
1. Bumps pin for triton to release/3.3.x branch
2. Bump pin for triton-xpu
3. Remove ROCm xfail tests
4. Add aarch64 triton build:
	* Depends on: https://github.com/pytorch/pytorch/pull/148768
	* Fixes: https://github.com/pytorch/pytorch/issues/130558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148705
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/EikanWang
2025-03-11 15:12:21 +00:00
c983e1124c Revert "[WIP] Initial implementation of Grouped Gemm API (#148531)"
This reverts commit ff29791ed8f815bdbca1a5606de046380baca69d.

Reverted https://github.com/pytorch/pytorch/pull/148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](https://github.com/pytorch/pytorch/pull/148531#issuecomment-2714577498))
2025-03-11 14:40:58 +00:00
f1787ee0f7 [dynamo] Remove L scoping for recompilation messages (#148917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148917
Approved by: https://github.com/williamwen42
2025-03-11 14:26:26 +00:00
992838e702 [dynamo][guards] Do not ID_MATCH on numpy tensors (#148923)
Might help with https://github.com/pytorch/pytorch/issues/148535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148923
Approved by: https://github.com/jansel
2025-03-11 14:20:26 +00:00
ee21ccc816 Skip ao_sparsity TestComposability for missing FBGEMM (#144146)
Those tests (from test_ao_sparsity) require FBGEMM which may not be available. So add the skip decorator.

Fixes #87364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144146
Approved by: https://github.com/jerryzh168, https://github.com/jcaip
2025-03-11 13:02:18 +00:00
da4bb72a71 Backout D70075331 (#148824)
Summary:
The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0"

So we revert D70075331 as a workaround now.

Test Plan: The model could be lowered and published successfully. e.g. 702869739_16

Differential Revision: D70823254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824
Approved by: https://github.com/eqy
2025-03-11 12:51:17 +00:00
9ad64ce795 [triton 3.3] Forward-fix mm template selection logic (#148924)
Follow-up from https://github.com/pytorch/pytorch/pull/148662.

The logic from https://github.com/pytorch/pytorch/pull/148662 is incorrect; what we want is "choose the second template 'AMD-specific template' only if we're on hip AND triton version < 3.3" - negating it, the code should be "choose the cirst template if we're NOT on hip OR triton version >= 3.3".

Tested locally to verify that it fixes the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148924
Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/eellison
2025-03-11 09:05:44 +00:00
2bcc3acb90 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2025-03-11 08:02:30 +00:00
41e4728f74 update types on dynamo configs (#146873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146873
Approved by: https://github.com/williamwen42
2025-03-11 05:33:48 +00:00
1fcc4bc109 Don't look at TESTING_ONLY in fuzzer (#146870)
Lots of configs aren't meant to be set because they're testing only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146870
Approved by: https://github.com/masnesral
2025-03-11 05:32:25 +00:00
bed92a8523 [Window][Inductor UT] Fix for tempfile.NamedTemporaryFile(delete=True) not work on Windows. (#148632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148632
Approved by: https://github.com/jansel
2025-03-11 05:05:15 +00:00
ecfbfe1603 [AOTI] Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor (#148907)
Summary: shim.h is only meant for generic tensor util shim functions. We should switch to use the auto fallback generation, but it will need some extra care on the op schema.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148907
Approved by: https://github.com/janeyx99
2025-03-11 04:41:05 +00:00
940b60db97 Use the device interface for detecting Triton availability (#139171)
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.

This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
2025-03-11 03:56:11 +00:00
ff29791ed8 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 02:41:09 +00:00
621dadd4ca partitioner: when materializing unbacked tensor intermediates, apply hint to symbol, not expr (#144097)
Fixes https://github.com/pytorch/pytorch/issues/144095

open to suggestions: the `hint_int(..., fallback=...)` API feels like a bit of a footgun, because:

(1) we use the same guess for every unbacked symint (both symbols, and compound expressions)
(2) the user may have established some relationship between some unbacked symints that we are not taking into account.

I'm not sure how real of an issue (2) is - is it common to e.g. generate two unbacked symints, and then add a runtime assert that they are unequal?

Instead I did something simpler that's just enough to fix the linked issue: if we have a sympy expression containing an unbacked symbol (e.g. `u0 + 1`), then the partitioner will now fill in the symbol with our guess instead of the expression (plugging in `u0=4096` gets us 4097). This was important for an internal custom op, that had some logic like this:
```
def custom_op(x: [u0], y: [u0 + 1]):
    assert x.shape[0] = y.shape[0] - 1
    ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144097
Approved by: https://github.com/laithsakka
2025-03-11 02:11:57 +00:00
8c45d44abb Skip distributed subprocess test internally as they don't work (#148909)
Follow up from https://github.com/pytorch/pytorch/pull/146098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148909
Approved by: https://github.com/janeyx99
2025-03-11 02:07:45 +00:00
457ff9b7ae [reland][ca] side-effect free inital trace: compiled_args (#148376)
This reverts commit ea12fc8a9ff7da808e0b661ca07e9d4ce75d04bc.
Reland https://github.com/pytorch/pytorch/pull/147804, there was a bad import inserted by my linter.

Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376
Approved by: https://github.com/jansel
2025-03-11 01:57:36 +00:00
9fddbf3417 Update the comment (#148726)
Differential Revision: D70747931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148726
Approved by: https://github.com/yf225
2025-03-11 01:19:14 +00:00
0fa0a74095 Refactor test/test_torch.py by moving testcase to test_indexing.py (#148875)
Fix `FIXME` in `test_torch.py` by moving test-cases to `test_indexing.py`

```python
# FIXME: move to test indexing
# FIXME: move to indexing test suite
```

- Move tests in `test/test_torch.py` to `test_indexing.py`
- Remove `FIXME` comments

## TestResult

```bash
pytest test/test_torch.py -k TestTorchDeviceType -vv
pytest test/test_indexing.py  -k TestIndexing -vv
```

![image](https://github.com/user-attachments/assets/49a80985-e74a-4da6-a063-476e87e6aa8a)

![image](https://github.com/user-attachments/assets/77afa936-5dba-480c-b293-eb1f7bc74420)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148875
Approved by: https://github.com/soulitzer
2025-03-11 01:01:59 +00:00
c297c09a37 Fix invalid nested int guarding in broadcast_shapes() (#145957)
Fixes #145874

This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially.

Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957
Approved by: https://github.com/bobrenjc93

Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-03-11 00:53:13 +00:00
cyy
295f2ed4d1 Fix "invalid application of 'sizeof' to an incomplete type" (#148854)
Fixes with C++23 and constexpr std::unique_ptr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148854
Approved by: https://github.com/Skylion007
2025-03-11 00:40:00 +00:00
cyy
a6e71dbc88 Enable ASAN on inductor CUDA tests (#148749)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148749
Approved by: https://github.com/jansel
2025-03-10 23:53:40 +00:00
b215841ebb [MM] Add sm carevout to lowerings (#148793)
# Summary

See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using
the following to verify, need to figure out how to do proper guarding

This does  do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793
Approved by: https://github.com/lw, https://github.com/eellison
2025-03-10 23:49:26 +00:00
492f3fd5cf replace usages of upload_graph in inductor with tlparse (v2) (#148720)
Reland of https://github.com/pytorch/pytorch/pull/148703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720
Approved by: https://github.com/mengluy0125
2025-03-10 22:47:58 +00:00
5bbca7d328 [ROCm][Windows] Fix OpenMP Flags for clang-cl (#148097)
When clang-cl parses its command line arguments, it expects MSVC-style arguments (beggining with `/` such as `/WX`, `/MD`, etc.) to be provided, and clang-style arguments to be preceded by `-Xclang`, otherwise, the clang-style parameters are ignored as they are interpreted unrecognized compiler options.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148097
Approved by: https://github.com/jeffdaily
2025-03-10 22:47:15 +00:00
a95eb0c0a7 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 2149f6c6845d00711ffab648132b7377e8cd3edb.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))
2025-03-10 22:38:40 +00:00
12a95390ae [Minimizer] allow overriding of ShapeProp logic by subclasses of _MinimizerBase (#148784)
Summary:
The changes contained in this diff
- allow subclass Minimizer implementations to override the default shape propagation logic with custom logic
- copies over the meta attribute on get_attr graph nodes during the graph splitting step
- for both changes, behavior for existing classes do not change

Test Plan: CI

Differential Revision: D70799942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148784
Approved by: https://github.com/blaine-rister
2025-03-10 22:22:16 +00:00
fcb633fafa Introduce TORCH_ABI_VERSION and a runtime aoti_torch_abi_version C shim ABI (#148892)
Importable https://github.com/pytorch/pytorch/pull/148836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148892
Approved by: https://github.com/albanD
2025-03-10 22:22:10 +00:00
98b3f1db9f [Flex Attention] support num_heads > 1 in block_mask (#148857)
Previously flex decoding errors when block mask has num_heads > 1. So users have to use num_heads=1, or explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}`.

This PR fixes this issue. When not using grouped query attention (GQA, i.e., Hq == Hkv), we support block mask with num_heads = 1 and num_heads = num_query_heads (i.e., Hq). This is the same setting as flex attention kernel.

When using GQA (i.e., Hq != Hkv), we support block mask with num_heads = 1. When num_heads = Hq, we fall back to flex attention kernel so user don't need to explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}` anymore.

Why fallback? In the current flex decoding triton kernel, grouped query heads for the same kv head are handled by the same thread block. Supporting num_heads = Hq with GQA requires support different kv num blocks for different query heads in the same thread block, leading to lots of redundant workload. So we should better use the main flex_attention kernel where each query head is handled by a separate block.

Fixes #148527
Fixes #147267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148857
Approved by: https://github.com/drisspg
2025-03-10 22:02:50 +00:00
6ef15c7f46 [pytorch] Update flexattention bwd config generation (#148600)
Summary: Currently `flex_attention` template's backward config generation returns values for every case. This change instead stores intermediate values in `'bwd_config` returned at the end.

Test Plan: CI. Existing tests.

Differential Revision: D70649316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148600
Approved by: https://github.com/Skylion007
2025-03-10 22:00:56 +00:00
8701b302cc setuptools pinning (#148879)
Fixes #148877

---

On 9 March 2025, [setuptools](https://pypi.org/project/setuptools/#history) published a new version  and it is causing an issue on `pytorch` with the following error:

```
AttributeError: module 'distutils' has no attribute '_msvccompiler'. Did you mean: 'ccompiler'?
```

Last known working version is [75.8.2](https://pypi.org/project/setuptools/75.8.2/)

Currently it is affecting Windows ARM64 nightly build, however soon it might affect also Windows x64 builds. (conda version is not updated yet [setuptools conda](https://anaconda.org/anaconda/setuptools)

Locally both `Windows ARM64` and `Windows x64` are having same problem with the latest `setuptools` (>75.8.2)

---

This PR is pinning `setuptools` version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148879
Approved by: https://github.com/seemethere
2025-03-10 21:29:32 +00:00
c652772af7 [aarch64] install ninja for docker to build triton on arm (#148768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148768
Approved by: https://github.com/atalman, https://github.com/Skylion007

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-10 21:28:53 +00:00
b706044cca [ROCm][Windows] Enable hipblaslt for Windows (#148563)
This PR adds hipblaslt library as one of the Windows' dependencies. `rocBLAS` is added too, since certain symbols aren't detected with `hipblas` alone on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148563
Approved by: https://github.com/jeffdaily
2025-03-10 21:07:16 +00:00
2a1eeaeed8 Remove 12.4 x86 builds and 12.6 sbsa builds from nightly (#148895)
https://github.com/pytorch/pytorch/issues/145570

redo https://github.com/pytorch/pytorch/pull/148625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148895
Approved by: https://github.com/atalman
2025-03-10 20:55:09 +00:00
4a2173d9a0 [cutlass backend][ez] Incorporate AOTI dynamic shape test into main test of MM (#148786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148786
Approved by: https://github.com/jingsh
2025-03-10 20:35:10 +00:00
e9c12e819d Update torch-xpu-ops commit pin (#148881)
Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](026b2c8c7c), includes:

- Enable AOT for LNL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881
Approved by: https://github.com/EikanWang
2025-03-10 20:31:51 +00:00
ed969d1236 [DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825)
Summary:
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825
Approved by: https://github.com/fduwjj, https://github.com/mori360
2025-03-10 20:04:36 +00:00
494abeff8a CUDACachingAllocator,c10d: fixes for IPC release performance (#148805)
This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL.

1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock
2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there

Test plan:

Run with torchft + torchtitan

```
REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par
allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096

...

[rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61  loss:  7.4825  memory: 79.73GiB(83.89%)  tps: 317  tflops: 16.34  mfu: 1.65%
```

Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors

![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d)
![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito
2025-03-10 19:47:04 +00:00
2e4874e48d Update RELEASE.md with latest changes to release process and release 2.7 information (#148888)
1. Update for Release 2.7 compatibility matrix
2. Remove mention of builder project, the scripts for release management were migrated to test-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148888
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2025-03-10 19:20:27 +00:00
clr
6b0fd741d1 dynamo: Count number of opcodes processes (#147149)
This gives us a decent proxy for how big of a graph we functionally had to parse.

Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149
Approved by: https://github.com/jansel
2025-03-10 19:20:09 +00:00
3129faf8be Optimize shard_dim_alltoall to use alltoall_single (#148868)
as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies

This PR uses alltoall_single instead, so that we could minimize tensor copies.

tested on all the shard dim change tests and it works properly:
```
pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868
Approved by: https://github.com/tianyu-l
2025-03-10 18:38:12 +00:00
ed7e964f2b codecache.py: use str.format rather than % formatting (#148691)
Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691
Approved by: https://github.com/desertfire
2025-03-10 18:33:58 +00:00
d1f21d8ec3 Enable Direct Use of Arm Compute Library (ACL) in ATen (#148584)
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set.
Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without  oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620.
This patch enables such use cases by exposing ACL to ATen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584
Approved by: https://github.com/malfet
2025-03-10 18:29:51 +00:00
00cabd4235 [Inductor][Windows] add env_var switch to turn all Windows inductor UTs. (#148733)
For timeout reason, we can't turn on all Windows Inductor UTs in CI: https://github.com/pytorch/pytorch/issues/135927
And without the UTs, we can't ensure Windows inductor quality.

Intel team will do some local test for Windows inductor, but we still need to add a switch to turn on the full Windows inductor UTs.

The switch is an environment variable:
```cmd
set TORCHINDUCTOR_WINDOWS_TESTS=1
```
After setup this environment variable, we can turn on all Windows inductor UTs, It will not affect to PyTorch CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148733
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-03-10 18:25:29 +00:00
4c13a859e5 Workaround no triton float8_e8m0fnu support in inductor (#148722)
Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873.

The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722
Approved by: https://github.com/vkuzo
ghstack dependencies: #148450
2025-03-10 17:37:39 +00:00
cyy
203dd18c5c Bump Clang-tidy to 19.1.4 (#148648)
Because Clang-tidy 19 has more powerful clang-analyzer checks to detect subtle bugs. New checks such as misc-use-internal-linkage can help identify potential static variables or functions, thus reducing binary sizes.

Some new checks are disabled temporarily for later enabling. Additional warnings have been fixed or suppressed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148648
Approved by: https://github.com/Skylion007
2025-03-10 17:32:30 +00:00
ebd087e4b5 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit f08146b67bab331f7bdc9fa247f526f6e60a7190.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))
2025-03-10 17:19:21 +00:00
2ec9aceaeb Revert "Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)"
This reverts commit 3680e666d8ceaa43069555f821d1e8a5de01d5ab.

Reverted https://github.com/pytorch/pytorch/pull/148834 on behalf of https://github.com/janeyx99 due to sorry I don't think I want this PR in before the branch cut, as it'd freeze the API in the file when it should really be in a different header ([comment](https://github.com/pytorch/pytorch/pull/148834#issuecomment-2711162193))
2025-03-10 16:29:40 +00:00
9dbc2527dc Disable some SVE autovec (#148489)
Summary: autovec miscompiles on patterns of the type:
```cpp
for (const auto i : c10::irange())
```
Same issue as described in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 and addressed by https://github.com/pytorch/pytorch/pull/137795 for gcc, but not clang

Test Plan:
buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface

Differential Revision: D70422723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148489
Approved by: https://github.com/malfet
2025-03-10 16:25:00 +00:00
a60b4ed623 [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148288
2025-03-10 16:06:19 +00:00
8f858e226b [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261
2025-03-10 16:06:19 +00:00
5d4e7d58b4 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-10 16:06:11 +00:00
bf752c36da [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-10 16:06:02 +00:00
bec7bdad47 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-10 16:05:53 +00:00
cyy
b8b1b364c9 Fix invalid format string in libfmt calls (#148855)
Wrap shaderSource inside fmt::runtime because the format string is not a string literal and can't pass libfmt's compile time check in C++23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148855
Approved by: https://github.com/Skylion007
2025-03-10 14:47:52 +00:00
a81751d8b7 [CD] Annotate linux/arm64 cuda wheels with consistent nvidia dependencies (#145021)
This resolves issues installing torch nightly wheels into a `uv sync`-generated `.venv`

The root cause is that the x64 and arm64 cuda nightly wheels have inconsistent metadata. This can be seen comparing `generated-linux-aarch64-binary-manywheel-nightly.yml` and `generated-linux-binary-manywheel-nightly.yml`

`uv` expects consistency:

https://github.com/astral-sh/uv/issues/10693
>Frankly, it's really not ideal that they change their dependencies from wheel to wheel.
>They could still put the dependencies there with the same platform markers they're using in the other wheel though... 🤷‍♀

https://github.com/astral-sh/uv/issues/10119#issuecomment-2559898792
>I think this is something that basically has to be solved by PyTorch. The issue is that the wheels for `2.6.0.dev20241222+cu126` don't have consistent metadata, and it's a fundamental assumption of uv that the metadata for a given version _is_ consistent.

To resolve this, I modified the arm64 nightly build workflow to add two new `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under `manywheel-py3_11-cuda-aarch64-build` and `manywheel-py3_12-cuda-aarch64-build`. These are based on their equivalents in the x64 workflow for the corresponding python versions.

I used the cuda 12.6 dependencies versions for the nvidia packages, to match the `DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main` being used by these jobs.

(The arm64 workflow file already had several `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under various cpu wheels. I'm not sure why these are there, but I left them as-is.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145021
Approved by: https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-10 14:39:39 +00:00
4fdd076907 [CD] Add triton xpu as dependency of torch xpu windows whl (#148755)
Depends on PR #147637 land

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148755
Approved by: https://github.com/atalman
2025-03-10 14:04:30 +00:00
31625b08b8 Add ccode for FloorDiv (#148727)
Summary: Add ccode for FloorDiv

Test Plan: CIs

Differential Revision: D70749021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148727
Approved by: https://github.com/bobrenjc93
2025-03-10 14:00:18 +00:00
2068235c0a Add timm_efficientnet to flaky models after cuda 12.6 update in CI/CD (#148788)
After https://github.com/pytorch/pytorch/pull/148612
This model have become flaky

Tracking this regression in an issue : https://github.com/pytorch/pytorch/issues/148699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148788
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-03-10 13:40:41 +00:00
68c12ecfe2 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-10 13:17:58 +00:00
098494e9cb [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-10 13:14:05 +00:00
59f14d19ae Implement gradient for the residuals of torch.linalg.lstsq (#148526)
Fixes #147543.

I have written some tests in python using `gradcheck`. Please advise where I should put these tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526
Approved by: https://github.com/lezcano
2025-03-10 12:35:09 +00:00
ea86b8d315 Fix redistribution cost for all-reduce (#148761)
This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce.

Thanks @lw for the helpful discussions on getting this PR out!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761
Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin
2025-03-10 12:13:11 +00:00
526524b489 Update slow tests (#148873)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148873
Approved by: https://github.com/pytorchbot
2025-03-10 11:46:30 +00:00
74da76f67c [ROCm][Windows] Fix ROCm/HIP version header (#148560)
On Windows, ROCm libraries do not have a `<rocm-core/rocm_version.h>` header, which causes the compilation to fail. This PR resolves this problem by utilising `<hip/hip_version.h>`  from HIP SDK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148560
Approved by: https://github.com/jeffdaily
2025-03-10 11:28:13 +00:00
00199acdb8 [inductor][triton] Block ptr analysis fix assert on matched index expression (#148446)
If dynamic shapes are enabled, then block analysis may create new precomputed size replacements from the index which can lead to an assertion failure when the matched index is compared with the original index. For example the below assertion fails, despite the expressions being equivalent (ps2 = 3 * ps0). This can be resolved by updating the original index with the replacements, or simply removing the replacements when the expressions are tested to be equal - the latter option is implemented in this PR.

```
       torch._inductor.exc.InductorError: AssertionError:
E       Invalid match!
E       Index: 3*ps0*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E       Matched expression: ps2*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E
```

This PR fixes the test below when `config.triton.use_block_ptr=True`:
```
python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_conv3d_channels_last_dynamic_shapes_cpu
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148446
Approved by: https://github.com/jansel
2025-03-10 05:26:55 +00:00
3680e666d8 Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)
I noticed that this op was likely intended to be in the `extern "C"` portion of the file, but it was not added as such in https://github.com/pytorch/pytorch/pull/145250 which means this function is actually not stable/would get mangled by C++.

Following the thread there I am thinking there are two possible solutions:
(1) Since this op was never stable to begin with, and @Xia-Weiwen already landed the fallback, maybe this op is deletable + should get deleted before the 2.7 branch cut
(2) Or we could just move the op to the right portion of the code. While I like just deleting the op, I am hesitant to do in case there's something I haven't considered, so this PR does option 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148834
Approved by: https://github.com/desertfire
2025-03-10 03:23:48 +00:00
7ae0ce6360 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:36 +00:00
b47d81682d [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:24 +00:00
cyy
aac230a511 [MPS] Fix Wreorder-init-list (#148839)
Fixes the following warning:
```
 warning: ISO C++ requires field designators to be specified in declaration order; field 'value' will be initialized after field 'size' [-Wreorder-init-list]
    662 |       return {.value.cf = scalar.to<c10::complex<float>>(), .size = sizeof(int64_t), .type = type};
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148839
Approved by: https://github.com/Skylion007
2025-03-09 23:45:46 +00:00
b95889042c [MPS] Introduce strides unary op (#148468)
By adding following template
```metal
template <typename T, typename F>
kernel void unary_strided(
    device result_of<F, T>* output [[buffer(0)]],
    constant T* input [[buffer(1)]],
    constant long* sizes [[buffer(2)]],
    constant long* input_strides [[buffer(3)]],
    constant long* output_strides [[buffer(4)]],
    constant uint& ndim,
    uint index [[thread_position_in_grid]]) {
  F f;
  int pos[max_ndim];
  pos_from_thread_index(int(index), pos, sizes, ndim);
  const auto input_offs = offset_from_coord(pos, input_strides, ndim);
  const auto output_offs = offset_from_coord(pos, output_strides, ndim);
  output[output_offs] = f(input[input_offs]);
}
```
and instantiating it for all existing unary shaders, which eliminates the need to any intermediate copies.
No extra testing are needed as those cases are already covered by `test_output_grad_match_corrcoef_cpu_float32` as well as `test_unary_ops_storage_offset_strided`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148468
Approved by: https://github.com/dcci
2025-03-09 22:30:51 +00:00
275a7c5dbb Revert "Add a stable TORCH_LIBRARY to C shim (#148124)"
This reverts commit 327e07ac1dc3351bb5f0ad436760b83590c400aa.

Reverted https://github.com/pytorch/pytorch/pull/148124 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but somehow it caused test failures in newly introduced tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pull%20%2F%20linux-focal-cuda12.6-py3.10-gcc11-sm89%20%2F%20test%20(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148124#issuecomment-2709057833))
2025-03-09 20:44:56 +00:00
19a39a7a06 Revert "[dynamo] allow global import from collections import deque in user code (#148676)"
This reverts commit 685fb377131cc684633dc5471e77038988db53f6.

Reverted https://github.com/pytorch/pytorch/pull/148676 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see f1444f006c/1(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148676#issuecomment-2709057326))
2025-03-09 20:42:03 +00:00
f1444f006c [caffe2/torch] Fixup upstream LLVM (major version 21) API changes (#148833)
Latest LLVM introduced two changes related to the `Triple` usage that causes build failures when building pytorch.

## Failure in llvm_codegen.cpp:
Triple is stored in Modules instead of the string: 979c275097

## Failure in llvm_jit.cpp:
Triple argument is removed from LLJITBuilder::... : b18e5b6a36

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148833
Approved by: https://github.com/Skylion007
2025-03-09 18:58:36 +00:00
9a1a2e1516 Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
2025-03-09 17:12:47 +00:00
a8e3d1984a [Inductor UT][XPU] Skip test case test_cat_max_autotune_triton for known issue. (#148734)
The mm triton template/configs have not been tuned for XPU, we observer that the epilogue fusion can not speed up on XPU because of registers spill. So XPU failed on the case `test_cat_max_autotune_triton` which checks the fusion. We'll remove the skip after #146568 being resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148734
Approved by: https://github.com/jansel
2025-03-09 15:09:43 +00:00
bb9c426024 Typo Errors fixed in multiple files (#148262)
# Fix typo errors across PyTorch codebase

This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.

## Changes Made

### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
  - `setup.py`: Build system documentation
  - `torch/_library/triton.py`: AOT compilation comments
  - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
  - `torch/export/_unlift.py`: Pass population comments
  - `torch/export/exported_program.py`: Decomposition table notes

### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
  - `test/mobile/test_lite_script_module.py`: Exception handling comments
  - `torch/export/_draft_export.py`: Error message text
  - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
  - `torch/csrc/utils/python_numbers.h`: Overflow handling comment
  - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
  - `torch/_dynamo/symbolic_convert.py`: Error explanation

### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
  - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
  - `torch/distributed/distributed_c10d.py`

## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.

## Test Plan
No testing required as these changes only affect comments and documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-09 12:21:40 +00:00
327e07ac1d Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-03-09 10:07:25 +00:00
685fb37713 [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-09 09:35:29 +00:00
6566d67bd3 [dynamo] show stack above dynamo in graph break user tracebacks (#148401)
Also show the line of code relevant to a dynamo-compiled frame, instead of just the first line (this was broken for data-dependent jump graph breaks and for 3.11+).

Also collapses resume frames together (use config.verbose to see full stack trace - for developers).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148401
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-03-09 07:37:38 +00:00
2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
85fe576ee3 [set_linter] allow x in {...} (#148422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148422
Approved by: https://github.com/Skylion007
2025-03-09 06:43:11 +00:00
9cb25f0ea2 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 17dbeb11db7afbab792ad76c24840c1552a0e76d.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))
2025-03-09 03:01:55 +00:00
17dbeb11db [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-08 20:00:12 +00:00
5245304f1e Update decompositions_for_jvp.py (#148821)
small typo thing that got my eye

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148821
Approved by: https://github.com/Skylion007
2025-03-08 19:08:42 +00:00
148eb735ee Change nvcc arch flags for sm100 (#148774)
### Summary
- Addressing this comment https://github.com/pytorch/pytorch/pull/148274#discussion_r1984944012

### Test plan
- Verified building from source w/ B200s is successful
- Verified B200 tensorcores are still being utilized properly via benchmarking script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148774
Approved by: https://github.com/Skylion007
2025-03-08 19:05:53 +00:00
7ffadff286 c10d/ProcessGroup: cleanup abort and shutdown (#148798)
This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs.

This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation.

Test plan:

```
pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798
Approved by: https://github.com/kwen2501
2025-03-08 18:33:18 +00:00
9841f0ddcf Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566)
This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation.

It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do.

For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`.
Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566
Approved by: https://github.com/weifengpy
2025-03-08 18:00:49 +00:00
439782960c Fix typos in SpectralOps.cpp (#148818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148818
Approved by: https://github.com/Skylion007
2025-03-08 17:34:59 +00:00
eqy
849cc058ee [CUDA][TF32] Account for tf32 in test_efficient_conv_bn_eval (#148802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148802
Approved by: https://github.com/Skylion007
2025-03-08 16:17:04 +00:00
c3b05c4a27 [triton 3.3] support both specialize_impl and create_specialize_impl (#148806)
After https://github.com/triton-lang/triton/pull/6099, we sometimes need to do `from triton.runtime.jit import specialize impl` and sometimes do `triton.runtime.jit.create_specialize_impl()`. This should fix a bunch of the new errors that appeared with the triton 3.3 / pytorch 2.7 integration (e.g. `python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_float_arg_dynamic_False_cuda`, failing at https://hud.pytorch.org/pr/pytorch/pytorch/148684#38392501220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148806
Approved by: https://github.com/drisspg
2025-03-08 09:31:52 +00:00
118c9e501a [ONNX] Remove inaccurate test comment (#148813)
Remove the comment that says jit trace strategy doesn't support dynamic shapes as dict because it does support it (which is what the test is testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148813
Approved by: https://github.com/cyyever, https://github.com/titaiwangms
2025-03-08 08:55:56 +00:00
3745da18f4 [AOTI] Swith to local cpp compile for fbcode (#148592)
Summary: as title, otherwise we can not find lamdhip64

Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431

Differential Revision: D70637798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592
Approved by: https://github.com/hl475
2025-03-08 08:38:26 +00:00
666508eb17 [aot cache][ca] remove restriction on caching ca's aot inference graph (#148491)
but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148381
2025-03-08 06:08:26 +00:00
c16cd25cf5 [ca] remove compiled_autograd_tracing (#148381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381
Approved by: https://github.com/jansel
2025-03-08 06:08:26 +00:00
5f1c79ba2b [CD] Enable triton xpu windows build (#147637)
Depends on #147727, which introduce triton xpu windows support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147637
Approved by: https://github.com/atalman
2025-03-08 05:28:46 +00:00
cyy
f7c0c230b0 Fix compile errors (#148758)
Fix
```
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry'
     91 |         static_assert(sizeof(_Tp)>0,
        |                       ^~~~~~~~~~~
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here
    399 |           get_deleter()(std::move(__ptr));
        |           ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here
    200 | AliasDb::~AliasDb() = default;
        |          ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here
    200 | AliasDb::~AliasDb() = default;
        |                       ^
  ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry'
    298 |   struct WriteRegistry;
        |          ^
  1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758
Approved by: https://github.com/Skylion007
2025-03-08 04:56:42 +00:00
75179fd6e6 [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148781)
Differential Revision: D70575053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148781
Approved by: https://github.com/SherlockNoMad
2025-03-08 04:43:32 +00:00
8f71d4563e Fix rms_norm in fp16/bf16 (#147203)
Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done.

Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203
Approved by: https://github.com/eqy
2025-03-08 04:43:18 +00:00
85467ed063 Fix for AOTI + CUDAGraphs when calling from Python (#148601)
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601
Approved by: https://github.com/desertfire
2025-03-08 02:44:14 +00:00
9f170d9d13 [Triton 3.3] Remove ROCm specific mm gemm template (#148662)
Fixes: https://github.com/pytorch/pytorch/issues/147121
Since triton 3.3.x fixes the problem

Needs to be handled in none BC breaking way, so we will conditionalise this change on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148662
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2025-03-08 01:24:40 +00:00
a89e7c2da9 [Upstream] Wrap log_2_e in tl.constexpr for new 3.3 bump (#148785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148785
Approved by: https://github.com/davidberard98
2025-03-08 01:09:28 +00:00
179b7a0abc Do not crash when compiling quantized LORA models (#148435)
Fixes #148072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148435
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel
2025-03-08 00:02:08 +00:00
24085db082 Don't clear feedback_saver_fns after cache clear (#148723)
Summary:
Since feedback_saver_fns are used for logging, I don't think it makes sense to clear them, and this resulted in weird behavior in user code where disabling caches caused logging code to break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148723
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-03-07 23:43:59 +00:00
d96c85558a [ONNX] Use torch export to get dynamic shapes for JIT convert strategy (#148627)
Use torch export to get dynamic shapes for JIT converted graph. I just realized we can retrace a converted jit graph with `torch.export` and produce dynamic shapes using `torch.export`.

-	**Prior:** The exporter will produce a **static graph silently** even when dynamic_shapes are provided.
-	**Proposed:** When `dynamic_shapes` is provided and when the strategy is able to handle it, it will succeed

## Why are we still keeping the JIT strategy?

It is useful when users want to convert JIT modules or `.pt` files into ONNX via the new path. Sometimes also useful when there are JIT scripted modules in the nn module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148627
Approved by: https://github.com/titaiwangms
2025-03-07 23:41:50 +00:00
26f8d81037 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-07 23:35:47 +00:00
187d5c0eb1 [logging] Log cudagraphify timings to dynamo_timed (#143220)
Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note:
* Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries
* The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id
* I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`:
* tlparse: https://fburl.com/21yrdn8h
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz

`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/r9mp7uiv
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220
Approved by: https://github.com/eellison
2025-03-07 23:07:13 +00:00
f2dfe2d99c [Triton 3.3] [ROCm] Enabled split_scan support for ROCm builds (#147619)
Fixes issue https://github.com/pytorch/pytorch/issues/133228

Enabled split_scan support for ROCm builds.

Must be handled in a non BC breaking way so this functionality is enabled conditionalised on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147619
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: David Berard <davidberard98@gmail.com>
2025-03-07 23:06:21 +00:00
0f852641c2 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit d35a4ddae2345e639001bfee58a0932e96597f2d.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/henrylhtsang due to mistakes when writing the tests ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2707637965))
2025-03-07 22:42:13 +00:00
755965d2e4 [inductor] fix matmul w/ torch.bucketize epilogue (#148769)
See https://github.com/pytorch/pytorch/issues/148764.

Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist.

As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape.

This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it.

Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769
Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi
2025-03-07 22:34:13 +00:00
67742128b7 [ROCm] Bump AOTriton to 0.9.2b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-07 22:10:07 +00:00
7b79e17275 [BE] Move cuda12.6 builds to gcc11 (#148740)
I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/`

Which accidentally fixes  undefined symbol references errors namely
```
/usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()'
```
Which happens because `libmagma.a` that were build with gcc-11 (after https://github.com/pytorch/pytorch/pull/148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`)

Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or
```
$ echo "int* foo(int x) { return new int[x];}"|g++ -std=c++17 -S -O3 -x c++ -o - -
	.file	""
	.text
	.section	.text.unlikely,"ax",@progbits
.LCOLDB0:
	.text
.LHOTB0:
	.p2align 4
	.globl	_Z3fooi
	.type	_Z3fooi, @function
_Z3fooi:
.LFB0:
	.cfi_startproc
	endbr64
	movslq	%edi, %rdi
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movabsq	$2305843009213693950, %rax
	cmpq	%rax, %rdi
	ja	.L2
	salq	$2, %rdi
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	jmp	_Znam@PLT
	.cfi_endproc
	.section	.text.unlikely
	.cfi_startproc
	.type	_Z3fooi.cold, @function
_Z3fooi.cold:
.LFSB0:
.L2:
	.cfi_def_cfa_offset 16
	call	__cxa_throw_bad_array_new_length@PLT
	.cfi_endproc
```

Fixes https://github.com/pytorch/pytorch/issues/148728 and https://github.com/pytorch/pytorch/issues/148495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148740
Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-03-07 21:21:12 +00:00
08baaa7d63 [Docs][TunableOp] TunableOp documentation update (#148384)
This PR aligns documentation to what is in the README file:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md

and removes the prototype NOTE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384
Approved by: https://github.com/jeffdaily, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-03-07 21:02:49 +00:00
bb94b65da7 Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 2fb654676f6291f6e27c6bab2761f170516598dd.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2707440106))
2025-03-07 20:58:28 +00:00
d8dc700e25 Delete duplicate entry from docker-builds.yml (#148782)
Regression introduced by merge conflict of https://github.com/pytorch/pytorch/pull/148612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148782
Approved by: https://github.com/atalman
2025-03-07 20:55:46 +00:00
99da439d10 Revert "Remove Cuda 12.4 from nightly Binaries (#148625)"
This reverts commit 1239176fe717839ca5612ac03a4806051225f381.

Reverted https://github.com/pytorch/pytorch/pull/148625 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/148625#issuecomment-2707415005))
2025-03-07 20:47:45 +00:00
6602e632cd Suppress build warnings when gcc-11 is used (#148763)
By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")`
that will suppress following (when building against ancient llvm-9)
```
In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24:
/opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type*, llvm::Value*, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]':
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
 1581 |     return Insert(new LoadInst(Ty, Ptr), Name);
      |                   ^~~~~~~~~~~~~~~~~~~~~
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void* llvm::UnaryInstruction::operator new(size_t)'
```

Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman
ghstack dependencies: #148739
2025-03-07 20:43:35 +00:00
d36391307f [ONNX] Handle error in verification interpreter (#148730)
Use a simple try catch to handle onnx runtime errors in the verification interpreter when that happens. One example is ort will sometimes produce a list of None for some nodes. I am not sure how that happens yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148730
Approved by: https://github.com/titaiwangms
ghstack dependencies: #148706
2025-03-07 20:24:49 +00:00
aebd2e411f [pytree][easy] lock global registry containers properly for thread-safety (#148750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148750
Approved by: https://github.com/StrongerXi
2025-03-07 20:04:52 +00:00
6b44a91a62 use statically_known_true instead of guard_size_oblivious in pattern matcher (#147557)
We shouldn't add guards here. Use statically_known_true instead. Internal xref: https://fb.workplace.com/groups/1075192433118967/?multi_permalinks=1609560723015466&comment_id=1610040026300869&notif_id=1740082892544333&notif_t=work_feedback_reaction_generic&ref=notif

Differential Revision: [D69950122](https://our.internmc.facebook.com/intern/diff/D69950122/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147557
Approved by: https://github.com/eellison
2025-03-07 19:17:25 +00:00
b246cd7b82 Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 17302b4bc837af079d2f6480f07ea2c99b93fb4b.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))
2025-03-07 18:59:58 +00:00
1239176fe7 Remove Cuda 12.4 from nightly Binaries (#148625)
https://github.com/pytorch/pytorch/issues/145570

removes cuda 12.4 nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148625
Approved by: https://github.com/atalman
2025-03-07 18:56:04 +00:00
61c4074df7 Add Windows Arm64 Nightly Builds (#139760)
This PR creates 3 new worklflows for Windows Arm64 target. The workflows and outputs can be reviewed at the following links:
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-release-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-debug-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-wheel-nightly.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139760
Approved by: https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-03-07 18:53:56 +00:00
cyy
e839e4f5bd Fix Wc++98-compat-extra-semi (#148757)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148757
Approved by: https://github.com/Skylion007
2025-03-07 18:49:12 +00:00
0a7ccee1e0 [ROCm][Windows] Disable Composable Kernels and Triton for Windows builds (#147334)
Currently, Composible Kernels and Triton aren't available on Windows. This PR ensures that the files relating to this dependency are not included during the build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147334
Approved by: https://github.com/jeffdaily
2025-03-07 18:40:49 +00:00
eqy
18c6e00c7b [CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in ProcessGroupNCCL.cpp (#148594)
Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594
Approved by: https://github.com/kwen2501
2025-03-07 18:39:02 +00:00
9769618d35 [CI] [inductor] Add cu126 inductor jobs and move away cu124 (#148612)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793 into eager and inductor benchmarks to unblock

Seems many inductor yml are added after initial change was prepared.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148612
Approved by: https://github.com/nWEIdia, https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 18:30:14 +00:00
da923afdc7 [MPS][BE] Align bitshift behavior with CPU (#148719)
By casting the argument to output type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148719
Approved by: https://github.com/Skylion007
ghstack dependencies: #148685, #148686
2025-03-07 18:28:14 +00:00
f84710aef4 [MPS] Fix scalar to tensors bitshifts (#148686)
By introducing a concept of non-commutative binary op and renaming all op templates from `bitwise_foo_tensor` and `bitwise_foo_scalar` to `bitwise_foo_tensor_tensor` and `bitwise_foo_tensor_scalar`

Add regression tests

Please note, that for some undefined values MPS and CPU behaviors are different, for example
```
>>> import torch
>>> 4095 >> torch.arange(12, device="mps", dtype=torch.uint8)
tensor([255, 255, 255, 255, 255, 127,  63,  31,  15,   7,   3,   1],
       device='mps:0', dtype=torch.uint8)
>>> 4095 >> torch.arange(12, device="cpu", dtype=torch.uint8)
tensor([255, 127,  63,  31,  15,   7,   3,   1,   0,   0,   0,   0],
       dtype=torch.uint8)
```
Because on CPU scalar is cast to output dtype before operation is performed, but on MPS this happens after the op is done

Fixes https://github.com/pytorch/pytorch/issues/147889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148686
Approved by: https://github.com/albanD
ghstack dependencies: #148685
2025-03-07 18:28:14 +00:00
cyy
116c1e42c5 Re-enable tests (#148732)
No UBSAN failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148732
Approved by: https://github.com/Skylion007
2025-03-07 18:11:57 +00:00
8059ead823 [ROCm] Incorporate ROCm triton specific tuning parameters (#148437)
Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.

A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437
Approved by: https://github.com/eellison, https://github.com/jansel
2025-03-07 18:09:47 +00:00
a3b77d434a Subprocess compile (attempt 2) (#148635)
Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Fixed the test which caused the previous version (#146134) to be reverted:
```
$ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635
Approved by: https://github.com/jamesjwu
2025-03-07 17:50:14 +00:00
50c9f6d83b [Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323)
In `fresh_inductor_cache` remove pyd files will raise permission error
on Windows because they are still used by the process.
So we clear the references to the loaded pyd libray obj and unload them
from the process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323
Approved by: https://github.com/jansel
ghstack dependencies: #148534, #148538, #147727
2025-03-07 17:19:59 +00:00
d05694807d [XPU][Inductor] Update Intel triton for release 2.7. (#147727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147727
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
ghstack dependencies: #148534, #148538
2025-03-07 17:19:59 +00:00
136b8165d1 [DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577)
Summary: Save Plan Caching: Fix the missing all_plans update in the cache.

Test Plan:
```
buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test
```

https://www.internalfb.com/intern/testinfra/testrun/17451448626323264

Reviewed By: MeetVadakkanchery

Differential Revision: D70229019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577
Approved by: https://github.com/MeetVadakkanchery
2025-03-07 17:00:59 +00:00
abcca2fcbb Revert "Fix torch.nn.functional.hardswish gradients corner case (#148049)"
This reverts commit 29b28e9d9f93d78092099a44a7bcc28cfbae06e3.

Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))
2025-03-07 16:05:56 +00:00
17302b4bc8 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-07 15:19:34 +00:00
d54b2b7fa7 [BE] Delete split builds (#148739)
They has been disabled since Oct 2024, perhaps time to remove them from the workflows

See https://github.com/pytorch/pytorch/issues/138750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148739
Approved by: https://github.com/atalman
2025-03-07 15:10:50 +00:00
372ad7b181 Enable FSDP2 on HPU device (#148667)
The motivation of this PR is to enable FSDP2 collectives for HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667
Approved by: https://github.com/wconstab
2025-03-07 14:33:43 +00:00
81847d08cf [Intel GPU][quant] Refine zero-point memory creation (#148640)
# Motivation
This PR skips  zero-point GPU memory creation when zero-point=0, as it would not be used by oneDNN library. This could help save the 1~3 H2D copy overhead per QLinear/QConv kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148640
Approved by: https://github.com/liangan1, https://github.com/EikanWang
2025-03-07 13:49:19 +00:00
f80aad62fa Improve Pareto frontier plot for AutoAC (#148678)
This was added in https://github.com/pytorch/pytorch/pull/126320. It's a very nice feature, which can be used to predict memory usage for different budget values.

However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings.

Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148678
Approved by: https://github.com/Chillee, https://github.com/fmassa
2025-03-07 13:22:29 +00:00
d4d7d813fa Update CURL url for manywheel images (#148343)
It looks like it was moved on the site it was downloaded from.
Switch to official site while updating URL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148343
Approved by: https://github.com/dr4gon01, https://github.com/janeyx99, https://github.com/atalman, https://github.com/seemethere
2025-03-07 11:41:12 +00:00
6cf360be04 fix lost input mutations with export_tracepoint (#148709)
Preserving module call signatures in the presence of input mutation cause incorrect results. The root cause turned out to be that export tracepoints would unwrap / wrap functional args that would lose mutation info on those args.

Differential Revision: [D70734821](https://our.internmc.facebook.com/intern/diff/D70734821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148709
Approved by: https://github.com/angelayi
2025-03-07 09:36:18 +00:00
bb84a23c22 [ROCm] [TunableOp] Enable logging of BLAS parameters (#147034)
This PR supports a logging feature that is being requested.
```
PYTORCH_TUNABLEOP_BLAS_LOG=1
```
Enables the logging of BLAS parameters with either offline of online (in-situ) tuning.

The BLAS parameters are written to the CSV file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147034
Approved by: https://github.com/jeffdaily
2025-03-07 09:32:59 +00:00
243b47e2ec [Intel GPU] Fix SDPA dummy LSE output to match meta function (#148652)
To fix XPU patched UTs including
```bash
pytest -vs third_party/torch-xpu-ops/test/xpu/test_meta_xpu.py::TestMetaXPU::test_dispatch_symbolic_meta_outplace_nn_functional_scaled_dot_product_attention_xpu_bfloat16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148652
Approved by: https://github.com/EikanWang
2025-03-07 08:36:18 +00:00
416ea1c71c Code Clean: Remove unnecessary code (#148735)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148735
Approved by: https://github.com/jingsh, https://github.com/cyyever
2025-03-07 08:15:37 +00:00
4075646bd8 Use oneDNN v3.7.1 for Intel GPU (#148403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148403
Approved by: https://github.com/EikanWang

Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-03-07 08:03:49 +00:00
cyy
3d854ea9bd Remove deprecated std::aligned_storage_t (#148660)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148660
Approved by: https://github.com/swolchok
2025-03-07 07:29:42 +00:00
3f069e7679 [mm_logs] enhance the printing for overview info (#148716)
Summary:
previously the dynamo counters does not print the counts information automatically.

explicitly added a log msg to print after lowering for overview info for inductor aten mms

it will look like:

the name is in `{aten_op_name}_{m}_{n}_{k}`
```
torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx
```

 {F1975874802}

Test Plan:
```
TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```

Differential Revision: D70739912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716
Approved by: https://github.com/henrylhtsang
2025-03-07 05:23:49 +00:00
5f392ae560 Throws error when using torch.cuda.MemPool with expandable segments (#148378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #148374
2025-03-07 05:22:03 +00:00
c0f1557285 [FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715)
we got asked a few times about FSDP2's equivalence of no_sync. highlight
set_requires_gradient_sync as the equivalence in docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715
Approved by: https://github.com/mori360
2025-03-07 04:34:46 +00:00
fe4b88f6aa [HPU] Add hpu to fused kernels supported devices (#148666)
This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring
compatibility with HPU backend.

Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with:

RuntimeError: fused=True requires all the params to be floating point Tensors
of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone']
but torch.float32 and hpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666
Approved by: https://github.com/albanD
2025-03-07 04:28:33 +00:00
33f8ab2f58 [ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238)
This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM.

There are few other items included in this PR as well:
- Fixes for offline tuning of scaled GEMM
- Simplification of existing offline UT
- Update existing online UT to also test rowwise versus tensorwise scaled GEMM
- New UT for offline scaled GEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238
Approved by: https://github.com/jeffdaily
2025-03-07 04:12:48 +00:00
cdb4fd0d29 Update win-vs2022-cuda12.1-py3 -> win-vs2022-cuda12.6-py3 (#148717)
Should have been migrated long ago
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148717
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-03-07 03:21:29 +00:00
389b496062 [XPU] Add test/kernel.errors.txt to .gitignore. (#148538)
Intel GPU user mode driver may generate kernel.errors.txt files in
current working directory in certain scenarios. It includes diagnostic
information but does necessarily indicates the issue with an
application. This is a known issue and will be fixed in newer version of driver.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148538
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #148534
2025-03-07 03:12:50 +00:00
9c9b05bc4f Expose functions used in custom backend in torch_python dll (#148213)
Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL).

Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file.

Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h.

Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213
Approved by: https://github.com/malfet
2025-03-07 02:34:37 +00:00
dfb4094b9c Skip buffer in dense update (#148533)
Summary:
as title.

PyTorch Module buffer will not be published in delta publishing.  In Quinn's previous diff, constant type annotations have been introduced.

In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list

Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn

Differential Revision: D69553929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533
Approved by: https://github.com/22quinn, https://github.com/jingsh
2025-03-07 01:59:58 +00:00
00cd6c07b9 [Intel GPU][pt2e] Enable quantized grouped convolution at XPU (#148522)
# Motivation&Details
This PR fix a bug that blocked quantized group convolution before. The bug is caused by that, grouped convolution requires setting weight scale mask on both group dimension and output channel dimension. This PR fixs the wrong mask in integration and add grouped conv in UT.

# UT
` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv2d_xpu`

# Runtime exemplification
```onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:s8::blocked:acdb::f0 wei:s8::blocked:abcde::f0 bia:f32::blocked:a::f0 dst:f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:3:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,g4mb1_ic128oc128_ih4oh2kh3sh1dh0ph0_iw4ow2kw3sw1dw0pw0,0.0529785``
The verbose shows that we successfully run into quantized convolution, where weight is `abcde` format(group conv).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148522
Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/jansel
ghstack dependencies: #148423
2025-03-07 01:57:45 +00:00
127bd5a02d Add sparsity (#148513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148513
Approved by: https://github.com/danielvegamyhre
2025-03-07 01:47:52 +00:00
b4430c3a6d [Intel GPU][pt2e]: Collapse 3D input to 2D for matmul in qlinear_pointwise_binary fusion (#148423)
# Motivation
During the `qlinear_pointwise_binary` lowering pass, dim collapsing only occurs when post-ops is `add`. It is the responsibility of  C++ kernels to handle dimension for post-ops `sum`

# Details
This PR explicitly reshape input from 3D to 2D in op `qlinear_pointwise_binary`. Besides, we refractor implementation `qlinear_pointwise_binary.tensor` to call `qlinear_pointwise_binary` for removing duplicated codes.

# UT testing
`python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlienar_add_xpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148423
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-03-07 01:47:33 +00:00
c8cd8f68bd [dynamo] Properly account for non-list instances in list comparison (#148470)
As title; this patch also removes an unused `list_compare` method.

Fixes #148179.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148470
Approved by: https://github.com/anijain2305
2025-03-07 01:29:30 +00:00
a7fe685be8 Add cpp wrapper skip to cudagraph logs (#148700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148700
Approved by: https://github.com/jbschlosser
2025-03-07 01:02:40 +00:00
e3087f6d76 [ONNX] Improve verify_onnx_program to use VerificationInterpreter (#148706)
I realized we can just extend `verify_onnx_program` to return intermediate values. There is no need for us to expose the VerificationInterpreter to users.

I added a `compare_intermediates` option to `verify_onnx_program`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148706
Approved by: https://github.com/titaiwangms
2025-03-07 00:40:54 +00:00
cyy
50eb4f3990 Enable UBSAN test (#147511)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147511
Approved by: https://github.com/colesbury
2025-03-07 00:35:32 +00:00
33a285379a [codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501
Approved by: https://github.com/Skylion007
2025-03-07 00:33:39 +00:00
a0bc6d81bb [CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests (#148602)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793/ into eager and inductor benchmarks to unblock

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148602
Approved by: https://github.com/atalman, https://github.com/malfet

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 00:15:04 +00:00
e2a0296e80 [dtensor] add CuDNN SDPA op support to DTensor (#148537)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537
Approved by: https://github.com/drisspg, https://github.com/fegin
2025-03-06 23:44:40 +00:00
3960f97832 Documents torch.cuda.MemPool API (#148374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148374
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-03-06 23:18:43 +00:00
ed9c8a5d13 ROCm: Disable torch check for Multiplication of two Float8_e5m2 matrices (#148228)
ROCm supports Multiplication of two Float8_e5m2 matrices.
Hence disabling the torch check for ROCm.
Test command (on ROCm h/w supporting fp8)
python test/test_matmul_cuda.py TestFP8MatmulCudaCUDA.test_float8_basics_cuda -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148228
Approved by: https://github.com/jeffdaily, https://github.com/petrex
2025-03-06 22:12:45 +00:00
e6800bda7f [Test][Linalg][CUDA] Increase niter in test_svd_lowrank_cuda_float64 (#145930)
A recent PR #143049 attempted to increase tolerances to make test passable. However, we are still seeing errors like:
```
Traceback (most recent call last):
  File "~git/pytorch/test/test_linalg.py", line 2540, in test_svd_lowrank
    run_subtest(None, size, (), device, torch.svd_lowrank, density=density)
  File "~git/pytorch/test/test_linalg.py", line 2505, in run_subtest
    self.assertEqual(A, a, rtol=1e-7, atol=2e-7)
  File "~git/pytorch/torch/testing/_internal/common_utils.py", line 4044, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 90 / 1000000 (0.0%)
Greatest absolute difference: 7.795904016052784e-07 at index (176, 930) (up to 2e-07 allowed)
Greatest relative difference: inf at index (6, 179) (up to 1e-07 allowed)
```
Increasing `niter` parameter actually decreases numerical differences.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145930
Approved by: https://github.com/ngimel
2025-03-06 22:10:53 +00:00
75d29443e7 [Docs] update bucketize documentaion (#148400)
Fixes #144504

Clarify the documentation for `torch.bucketize` by referencing the existing table. The current version includes a somewhat confusing explanation for the `right` kwarg, whereas the existing table is much clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148400
Approved by: https://github.com/benjaminglass1, https://github.com/eellison, https://github.com/albanD
2025-03-06 22:07:52 +00:00
2fb654676f [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:26 +00:00
d35a4ddae2 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:19 +00:00
5a5ac98918 [aarch64] add libcufile for cu126 and cu128 (#148465)
seeing `  File "/usr/local/lib/python3.12/site-packages/torch/__init__.py", line 411, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory` with arm cu128 nightly.
related to https://github.com/pytorch/pytorch/pull/148137
need to copy the dependency for arm build as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148465
Approved by: https://github.com/atalman, https://github.com/abhilash1910
2025-03-06 21:39:43 +00:00
3d62e81a1e [DCP] fix dcp gather_object/scatter_object_list (#147675)
gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675
Approved by: https://github.com/MeetVadakkanchery
2025-03-06 21:20:38 +00:00
1d7fc0c681 [dynamo] Remove dead code path around functools.partial objects (#148683)
This removes the code paths added in #98120, which has then been
superceded by #108846.

More importantly, it makes `EQUALS_MATCH`'s `ok_mutable_types` (added in #134016)
easier to reason about, i.e., no need to worry about `dict` types, which
was only needed for #98120.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148683
Approved by: https://github.com/yanboliang
2025-03-06 21:20:04 +00:00
262411e48b [inductor] online softmax (#127011)
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row

When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.

Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).

Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54

## Microbenchmark

- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
  - eager_ms=6.671296119689941
  - opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
  - eager_ms=6.634047985076904
  - opt_ms=6.230591773986816

Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
2025-03-06 21:07:18 +00:00
cf9efbdf16 Revert "Enable onednn in pytorch for ppc64le architecture (#143743)"
This reverts commit d4cf0e5af406239881acfeb4f9e4f62373faca8b.

Reverted https://github.com/pytorch/pytorch/pull/143743 on behalf of https://github.com/davidberard98 due to windows build failures look related [GH job link](https://github.com/pytorch/pytorch/actions/runs/13705127978/job/38329845095) [HUD commit link](d4cf0e5af4) ([comment](https://github.com/pytorch/pytorch/pull/143743#issuecomment-2704903253))
2025-03-06 20:47:57 +00:00
1add61c242 Replace unimplemented with unimplemented_v2' in codegen.py` (#148069)
Fixes #147913

- replace `unimplemented` in `codegen.py`
- remove unused import `unimplemented`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148069
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-03-06 20:42:37 +00:00
edd640a95a [BE][Ez]: Use itertools.chain.from_iterable when possible (#148190)
Often makes the code more readable, more efficient, and adds support for infinite iterables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190
Approved by: https://github.com/jansel, https://github.com/malfet
2025-03-06 20:37:06 +00:00
65dbc3b454 [BE][MPS] Remove redundant handle_tensor_scalar_binary_op (#148685)
After https://github.com/pytorch/pytorch/pull/143934 `mtl_setBuffer` can handle scalar tensors correctly, so no need to have a specialized function here
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148685
Approved by: https://github.com/dcci
2025-03-06 19:24:46 +00:00
29b28e9d9f Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-06 19:04:52 +00:00
f08146b67b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-06 18:59:02 +00:00
96176e32a9 Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433)"
This reverts commit 8af79b7ec816f5c73536a806aa4c7ea1f7bd3867.

Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))
2025-03-06 18:32:48 +00:00
b85ae06bed Update CPU tolerance for f16 triplet margin loss (#147742)
Currently, the `test_torchinductor_opinfo` test for `nn.functional.triplet_margin_loss` fails on AArch64, this PR increases the acceptable ATOL and RTOL for this test when using F16. There is precedent for this as XPU and CUDA already increase the tolerance. Additionally, the CPU backend increases the tolerance for the `with_distance_loss` variant of `triplet_margin_loss`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147742
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 18:09:43 +00:00
d10bacd4ce [AOTI][dashboard] Skip torchbench models not supported by export (#148359)
Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359
Approved by: https://github.com/angelayi, https://github.com/ysiraichi
2025-03-06 18:08:17 +00:00
d91a634edf [c10d] Make getDefaultBackend more fault tolerant (#148596)
This is a forward fix for #135338.
It hits error like this:
```
"distributed_c10d.py", line 2156, in destroy_process_group
    if type(pg) == ProcessGroup and pg._has_hooks():
RuntimeError: Could not find the default backend type 0 for Process Group with name undefined.
```

When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call.

The fix wraps `getDefaultBackend` with a try-catch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596
Approved by: https://github.com/LucasLLC, https://github.com/fduwjj
2025-03-06 18:07:43 +00:00
d4cf0e5af4 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet, https://github.com/albanD
2025-03-06 18:00:55 +00:00
097b0d372a [pytree] fix previously failed dynamo tests (#148669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148669
Approved by: https://github.com/zou3519
2025-03-06 17:59:29 +00:00
28b68b46bc Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 4aeca28137dcee74b5fcd0c0636d0ee1f113d5fb.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2704534995))
2025-03-06 17:45:49 +00:00
3cde4c3069 [BE] Remove onlyCPU decorator from test_local_scalar_dense (#148559)
Followup from https://github.com/pytorch/pytorch/pull/145717, not sure why author thinks those tests should be limited to one architecture.
And fixed similar crashes for CUDA and MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148559
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/seemethere
2025-03-06 17:43:02 +00:00
841451af9f Revert "[Inductor] Avoid tensor slice overflow for large step (#147433)"
This reverts commit 1d7397a2d04a4d636559f41511a20f7dadbe5777.

Reverted https://github.com/pytorch/pytorch/pull/147433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/147433#issuecomment-2704506627))
2025-03-06 17:33:08 +00:00
679e7d257e [mm_logs] follow up to add count info based on shape for inductor aten.mms (#148623)
Summary:
as title.

when enable `TORCH_LOGS="+inductor"`, you can get logs at the end such as

stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('benchmarking.TritonBenchmarker.benchmark_gpu', 2), **(('aten_addmm', (16, 6, 16)), 1)**, ('extern_calls', 1), ('async_compile_cache_miss', 1)]
graph_break []

Test Plan: follow up to add proper logging test.

Differential Revision: D70665104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148623
Approved by: https://github.com/henrylhtsang
2025-03-06 16:20:04 +00:00
b160dda743 cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403)
This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode.

Changes:

1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions.
2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns.
3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive.

The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression.

Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403
Approved by: https://github.com/desertfire
2025-03-06 16:08:16 +00:00
5fb0f45d3b [triton 3.3] test_triton_kernel_constants fix (#148626)
Thanks @FindHao who did the initial version of this PR: https://github.com/pytorch/pytorch/pull/148505

TL;DR is that https://github.com/triton-lang/triton/pull/5961 deprecates `tl.constexpr` annotations - you're supposed to wrap the constexpr value in `tl.constexpr()` instead.

This just updates the tests to wrap with `tl.constexpr()` (and leaves the annotations - that way the old triton versions will still pass).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148626
Approved by: https://github.com/FindHao
2025-03-06 14:18:21 +00:00
d5184901c4 Make torch.serialization.skip_data work with torch.load (#148018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148018
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787, #147788
2025-03-06 12:04:46 +00:00
be0ceee1c3 Make record/storage alignment in torch.save configurable (#147788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787
2025-03-06 12:04:46 +00:00
209977e6e5 Add information about checkpoint offset to untyped storages when torch.load under FakeTensorMode (#147787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147787
Approved by: https://github.com/albanD
ghstack dependencies: #147786
2025-03-06 12:04:39 +00:00
bdcc1b579b Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786)
This only fixes _rebuild_tensor_v2 and _rebuild_tensor_v3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147786
Approved by: https://github.com/albanD
2025-03-06 12:04:32 +00:00
79aa17489c [dynamo] ctx_manager.py: replace unimplemented with unimplemented_v2 (#148570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148570
Approved by: https://github.com/williamwen42
ghstack dependencies: #148454
2025-03-06 07:46:31 +00:00
e7bc1d1791 [ONNX] Update saved exported program in debugging report if the exporting passes run_decomposition() (#148617)
Previous to this PR, if the exporting passes run_decomposition(), the report still shows the exported_program before decomposition, which adds the difficulties to our users when they want to check the exported program that are used to translate to ONNX graph.

The following example is what we see before this PR:

```
# PyTorch ONNX Conversion Report

```
 Obtain model graph with `torch.export.export(..., strict=False)`
 Obtain model graph with `torch.export.export(..., strict=True)`
 Obtain model graph with `torch.jit.trace`
 Decompose operators for ONNX compatibility
 Translate the graph into ONNX
 Run `onnx.checker` on the ONNX model
 Execute the model with ONNX Runtime
 Validate model output accuracy
```

## Error messages

```pytb

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 707, in _translate_fx_graph
    _handle_call_function_node_with_lowering(

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 486, in _handle_call_function_node_with_lowering
    raise _errors.DispatchError(

torch.onnx._internal.exporter._errors.DispatchError: No ONNX function found for <OpOverload(op='aten.slice', overload='Tensor')>. Failure message: No decompositions registered for the complex-valued input

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1371, in export
    onnx_program = _exported_program_to_onnx_program(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1007, in _exported_program_to_onnx_program
    values = _translate_fx_graph(
             ^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 733, in _translate_fx_graph
    raise _errors.ConversionError(

torch.onnx._internal.exporter._errors.ConversionError: Error when translating node %slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%_to_copy, 0, 0, 9223372036854775807), kwargs = {}). See the stack trace for more information.

```

## Exported program

```python
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[3, 4]"):
             # File: /home/titaiwang/pytorch/test_slice_complex.py:6 in forward, code: x_complex = x.to(torch.complex64)
            to: "c64[3, 4]" = torch.ops.aten.to.dtype(x, torch.complex64);  x = None

             # File: /home/titaiwang/pytorch/test_slice_complex.py:8 in forward, code: return x_complex[:, :2]
            slice_1: "c64[3, 4]" = torch.ops.aten.slice.Tensor(to, 0, 0, 9223372036854775807);  to = None
            slice_2: "c64[3, 2]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 2);  slice_1 = None
            return (slice_2,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='slice_2'), target=None)])
Range constraints: {}

```

## Analysis

PyTorch ONNX Conversion Analysis

## Model Information

The model has 0 parameters and 0 buffers (non-trainable parameters).
Number of parameters per dtype:
```python
defaultdict(<class 'int'>, {})
```
Number of buffers per dtype:
```python
defaultdict(<class 'int'>, {})
```

Inputs:
- `x`: `TensorMetadata(shape=torch.Size([3, 4]), dtype=torch.float32, requires_grad=False, stride=(4, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})`

Outputs:
- `slice_2`: `TensorMetadata(shape=torch.Size([3, 2]), dtype=torch.complex64, requires_grad=False, stride=(4, 1), memory_format=None, is_quantized=False, qparams={})`

The FX graph has 5 nodes in total. Number of FX nodes per op:
- `placeholder`: 1
- `call_function`: 3
- `output`: 1

Of the call_function nodes, the counts of operators used are:

- `aten.slice.Tensor`: 2
- `aten.to.dtype`: 1

## ONNX Conversion Information

The model contains operators the dispatcher could not find registered ONNX decompositions for. This may be due to missing implementations, decompositions not registered correctly, or a bug in the dispatcher.

Errors grouped by operator:

- `aten.to.dtype`:     No decompositions registered for the real-valued input. Example node: `%to : [num_users=1] = call_function[target=torch.ops.aten.to.dtype](args = (%x, torch.complex64), kwargs = {})`. All nodes: `[to]`
- `aten.slice.Tensor`:     No decompositions registered for the complex-valued input. Example node: `%slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%to, 0, 0, 9223372036854775807), kwargs = {})`. All nodes: `[slice_1, slice_2]`

## Decomposition comparison

Ops exist only in the ExportedProgram before decomposition: `['aten.to.dtype']`

Ops exist only in the ExportedProgram after decomposition: `['aten._to_copy.default']`

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148617
Approved by: https://github.com/justinchuby
2025-03-06 07:03:45 +00:00
ae6bb58483 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit ad49cfc9f0a8a4d8881b3734edd8c33a087c8b97.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/davidberard98 due to broke lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/13690720601/job/38283359447) [HUD commit link](ad49cfc9f0) ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2702980028))
2025-03-06 06:59:39 +00:00
4dc956a1d8 [Inductor][Triton] Fix test_autotune_inplace_kernel to work with newer Triton version (#148595)
For new Triton version 3.3, constexpr are included as part of the signature. Update failing test to reflect this change, additional context in https://github.com/pytorch/pytorch/pull/145051.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148595
Approved by: https://github.com/davidberard98
2025-03-06 05:37:08 +00:00
1fac47702e [Break XPU][Inductor UT] Generalize device-bias code introduced by #146866. (#148534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148534
Approved by: https://github.com/nandesuka
2025-03-06 04:39:50 +00:00
f057206fca [ONNX] Support complex comparison when verify=True (#148619)
Previously, the comparison of complex numbers was not supported when `verify=True`.

NOTE: This PR can be extended to support more complex comparison cases if there are other places in onnx codebase needed to be changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148619
Approved by: https://github.com/justinchuby
2025-03-06 04:38:43 +00:00
8b65d522e1 refactor delayed compile to use code context (#148530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148530
Approved by: https://github.com/williamwen42
ghstack dependencies: #148509
2025-03-06 04:02:30 +00:00
ad49cfc9f0 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 03:42:55 +00:00
02e1580e39 [MPS] fix crash for mse loss with 0 numel inputs (#148608)
Fixes #148589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148608
Approved by: https://github.com/malfet
2025-03-06 03:32:34 +00:00
8728d4b815 Clear triton kernels after parent make_launcher (#148604)
Before, we were clearing the cache only after inductor compile. But inductor may not **always** compile, i.e. on AOTAutogradCache hit.

So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856

Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604
Approved by: https://github.com/masnesral
2025-03-06 03:28:38 +00:00
cyy
1433bc1455 Remove CAFFE2_USE_EXCEPTION_PTR (#147247)
The check is for older compilers and is now aways true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147247
Approved by: https://github.com/janeyx99
2025-03-06 02:56:23 +00:00
43e1284c96 Fix empty matrix handling of addmv in inductor (#143792)
This is a resubmission of my previous PR that I accidentally deleted, apology in advance if any inconvenience caused. Below are details of this PR.

Fix an issue when torch.addmv behaves inconsistent between torch.compile mode and eager mode. Here is the code to reproduce:

```
import torch
import numpy as np

@torch.compile
def test_optimized(input, mat, vec):
    return torch.addmv(input, mat, vec)

def test(input, mat, vec):
    return torch.addmv(input, mat, vec)

input = torch.tensor([2], dtype=torch.int32)
mat = torch.tensor(np.random.randn(0, 0), dtype=torch.int32)
vec = torch.tensor([])
origin_out = test(input, mat, vec)
optimized_out = test_optimized(input, mat, vec)
print(origin_out)  # tensor([2.])
print(optimized_out)  # tensor([])
```

According to the equation (https://pytorch.org/docs/stable/generated/torch.addmv.html), when matrix and vector is empty, returning `[2.]` seems more reasonable to me.

Following the cpu implementation of this API:e97b97af56/aten/src/ATen/native/Blas.cpp (L62)

I add an additional branch to handle empty matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143792
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 02:09:27 +00:00
38b3375a81 [MTIA] Use "ieee" instead of "tf32" for MTIA's default precision in FlexAttention (#148565)
Summary: MTIA supports ieee but not tf32, so we set the default precision of MTIA to ieee similar to how it's done for AMD.

Test Plan: CI

Reviewed By: mortzur

Differential Revision: D70072064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148565
Approved by: https://github.com/mortzur
2025-03-06 02:07:18 +00:00
32715a2311 [inductor][ck] add kBatch_sweep to config.rocm (#148223)
Summary:
# Why

enable testing and users to specify a set of kBatches to try rather than relying on our hand written heuristic

# What

add rocm.kBatch_sweep as a list of kBatches to try out. These will generate a product of CK instances, one per kBatch for each existing op, though they are often filtered out if they are likely to fail at runtime

Test Plan: n/a

Reviewed By: chenyang78

Differential Revision: D70226055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148223
Approved by: https://github.com/ColinPeppler
2025-03-06 01:14:33 +00:00
63fbc738dc [Easy/Profiler] Add last entry to truncated values (#148576)
Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata

Test Plan:
Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D70637461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576
Approved by: https://github.com/davidberard98
2025-03-06 01:14:15 +00:00
23441492f6 [scan] Refactoring of input checking and dynamo invocation (#142125)
This PR does a refactoring of the way dynamo is invoked and how the input shapes are checked for scan and for associative_scan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142125
Approved by: https://github.com/ydwu4
2025-03-06 01:06:54 +00:00
6cc3e69103 [inductor] use eager stride for custom op if no tags (#148367)
Fix https://github.com/pytorch/pytorch/issues/148356

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: https://github.com/pytorch/pytorch/pull/148104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148367
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-03-06 00:58:00 +00:00
703176e538 [ROCm] Fix sort for non-standard bool (#147459)
When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true.

In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool.

Fixes #139972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-03-06 00:23:02 +00:00
690fc2c876 Add aot_eager_then_compile stance (#148509)
Sometimes `eager_then_compile` stance isn't enough since some models are so close to the memory limit that going to eager will OOM since we don't get the memory reductions from activation checkpointing. This PR introduces `aot_eager_then_compile` which avoids the expensive inductor compile, but still does aot_eager to get the benefits of memory reduction in the first invocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148509
Approved by: https://github.com/williamwen42
2025-03-05 23:23:45 +00:00
d6d670ab4d [AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587)
In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits).

Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587
Approved by: https://github.com/desertfire
2025-03-05 22:47:46 +00:00
897fd9b514 Revert "Subprocess compile (#146134)"
This reverts commit 07f876e9602ec6881df2360ab4817e129b563b7c.

Reverted https://github.com/pytorch/pytorch/pull/146134 on behalf of https://github.com/malfet due to looks like it broke slow jobs, see e1dee4ccb3/3 ([comment](https://github.com/pytorch/pytorch/pull/146134#issuecomment-2702239123))
2025-03-05 22:41:19 +00:00
e1dee4ccb3 [ONNX] Assert capture strategy in tests (#148348)
Previously the strategy used for obtaining the exported program is not asserted. This leads to silent errors if torch.export breaks something and a fallback strategy is used. This change adds a _capture_strategy field to ONNXProgram and enables unit tests to assert the strategy used to prevent fallbacks from happening.

Fixes #147674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148348
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1
2025-03-05 22:31:54 +00:00
5ccd659c0e Fix decomp for linspace (#147997)
In python decompositions, we shouldn't do any non-functional operations for functional operators. This should go away once we start decomposing before functionalization.

Differential Revision: [D70265200](https://our.internmc.facebook.com/intern/diff/D70265200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147997
Approved by: https://github.com/zou3519
2025-03-05 22:10:08 +00:00
9e755a1c03 [ROCm] add gfx12 to nightly wheels (#148562)
Adds gfx1200 and gfx1201 to PYTORCH_ROCM_ARCH for wheels and libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148562
Approved by: https://github.com/jeffdaily
2025-03-05 21:56:22 +00:00
2a639ce1d7 Add new hf storage class to torch.distributed package (#148361)
Summary:
title - Add new hf storage class  to torch.distributed package so that it can be imported by customers.
The HF storage reader/writer was added as DCP storage components so that DCP load and save can directly interact with hugging face format and storage.

Test Plan: ensure signals pass

Differential Revision: D70495399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148361
Approved by: https://github.com/MeetVadakkanchery
2025-03-05 21:52:06 +00:00
10354e146f Re-enable test_torchinductor:test_buffer_batch_norm (#148573)
Summary: Per https://github.com/pytorch/pytorch/issues/128198 seems like this is working now
Fixes #128198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148573
Approved by: https://github.com/StrongerXi
2025-03-05 21:51:24 +00:00
87bd3471ff [c10d] Move record param for init to the right place (#148571)
The place we do the log of init does not look correct. We move it to the beginning of comm init.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148571
Approved by: https://github.com/kwen2501
2025-03-05 21:43:30 +00:00
ad9a10aff0 [dynamo] Make nonstrict_trace work with some pytree.register_constant-ed instances (#148007)
As title, this enables `nonstrict_trace`-ed function to take in object
whose type has been `pytree.register_constant`-ed, as long as the object
existed outside the `torch.compile` region. This also forces Dynamo to
emit a `EQUALS_MATCH` guard on the object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148007
Approved by: https://github.com/zou3519
ghstack dependencies: #148385
2025-03-05 21:28:26 +00:00
a10f577ee0 [dynamo] Account for function id reuse in relevant Dynamo decorators (#148385)
This fixes a recent series of flaky failure from `nonstrict_trace` unit
tests: #148166, #148056, #148055, #148054, #148034, #148033, #148032, #148031.

For now we don't need to worry about the other decorators because they
are either meant for builtin/numpy functions (which should never
deallocate in practice), or used for polyfills which keeps the function
object in `get_torch_obj_rule_map()`.

Fixes #147777.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148385
Approved by: https://github.com/zou3519
2025-03-05 21:28:26 +00:00
4aeca28137 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-05 21:26:22 +00:00
ed9624ee60 [export] Fix AttrProxy slicing (#148507)
Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1159599265750221/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148507
Approved by: https://github.com/zhxchen17
2025-03-05 21:03:15 +00:00
dd6ec8706e [BE] Relax sympy dependency to 1.13.3 or newer (#148575)
Fixes https://github.com/pytorch/pytorch/issues/145225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148575
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-03-05 20:51:16 +00:00
9efa9c73f6 [Dyamo] Replace unimplemented with unimplemented_v2 for variables/distributed (#148500)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148500
Approved by: https://github.com/williamwen42
2025-03-05 20:41:43 +00:00
98458e5c81 Add a docstring to build.sh (#144566)
Add a little blurb to explain what build.sh is doing.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
2025-03-05 15:26:37 -05:00
c6a05df174 [ONNX] Use onnxscript apis for 2.7 (#148453)
Use onnxscript apis for 2.7.

Remove reference to `torchlib_opset()` and `torchlib_opset_version()` which were removed in the onnxscript 2.7 apis. These apis were removed because torchlib in onnxscript will always stay on opset 18. Future opset version bumps will happen in pytorch core after the migration of torchlib.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148453
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1
2025-03-05 20:10:00 +00:00
c9edd37ffb Revert "[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377)"
This reverts commit 9eef457c0241f87097a2ca7625f9961e31f3adcd.

Reverted https://github.com/pytorch/pytorch/pull/148377 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/13683650448/job/38261818684) [HUD commit link](9eef457c02) probably landrace ([comment](https://github.com/pytorch/pytorch/pull/148377#issuecomment-2701903810))
2025-03-05 19:45:16 +00:00
c5d92edd5a [dynamo] WeakRefVar reconstruct (#148083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148083
Approved by: https://github.com/anijain2305
2025-03-05 19:34:17 +00:00
50e827b3df [ONNX] Create VerificationInterpreter (#148396)
An fx interpreter for comparing ONNX values with pytorch ones.

```py
import torch
from torch.onnx._internal.exporter._verification import VerificationInterpreter

class Model(torch.nn.Module):
    def forward(self, query, key, value):
        res = torch.nn.functional.scaled_dot_product_attention(
            query, key, value
        )
        rest = res.transpose(0, 1)
        return rest.view(8, 32, 128 * 64)

model = Model()

query = torch.rand(32, 8, 128, 64, dtype=torch.float16)
key = torch.rand(32, 8, 128, 64, dtype=torch.float16)
value = torch.rand(32, 8, 128, 64, dtype=torch.float16)

onnx_program = torch.onnx.export(model, (query, key, value), dynamo=True)
interpreter = VerificationInterpreter(onnx_program)
interpreter.run(query, key, value)
for info in interpreter.verification_infos:
    print(info)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148396
Approved by: https://github.com/titaiwangms
2025-03-05 19:18:52 +00:00
8af79b7ec8 [ROCm] Bump AOTriton to 0.9.1b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-05 19:11:57 +00:00
9eef457c02 [dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA.

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377
Approved by: https://github.com/drisspg
2025-03-05 19:09:52 +00:00
9dd46a9233 Deprecate sm70 for cuda 12.8 binary (#147607)
follow up for https://github.com/pytorch/pytorch/pull/146265/files, dropping sm_70 as well, since "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release."

https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147607
Approved by: https://github.com/atalman
2025-03-05 18:54:17 +00:00
3f4311d589 [CD] Upgrade xpu runtime pypi packages version and enable windows kineto again (#148319)
Fixes https://github.com/pytorch/pytorch/issues/145155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148319
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-03-05 18:39:55 +00:00
9db9593bba Add some more meta kernels (#147862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147862
Approved by: https://github.com/zou3519
2025-03-05 18:33:00 +00:00
e555c4d8ae Fix bug in AOTI lowering (#148364)
Fixes: https://github.com/pytorch/pytorch/issues/148370

Differential Revision: [D70514480](https://our.internmc.facebook.com/intern/diff/D70514480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148364
Approved by: https://github.com/desertfire
2025-03-05 18:27:15 +00:00
38479e495e Add note to get start xpu (#148168)
Installing PyTorch from binaries will automatically install the runtime packages of Intel® Deep Learning Essentials. In this case, if we activate oneAPI in a standalone installation of Intel® Deep Learning Essentials, there will be an environment issue. Therefore, add a note to remind users to avoid this situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148168
Approved by: https://github.com/janeyx99

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-05 18:11:14 +00:00
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
70c5edb697 [ROCm] fix CK compile for gfx1200 (#148496)
gfx1200 causes the CK-based GEMM to fail to compile because CK is choosing an incorrect FP8 interpretation.  CK assumes FP8 interpretation is static and chosen prior to compilation.  This PR is a work-around that makes the selection dynamic during hipclang compilation passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148496
Approved by: https://github.com/jeffdaily
2025-03-05 16:11:03 +00:00
864b75dd50 [MPS] Fix unary_kernel_strided logic (#148512)
Fixes bug introduced by https://github.com/pytorch/pytorch/pull/148350
Before this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[  0.0000,   1.4142,   2.0000,   2.4495],
        [ 80.0000,  82.0000,  84.0000,  86.0000],
        [ 96.0000,  98.0000, 100.0000, 102.0000],
        [112.0000, 114.0000, 116.0000, 118.0000]], device='mps:0')
```
After this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[0.0000, 1.4142, 2.0000, 2.4495],
        [4.0000, 4.2426, 4.4721, 4.6904],
        [5.6569, 5.8310, 6.0000, 6.1644],
        [6.9282, 7.0711, 7.2111, 7.3485]], device='mps:0')
```
One can not avoid copies if both input and output tensors have the same strides, one needs to make sure that they are dense-in-storage (transposed tensor would be dense, but say selecting every odd and even column wouldn't)

Add regression test to prevent those from happening again

Also, no need to check that sizes match, luckily it is checked by the structured op (and `out` for unary ops does not support broadcasting, I just checked)

Revived needs_copy_logic, though it  will become irrelevant after https://github.com/pytorch/pytorch/pull/148468 is landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148512
Approved by: https://github.com/janeyx99
2025-03-05 15:57:54 +00:00
8274da9312 [c10d][PGNCCL] Fix capturability of isend and irecv (#148462)
This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode.

<details>
<summary>The repro code</summary>

```Python
import os
import torch
import torch.distributed as dist

USE_ASYNC = True

def test_func(x, rank):
    if rank == 0:
        x += 1
        # Send the tensor to process 1
        if USE_ASYNC:
            a = dist.isend(tensor=x, dst=1)
        else:
            dist.send(tensor=x, dst=1)
    else:
        # Receive tensor from process 0
        if USE_ASYNC:
            a = dist.irecv(tensor=x, src=0)
        else:
            dist.recv(tensor=x, src=0)
    if USE_ASYNC:
        a.wait()
    return x + 2

def run(rank):
    torch.cuda.set_device(rank)
    x = torch.ones(1, device='cuda')
    with torch.cuda.stream(torch.cuda.Stream()):
        for i in range(11):
            x.copy_(torch.ones(1, device='cuda'))
            y = test_func(x, rank)
            print(f"Rank{rank} has data {y} in warmup")
    torch.cuda.synchronize()
    graph = torch.cuda.CUDAGraph()

    x.copy_(torch.ones(1, device='cuda'))
    with torch.cuda.graph(graph):
        y = test_func(x, rank)

    for i in range(1):
        x.copy_(torch.ones(1, device='cuda'))
        graph.replay()
    print(f"Rank{rank} has data {y} after graph replay")

def main():
    rank = int(os.environ['RANK'])
    local_rank = int(os.environ['LOCAL_RANK'])
    world_size = int(os.environ['WORLD_SIZE'])
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    run(local_rank)

if __name__ == "__main__":
    main()
```
</details>

Fails with an error stating that work handle is of a NoneType:
```
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/repro.py", line 54, in <module>
[rank1]:     main()
[rank1]:   File "/workspace/repro.py", line 51, in main
[rank1]:     run(local_rank)
[rank1]:   File "/workspace/repro.py", line 38, in run
[rank1]:     y = test_func(x, rank)
[rank1]:         ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/repro.py", line 22, in test_func
[rank1]:     a.wait()
[rank1]:     ^^^^^^
[rank1]: AttributeError: 'NoneType' object has no attribute 'wait'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462
Approved by: https://github.com/kwen2501
2025-03-05 15:49:53 +00:00
19a6cf35f6 add input shape check for _local_scalar_dense (#145717)
Fix https://github.com/pytorch/pytorch/issues/145066.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145717
Approved by: https://github.com/malfet
2025-03-05 15:24:08 +00:00
96afa8a2bb [TEST][SPARSE] Simplify branching in test_cusparselt_backend (#148318)
Due to introduction of CUDA versions, the branching becomes more complicated. This PR is proposed to simplify branching in `test_cusparselt_backend` in order to avoid checking each and every CUDA version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148318
Approved by: https://github.com/jcaip
2025-03-05 10:17:00 +00:00
0ef2e938d0 [ROCm] [TunableOp] Track top solutions during tuning process (#147243)
For each set of GEMM parameters that are evaluated by Tunableop, keep track of the top 5 solutions. Print the top 5 solutions when `PYTORCH_TUNABLEOP_VERBOSE=2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147243
Approved by: https://github.com/jeffdaily
2025-03-05 09:35:02 +00:00
6c3492b491 [ROCm] Enable mi300-specific workflows to be triggered on PRs (#147904)
This change will be needed to be able to trigger the MI300-specific CI workflows on PRs by using a PR label.

* inductor-rocm-mi300.yml uses the existing `ciflow/inductor-rocm` label so that any PR manually labeled as such will trigger `inductor` config runs on both MI200 and MI300.
* rocm-mi300.yml uses a separate `ciflow/rocm-mi300` label, since we don't want to over-trigger `default` config runs on MI300 runners due to limited capacity, and [`ciflow/rocm` label is automatically applied](79438512a0/torchci/lib/bot/autoLabelBot.ts (L24)) on many PRs.
* inductor-perf-test-nightly-rocm.yml uses a separate `ciflow/inductor-perf-test-nightly-rocm` label, so that we can manually trigger a round of perf testing on MI300 runners to test the perf impact of a major inductor-related change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147904
Approved by: https://github.com/huydhn
2025-03-05 06:00:37 +00:00
2295efa1b3 Fix only logging ir_post_fusion with torch_compile_debug enabled (#148499)
Because we were invoking the logs through `V.debug`, it was not running if TORCH_COMPILE_DEBUG was not set. this is because there is some magic the in debug [getattr](d789c22712/torch/_inductor/debug.py (L468-L480)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148499
Approved by: https://github.com/shunting314
2025-03-05 05:35:09 +00:00
fb1b7ec173 Remove deprecate method and attirbute in LRScheduler (#147301)
Following [#99270 suggestion](https://github.com/pytorch/pytorch/issues/99270#issuecomment-1511656408), remove deprecate method `LRScheduler.print_lr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147301
Approved by: https://github.com/janeyx99
2025-03-05 05:30:19 +00:00
df7e43e5d4 [AOTI] Fix aot_inductor_package test errors (#148279)
Summary: Fix fbcode test failures introduced by https://github.com/pytorch/pytorch/pull/147975. Make sure script.ld is copied to the build-time directory.

Differential Revision: D70454149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148279
Approved by: https://github.com/zoranzhao
2025-03-05 05:22:48 +00:00
b020d166f2 stage 1 of depreate silent fallback of tuning gemm (#147798)
Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/)

context:
https://github.com/pytorch/pytorch/issues/147479

For the most part, this should not change the behavior.

For int_mm, I also removed
```
    # TODO: Re-enable eager mode implementation once cuBLAS is fixed
    if use_cutlass or use_triton_template(layout, enable_int32=True):
        choices = []
```
because I think it is unwanted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798
Approved by: https://github.com/eellison
2025-03-05 05:15:59 +00:00
913356fb41 Fix recent regression in evaluate_expr that effect cache lookups (#147836)
PR https://github.com/pytorch/pytorch/pull/146939/ added an argument for evaluate_expr for the purpose of logging.
This caused a regression that we thought is due to calling id on symnode.

I digged deeper and found that adding that argument although does not effect results of evaluate_expr it mess the cache
lookups.
I refactored the code to avoid using expr_sym_node_id in the cache lookup, I also introduced evaluate_sym_node to and simplified the calls to evaluate_expr
#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147836
Approved by: https://github.com/oulgen
2025-03-05 04:11:41 +00:00
ed8ec0cb98 [cutlass backend][BE] Fix two small things in cutlass backend standalone debugger (#148493)
Differential Revision: [D70583777](https://our.internmc.facebook.com/intern/diff/D70583777/)

Two really small things:
* The bits in BlockFillRandomUniform would round float to ints
* when bias exists, the order of args are C, A, B, D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148493
Approved by: https://github.com/chenyang78
2025-03-05 04:01:36 +00:00
e0ea593974 [CD] Upgrade Windows xpu support package to 2025.0.1 for binary compression (#148313)
The binary compression feature can reduce the size of the Torch XPU Windows wheel packages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148313
Approved by: https://github.com/atalman
2025-03-05 03:00:27 +00:00
1673bc7610 [mm_logs][ez] dump tuned mm info at lowering stage (#148363)
Summary:
As title. it would be beneficial for judging e2e perf improvement

Easy first step to dump mm info at lowering stage.

e.g.

```
fbsource/fbcode/caffe2/torch/_inductor/kernel/mm.py:525] [0/0] Tuned aten.addmm: m=16, n=6, k=16, layout=FixedLayout('cuda:0', torch.float32, size=[16, 6], stride=[6, 1])
```

Next step:

Dump overview info at `post_grad_graph` stage such as
overall count of `aten.mm` in the graph & visualize to a table structure.

Test Plan: by looking very hard in aot inductor bmm and mm UTs.

Differential Revision: D70507880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148363
Approved by: https://github.com/henrylhtsang
2025-03-05 02:21:27 +00:00
edc3ca577e [Profiler] Add profiler activity for HPU devices (#148182)
Fixes #148181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148182
Approved by: https://github.com/sraikund16
2025-03-05 01:37:48 +00:00
3985ce0b88 [dynamo] rename test_graph_break_messages -> test_error_messages (#148220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148220
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #148205
2025-03-05 01:16:53 +00:00
b28cbe5db3 [dynamo] remove internal stack trace for fullgraph=True graph breaks (#148205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148205
Approved by: https://github.com/zou3519
2025-03-05 01:16:53 +00:00
2927a64357 [inductor][cpu] Fix error with FlexibleLayout weights in BMM (#148188)
Fixes #148074

When node A is reshaped (is a `ReinterpretView`) and node B has a `FlexibleLayout`, then the layout of node B *may* be changed during the `kernel.select(options["W"], 0, self.b_index)` call, which could cause the assertion in `kernel.select` to fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148188
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-03-05 01:05:05 +00:00
713a504a82 [dynamo][guards] Fix mem leak caused be refcount increment (#148480)
Should help [internalfb.com/sevmanager/view/491701](https://www.internalfb.com/sevmanager/view/491701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148480
Approved by: https://github.com/xmfan, https://github.com/StrongerXi, https://github.com/williamwen42, https://github.com/zou3519
2025-03-05 01:04:08 +00:00
b5873292c6 Add overload names to profiler trace (#143114)
Currently, recorded profiler events for aten ops do not store overload names. It would be useful to know which overloads are actually called to analyse performance.
For example, consider the following dispatch trace which occurs if there is a fallthrough kernel registered for aten::add:
```
             [call] op=[aten::add.Tensor], key=[AutogradCPU]
               [redispatch] op=[aten::add.Tensor], key=[Undefined]
                 [call] op=[aten::empty.memory_format], key=[BackendSelect]
                   [redispatch] op=[aten::empty.memory_format], key=[CPU]
                 [call] op=[aten::add.out], key=[CPU]
```

In this case, aten::add.out is a child of aten::add.Tensor, however the current profiler trace provides no way to differentiate aten op calls.

See the added unit test for a more detailed example.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143114
Approved by: https://github.com/sraikund16
2025-03-05 01:00:29 +00:00
cf5e3f3cea Add cutlass kernel for rowwise scaled mm on sm100 (#148421)
### Important
- Previous PR in stack https://github.com/pytorch/pytorch/pull/148274
- Despite the changes between sm90 vs sm100 being fairly minimal, I created a separate kernel since we'll be making various arch specific perf optimizations to the sm100 kernel next.
- This kernel has not been optimized yet. However, initial perf testing shows numbers which indicates the tensorcores are being utilized as expected (not just CUDA cores).

### Summary of changes
- This PR adds a new cutlass kernel for rowwise GEMM on sm100.
- sm100 kernel is based on sm90 kernel, with the following changes:
  - Use new arch tag `cutlass::arch::Sm100`
  - Do not use [large tile](4eb0c45297/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (L203)) schedule in CollectiveMainLoop or CollectiveEpilogue (causes build errors)
- SM90 vs SM100 kernel diff: https://www.diffchecker.com/ZCAPaFAg/

### Next steps
- Arch specific performance optimization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148421
Approved by: https://github.com/drisspg
2025-03-05 00:46:01 +00:00
a907b6abae [compiled_autograd] workaround windows compilation issue (#148454)
torch.compile doesn't work on windows so we can ifdef-away the problem.
I do not know what the root cause actually is. Most notably, the pytorch
windows build is fine, but some third-party projects that use pytorch headers
on windows (e.g. torchaudio) have issues.

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148454
Approved by: https://github.com/atalman, https://github.com/xmfan
2025-03-05 00:18:20 +00:00
e02a2ca07a Fix dist.init_process_group on windows (#148266)
Fix https://github.com/pytorch/pytorch/issues/139990

We don't build libuv on windows so anything that creates `TCPStore` which includes `init_process_group()` will fail, which is a bad experience. We should just default to `USE_LIBUV=0` for windows. There were a decent amount of hits for this [error on google ](https://www.google.com/search?q=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&sca_esv=921f59ac5f8bd98a&sxsrf=AHTn8zpG3PxdKoomFHkclOc451rBhoc3jw%3A1740854890873&source=hp&ei=albDZ5GHM-uIptQP4NTikQw&iflsig=ACkRmUkAAAAAZ8Nkei9H-aB2IBCk3pUOK3yFl5xBLZUt&ved=0ahUKEwiR5P7qxemLAxVrhIkEHWCqOMIQ4dUDCBg&uact=5&oq=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&gs_lp=Egdnd3Mtd2l6IkN1c2VfbGlidXYgd2FzIHJlcXVlc3RlZCBidXQgUHlUb3JjaCB3YXMgYnVpbGQgd2l0aG91dCBsaWJ1diBzdXBwb3J0SABQAFgAcAB4AJABAJgBAKABAKoBALgBA8gBAPgBAvgBAZgCAKACAJgDAJIHAKAHAA&sclient=gws-wiz) and https://github.com/pytorch/pytorch/issues/139579, so I figured we should add a more helpful message as well.

We don't have CI for windows and our support is just best effort, so I just tested these changes on my windows machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148266
Approved by: https://github.com/d4l3k
2025-03-05 00:07:56 +00:00
84b58bd63e Enable FSDP tests on XPU device (#147518)
**Motivation:**

Enable FSDP tests on XPU device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518
Approved by: https://github.com/weifengpy
2025-03-04 23:49:37 +00:00
c98c3af421 Add a couple config options to compiler bisector (#148450)
These are commonly source of bugs/divergence (through bad interactions etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148450
Approved by: https://github.com/shunting314
2025-03-04 23:23:21 +00:00
0c0a4baddd [MPS] unary kernels - avoid copying tensors if they have same stride (#148350)
I was a bit concerned when I saw in #148272 that metal unary kernel was 0.02x of the performance of what we had with MPS Graphs for sqrt(for non contiguous) tensors. This change makes it so that copying is only done if we don't have same strided tensors(for input/output). So if out tensor is not provided then we don't do copy(don't call contiguous) at all and dispatch the kernel as is. After making this change the script that I listed at the end of the above PR has the same execution time as the non-transposed one.

Times for reference(on transposed tensor where matrix is NxN matrix):

| N     | time_old           | time_new           |
|-------|--------------------|--------------------|
| 100   | 0.0002241021       | 0.0001548659       |
| 1000  | 0.0005934822       | 0.0002150342       |
| 10000 | 0.3242016407       | 0.0045755033       |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148350
Approved by: https://github.com/janeyx99
2025-03-04 23:20:26 +00:00
ade4af8c95 [MPS][BE] Fix c10:🤘:sinc implementation (#148471)
Restrict scalar implementation to `is_scalar_floating_point_v` types, but perform all internal computations in full 32-bit floats. Make complex implementation a template for `is_complex_v` types
This makes its eager kernel implementation for both real and complex type a trivial call to the template
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148471
Approved by: https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448, #148449
2025-03-04 23:14:03 +00:00
93e9daed54 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-03-04 23:09:09 +00:00
d789c22712 Upgrade github ubuntu-20.04 runners to ubuntu-24.04 (#148469)
The github provided ubuntu-20.04 gha runners are being deprecated (https://togithub.com/actions/runner-images/issues/11101) so upgrade workflows using them to the latest runner 24.04

They are currently doing a brownout, resulting in failures like: https://github.com/pytorch/pytorch/actions/runs/13660782115
```
[do_update_viablestrict](https://github.com/pytorch/pytorch/actions/runs/13660782115/job/38192777885)
This is a scheduled Ubuntu 20.04 brownout. Ubuntu 20.04 LTS runner will be removed on 2025-04-01. For more details, see https://github.com/actions/runner-images/issues/11101
```

Should we be using ubuntu-latest instead?

I attempted to upgrade actionlint to 1.7.7 but on my local in test-infra it seems to add a lot of new checks, and on test-infra's CI, I seem to have uploaded the wrong executable or something so it failed.  I'll try again later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148469
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-04 22:29:04 +00:00
5f47b7e268 [ROCm][TunableOp] Unit test for offline tuning of GEMM with bias (#148371)
One more unit test for the offline version of TunableOp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148371
Approved by: https://github.com/jeffdaily
2025-03-04 22:24:27 +00:00
842ffea445 [MPS][BE] Towards strided unary ops support (#148449)
Add generic functors kernels and rewrite all existing implementations into functors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148449
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448
2025-03-04 22:22:39 +00:00
70d0e1b96a Bump onnxscript to 0.2.2 in CI (#148388)
Unblock https://github.com/pytorch/pytorch/pull/148140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388
Approved by: https://github.com/malfet
2025-03-04 22:09:50 +00:00
c677f3251f [export] don't use unbacked_renamings in export (#147574)
Plan: avoid the use of unbacked renamings, and introduce a pass run in `_produce_aten_artifact` that recomputes unbacked bindings. Decided to do this because in we don't serialize unbacked renamings (or any ShapeEnv state), so this used to compose poorly with de/serialization. This hopefully establishes the invariant that the unbacked binding keys are always in sync with the example values (i.e. same indices, and removed if the symbol is replaced / specialized).

For de/serialization, we don't stored unbacked bindings, and just rerun the pass.

Involved a refactor of compute_unbacked_bindings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147574
Approved by: https://github.com/avikchaudhuri
2025-03-04 21:43:49 +00:00
84961a0c17 ci: Add workflow dispatch for commit hash update (#148486)
Maybe this should also be split into its own workflow instead of piggy
backing off of nightly?

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148486
Approved by: https://github.com/clee2000
ghstack dependencies: #148466, #148472
2025-03-04 21:26:23 +00:00
d290186ed3 ci: Add triton to update hash workflow (#148472)
Adds triton to our auto-update workflows so that PRs can be
automatically made and the triton team can follow up to fix any issues
that may arise.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148472
Approved by: https://github.com/Camyll, https://github.com/atalman
ghstack dependencies: #148466
2025-03-04 21:26:23 +00:00
9be8f74156 ci: Consolidate commit hash updates into a matrix (#148466)
Consolidates all of our commit hash update jobs into a single matrix to
make it easier to add more jobs later on.

Side note: How do I even test if this works?

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148466
Approved by: https://github.com/Camyll, https://github.com/clee2000, https://github.com/atalman
2025-03-04 21:26:13 +00:00
d1abde11ec [dynamo] Support passing arguments to DeviceMesh.get_group (#147741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147741
Approved by: https://github.com/StrongerXi
2025-03-04 21:19:47 +00:00
f30776c37a [BE] Upgrade to mypy 1.14 (#145966)
Upgrade mypy version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145966
Approved by: https://github.com/Skylion007
2025-03-04 20:58:26 +00:00
60205b0eb2 [export] Fix logging so that it doesn't result in max recursion error (#148231)
Test Plan:
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id=487493491 --test_suite ads_all --mode test_full_model

Produces https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp2wsjQH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Differential Revision: D70416613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148231
Approved by: https://github.com/yiming0416
2025-03-04 20:47:25 +00:00
e4c558be1d [scan] Corrections for scan (#146110)
This PR resolves some minor issues with the scan HOP and unifies the handling of the additional_inputs in the same way as for associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146110
Approved by: https://github.com/ydwu4
2025-03-04 20:29:08 +00:00
439395c0ae [MPS] add slogdet and logdet implementations to mps (#148287)
Low hanging fruits, all ops for these are implemented so just adding them to native functions adds the functionality on mps. Probably next op I should add should be lu solve seeing as how many ops need it for the grad calculation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148287
Approved by: https://github.com/malfet
2025-03-04 19:49:23 +00:00
92beda54c8 Revert "[fx] Move map_aggregate to C++ (#148243)"
This reverts commit edaff88f69f069d517b72ea23fd5eb04702eb0b5.

Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
17d003fe75 Revert "[fx] Move Node._update_args_kwargs to C++ (#148260)"
This reverts commit 0135f57f4aaeaba8d720f551eab6dca6fcede8cd.

Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
97b9e68bc6 Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)"
This reverts commit 29c2de9ae16f1673f3f44363243294d403e53d37.

Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
6fb18ff685 Revert "Better log message to update pr_time_benchmarks/expected_results.csv (#148303)"
This reverts commit a3d69e6e1a530ae2b91cd549ea26aac51ffc7566.

Reverted https://github.com/pytorch/pytorch/pull/148303 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
63778cb8a0 Revert "[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#147019)"
This reverts commit e3e45d90d8578083da8b51a3b1d911e9a4523e5b.

Reverted https://github.com/pytorch/pytorch/pull/147019 on behalf of https://github.com/clee2000 due to broke inductor test inductor/test_max_autotune.py::TestMaxAutotune::test_cat_max_autotune_extern [GH job link](https://github.com/pytorch/pytorch/actions/runs/13653495421/job/38171259603) [HUD commit link](e3e45d90d8) on inductor workflow and rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/147019#issuecomment-2698677222))
2025-03-04 19:20:15 +00:00
9d196edb7d Revert "Bump onnxscript to 0.2.2 in CI (#148388)"
This reverts commit 7ab6749ec7db32e0b3cdfd19db087f15dd0bebe2.

Reverted https://github.com/pytorch/pytorch/pull/148388 on behalf of https://github.com/clee2000 due to broke libtorch debug build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/13646179239/job/38152039312) [HUD commit link](7ab6749ec7) ([comment](https://github.com/pytorch/pytorch/pull/148388#issuecomment-2698665495))
2025-03-04 19:16:34 +00:00
c219c5ca38 Fix code descriptions in the test package. (#148145)
The parameter and function description have something wrong and make them correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148145
Approved by: https://github.com/janeyx99
2025-03-04 19:14:41 +00:00
e8900fbe4f [MPS] Add some useful utils (#148448)
Like `is_compex_v`, `is_scalar_intergral_v`, `result_of` etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148448
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399
2025-03-04 19:09:17 +00:00
f859722f70 [dtensor] refactor sharding prop to handle cross mesh computation (#147869)
as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level.

This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way".

This should also fix https://github.com/pytorch/pytorch/issues/134212

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869
Approved by: https://github.com/tianyu-l
2025-03-04 18:30:44 +00:00
eea54a55f6 ci: Switch manywheel build.sh to just use dev (#148310)
To avoid annoying error message like:

> fatal: no tag exactly matches 'a6520c85bd85875b09f2c68e51622699d7d07595'

These were popping up when GITHUB_REF is not set so let's just assume
that if someone is building without directly setting GITHUB_REF then
they're probably doing a dev build.

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148310
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-03-04 18:27:44 +00:00
611b0e9bc4 Revert "[fx] Optimizations for node name generation (#148288)"
This reverts commit 5eb0337cfd5e7c2cdf4a2d4829609e391467270f.

Reverted https://github.com/pytorch/pytorch/pull/148288 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
ed9055c303 Revert "[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)"
This reverts commit 8531d247ba411993f9a10686d70514f6945f9960.

Reverted https://github.com/pytorch/pytorch/pull/148292 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
67937be673 [BE] Move sinc kernels to the same OP family (#148399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148399
Approved by: https://github.com/dcci
ghstack dependencies: #148398
2025-03-04 15:49:20 +00:00
7fcbaff206 [BE] Remove stale arg for complex ops (#148398)
Not need to pass DTYPE0 and DTYPE1 if only one DTYPE is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148398
Approved by: https://github.com/dcci
2025-03-04 14:35:43 +00:00
f2f25a5444 Upgrade submodule oneDNN to v3.7.1 (#148293)
This PR is to upgrade submodule oneDNN to v3.7.1.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293
Approved by: https://github.com/mingfeima, https://github.com/atalman
2025-03-04 13:56:45 +00:00
f339e41a38 [inductor][triton] Fix average pool nd for int64 dtype (#146061)
The eager mode implementation of average pool nd returns an integer tensor if the input is also an integer tensor. This should also be preserved in inductor.

Fixes pytest -k test_comprehensive_nn_functional_avg_pool2d_cpu_int64 error: Triton compilation failed: triton_poi_fused_avg_pool2d_0

See WIP https://github.com/pytorch/pytorch/pull/145865#issuecomment-26200289890 to potentially enable such tests as they aren't enabled yet.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146061
Approved by: https://github.com/eellison
2025-03-04 13:53:50 +00:00
fdee60769a [DCP] Introduce process based async checkpointing (#147039)
Summary:
### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: Added E2E UTs for process based async save.

Differential Revision: D69272583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039
Approved by: https://github.com/saumishr
2025-03-04 13:33:28 +00:00
16d07988fc add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501, https://github.com/albanD
2025-03-04 12:37:06 +00:00
e3e45d90d8 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#147019)
Modified  TorchInductor’s autotuning flow so that each `best_config` JSON file also includes the Triton “base32” (or base64) cache key.

**Motivation**

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set `store_cubin = True` since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147019
Approved by: https://github.com/davidberard98
2025-03-04 12:16:38 +00:00
f1cce0951b Create unique test report files for distributed tests (#148325)
The distributed tests are executed once for each backend and for each init method.
`$TEST_REPORT_SOURCE_OVERRIDE` is used such that test results from different backends are stored in different files.
The same needs to be done for the init method.

Move the setting of the variable into `test_distributed` and incorporate the init method into the name.

Useful for e.g. https://github.com/pytorch/pytorch/issues/126523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148325
Approved by: https://github.com/clee2000
2025-03-04 10:45:33 +00:00
0b0d28accd Optimize param prepend class reference torch.nn.Module (#148304)
Fixes #147696

## Changes

Change `prepend` description  `torch.nn.modules.Module` to `torch.nn.Module`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/054f54b7-9487-4505-a926-3e17a84bd2f9)

### After

![image](https://github.com/user-attachments/assets/1d2a5708-62d1-428e-b136-bcaa35e5e6da)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148304
Approved by: https://github.com/Skylion007
2025-03-04 08:46:14 +00:00
da2688f624 Introduce delayed compile via eager_then_compile stance (#147983)
Recently I've been experimenting with introducing new APIs to delay compile as a way to reduce compile times while improving the ergonomics of using dynamic shapes. The high level idea is to run the first invocation of compile in eager, save the example inputs, and on the second invocation we can derive the dynamism in the inputs so that we don't need to waste our time doing a compile with static shapes (which is the status quo today with automatic dynamic).

Another benefit of this is most users no longer need to annotate their inputs with mark_dynamic and mark_unbaked calls since we can derive the dynamism on the very first call. Additionally we get dynamic ints out of the box in this new regime.

This PR implements this idea through the set_stance APIs. In particular it introduces a new `eager_then_compile` stance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147983
Approved by: https://github.com/williamwen42
2025-03-04 07:46:31 +00:00
e0f0db0105 updates to benchmarks (#144831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144831
Approved by: https://github.com/danielvegamyhre
2025-03-04 06:21:12 +00:00
ac99fc7e57 Updates to build rowwise scaled mm kernel on SM10.0a (#148274)
## Summary
Update cmake files and RowwiseScaledMM.cu to build on SM10.0a arch.

**NOTE**: performance optimization will be done in separate follow up PRs

## Steps to verify build
1. Access devgpu/machine with B200 GPUs, verify B200s are visible w/ `nvidia-smi`
2. Install CUDA tookit 12.8
    - e.g. see [Nvidia docs](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Rocky&target_version=9&target_type=rpm_local)
3. Verify CUDA toolkit installation
    - e.g. `nvcc --version` should have `... Cuda compilation tools, release 12.8 ... ` in output
4. Set env var `TORCH_CUDA_ARCH_LIST=10.0a`
4. Build pytorch from source with this PR ([steps](https://github.com/pytorch/pytorch#from-source))
5. Uninstall `pytorch-triton` with `pip uninstall pytorch-triton`
6. Build and install triton from source: https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source
7. Run tests shown in test plan below

**NOTE**: performance optimization will be done in a separate PR. The goal of this PR is just to ensure it builds correctly.

## Test plan
- `python test/distributed/tensor/test_matrix_ops.py  -k scaled_mm`: OK
- `python test/test_matmul_cuda.py -k rowwise`: OK
- `python test/test_flop_counter.py -k scaled_mm`: OK
- `python test/inductor/test_aot_inductor.py -k fp8`: OK
- `python test/inductor/test_fp8.py`: OK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148274
Approved by: https://github.com/drisspg
2025-03-04 05:23:41 +00:00
7ab6749ec7 Bump onnxscript to 0.2.2 in CI (#148388)
Unblock https://github.com/pytorch/pytorch/pull/148140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388
Approved by: https://github.com/malfet
2025-03-04 04:21:58 +00:00
d54cab78e1 [codemod] Fix missing field initializer in caffe2/torch/lib/libshm/manager.cpp +1 (#148393)
Summary:
The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers.

This can be unintended such as:
```
my_struct s1 = {0}; // Initializes *only* the first field to zero; others to default values
my_struct s2 = {}; // Initializes *all* fields to default values (often zero)
```
or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated.

To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D70472663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148393
Approved by: https://github.com/Skylion007
2025-03-04 04:20:04 +00:00
70410f93f2 doc/xpu: align description of SyclExtension with CPP/CUDA (#147988)
This commit just aligns description of `py_limited_api` feature in SyclExtension with CPP/CUDA. We've missed this change on doing SyclExtension due to parallel work on the changes. For CPP/CUDA change was done in 515e55e6927ad5f57ec222d7779712630341acf3.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147988
Approved by: https://github.com/janeyx99, https://github.com/guangyey
2025-03-04 04:17:36 +00:00
cyy
ec2805ada8 Remove outdated CUDA version check (#148142)
Since Torch requires CUDA>=11, some checks can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142
Approved by: https://github.com/janeyx99, https://github.com/eqy
2025-03-04 03:33:44 +00:00
cyy
98bf2f1170 Use Python 3.9 typing (#148157)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157
Approved by: https://github.com/janeyx99
2025-03-04 03:09:55 +00:00
cyy
b7832f0339 Enable ASAN in CUDA tests (#147812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147812
Approved by: https://github.com/janeyx99
2025-03-04 02:50:39 +00:00
8531d247ba [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303, #148288
2025-03-04 02:42:23 +00:00
5eb0337cfd [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303
2025-03-04 02:42:23 +00:00
a3d69e6e1a Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
ghstack dependencies: #148243, #148260, #148261
2025-03-04 02:42:23 +00:00
17518007b2 [cutlass backend] Benchmark compared to aten and triton (#148347)
Benchmark for cutlass backend.

```
python benchmarks/inductor_backends/cutlass.py
```

Test Plan:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 12.759539298713207 |  2.7271360370796174  |         NA          |
|        triton         | 10.573655366897583 |  1.8661278090439737  | -17.131370346859384 |
| triton_persistent_tma | 10.884030722081661 |  0.5315794269554317  | -14.698873781600327 |
|  cutlass_lvl_default  | 13.09632882475853  |  0.5520401500398293  | 2.6395116481931873  |
|   cutlass_lvl_1111    | 11.05172373354435  |  0.569593315012753   | -13.384617776451302 |
|   cutlass_lvl_2222    | 11.371277272701263 |  133.58984916994814  | -10.880189272601317 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 14.472318813204765 |  1.5445372510002926  |         NA          |
|        triton         | 10.568295605480671 |  16.583424195996486  | -26.975796056689987 |
| triton_persistent_tma | 10.45411266386509  |  5.830657540936954   | -27.764770809729562 |
|  cutlass_lvl_default  | 12.742593884468079 |  28.994930602959357  | -11.951954286402668 |
|   cutlass_lvl_1111    | 11.522261425852776 |  79.85037935699802   | -20.38413764531163  |
|   cutlass_lvl_2222    | 10.993581265211105 |  132.86601971101481  | -24.037181552548486 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.700622126460075 |  2.225986961973831   |         NA          |
|        triton         | 29.17378954589367  |  38.571991189033724  |  -4.97329524553989  |
| triton_persistent_tma | 29.642896726727486 |   7.2848734309664    | -3.4452897904663744 |
|  cutlass_lvl_default  | 29.514770954847336 |  29.819900761009194  | -3.8626291243482167 |
|   cutlass_lvl_1111    | 29.411429539322853 |  23.82907024596352   |  -4.19923929172139  |
|   cutlass_lvl_2222    | 29.57325428724289  |  134.31008586101234  | -3.672133530628152  |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 30.858177691698074 |  1.181898436974734   |         NA         |
|        triton         | 28.630023822188377 |  39.24473957403097   | -7.220626868414034 |
| triton_persistent_tma | 28.641965240240097 |  5.275042273919098   | -7.181929126210897 |
|  cutlass_lvl_default  | 29.16003204882145  |  29.934022572939284  | -5.503065216107967 |
|   cutlass_lvl_1111    | 28.79570797085762  |  23.948012012057006  | -6.683705504085324 |
|   cutlass_lvl_2222    | 29.02756631374359  |  136.25560767308343  | -5.932337924306467 |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1456.143856048584  |  1.020197194069624   |         NA         |
|        triton         | 1708.2737684249878 |  5.766509635956027   | 17.31490410985819  |
| triton_persistent_tma | 1476.485013961792  |  7.455113030038774   | 1.3969195302177155 |
|  cutlass_lvl_default  | 1583.3594799041748 |  50.408804678940214  | 8.736473620182366  |
|   cutlass_lvl_1111    | 1636.4418268203735 |  82.82403108896688   | 12.381879030898025 |
|   cutlass_lvl_2222    | 1507.5665712356567 |  260.03901409788523  | 3.531430975962381  |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1382.230520248413  |  1.2586536260787398  |         NA         |
|        triton         | 1646.9683647155762 |  5.442052865982987   | 19.15294450447995  |
| triton_persistent_tma | 1423.9195585250854 |  6.515797697938979   | 3.016069871556595  |
|  cutlass_lvl_default  | 1500.9030103683472 |  51.36402789200656   |  8.58557877152115  |
|   cutlass_lvl_1111    | 1446.9740390777588 |  30.65435610699933   | 4.683988515729638  |
|   cutlass_lvl_2222    | 1419.661521911621  |  205.1948991640238   | 2.7080144096717635 |
+-----------------------+--------------------+----------------------+--------------------+
```

Differential Revision: D70147589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347
Approved by: https://github.com/drisspg, https://github.com/chenyang78
2025-03-04 01:45:36 +00:00
c21dc11a17 [Intel GPU] Enable SDPA on XPU (#147614)
Motivation
===

This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`.

Depends on OneDNN version v3.7 upgrade in #147498
Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-04 01:40:45 +00:00
b17f5223a4 Generate AOTI input check by default (#148005)
Summary:
Generate AOTI size and stride input check by default. But the checks are only run if `AOT_INDUCTOR_DEBUG_COMPILE` env variable is set (to avoid slowing down the performance).

Example output:

```cpp
            bool _check_aoti_runtime_check_inputs_env() {
                const static char* env_var_value = getenv("AOTI_RUNTIME_CHECK_INPUTS");
                const static bool result = env_var_value != nullptr && env_var_value[0] != '\0';
                return result;
            }

            AOTI_NOINLINE static void __check_inputs_outputs(
                AtenTensorHandle* input_handles,
                AtenTensorHandle* output_handles) {
                if (!_check_aoti_runtime_check_inputs_env()){
                    return;
                }
//rest of the check
}

```

Test Plan: CI

Differential Revision: D70260490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148005
Approved by: https://github.com/hl475, https://github.com/desertfire, https://github.com/jingsh
2025-03-04 00:55:14 +00:00
0bd2caac55 Docker release - pin buildkit to v0.19.0 (#148372)
Fix nightly build failure during arm64 docker build (since 02.21.2025): https://github.com/pytorch/pytorch/actions/runs/13452177170/job/37588508155#step:12:851

Error:
```
#10 73.62 Segmentation fault (core dumped)
#10 73.67 qemu: uncaught target signal 11 (Segmentation fault) - core dumped
#10 73.85 Segmentation fault (core dumped)
#10 73.85 dpkg: error processing package libc-bin (--configure):
#10 73.85  installed libc-bin package post-installation script subprocess returned error exit status 139
```
Looks like we are hitting: https://github.com/moby/buildkit/issues/5783

Update setup-qemu and buildkit actions to v3 and buildkit to v0.19.0

Please note: CUDA 12.8 error is not related to this failure in nightly cpu arm64. Looks like we are trying to install release torch when running on PR. Cuda 12.8 build is not released yet, hence a failure. Will send followup to make sure we are using nightly torch when running on PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148372
Approved by: https://github.com/seemethere
2025-03-03 23:55:30 +00:00
d43c6f0033 [invoke_subgraph] Run joint passes on the hop graphs (#139325)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139325
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #147559
2025-03-03 23:38:14 +00:00
216a108aaf [ROCm] Add rocm-mi300 and inductor-rocm-mi300 to upload-test-stats.yml (#148365)
We currently run MI300X machines on rocm-mi300 and inductor-rocm-mi300 but we don't have artifacts for the results:
e.g.
6e10471966 (rocm-mi300)
![image](https://github.com/user-attachments/assets/f5588072-b818-4f54-a348-0e6ac7e96829)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148365
Approved by: https://github.com/jeffdaily
2025-03-03 23:22:56 +00:00
586d8df651 Fix condition for CONVERT_NON_VECTORIZED_INIT invocation (#148362)
Yet another regression caused by https://github.com/pytorch/pytorch/pull/146596 that breaks builds if PyTorch is compiled for Android or using NVIDIA GraceHopper systems

Not sure why author was trying to change the conditon to begin with

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148362
Approved by: https://github.com/izaitsevfb
ghstack dependencies: #148354
2025-03-03 23:13:37 +00:00
5887a2d8de [BE] Use C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED (#148354)
Instead of `#pragma GCC diagnostic ignored "-Wignored-qualifiers"`
Also limit the scope to just `Vectorized::map` that has to be declared that way due to sleef function signature definitions that return `const __m256` for AVX2 methods

Also delete `#pragma GCC diagnostic pop` from vec256_half and vec256_bfloat16 as it results in an unbalanced pop warning, for push that is defined in vec256_16bit_float, which will be included only once
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:232:27: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas]
  232 | #pragma GCC diagnostic pop
      |                           ^
1 warning generated.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148354
Approved by: https://github.com/izaitsevfb
2025-03-03 23:00:47 +00:00
d0b23e661d [cutlass backend] Add main tests for mm, addmm and bmm - step 1 (#148229)
This adds very good coverage for normal mm tests {aoti x torch.compile} x {default, dynamic}.

There are some parts that are less tested. For example:
* different layout combo
* shapes that are less aligned

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148229
Approved by: https://github.com/chenyang78
2025-03-03 22:31:46 +00:00
a41413829c Use release notes label for module: distributed_checkpoint (#148352)
module: distributed_checkpoint is redundant with oncall: distributed checkpointing.

@fduwjj let us know that module: distributed_checkpoint is just used for release notes, so let's use the release notes label for the release notes, which the bot will pick up better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148352
Approved by: https://github.com/fegin
2025-03-03 21:33:28 +00:00
e45040b1d3 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang
2025-03-03 21:32:21 +00:00
52078154f2 Add support for no-op concat with padded output (#146866)
Add support for no-op concat with padded output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146866
Approved by: https://github.com/shunting314
2025-03-03 21:10:46 +00:00
07f876e960 Subprocess compile (#146134)
Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134
Approved by: https://github.com/jamesjwu
2025-03-03 21:10:12 +00:00
8f361c808b [dynamo] run-only recursively on recompile limit exceeded (#148021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148021
Approved by: https://github.com/anijain2305
2025-03-03 21:01:08 +00:00
1bbe57336b Replace unimplemented with unimplemented_v2 for dynamo (#148158)
torch/_dynamo/variables/constant.py

https://github.com/pytorch/pytorch/issues/147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148158
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
2025-03-03 21:00:17 +00:00
b162b1600b [Inductor] Hot fix after #148011 (#148270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148270
Approved by: https://github.com/davidberard98
2025-03-03 20:18:21 +00:00
d260d4fc55 HSDP custom hook UTs are multi-threaded - can't set device rank (#148099)
HSDP custom hook UTs are multi-threaded and using single physical GPU. If we set rank in each thread, then we are referencing the same GPU with multiple ranks, which isn't right. Therefore, removing the rank setting from these UTs. Now, they are passing with 1, 2, 4 GPUs.

Fixes #147767 and #147769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148099
Approved by: https://github.com/jeffdaily
2025-03-03 19:48:49 +00:00
302c660298 Consistently use load_torchbind_test_lib in tests (#148082)
The same code is repeated multiple times with slightly different implementations.
Use the existing function for brevity and consistency.

In the function the code from `test_export` is used which does a single `load_library` with cleaner conditions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148082
Approved by: https://github.com/angelayi
2025-03-03 19:37:28 +00:00
40c2505f16 [logging] Log individual Triton kernel compilation times to dynamo_compile (#147022)
Summary: Gather the compilation time of individual triton kernels and log them to dynamo_compile:
* Time compilation in `_worker_compile_triton` and pass back to the main process and logged from `get_result()`.
* Added a way to track the "top N" (or N most-expensive compiles) in the metrics_context. I did this because I doubt we really care to capture potentially thousands of kernel compile times. That would be problematic for scuba logging anyway, so let's limit the number we track from the beginning. Arbitrarily chose 25 for now.
* Format the list of compile times as a json string before logging.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
Scuba: https://fburl.com/scuba/dynamo_compile/sandbox/nc4dzm3r

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147022
Approved by: https://github.com/jamesjwu
2025-03-03 19:32:17 +00:00
aade4fbd55 Expose the rendezvous keepalive arguments (#145228)
Enables support for this:

```python
from torch.distributed.launcher.api import LaunchConfig

config = LaunchConfig(
    ...,
    rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```

These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.

Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
2025-03-03 19:11:56 +00:00
a929e11e4f [dynamic shapes][export] ignore when real-tensor fallback fails (#147779)
Summary: uninspired solution to https://github.com/pytorch/pytorch/issues/147402

Test Plan: test_draft_export

Differential Revision: D70132269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147779
Approved by: https://github.com/bobrenjc93
2025-03-03 19:09:56 +00:00
cyy
09291817b2 Fix extra semicolon warning (#148291)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291
Approved by: https://github.com/Skylion007
2025-03-03 18:51:44 +00:00
1c544a9ddd [Inductor-CPP] If all of the activation scale dims are 1, make it a 0D tensor (#147033)
For int8 dynamically quantized activation & int8 quantized weights, add a workaround for some indexing issue that expected an empty index ( so, was expecting a 0D tensor) in epilogue creator when the activation scale was sized [1, 1] by converting it into a 0D tensor.

The issue was discovered while running LLaMA2 quantized with torchao's `int8_dynamic_activation_int8_weight` quantization on CPU with max-autotune enabled (although this error would've occurred regardless).

The final hidden states tensor that's activation to LM head is of shape `[batch_size, sequence_length, hidden_dim]` during decoding. For decoding one token at a time with batch size 1, sequence length is 1. The activation scale is shaped `[1, 1]` (reshaped from `[1, 1, 1]`). However, Inductor epilogue creator expects a 0D tensor in this case (my guess is that the corresponding logic in Inductor expects a 0D tensor if a tensor has only one element, even if it's 1D?).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147033
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
2025-03-03 18:32:27 +00:00
57addfcd58 Significantly speed up save_cache_artifacts (#148227)
While using save_cache_artifacts on internal workloads, we have noticed that repeatedly calling this function after every batch is incredibly expensive. This PR significantly speeds up this function call by opting out of pickle and redesigning serialization algorithm.

Essentially what we want is to be able to call serialize many times without incurring costs from scratch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148227
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148226
2025-03-03 17:28:41 +00:00
3ca1a2564d [BE][MPS] Use copysign for imaginary part of sqrt (#148286)
Also it's tempting trying to replace `a*a + b*b` with `dot(input[index])` but for some reason it results in a slightly different output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148286
Approved by: https://github.com/dcci
ghstack dependencies: #148285
2025-03-03 16:03:54 +00:00
84502baaff [MPS] Fix sqrt and other for torch.chalf (#148285)
Those kernels, instead of being instantiated for half2 (which corresponds to ComplexHalf) were instnatiated for short2, which resuled in the following test
```
% python3 -c "import torch; print(torch.rand(6, device='mps', dtype=torch.chalf).sqrt())"
```
Fail with
```
RuntimeError: Failed to create function state object for: sqrt_complex_half_half
```

As sqrt is not implemented for CPU, add explicit test to `test_sqrt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148285
Approved by: https://github.com/dcci
2025-03-03 16:03:54 +00:00
d57f617844 [Inductor][CPP] Avoid transpose with cpp micro-gemm for FlexAttention (#147069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147069
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/drisspg
ghstack dependencies: #147068
2025-03-03 15:22:11 +00:00
6c089f5da3 ci: move xpu triton build to manylinux 2.28 (#148195)
Follow PR #148129 to remove manylinux builds for triton xpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148195
Approved by: https://github.com/seemethere
2025-03-03 12:31:08 +00:00
165e33531c [Inductor][CPP] Fix the vec codegen for tanh (#148254)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/148241, The previous vectorized code generation for `tanh` used a decomposed implementation, leading to numerical differences that were further amplified by `atan2`. For example, in the given test case after `tanh`, the eager output at `[0,0,11,47]` was `-5.820766091346741e-10`, while the compiled output was `1.4319084584712982e-08`, resulting in different `atan2` outputs of `-2.3561` and `0.7853`. This issue is fixed by switching to the Sleef implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_tanh_atan2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148254
Approved by: https://github.com/malfet, https://github.com/jgong5
2025-03-03 11:46:57 +00:00
118a165ac5 [Inductor][CPP] Add transposed B matrix support for CppMicroGemmFP32Vec (#147068)
* Add transposed B support for CppMicroGemmFP32Vec.
* Add support for cases where N is not divisible by `block_n`.
Expand CppMicroGemmFP32Vec to generate gemm kernel that supports transposed B and N of arbitrary size.
This is the basis for https://github.com/pytorch/pytorch/pull/147069 to get better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147068
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-03 11:08:23 +00:00
6a3a1f96ce Enable XPU for Inductor MM Triton Kernel Benchmark (#148237)
#147620 enabled `force_shape_pad` for triton kernel benchmark. Intel GPU supports this scenario. Hence, we need to enable the case in this PR. Otherwise, there would be a test case regression for Intel GPU as #147620 has been landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148237
Approved by: https://github.com/jansel
2025-03-03 10:09:06 +00:00
b3bb73e11c Separate transpose from memory load/store and add load size support for convert_to_int32 (#147067)
Separate transpose from memory load/store and add load size support for convert_to_int32 to facilitate the expansion for CppMicroGemmFP32Vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147067
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 02:56:16 +00:00
ab81ca5053 [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes.

Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested.

**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 00:56:29 +00:00
608377d341 Revert "[import][inductor] Simplify grid handling (#147583)"
This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e.

Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))
2025-03-03 00:49:32 +00:00
29c2de9ae1 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-02 22:42:31 +00:00
0135f57f4a [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-02 22:42:31 +00:00
edaff88f69 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-02 22:42:31 +00:00
94afb165d9 Revert "[c10d] Add hccl distributed backend to c10d data structures (#146478)"
This reverts commit dae3fbfe9720e83e7e81d41430fb5067221bbed7.

Reverted https://github.com/pytorch/pytorch/pull/146478 on behalf of https://github.com/malfet due to This seems to break ROCM tests, see dae3fbfe97 ([comment](https://github.com/pytorch/pytorch/pull/146478#issuecomment-2692913573))
2025-03-02 21:22:04 +00:00
1106eb0212 [BE] Fix extra semicolon warning (#148284)
Introduced by https://github.com/pytorch/pytorch/pull/146596

I.e. while building locally my log was littered with
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL2d.cpp:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:228:42: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  228 | LOAD_FP32_NON_VECTORIZED_INIT(Half, fp16);
      |                                          ^
2 warnings generated.
[230/1017] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL.cpp:9:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:14:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:228:46: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  228 | LOAD_FP32_NON_VECTORIZED_INIT(BFloat16, bf16);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148284
Approved by: https://github.com/Skylion007
2025-03-02 19:06:46 +00:00
6d70b42810 [BE][Ez]: Update fmt submodule to 11.1.4 (#148264)
This minor release is mostly bugfixes, ABI fixes, and compiler support fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-02 19:00:00 +00:00
95d81d21a6 [MPS] Speedup interpolation (#148277)
First of all, perf claims made in https://github.com/pytorch/pytorch/pull/145581 and https://github.com/pytorch/pytorch/pull/148154 are too good to be true (due to the bug in the script that did not call `torch.mps.synchronize` at the end of the benchmark script, but still slightly better than MPS, probably due to the launch overhead.

And while measure performance correctly, I've noticed that a lot of time is spent on 64-bit integral division of thread_index to get spatial coordinates. Simply downcasting divisior to 32-bit integer (which is also the thread index) speeds it up almost 2x for bilinear and bicubic as could be demonstrated by running following script
```python
import torch
import time
import subprocess
import itertools

def benchmark(device, dtype, mode="bilinear", antialias=False, sf=.5):
    # Create example inputs
    x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype)

    # define kwargs
    kwargs = {"antialias": antialias, "mode": mode, "scale_factor": sf}

    # Skip for unimplemented flavors
    if antialias and mode == "bicubic" and device == "mps":
       return None, "Skip"
    elif antialias and dtype != torch.float32:
       if device == "cpu":
           return None, "Skip"
       outputs_match = None
    else:
        # Check output
        y = torch.nn.functional.interpolate(x, **kwargs)
        z = torch.nn.functional.interpolate(x.cpu(), **kwargs)
        outputs_match = torch.allclose(y.cpu(), z)
        if not outputs_match:
           atol = (y.cpu() - z).abs().max()
           rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
           print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, **kwargs)
    torch.mps.synchronize()
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
for mode,antialias in itertools.product(["bilinear", "bicubic"], [False, True]):
    outputs_match_list = []
    average_time_list = []
    for device in ["mps", "cpu"]:
      for dtype in [torch.float32, torch.float16, torch.bfloat16]:
          outputs_match, average_time = benchmark(device, dtype, mode=mode, antialias=antialias)
          outputs_match_list.append(str(outputs_match))
          average_time_list.append(average_time)

    print(f"\nBenchmarking Results (collected on {brand_string}) for {mode} interpolation {'with antialias' if antialias else ''}:")
    print("-"*40)
    print("Device            :                MPS        |               CPU")
    print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16")
    print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
    print(f"Average Time (us) :", "  |".join(average_time_list))
```

Before
```
Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  292.0  | 264.7  | 267.9  | 289.1  | 230.9  | 309.1
atol=1.430511474609375e-06 rtol=0.11363636702299118

Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias:
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  False |  False |  True  |  None |  None
Average Time (us) :  698.3  | 684.2  | 683.8  | 851.0  |Skip  |Skip
atol=2.086162567138672e-06 rtol=0.019750799983739853

Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  True  |  True  |  True  |  True  |  True
Average Time (us) :  314.3  | 301.0  | 298.8  | 681.5  | 616.7  | 833.7
```

After
```
Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  119.9  |  98.9  |  98.6  | 289.8  | 231.9  | 308.5
atol=1.430511474609375e-06 rtol=0.05681818351149559

Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias:
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  False |  False |  True  |  None |  None
Average Time (us) :  541.9  | 531.1  | 531.0  | 846.8  |Skip  |Skip
atol=2.0265579223632812e-06 rtol=0.008604463189840317

Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  True  |  True  |  True  |  True  |  True
Average Time (us) :  314.3  | 301.0  | 298.8  | 681.5  | 616.7  | 833.7
```

TODO:
 - Figure out if this ops make more sense as 3D jobs with n and c channels dispatch as one more dimension
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148277
Approved by: https://github.com/Skylion007
2025-03-02 17:13:52 +00:00
cyy
9aa897b992 Remove unnecessary tensor clone (#148159)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148159
Approved by: https://github.com/Skylion007
2025-03-02 16:21:39 +00:00
1d7397a2d0 [Inductor] Avoid tensor slice overflow for large step (#147433)
Fixes #147071

Currently, if step is a value very close to INT64_MAX, the calculation of slice output length will overflow. This PR tries to fix this problem and thus fix #147071.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147433
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-02 16:07:15 +00:00
9c506aa8a6 [aotinductor] add option to disable runtime assertions (#146462)
A recent user experience is like this:
* User runs AOTI lowering, it's successful.
* They take AOTI model and run it with some sample inputs. Everything runs well
* Then they boot up a serving test that loads the AOTI model and runs it with a set of sample requests.
* They see that some of the requests fail. The logs show them this:
* AOTInductorModel run failed with input spec: [1, 32]:c10::BFloat16, [2]:long ...
  * Error: u45 >= 2
* To the untrained eye, "AOTInductorModel run failed" is all they see. But, the true reason is Error: u45 >= 2

However, the assertion isn't always correct.
* In fact, u45 can actually be 0.
* So, why did AOTI say u45 ≥ 2? It's a two-piece combo:
* With 0/1 Specialization, the ShapeEnv creates symbolic shapes (e.g. s0) with a default value-range of [2, inf]
* In the graph, Dynamo traces torch.mul(A, B) where A is [s0, ...]and B is [u45, ...]. So, Dynamo learns Eq(s0, u45).
* Therefore, u45 also has a range of [2, inf]. Hence, the incorrect runtime assertion.

So, the motivation for this PR is to add an option to disable the logging. If you run into a situation like this. However, another way to avoid this is to call `mark_unbacked()` on all the dynamic dims.

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146462
Approved by: https://github.com/desertfire, https://github.com/22quinn
2025-03-02 09:14:58 +00:00
26358fa2d8 Add AppendingByteSerializer class (#148226)
This PR adds a new util class that enables efficient appending of sequential byte data with custom serialization and deserialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148226
Approved by: https://github.com/aorenste
2025-03-02 08:20:58 +00:00
b59776d857 [import][inductor] Simplify grid handling (#147583)
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Note the attached diff contains some minor fbcode-only changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-03-02 07:31:07 +00:00
dae3fbfe97 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang, https://github.com/guangyey
2025-03-02 05:13:48 +00:00
6e10471966 [ci] disable cudagraph for tts_angular on dashboard (#148221)
tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular.

[Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221
Approved by: https://github.com/eellison
2025-03-02 03:31:19 +00:00
de7af81f18 [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-03-02 03:25:28 +00:00
ce2f680e00 [fr] Added protection against missing stack frames in fr (#148203)
Summary: We have quite a while failures due to this unprotected access. https://fburl.com/scuba/ai_rca_debug_tracing/qtnb63qf

Test Plan:

Reviewed By: fduwjj

Differential Revision: D70358287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148203
Approved by: https://github.com/fduwjj
2025-03-02 01:03:49 +00:00
19de523de6 [MPS] metal unary kernel for sqrt (#148272)
Issue #148219 highlighted the high dispatch times of ops which ran with MPS Graph on smaller tensors. This PR rewrites the sqrt with metal kernel to mitigate that issue

## Speedups:

Matrix size means NxN matrix here.
![speedup_sqrt](https://github.com/user-attachments/assets/db0a705b-1a0e-42b4-bd42-4e7960415c81)

Code to generate the times(needs building the torch with old time and new time):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [1, 100, 1000, 10_000]
num_runs = 1000
warmup_runs = 3

def run_sqrt(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = torch.sqrt(A)
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.rand((n, n), dtype=torch.float32, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_sqrt(A_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_sqrt(A_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('sqrt_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148272
Approved by: https://github.com/malfet
2025-03-02 00:45:45 +00:00
1a6883759d Fix macro for bit_cast in c10/util/bit_cast.h - one line change (#148265)
Fixes #148263.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148265
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-03-01 20:55:31 +00:00
1919e0de9a Revert "stage 1 of depreate silent fallback of tuning gemm (#147798)"
This reverts commit 297c00264e54cfb192f289e23a41775b81cb9cb8.

Reverted https://github.com/pytorch/pytorch/pull/147798 on behalf of https://github.com/wdvr due to failing internal builds, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147798#issuecomment-2692390551))
2025-03-01 20:04:23 +00:00
82603fd7d2 introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk, https://github.com/wdvr
2025-03-01 19:57:54 +00:00
5301710b15 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69945678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-03-01 19:46:13 +00:00
0ff2e6a85a Fix None and equal_to_1 arguments issue in Triton kernel generated by AOTI (#148102)
Summary:
When a Triton kernel has arguments with None values followed by arguments with value 1, AOTI attempts to remove the None arguments and update the indices of the equal_to_1 arguments in triton_meta["configs"]. However, if the same kernel is called multiple times, this optimization process is repeated. Prior to this diff, the indices of equal_to_1 arguments from subsequent calls (second and later) were based on the updated indices from the previous call, resulting in incorrect behavior.
This diff aims to localize the updated indices for equal_to_1 arguments within the optimization process of the current call, ensuring accurate and consistent results.

Test Plan:
Unit Test:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_triton_kernel_with_none_inputs_and_equal_to_1_arg
```

Differential Revision: D69998314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148102
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
2025-03-01 18:38:33 +00:00
2b86309da3 separate f16 vectorized class from bf16 (#146596)
Separating the f16 vectorized class into a different file from the bf16 vectorized class in order to be able to add a new bf16 SVE vectorized class in https://github.com/pytorch/pytorch/pull/143666. This is required as we would need to exclude the current bf16 class in order to use the sve bf16 class but still include the current f16 vectorized class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146596
Approved by: https://github.com/malfet
2025-03-01 18:22:32 +00:00
8e004865dd Revert "introduce dynamism library (#147981)"
This reverts commit 1c1bf410ecdeac8d240e15bf8c33c0f00fab0673.

Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/malfet due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2692351017))
2025-03-01 18:16:52 +00:00
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
d23051f29b [Inductor] Support parallel reduction for GroupNorm (#144020)
Summary:
Support parallel reduction for GroupNorm by optimizing the parallelization heuristics: When the range of the first inner loop is much larger than the range of all outer loops, change the starting depth of parallelization to the first inner loop.
I tested the Inductor benchmark with this PR on CPU. One torchbench model(pytorch_CycleGAN_and_pix2pix) achieved ~45% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) achieved ~2% performance improvement.

Example:
```
import torch
import torch.nn as nn

class GN(nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

x = torch.randn(2, 64, 168, 168).to(memory_format=torch.channels_last)
m = GN(2, 64).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    out = compiled_m(x)
```

Generated code:

- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       float* out_ptr3,
                       float* out_ptr4)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(56448L));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L))
                            {
                                {
                                    if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L)))
                                    {
                                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32L*x1 + 64L*x2 + 1806336L*x0), static_cast<int64_t>(16));
                                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                                    }
                                }
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        #pragma omp single
        {
            {
                #pragma GCC ivdep
                for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
                    {
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                        {
                            {
                                if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                                {
                                    auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2L*x0)];
                                    auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                                    auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2L*x0)];
                                    auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                                    auto tmp1 = static_cast<float>(903168.0);
                                    auto tmp2 = tmp0 / tmp1;
                                    auto tmp3 = static_cast<float>(1e-05);
                                    auto tmp4 = decltype(tmp2)(tmp2 + tmp3);
                                    auto tmp5 = 1 / std::sqrt(tmp4);
                                    auto tmp7 = at::vec::Vectorized<float>(tmp5);
                                    auto tmp8 = tmp7 * tmp6;
                                    auto tmp10 = decltype(tmp9)(-tmp9);
                                    auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                    auto tmp12 = tmp11 * tmp8;
                                    auto tmp14 = tmp12 + tmp13;
                                    tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                                    tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                                }
                            }
                        }
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L))
                {
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L))
                    {
                        {
                            if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L)))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp2 = tmp0 * tmp1;
                                auto tmp4 = tmp2 + tmp3;
                                tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')
```

- After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       float* out_ptr3,
                       float* out_ptr4)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            #pragma GCC ivdep
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
            {
                {
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>();
                    }
                    Welford<float> tmp_acc0_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        tmp_acc0_arr[i] = Welford<float>();
                    }
                    Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        masked_tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>();
                    }
                    #pragma omp parallel num_threads(56)
                    {
                        int tid = omp_get_thread_num();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(1008L));
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>();
                        Welford<float> tmp_acc0_local = Welford<float>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>();
                        #pragma omp for
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L))
                            {
                                {
                                    if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L)))
                                    {
                                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32L*x1 + 64L*x2 + 1806336L*x0), static_cast<int64_t>(16));
                                        tmp_acc0_vec_local = welford_combine(tmp_acc0_vec_local, tmp0, &wrecps0);
                                    }
                                }
                            }
                        }
                        tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local;
                        tmp_acc0_arr[tid] = tmp_acc0_local;
                        masked_tmp_acc0_vec_arr[tid] = masked_tmp_acc0_vec_local;
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp_acc0_vec_arr[tid]);
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]);
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, masked_tmp_acc0_vec_arr[tid]);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            #pragma GCC ivdep
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
            {
                for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                        {
                            auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2L*x0)];
                            auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                            auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2L*x0)];
                            auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                            auto tmp1 = static_cast<float>(903168.0);
                            auto tmp2 = tmp0 / tmp1;
                            auto tmp3 = static_cast<float>(1e-05);
                            auto tmp4 = decltype(tmp2)(tmp2 + tmp3);
                            auto tmp5 = 1 / std::sqrt(tmp4);
                            auto tmp7 = at::vec::Vectorized<float>(tmp5);
                            auto tmp8 = tmp7 * tmp6;
                            auto tmp10 = decltype(tmp9)(-tmp9);
                            auto tmp11 = at::vec::Vectorized<float>(tmp10);
                            auto tmp12 = tmp11 * tmp8;
                            auto tmp14 = tmp12 + tmp13;
                            tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                            tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                        }
                    }
                }
            }
        }
    }
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L))
                {
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L))
                    {
                        {
                            if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L)))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp2 = tmp0 * tmp1;
                                auto tmp4 = tmp2 + tmp3;
                                tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144020
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-01 17:11:50 +00:00
cyy
8bf3920279 Remove unneeded Clang-tidy suppression (#148246)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148246
Approved by: https://github.com/Skylion007
2025-03-01 16:51:54 +00:00
1c1bf410ec introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk
2025-03-01 14:57:06 +00:00
3a0c9f7f9d [MPS] Fix SDPA crash (#148239)
If operation is invoked with mask twice it will crash, as mask expansion logic was implemented inside cache creation block, which is executed only once for all shapes

Fixes https://github.com/pytorch/pytorch/issues/148194 which is a regression introduced by https://github.com/pytorch/pytorch/pull/147545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148239
Approved by: https://github.com/dcci
2025-03-01 13:06:51 +00:00
735d7b1af6 [EZ][BE] Increase tolerances for interpolate op (#148224)
Not sure why tolerances were set like that, this logic was added in https://github.com/pytorch/pytorch/pull/104181 without much explanation
But if I'm to make a guess, it's likely due to the inaccuracy of bilinear op, that has since been replaced by shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148224
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148154, #148187, #148211
2025-03-01 13:03:59 +00:00
762724f3d0 [Break XPU][Inductor] Generalize device-bias code and fix test_graph_partition for XPU (#148178)
This PR generalized the device-bias code introduced by #147038 . And align the behavior between XPU and CUDA on add + mm + pointwise pattern (for XPU, from addmm + pointwise to mm + fused_add_pointwise) , which fix the failed test case `test_graph_partiton` on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148178
Approved by: https://github.com/benjaminglass1, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #148155
2025-03-01 10:59:55 +00:00
ab78bf5c66 [Break XPU][Inductor UT] Avoid custom op registration conflicts in test_auto_functionalize.py. (#148155)
Fix #148148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148155
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-01 10:59:55 +00:00
2f1b8e0fe2 [DTensor][Test] Add a test to demonstrate current dtensor view behavior if redistribution happens (#148015)
This does not fix the view op issue when redistribution happens. We want to add a test to demonstrate/record the issue, in which the distributed behavior does not match up with single device behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148015
Approved by: https://github.com/XilunWu
2025-03-01 10:24:40 +00:00
191c9bd013 Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)"
This reverts commit b8efebe57d05a87be5b0f304218d2af7bb2bf6c6.

Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/davidberard98 due to looks like another lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2692042859))
2025-03-01 07:43:58 +00:00
fe3b9e3764 [Inductor] optimize the heuristics of outer loop fusion (#147523)
Summary:
Optimize the heuristics of outer loop fusion: When the range of the first inner loop is much larger than the range of all outer loops, do not fuse the outer loops and fallback to standard codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147523
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-01 06:50:04 +00:00
b8efebe57d [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-03-01 06:38:39 +00:00
fd16311e7f [inductor][subgraph] Plumbing to get ShapeAsConstantBuffer from subgraph to main graph output (#147559)
I am unable to create a test case that fails without the next PR. The idea is to have a symint which is returned by the inner subgraph and then returned by the forward graph after partitioning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147559
Approved by: https://github.com/eellison
2025-03-01 06:17:11 +00:00
c87097e74a [triton 3.3] Fix inductor/test_profiler.py test (#148230)
test_inductor_profiling_kernel_names_pointwise is checking that the profiler correctly records the input shapes to the kernel. After triton 3.3, we get a different number of args (because the constexpr args are passed in, from the python perspective). This just patches the test to pass in either case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148230
Approved by: https://github.com/drisspg, https://github.com/YUNQIUGUO
2025-03-01 04:27:49 +00:00
9377a32cd1 [Inductor][NFC] Remove unused functions from compile_tasks.py (#147564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147564
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2025-03-01 03:44:43 +00:00
baf1c8fcdc Revert "introduce dynamism library (#147981)"
This reverts commit 6eff6b28e4d09cbf632f79502a8e317bf5b53c34.

Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2691906065))
2025-03-01 03:43:01 +00:00
493cd97af5 add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134)
Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134
Approved by: https://github.com/eqy
2025-03-01 02:59:48 +00:00
6eff6b28e4 introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk
2025-03-01 02:49:16 +00:00
08434df1f2 [MPS] fix empty place holder error for smooth l1 loss (#148133)
Fixes #123171

And parametrizes the tests for it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148133
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-01 02:32:45 +00:00
02c5f21541 [Inductor] fix AOTInductorTestABICompatibleGpu.test_triton_kernel_weird_param_order with new Triton (#148011)
In this case, the parameters have already been filtered [here](201666d77d/torch/_inductor/codegen/cpp_wrapper_gpu.py (L335)) and subsequent filtering is not only unnecessary, it breaks the code, since the positions of the parameters change after filtering. For this test, for example, the second filtering discarded `buf0`.

For example:
```python
(Pdb) triton_meta["signature"]
{'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'n_elements': 'i32', 'BLOCK_SIZE': 'constexpr', 'out_ptr': '*fp32'}
(Pdb) call_args
['arg0_1', 'arg0_1', '256L', 'buf0']
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148011
Approved by: https://github.com/davidberard98
2025-03-01 01:21:20 +00:00
338ed67a1e [inductor] Implement max_pool2d_with_indices as a reduction for large window sizes (#147876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147876
Approved by: https://github.com/eellison
2025-03-01 01:07:01 +00:00
230a3b0f83 Add cuda 11.8 guard for cufile preload (#148184)
Follow up after https://github.com/pytorch/pytorch/pull/148137
Make sure we don't try to load cufile on CUDA 11.8

Test:
```
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu118'
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148184
Approved by: https://github.com/mikaylagawarecki
2025-03-01 01:01:04 +00:00
2544afaa1a [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364
Approved by: https://github.com/fduwjj
2025-03-01 00:57:37 +00:00
5d297f7a34 [MPS][BE] Combine two upsample_kernel_out_template into one (#148211)
- First, by stopp inverting sizes and strides, i.e. passing them as is, but reading them in inverse order in the shader as 1st stride of 4D tensor is one used for batches, 2nd for channels and 3rd and 4th for spatial coordinates
- Pass `scales` as float2 even in linear tensor
Above  allows one to collide two flavors `upsample_kernel_out_template` into one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148211
Approved by: https://github.com/dcci
ghstack dependencies: #148154, #148187
2025-03-01 00:39:26 +00:00
clr
83fb974b5d scriptfunction: Make sure we have valid __name__ and __qualname__ (#147906)
It's not fully clear why these are not being created, but you can definitely
    reproduce this in code. `__name__` is fun, since there appears to be no way to
    explicitly set it on the pybind11 layer or c++ layer. I've set this in the
    python wrapper code (which works correctly). But let me know if people feel
    strongly and want us to go explicitly cast to python within the cpp functions
    and set it there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147906
Approved by: https://github.com/jansel
ghstack dependencies: #147894
2025-02-28 23:25:47 +00:00
1ae7cc41ca Define __all__ for torch.utils.tensorboard (#147550)
Fixes the issue:

```python
import torch.utils.tensorboard
torch.utils.tensorboard.FileWriter  # pyright: "FileWriter" is not exported from module "torch.utils.tensorboard"
torch.utils.tensorboard.RecordWriter  # pyright: "RecordWriter" is not exported from module "torch.utils.tensorboard"
torch.utils.tensorboard.SummaryWriter  # pyright: "SummaryWriter" is not exported from module "torch.utils.tensorboard"
```

The [docs page for `torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147550
Approved by: https://github.com/albanD
2025-02-28 23:06:11 +00:00
3a69dee955 [Submodule][FlashAttention] Bump to 2.7.4 (#148147)
# Summary
This makes me happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147
Approved by: https://github.com/Skylion007
2025-02-28 22:40:02 +00:00
83ec7cdcd4 Fix recompile reason logging (#148200)
for the following test case

```
        @torch.compile(dynamic=False, backend=cnts)
        def fn(x, y, z):
            return x * y * z[0]

        fn(1, torch.randn(1), {0: torch.randn(1)})
        fn(2, torch.randn(2), {0: torch.randn(2)})
        fn(3, torch.randn(3), {0: torch.randn(3)})
        fn(4, torch.randn(4), {0: torch.randn(4)})
        fn(5, torch.randn(5), {0: torch.randn(5)})
```

previously we would log

```
0/0: L['x'] == 1
0/0: L['x'] == 1
0/0: L['x'] == 1
0/0: L['x'] == 1
```

but after this change we now log

```
0/0: L['x'] == 1
0/1: L['x'] == 2
0/2: L['x'] == 3
0/3: L['x'] == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148200
Approved by: https://github.com/xmfan
2025-02-28 22:33:37 +00:00
40b3e4a358 [dynamo] expose code execution strategy to python (#148020)
@anijain2305 this can be used to mark a code object to be skipped/run-only (recursively) while tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148020
Approved by: https://github.com/jansel
2025-02-28 21:59:12 +00:00
e74fdbe6d0 [inductor] ignore block ptr advancements for removed buffers (#148087)
Follow up to https://github.com/pytorch/pytorch/pull/147193. Some buffers are removed only when the kernel context is exited so defer the lines instead.

Added `use_block_ptr` as a parameter to test case that fails if run with block ptrs enabled.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148087
Approved by: https://github.com/jansel, https://github.com/eellison
2025-02-28 21:31:15 +00:00
d174562487 [MPS][BE][EZ] Aggregate macros (#148187)
Refactor `INSTANTIATE_UPSAMPLE_BILINEAR2D(DTYPE)`, `INSTANTIATE_UPSAMPLE_BICUBIC2D(DTYPE)` and `INSTANTIATE_UPSAMPLE_BILINEAR2DAA(DTYPE)` use common `INSTANTIATE_UPSAMPLE2D`
Then combine multiple invocations into `INSTANTIATE_UPSAMPLE_ALL`

I.e. functionally it's a no-op, but achieves the same with fewer lines of code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148187
Approved by: https://github.com/Skylion007
ghstack dependencies: #148154
2025-02-28 21:30:00 +00:00
4995e058bf [user-triton] handle inline_asm_case (#148043)
Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value

Test Plan:
../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel

```
test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True)
ok

----------------------------------------------------------------------
Ran 2 tests in 0.706s

OK
```

Differential Revision: D69878591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043
Approved by: https://github.com/zou3519
2025-02-28 20:52:51 +00:00
6f91720e1c [inductor][ck] manual kBatch heuristic (#148118)
Summary:
# Why

Leverage kBatch parameter for large splitK examples for CK for better than ATEN performance

# What

replace default kBatch = 1 with a manual heuristic

- if K > 16 * max (M,N)
- leverage k_per_block, and K and number of SMs on the chip
- upper bound to 128, lower bound to 1

This is better than defaulting to 1, cheap to calculate, and shows performance beyond ATEN

This is of course subject to change and improvement

Test Plan:
with minor modifications to to run torch.mm on the shape `M, N, K = 2048, 2048, 524288`

```
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

```
AUTOTUNE mm(2048x524288, 524288x2048)
  rocm_ck_gemm_template_49 10.4972 ms 100.0%
  rocm_ck_gemm_template_8 10.6132 ms 98.9%
  rocm_ck_gemm_template_9 10.6907 ms 98.2%
[...]
  mm 18.9880 ms 55.3%
```

Reviewed By: ColinPeppler

Differential Revision: D70224591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148118
Approved by: https://github.com/ColinPeppler
2025-02-28 20:36:16 +00:00
48c55a66ec [ROCm] Move ROCm unstable MI300 jobs back to stable (#146675)
Fixes #145790
Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners.

This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](https://github.com/pytorch/pytorch/pull/145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146675
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-02-28 20:34:27 +00:00
6778084531 [inductor][cutlass] Environment variables for allow/denylist (#148161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148161
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-02-28 20:33:10 +00:00
5a1954eb93 [Inductor-CPU] Fix broken int8 WoQ GEMM AMX implementation in main (#147895)
#146843 broke int8 WoQ GEMM's (for BF16 activation) AMX ISA implementation in the main branch.
UT: `python test/inductor/test_cpu_select_algorithm.py -v -k woq`

The issue remained undetected because in case of templated kernel compilation failure, the auto-tuning infra marks its runtime as `inf`, and the op against which it was being benchmarked is used, so UTs didn't fail even on machines that support AMX ISA.

`test/inductor/test_cpu_select_algorithm.py` UTs checked the value of the `select_algorithm_autotune` counter, which only counts how many ops were selected for autotuning against their templated codegened counterparts.

@leslie-fang-intel advised using a new counter. I added `counters["inductor"]["cpp_templated_kernel_counter"]`, which is incremented after a codegened kernel's compilation, so it'd help catch breakage scenarios in which a templated kernel could not be codegened due to a compilation failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147895
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-02-28 20:20:45 +00:00
clr
e0e516c554 Don't crash when we call __qualname__ on torch._C.ScriptFunction (#147894)
We've root caused this to correctly throwing attribute error on ScriptFunction
when missing attributes are caused. This PR will fix crashes that are showing
up. I'm going to stack a second PR to fix torch._c.ScriptFunction just being a
very badly behaving python object (which should also fix this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147894
Approved by: https://github.com/jansel
2025-02-28 20:15:38 +00:00
297c00264e stage 1 of depreate silent fallback of tuning gemm (#147798)
Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/)

context:
https://github.com/pytorch/pytorch/issues/147479

For the most part, this should not change the behavior.

For int_mm, I also removed
```
    # TODO: Re-enable eager mode implementation once cuBLAS is fixed
    if use_cutlass or use_triton_template(layout, enable_int32=True):
        choices = []
```
because I think it is unwanted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798
Approved by: https://github.com/eellison
2025-02-28 19:51:55 +00:00
ebc3f27bf4 Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)"
This reverts commit 6e037ac41c095dfdb37fdd4b36bf8ec2ebf84bf1.

Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/wdvr due to lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2691421540))
2025-02-28 19:44:54 +00:00
42aeb5d259 Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners (#145504)
E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120
```
Beginning upload of artifact content to blob storage
Error: An error has occurred while creating the zip file for upload
Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log'
/home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459
    throw new Error('An error has occurred during zip creation for the artifact');
    ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145504
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-02-28 19:16:28 +00:00
6e037ac41c [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-02-28 18:51:42 +00:00
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
982d7ba3ef [while_loop][inductor] relax the constraint that all inputs must be on the same device (#148019)
Previously, we require all inputs of while_loop to be on the same device. However, there're use cases where we want to keep some of the inputs on cpu while others on gpu e.g. an loop_idx on cpu will save the gpu to device copies. This PR relaxes the constraint and only check if carry and input at the same position have the same device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148019
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-28 18:27:03 +00:00
2d2f60bdda [cond] support mismatched output in inductor (#147567)
In this PR, we extract `codegen_unbacked_symbol_defs` of FallbackKernel out as a `codegen_unbacked_symbol_defs_for_outputs` method in wrapper. With it,  HOPs can support the case where the subgraph returns a tensor with unbacked symints. This PR only do it for cond, we'll have follow up PRs for others (e.g. while_loop) as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147567
Approved by: https://github.com/jansel
2025-02-28 18:26:48 +00:00
d765077004 [cutlass backend] Sort the list of ops for better repro (#148047)
Differential Revision: [D70298051](https://our.internmc.facebook.com/intern/diff/D70298051/)

This only affects anything if `cutlass_max_profiling_configs` is used. I believe cutlass_max_profiling_configs is more of a testing config.

Problem is when we get the configs from cutlass_library, the ops can come in different orders.

Motivation is to make repro small issues easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148047
Approved by: https://github.com/chenyang78, https://github.com/coconutruben
2025-02-28 18:04:10 +00:00
790ec756ee [cutlass backend] Check if len(timings) == len(choices) before skipping precompile (#148050)
Differential Revision: [D70298908](https://our.internmc.facebook.com/intern/diff/D70298908/)

Mostly from @coconutruben observation. Right now, we skip precompilation if we find **some** timings. That sounds like a bug. Most of the time it is fine, since we don't change the number of configs and triton compilation doesn't take too long. But it is devastating for cutlass backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148050
Approved by: https://github.com/coconutruben
2025-02-28 17:58:58 +00:00
e5e31050d3 [MPS] Implement linear1d as shader (#148154)
And get rid of MPS call, as for some reason implementation via MPSGraph
API call is 100x+ times slower that Metal shader, at least according to
the following benchmark
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(3, 5, 65536, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="linear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS        |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16  ")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results after the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    2.5  |   2.1  |   2.2  | 161.4  | 115.0  | 161.1
```
And before the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  354.0  | 336.0  | 332.4  | 145.5  | 114.7  | 148.3
```

Fixes https://github.com/pytorch/pytorch/issues/144245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148154
Approved by: https://github.com/dcci
2025-02-28 16:47:42 +00:00
b5cd4ac950 [torchgen] Add support for schema with namespace (#148038)
Fixes https://github.com/pytorch/executorch/issues/8711

In ExecuTorch when we try to parse the following schema:

```
aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor
```
Repro:

```python
from torchgen.model import FunctionSchema
native_schema = FunctionSchema.parse("aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor")
```
It's failing because `BaseOperatorName` categorizes it to be a
inplace operator.

I understand we are not supposed to pass in namespace "aten::" into
`FunctionSchema.parse()` but unfortunately ExecuTorch requires this
feature to work.

This PR adds a new `namespace` attribute to `BaseOperatorName` and makes
sure the rest of the stack works as before, if a schema without
namespace  is passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148038
Approved by: https://github.com/bdhirsh
2025-02-28 16:41:50 +00:00
e593288859 ci: Remove manylinux builds for triton, except for XPU (#148129)
We're dropping regular old manylinux so let's drop it here too

Relates to #123649

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148129
Approved by: https://github.com/Camyll, https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
ghstack dependencies: #148126
2025-02-28 16:23:18 +00:00
4708cfdbd9 Support whitelist of dynamic sources (#147979)
This PR introduces the ability to whitelist sources as dynamic. This is particularly useful for large models with graph breaks, as you can keep the dynamism across graph breaks since source names stay consistent. Additionally you can use this to mark ints as dynamic.

NB: I intentionally didn't complicate the interface by supporting specification of per dimension dynamism. There is virtue in keeping true to the standard way of representing sources (eg. L['x']). If we find in practice that we need more more fine grained control, we can explore further affordances at that time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147979
Approved by: https://github.com/Mingming-Ding
2025-02-28 15:43:14 +00:00
0a948f705b [Dynamo] Fix AssertionError when dynamo traces torch.functional.xxx() functions (#148075)
Fixes #147840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148075
Approved by: https://github.com/yanboliang
2025-02-28 15:09:11 +00:00
1db3c58fab Remove manylinux 2014 artifacts (#148135)
1. Switch Magma build to Manylinux 2.28 base
2. Use manylinux 2.28 as default in populate_binary_env.sh
3. Remove manylinux 2014 docker builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148135
Approved by: https://github.com/malfet
2025-02-28 13:43:14 +00:00
1cb4e2df65 [BE][PYFMT] migrate PYFMT for torch._inductor to ruff format (#144550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550
Approved by: https://github.com/jansel
2025-02-28 13:33:19 +00:00
34d726011f [dynamo] update data-dependent branching graph break messages (#147912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147912
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #147494, #147872
2025-02-28 12:30:06 +00:00
4106aa33eb [dtensor][fix] fix _scaled_dot_product_flash_attention sharding (#148125)
### Summary
https://github.com/pytorch/pytorch/pull/146372/ changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at 40ad5e01df/torch/distributed/tensor/_ops/_matrix_ops.py (L232)

### Test
`pytest test/distributed/tensor/test_attention.py`

### Follow-up
It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148125
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-02-28 09:26:43 +00:00
af720cd5a7 [Intel GPU] Decompule Intel GPU oneDNN from other backends (#147926)
# Motivation
Currently, Intel GPU is moving forward rapidly with the development of feature. We(Intel GPU) want an independent version control over oneDNN component so as to quickly adopt the optimization or bug fixing provided by oneDNN team.

This PR does not change the behaviors of other backends like Intel CPU, ARM. They can keep using the stable version contained in `third_party/ideep`.

# Detail

At compilation time, we will `git clone` oneDNN via  URL `https://github.com/oneapi-src/oneDNN` and checkout to the tag/commit that Intel GPU backend prefers. This feature is supported by CMake `Externalproject_add` command.
Following is a build log example:
```bash
[11/60] Performing download step (git clone) for 'xpu_mkldnn_proj'
Cloning into 'xpu_mkldnn_proj'...
HEAD is now at 5e92240360 meta: updated citation file
[12/60] Performing update step for 'xpu_mkldnn_proj'
-- Already at requested tag: v3.7
[13/60] No patch step for 'xpu_mkldnn_proj'
```
The log demonstates that, we explicitly download the source files and checkout to a specific tag. The source file of oneDNN is located at `build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj`

# Runtime verification
Running UT for CPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit fc3f17ad469b8a6da7192ae12d32625faa509f1e)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine
```

Runnint UT for Intel GPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:DPC++
onednn_verbose,v1,info,gpu,engine,sycl gpu device count:2
```

We can see that, Intel GPU would uses commit `5e922` (tag v3.7), while CPU uses `fc3f17`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147926
Approved by: https://github.com/EikanWang

Co-authored-by: leizhenyuan <zhenyuan.lei@intel.com>
2025-02-28 07:42:06 +00:00
3a58a04898 Build a storage reader/writer to write checkpoints in HF format (#148089)
Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere

Test Plan:
signals pass
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage

Differential Revision: D70324090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089
Approved by: https://github.com/saumishr
2025-02-28 07:38:10 +00:00
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
4e160d5fd9 [triton 3.3] Fix aoti cpp wrapper remaining 5 issue. (following #148051) (#148117)
Summary:
Fix the following 5 on a100:

- test_foreach_cpp_wrapper_cuda_gpu_wrapper

- test_enable_dynamic_shapes_cpp_wrapper_cuda_gpu_wrapper

- test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_gpu_wrapper

- test_enable_dynamic_shapes_cpp_wrapper_cuda_dynamic_shapes_gpu_wrapper

- test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_dynamic_shapes_gpu_wrapper

Test Plan:
oss :

```
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCH_LOGS="+inductor, output_code" TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CPLUS_INCLUDE_PATH=/usr/local/cuda-12.6/include:$CPLUS_INCLUDE_PATH python test/inductor/test_gpu_cpp_wrapper.py -k test_foreach_cpp_wrapper_cuda_gpu_wrapper
```

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148117
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
2025-02-28 06:56:30 +00:00
ea12fc8a9f Revert D70262395 (#148164)
Summary:

This reverts #147804 due to internal revert.

---
This diff reverts D70262395

Reviewed By: RossMcKenzie

Differential Revision: D70318024

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148164
Approved by: https://github.com/xmfan
2025-02-28 06:39:48 +00:00
baba7beed2 [dynamo] add context manager debug information to graph breaks (#147872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147872
Approved by: https://github.com/zou3519
ghstack dependencies: #147494
2025-02-28 06:23:28 +00:00
4caeede799 [dynamo] more better error messages [3/N] (#147494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147494
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-02-28 06:23:28 +00:00
bc362cc15a Move expanded dim require_exact_stride handling to api from sdpa lowering (#148101)
See issue: https://github.com/pytorch/pytorch/issues/147156#issue-2852362217.

Original tests from https://github.com/pytorch/pytorch/pull/146054 should cover these changes, and I tested that the perf on https://github.com/pytorch/pytorch/issues/145760 remains fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148101
Approved by: https://github.com/zou3519
2025-02-28 06:02:18 +00:00
cyy
b0dfd242fa Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 05:53:19 +00:00
3b4b23ab0b [BE][Ez]: Remove extra copy in dtensor parallel loss (#148096)
Remove an extra copy of the input to `_log_softmax` when there is a dtype and memory format change. Fuse the copies instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148096
Approved by: https://github.com/jansel, https://github.com/wconstab
2025-02-28 05:42:32 +00:00
9b7130b8db Clean temporary directory at exit (#147813)
Issue: A temporary directory is created in [pytorch/torch/distributed/nn/jit/instantiator.py](https://github.com/arthurlw/pytorch/blob/clean-temp-directory-at-exit/torch/distributed/nn/jit/instantiator.py) but is never cleaned up, leading to a ResourceWarning on program exit.

Solution: Registered an `atexit` handler to properly clean up the temporary directory when the program exits.

Fixes #147744

**Line 23 in [0a49f8f](0a49f8fd3d)**
```python
23  atexit.register(_TEMP_DIR.cleanup)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147813
Approved by: https://github.com/H-Huang
2025-02-28 04:12:23 +00:00
760921a7d8 [MPS] Add inductor support for the entr() operator. (#148128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148128
Approved by: https://github.com/jansel, https://github.com/malfet
2025-02-28 03:33:22 +00:00
eb9c127341 [dynamo][optimizers] Install ID_GUARDED tensors into the Fx graph (#147824)
Earlier, with inline flag we were lifting id-guarded tensors to the inputs to the Fx graph. But this offers no benefit. Main idea behind lifting parameters as inputs was to reuse the compilation units across many instances of the nn-module. However, if we are guarding on the `id`, we are explicitly specializing the compiled artifact to the parameter.

This PR installs the parameters back into the graph. The benefit is removal of all pre-graph bytecode to extract the id-guarded tensors from locals/globals. This increases speedup from 1.67x to 1.75x for an internal model that has large number of optimizer parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147824
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-02-28 03:22:11 +00:00
926b7b5027 Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705)"
This reverts commit 40ad5e01dff05c7d64e070fb01683820e678f788.

Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))
2025-02-28 03:04:38 +00:00
3ce352e389 [BE][PYFMT] migrate PYFMT for torch._dynamo to ruff format (#144549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549
Approved by: https://github.com/jansel
2025-02-28 03:03:53 +00:00
edc5bf91d2 [Intel GPU] Add synchronize() in torch.utils.benchmark (#147835)
When following https://pytorch.org/tutorials/recipes/recipes/benchmark.html on XPU, I notice that the device it is not synchronized in the benchmark. This PR tries to fix this and align the behavior with CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147835
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-02-28 02:58:17 +00:00
0edb2da4a4 [dynamo] add sourceless builder for types.MethodType (#147880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880
Approved by: https://github.com/jansel
2025-02-28 02:30:04 +00:00
30375cb326 Fix minor typo in python_nccl (#148088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148088
Approved by: https://github.com/Skylion007
2025-02-28 00:47:09 +00:00
481a57bc37 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-28 00:47:03 +00:00
c6d1038aaa only print GraphModule during fx.Interpreter errors if valid (#148090)
Came up in https://www.internalfb.com/diff/D69057074?dst_version_fbid=970771615000938&transaction_fbid=1723357345264461 - we need to make sure the GraphModule is valid before calling `print_readable` on it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148090
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #147749
2025-02-28 00:44:27 +00:00
5a14ff8ace Add cufile to list of libraries to preload (#148137)
Fixes: https://github.com/pytorch/pytorch/issues/148120

Test with almalinux/9-base:latest :
```
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module>
    from torch._C import *  # noqa: F403
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory
>>> exit()
[root@18b37257e416 /]# vi /usr/local/lib64/python3.9/site-packages/torch/__init__.py
[root@18b37257e416 /]# python3
Python 3.9.19 (main, Sep 11 2024, 00:00:00)
[GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu126'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148137
Approved by: https://github.com/malfet
2025-02-28 00:35:47 +00:00
40ad5e01df Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 00:15:32 +00:00
2978771c9d [CI] test upload: better check for if job is rerun disabled tests (#148027)
Some disabled test runs weren't being uploaded as disabled tests because some dynamo tests are set to mark themselves as skipped if they are failing.  This makes the script think that there are fewer retries than there are actually are and that the job is not a rerun disabled tests job.  Instead, query for the job name to see if it contains rerun disabled tests and fall back to counting the number of retries if querying fails

Alternate options: relax the check for the number of tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148027
Approved by: https://github.com/huydhn
2025-02-28 00:04:33 +00:00
fc78192b1d ci: Only run CI specific things when in CI (#148126)
This was blocking me from running this locally so don't run it like this

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148126
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/atalman
2025-02-27 23:27:57 +00:00
f4235310e8 [BE][Ez]: Remove redundant empty tensor copies in meta-reg (#147978)
Empty_likes includes a memory_format arg. Let's use it to avoid unnecessary copy operations. Noticed while reviewing: https://github.com/pytorch/pytorch/pull/147862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147978
Approved by: https://github.com/jansel
2025-02-27 23:16:44 +00:00
915b9c80ab [export] Sync aoti schema to schema.py (#148017)
Summary: Synchronizing internal AOTI schema to OSS schema.py

Test Plan: CI

Differential Revision: D70271151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148017
Approved by: https://github.com/yiming0416
2025-02-27 21:46:11 +00:00
871b3909fc ci: Remove manylinux 2014 remnants (#148028)
These are the only remaining references I could find to manylinux2014,
we should probably look to remove these a bit quicker since it made it
difficult to know which Dockerfiles were important in
.ci/docker/manywheel/

> [!TIP]
> I checked if we were using these by running
> `rg 2014 .github/`

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148028
Approved by: https://github.com/wdvr, https://github.com/malfet, https://github.com/atalman
2025-02-27 21:37:00 +00:00
10ffd94216 Reference the commit explicitly (#148026)
Reference the commit tested by CI explicitly, and fail the merge if the PR was updated.

Tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148026
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/atalman
2025-02-27 21:06:34 +00:00
783d83c5d8 [PT2] Port fuse_split_getitem_squeeze to PT2 pre_grad passes (#148059)
Summary: put it as an add_pass option

Reviewed By: frank-wei

Differential Revision: D68909559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148059
Approved by: https://github.com/frank-wei
2025-02-27 21:03:51 +00:00
d48eb58d1d [BE][CI] bump ruff to 0.9.8 (#145606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145606
Approved by: https://github.com/malfet
ghstack dependencies: #144546
2025-02-27 21:01:10 +00:00
644d84d594 Revert "optimize the decomposition of aten.native_group_norm (#144733)"
This reverts commit b533bb4b133c36767270bd8a24f11d5c37f8dd5c.

Reverted https://github.com/pytorch/pytorch/pull/144733 on behalf of https://github.com/desertfire due to Cause TIMM pass rate regression on H100, see https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2020%20Feb%202025%2020%3A53%3A55%20GMT&stopTime=Thu%2C%2027%20Feb%202025%2020%3A53%3A55%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=main&lCommit=4216478250e08e950fdd090fc23a1b270c520cc4&rBranch=main&rCommit=4986f0f52eb871cdb91b8124ee162cfe622b8688 ([comment](https://github.com/pytorch/pytorch/pull/144733#issuecomment-2689092714))
2025-02-27 20:57:25 +00:00
1845e7d1f5 Use nightly-wheel-upload env for triton wheel publishing (#148108)
Required for publishing triton builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148108
Approved by: https://github.com/malfet
2025-02-27 20:47:40 +00:00
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
f0d00421cf [inductor][ck] kBatch filtering with gen_ops (#148004)
Summary:
# Why

not all choices of kBatch are valid and will lead to a runtime error (when CK checks the validity of the args)

c9bcfd755e/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp (L1020)

# What

- move kBatch inside the gen_ops to have more control over it, and be able to filter it
- expand filtering based on the cpp logic
- refactor the padding checks to be more readable

Test Plan:
```
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

with

kBatch = 128: some filering
kBatch = 1: no filering
kBatch = 1738: all options filtered out

Reviewed By: henrylhtsang

Differential Revision: D70211442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148004
Approved by: https://github.com/ColinPeppler, https://github.com/tenpercent
2025-02-27 20:13:58 +00:00
ce805a5ba5 [BE/metal] Rename REGISTER_I0_I1 to REGISTER_SPECIAL. (#148036)
Now that it's used for other ops as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148036
Approved by: https://github.com/malfet, https://github.com/jansel
2025-02-27 17:56:26 +00:00
9a1f720a72 Validate inputs to _nested_view_from_buffer to prevent overflows (#147356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147356
Approved by: https://github.com/albanD, https://github.com/jbschlosser
ghstack dependencies: #147352, #147354
2025-02-27 15:48:58 +00:00
536bce5a04 Make Tensor.set_ validate storage_offset when sizes/strides are unchanged (#147354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147354
Approved by: https://github.com/albanD
ghstack dependencies: #147352
2025-02-27 15:48:58 +00:00
e64441915f Fix overflow in checkInBoundsForStorage (#147352)
Use `computeStorageNbytes` (which checks for overflows) to include the computation re the storage_offset

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147352
Approved by: https://github.com/albanD
2025-02-27 15:48:50 +00:00
6ccbff1450 [Inductor] Fix inductor/test_kernel_benchmark.py for new Triton; do not duplicate parameters in _dump_launch_params (#147746)
The problem is that the new Triton uses the following code branch, which does not filter the call parameters, which may already be in the launcher's cfg.kwargs. This is generally expected behavior, so I just stopped adding arguments from `launcher.config.kwargs`: cde12207a0/torch/_inductor/runtime/triton_heuristics.py (L1099)

Issue example (from https://github.com/intel/intel-xpu-backend-for-triton/issues/3499):

```bash
Failed when when running cleaned triton Command '['/home/xinanlin/xinanlin/miniforge3/bin/python', '/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3b
dmtky5n4j4jrd5k5pu.py.cleaned']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 103, in <module>
    compiled_module_main('None', benchmark_compiled_module)
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/wrapper_benchmark.py", line 435, in compiled_module_main
    wall_time_ms = benchmark_compiled_module_fn(times=times, repeat=repeat) * 1000
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 98, in benchmark_compiled_module
    return print_performance(fn, times=times, repeat=repeat)
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in print_performance
    [timed(model, example_inputs, times, device) for _ in range(repeat)]
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in <listcomp>
    [timed(model, example_inputs, times, device) for _ in range(repeat)]
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 434, in timed
    result = model(*example_inputs)
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 97, in <lambda>
    fn = lambda: call([arg0_1, arg1_1])
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 86, in call
    triton_poi_fused_add_0[grid(1)](arg0_1, arg1_1, buf0, 1, 1, XBLOCK=1, num_warps=1, num_stages=1)
  File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 336, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 531, in run
    bound_args, specialization, options = binder(*args, **kwargs)
TypeError: dynamic_func() got multiple values for argument 'XBLOCK'
```

Reroduce:
`python test/inductor/test_kernel_benchmark.py -k test_remove_inductor_deps`

Triton: c4a79a1960
Pytorch: bea72180ed75f522ce4fe5e723bc2112e0874732

@davidberard98 @etaf please take a look
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147746
Approved by: https://github.com/jansel
2025-02-27 14:40:22 +00:00
2c35af4def [Intel GPU] Avoid including CPU oneDNN header files for Intel GPU (#147969)
XPU builds oneDNN in another folder. The XPU oneDNN head files are in the XPU-specific folder - `${__XPU_MKLDNN_BUILD_DIR}`.
f522d899fb/cmake/Modules/FindMKLDNN.cmake (L73)

 So, `${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn/include` is useless for XPU. `XPU_MKLDNN_INCLUDE` is good enough. Meanwhile, it may mess up the included files if the version of XPU oneDNN differs from other backends.

* __->__ #147969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147969
Approved by: https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/atalman
2025-02-27 14:22:17 +00:00
71ee17baa1 Smoke Test skip cuda.gds on windows (#148060)
Follow up after : https://github.com/pytorch/pytorch/pull/147120
Cufile was enabled only on Linux: https://pypi.org/project/nvidia-cufile-cu12/#files
Fixes validation workflow failues: https://github.com/pytorch/test-infra/actions/runs/13558218752/job/37896578837

```
 File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 105, in __init__
    raise RuntimeError("GdsFile is not supported on this platform.")
RuntimeError: GdsFile is not supported on this platform.
Exception ignored in: <function GdsFile.__del__ at 0x000001772B5003A0>
Traceback (most recent call last):
  File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 113, in __del__
    if self.handle is not None:
AttributeError: 'GdsFile' object has no attribute 'handle'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148060
Approved by: https://github.com/mikaylagawarecki
2025-02-27 14:00:49 +00:00
7ae0e0b2ea [aotd] Log torch._functorch.config in tlparse (#147883)
Adding torch._functorch.config to tlparse for better debugability.
E.g. https://github.com/pytorch/pytorch/pull/147638 happened only with `torch._functorch.config.view_replay_for_aliased_outputs=False` which is True by defautl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147883
Approved by: https://github.com/bdhirsh, https://github.com/jamesjwu
2025-02-27 11:22:45 +00:00
c5bf9aaf1c Log graph breaks (#146537)
Graph breaks currently aren't logged to dynamo_compile and pt2_compile_events. We want to log them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146537
Approved by: https://github.com/c00w
2025-02-27 11:06:33 +00:00
0489a349e7 Skip the logging if the pass cannot be pickled (#148053)
Summary:
Skip the logging for vllm at this moment, we can add some pickle logic later.

The log is only for debugging purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148053
Approved by: https://github.com/chenyang78
2025-02-27 10:54:34 +00:00
26f19539ad [triton 3.3] cpp_wrapper: add a global_scratch arg (#148051)
Following triton # 4916, the generated cubin expects a global_scratch argument to support on-device TMA. We believe this is the source of many of the "invalid argument" failures on AOTI/cpp_wrapper tests. AFAIK, we don't use on-device TMA in Inductor as of now, so it should be safe to use a nullptr for the scratch space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148051
Approved by: https://github.com/YUNQIUGUO
2025-02-27 10:13:57 +00:00
91e7c7945c [Intel GPU] Avoid unnecessary copy when the dst of Matmul is non-contiguous (#144759)
We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in https://github.com/pytorch/pytorch/pull/143784

I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144759
Approved by: https://github.com/EikanWang
2025-02-27 08:04:34 +00:00
8ee84aa703 [ONNX] Fix missed None type support in dyamic shapes string cases (#148025)
In `_any_str_or_dim_in_dynamic_shapes`, we strictly guard the `dynamic_shapes` to make sure the flattened shapes are valid. But the code missed to consider None could be in the shapes.

NOTE: Found in benchmarking with Olive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148025
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-02-27 07:57:47 +00:00
fd43c36aa9 [ca] side-effect free initial trace: RAII PyCompilerInterface (#147891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147891
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796, #147804
2025-02-27 07:17:30 +00:00
9017becf1d Add unique kernel name support for user defined triton kernel (#147587)
Summary:
Add unique_user_kernel_names which mimics what unique_kernel_names do, but for user defined Triton kernels.
This does rewrite the copied kernel src, and modifies non-Inductor generated code, so we split it out from unique_kernel_names, where we have more control over all namings and generations.

Test Plan: Only used for debug purpose

Differential Revision: D69966608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147587
Approved by: https://github.com/desertfire
2025-02-27 06:00:50 +00:00
c622796cde Revert "Build a storage reader/writer to write checkpoints in HF format (#147622)"
This reverts commit 6a658d983e84f7bcb8e67328b00661ec49db78c5.

Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))
2025-02-27 05:14:28 +00:00
21bd5fe203 Update torch-xpu-ops commit pin (#147968)
Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](86aaaf8a9d), includes:

- Bugfix (PT2E/BatchNorm)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968
Approved by: https://github.com/Skylion007
2025-02-27 05:01:12 +00:00
b6fe28ff02 [Inductor] Graph Partition (#147038)
This PR implements inductor graph partition. Previously, 1 dynamo graph is mapped to 1 inductor graph, and further mapped to 1 call function. In this PR, we allow 1 dynamo graph mapped to multiple inductor graphs and multiple `graph_partition` functions in the generated code. This allows applying different further optimizations to different `graph_partition`.

Design Doc: [link](https://docs.google.com/document/d/1qPgOfy25l7SIYnrQrvU-TO1mdHMslCwv_SLmeXID6tM/edit?usp=sharing)
Example: [Generated code before and after this diff](https://www.internalfb.com/intern/diffing/?paste_number=1737334601)

In the follow-up PR, we will extend the work to cudagraph, which allows applying cudagraph to parts of the generated code (#125864).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147038
Approved by: https://github.com/eellison
2025-02-27 04:50:43 +00:00
e0b93082f1 Remove HuggingFace reader and writer from __init__.py (#148030)
Summary: This is causing a HFStorageReader/Writer to be imported which imports fsspec but dependencies don't have fsspec, which is causing failing builds

Differential Revision: D70286926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148030
Approved by: https://github.com/hl475
2025-02-27 04:50:14 +00:00
8cb8722979 [inductor][triton] Ignore block ptr advances for removed buffers (#147193)
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-27 03:37:33 +00:00
17358ce778 Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878)"
This reverts commit ad0c879e2203145f6d56df0b95af36822220ab8f.

Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))
2025-02-27 03:36:16 +00:00
9d3636283b [Inductor] Use generic GPU device in test_preserves_strides (#148006)
#147861 added a new test tagged for the generic GPU but uses the cuda GPU type for creating the tensors. Update the GPU type to also be generic. This passes with my local Intel Triton install, presumably it will work for the current pin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148006
Approved by: https://github.com/eellison, https://github.com/etaf
2025-02-27 02:52:51 +00:00
07b7b3ed4e torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-27 02:44:39 +00:00
84c89a4527 [cutlass backend] cache_clear algorithm select cache on fresh inductor cache (#147590)
Differential Revision: [D69959917](https://our.internmc.facebook.com/intern/diff/D69959917/)

AlgorithmSelectorCache is a cache. The expectation is that when we force disable cache + clear inductor caches, it would be clear. However that is not the case.

The reason why this is a problem can be seen by following this repro:
What we will see is
```
SingleProcess AUTOTUNE benchmarking takes 6.2202 seconds and 46.0568 seconds precompiling for 36 choices
SingleProcess AUTOTUNE benchmarking takes 492.3141 seconds and 0.0010 seconds precompiling for 36 choices
```

The root cause is, while precompiling is skipped, due to it being cache, autotuning isn't skipped since we force disable it.

repro:
```
import logging
import os

os.environ["TORCH_LOGS"] = "+output_code,+benchmarking,+inductor"

import torch

import torch._inductor.config
from torch._inductor.utils import clear_inductor_caches

torch._inductor.config.max_autotune = True
torch._inductor.config.force_disable_caches = True
torch._inductor.config.autotune_num_choices_displayed = None
torch._inductor.config.max_autotune_gemm_backends = "CUTLASS"
torch._inductor.config.autotune_fallback_to_aten = False
torch._inductor.config.cuda.cutlass_instantiation_level = "0001"

def main():
    M, N, K = 2048, 2048, 2048
    dtype = torch.bfloat16
    A = torch.randn(M, K, device="cuda", dtype=dtype)
    B = torch.randn(K, N, device="cuda", dtype=dtype)
    for _ in range(2):
        torch._dynamo.reset()
        clear_inductor_caches()
        compiled_model = torch.compile(torch.mm, fullgraph=True)
        _ = compiled_model(A, B)

    print("done")

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147590
Approved by: https://github.com/eellison, https://github.com/chenyang78
2025-02-27 02:30:49 +00:00
97ebccaa91 Add _fft_r2c as core ATen (#147998)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147998
Approved by: https://github.com/tugsbayasgalan
2025-02-27 02:29:59 +00:00
ad0c879e22 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-27 02:08:29 +00:00
784902983e Remove +PTX from cuda 12.6 builds (#148000)
Similar to: https://github.com/pytorch/pytorch/pull/141142

Ahead of the release 2.7
I see following validation failure: https://github.com/pytorch/test-infra/actions/runs/13552433445/job/37879041739?pr=6339
```
RuntimeError: Binary size of torch-2.7.0.dev20250226+cu126-cp310-cp310-manylinux_2_28_x86_64.whl 1076.45 MB exceeds the threshold 750 MB
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148000
Approved by: https://github.com/clee2000, https://github.com/ngimel, https://github.com/tinglvv
2025-02-27 02:02:11 +00:00
20ce67cd06 Udpate hw requirement for FP64 on "Getting Started on Intel GPU" (#147802)
Fixes #147731

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147802
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-27 01:54:19 +00:00
cyy
9ca871f32b Remove binaries/benchmark_args.h (#147920)
It's not used in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147920
Approved by: https://github.com/Skylion007
2025-02-27 01:16:28 +00:00
ea5d40db73 Address source code building command for Intel GPU support (#143476)
As the title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143476
Approved by: https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Xu Han <xu.han@outlook.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-27 01:07:40 +00:00
f104ef1248 [AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode (#147975)
Summary: Let CppBuilder handle all the cpp build logic

Differential Revision: D70141808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147975
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-02-27 00:35:12 +00:00
f98cd84b04 cpp_wrapper: use largeTensorTest for test memory checks (#146991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146991
Approved by: https://github.com/desertfire
2025-02-27 00:30:21 +00:00
723f3a9eab torch.utils._content_store: fix error in hash_storage on XPU (#147785)
See https://github.com/pytorch/pytorch/actions/runs/13508573465/job/37745227468 for an example error. This is triggering after the merge of #147541, which enabled Dynamo compilation on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147785
Approved by: https://github.com/jansel
2025-02-26 23:57:59 +00:00
915eb012e1 Revert "[dynamo] add sourceless builder for types.MethodType (#147880)"
This reverts commit 08f4c1a2332921e57c782c80a66b2adc9cdc0575.

Reverted https://github.com/pytorch/pytorch/pull/147880 on behalf of https://github.com/wdvr due to failing trunk tests ([comment](https://github.com/pytorch/pytorch/pull/147880#issuecomment-2686436432))
2025-02-26 23:29:58 +00:00
84e60eece8 [ROCm] [TunableOp] Unit tests for scaled GEMM and GEMM with bias (#147890)
Two more unit tests for TunableOp:
- Scaled GEMM
- GEMM with bias

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147890
Approved by: https://github.com/jeffdaily
2025-02-26 22:41:24 +00:00
b13ad1a193 [ROCm][TunableOp] Remove extra transpose characters in hipBLASLt signature. (#147900)
Cleanup the TunableOp hipBLASLt signature of extra transpose characters.

Test manually and no new regressions found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147900
Approved by: https://github.com/jeffdaily
2025-02-26 22:28:00 +00:00
7e7d05bf85 Revert "[do not merge yet] update grammar (#147996)"
This reverts commit 6e129a697f86425d0682ed30ffc9b3f8abe00e9e.

Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686291282))
2025-02-26 22:01:12 +00:00
6e129a697f [do not merge yet] update grammar (#147996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996
Approved by: https://github.com/seemethere
2025-02-26 21:52:58 +00:00
dc7556f1bd Revert "[do not merge yet] update grammar (#147996)"
This reverts commit a1ee2c3a08c3bf3d83c4e9f352ea179c107edb13.

Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686266052))
2025-02-26 21:43:06 +00:00
a1ee2c3a08 [do not merge yet] update grammar (#147996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996
Approved by: https://github.com/seemethere
2025-02-26 21:39:08 +00:00
201666d77d [cutlass backend] turn autotuning logs off by default + rename log to autotuning log (#147922)
things we did:
* turn off autotuning logs by default
* rename autotuning logs from log to autotuning_log, so people are aware that it is a special artifact log.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147922
Approved by: https://github.com/eellison
2025-02-26 21:02:04 +00:00
976ff5cf01 Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418)
per title

sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-02-26 20:52:28 +00:00
6a658d983e Build a storage reader/writer to write checkpoints in HF format (#147622)
Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.
Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622
Approved by: https://github.com/saumishr
2025-02-26 20:47:54 +00:00
7c71ab1d40 [scan] User-facing reverse flag handling (#147886)
This PR removes the reverse flag from the backend implementation and resolves it via `torch.flip` in the frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147886
Approved by: https://github.com/ydwu4
2025-02-26 20:04:57 +00:00
683e083e8d [MPS] Add support for entr() in eager. (#147948)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147948
Approved by: https://github.com/malfet
2025-02-26 19:55:02 +00:00
eb08ada5d3 [dynamo] Support reads to global/captured tensors in nonstrict_trace-ed function (#147572)
As title. Without this patch we get the following error:

Tweaking the `allow_non_fake_inputs` flag on tensor mode doesn't quite
work for AOTAutograd, which also needs to fake-tensor-propagate the
`nonstrict_trace`-ed function, but that's _after_ Dynamo has handled the
`nonstrict_trace` processing and put the `flat_apply(...)` node into the graph.

So we can't easily to temporarily enable the `allow_non_fake_inputs`
flag on current fake mode, when AOTAutograd processes a `flat_apply`
node from Dynamo's `nonstrict_trace` handling. And after discussing
with zou3519, I decided to add a global `FakeTensorTLS` that contains a
`allow_non_fake_inputs_override` flag, and patch the `nonstrict_trace`-ed
function to temporarily tweak this flag during its execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147572
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367, #146950, #147571
2025-02-26 19:47:39 +00:00
73e963459e [dynamo] Support nonstrict_trace on class method (#147571)
As title, also see
1. new test `test_nonstrict_trace_on_method` for example.
2. newly added comments for why we need special treatment on methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147571
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367, #146950
2025-02-26 19:47:39 +00:00
7e0ef2c844 [dynamo] Use the new get_unique_name_wrt helper when applicable (#146950)
This patch removes some duplicated name generation logic in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146950
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367
2025-02-26 19:47:39 +00:00
f46f0e465c [dynamo] Initial support for nonstrict_trace (#146367)
## Context
> **Note:** `mark_traceable` got renamed to `nonstrict_trace` after
> offline discussion. The reasons are (1) it aligns with `torch.export`'s
> `nonstrict` notion, and (2) it's more definitive in behavior suggestion.

1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0)
2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn)

## Summary
This patch adds a `torch._dynamo.nonstrict_trace` decorator, which
currently is an enhanced version of `torch._dynamo.allow_in_graph` (see
docstring for their differences). Specifically, this patch focuses on
the UI and functionality prototyping/plumbing.

The main enhancement is supporting more input types, and the
implementation challenge lies in reconstructing the input objects from
Dynamo `VariableTracker` (while accounting for buffered side-effects and
guards).  This patch takes a middle-ground (simple implementation with a
bit of user labor), by
1. asking the user to provide pytree registration for non-proxy-able
   input types,
2. letting Dynamo trace through `pytree_flatten` (which accounts for
   buffered side-effects and guards automatically),
3. and passing in the TreeSpec as a graph attribute constant into
   `torch._higher_order_ops.flat_apply` (which unflattens the inputs and
   invokes the underlying function).

## Next Steps
In subsequent patches, we will try to support the following:
- annotating on class method
- reads to global tensors
- inputs that contains `pytree.register_constant`-ed instances.
- function as input
- more output types (e.g., any pytree-registered type)
- `torch.nn.Module` as inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367
Approved by: https://github.com/zou3519
ghstack dependencies: #146714
2025-02-26 19:47:39 +00:00
bab84f0bd9 [hop] Support more output types for flat_apply (#146714)
This patch enables `flat_apply` to support certain non-Tensor output
types like containers and graphable types. This will in turn enable the
upcoming `mark_traceable` to support more output types.

The patch also exposes a `func_to_graphable` rather than having the
users calling the lower level `pytree.flatten(ConstantFunction(...))`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146714
Approved by: https://github.com/zou3519
2025-02-26 19:47:39 +00:00
8594856651 [aotd] Alias of intermediate unwrap TensorAlias (#147638)
Bug was reported by internal user.

AOTD classified outputs that are aliases of intermediates of the graph in different categories.

...
- output is alias of intermediate which base is already output
- output is alias of intermediate which base is not in output

If we look at the fn:
```
def fn(x):
    ix = x + 1
    a = ix.transpose(0, 1)
    return a.detach(), a
```

output 0: detach view of alias a, where a is already output
output 1: alias of intermediate ix, then additional output ix will be added internally

output 0 base is TensorAlias(a) in this case, but could be Tensor.
Adding runtime unwrapping solves this problem.

Alternatively we should track base of a.detach() all the way to ix, in that case the base will be always a Tensor, not TensorAlias.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147638
Approved by: https://github.com/bdhirsh
2025-02-26 19:42:21 +00:00
30db64bf51 [PT2] Support add/remove passes in pre_grad (#146064)
Summary:
support the same functionality with acc_tracer disabled, add a new config for pre_grad add/remove_passes, at the front end it still uses the same interface

some minor updates in pre_grad passes to make sure the passes are run in desired order, after added passes, still run pass like remove_noops at the end

Test Plan: add new UT, please see stacked diff for add pass tests (TODO: update diff link)

Differential Revision: D68909278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146064
Approved by: https://github.com/frank-wei
2025-02-26 18:46:43 +00:00
00732c3f7e [MPS] Implemented masked_fill_scalar as shader (#147369)
- Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header
- Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop
```metal
ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim);
if (mask[thread_index]) {
  StridedTensor<T> input(input_data, sizes, input_strides, ndim);
  input[thread_index] = val;
}
```
But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow

- Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided`  which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors.

Performance measured on M2Pro thru different iterations of the same shader

| dtype | MPS | int64-idx | int64-inlined | 32-bit strided | 32-bit broadcasted |
| ------|------| -----|   ---- | --- | ---- |
| float32 | 2.8 msec  | 41.6 msec | 26.9 msec | 5 msec | 2.4 msec |
| float16 | 1.86 msec | 38.2 msec| 26.6 msec | 4.6 msec | 1.9 msec |
|bfloat16|1.86 msec |38.3 msec | 26.6 msec | 4.6 msec | 1.9 msec |

And benchmark script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_mask_fill(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()",
        setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_mask_fill(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```
Fixes https://github.com/pytorch/pytorch/issues/143477
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369
Approved by: https://github.com/dcci
ghstack dependencies: #147977
2025-02-26 18:39:15 +00:00
ebf6b9839c [MPS] faster integer batched matmul (#147877)
Followup to #147526
Tiled matmul for bmm as well.

## Speed ups:
![speedups_bmm](https://github.com/user-attachments/assets/02501145-7d64-4bbe-9dcc-994f004b4829)

Script to record times:
```python
import torch
import numpy as np
import time
import csv

batch_sizes = [1, 2, 4, 8]
matrix_sizes = [256, 512, 1024, 2048]
num_runs = 10
warmup_runs = 3

def run_int_mm(A, B):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = A @ B
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'B': [],
    'mean_time': [],
    'std_time': []
}

for b in batch_sizes:
    for n in matrix_sizes:
        print(f"\nBenchmarking N={n} and B={b}")

        try:
            A_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps")
            B_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps")

            for _ in range(warmup_runs):
                _, _ = run_int_mm(A_mps, B_mps)

            times = []
            for _ in range(num_runs):
                _, t = run_int_mm(A_mps, B_mps)
                times.append(t)

            mean_time = np.mean(times)
            std_time = np.std(times)

            results['N'].append(n)
            results['B'].append(b)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)

            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

        except RuntimeError as e:
            print(f"Error for N={n}: {e}")
            continue

with open('int_bmm_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['B'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147877
Approved by: https://github.com/Skylion007
2025-02-26 18:37:13 +00:00
cfb293ee02 [inductor] Add logs for precompile and autotuning (#147923)
Differential Revision: D70222645

I want to add more logs around precompile, especially around the reason why sometimes it gets fast returned. See https://github.com/pytorch/pytorch/pull/147590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147923
Approved by: https://github.com/Skylion007
2025-02-26 18:26:07 +00:00
0ea5d1067b ROCm: Remove static specifier for allow_tf32 variable. (#147186)
Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186
Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd
2025-02-26 18:24:02 +00:00
4e4191854b [logs][qol] Print log options alphabetically (#147888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147888
Approved by: https://github.com/jansel
2025-02-26 18:15:39 +00:00
fb566c5aea Fix auto_functionalize x inference_mode (#147925)
Fixes #147924

We were using the wrong FunctionalTensorMode to construct
FunctionalTensors. FunctionalTensors modify the FunctionalTensorMode on
construction, so that led to the wrong FunctionalTensorMode being
modified. This PR threads the FunctionalTensorMode through correctly.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147925
Approved by: https://github.com/bdhirsh
2025-02-26 18:05:30 +00:00
678435c443 [FlexAttention] Fix IMA bug (#147918)
# Summary
Fixes: https://github.com/pytorch/pytorch/issues/147268

I got this right for the backwards and somehow forgot to do the flip in the forward, not sure how this wasnt found earlier..

Testing IMAs is tuff in pytest so didnt add but verified on reproducer

```py
❯ sanitize python flex/maurice_ima.py --setting 0
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(1.0078, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(1.7994, device='cuda:0')
========= ERROR SUMMARY: 0 errors
❯ sanitize python flex/maurice_ima.py --setting 1
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(2.8297, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(1.9714, device='cuda:0')
========= ERROR SUMMARY: 0 errors
❯ sanitize python flex/maurice_ima.py --setting 2
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(3.2232, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(2.2095, device='cuda:0')
========= ERROR SUMMARY: 0 errors
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147918
Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007
2025-02-26 17:59:05 +00:00
3f7e242c86 [CI] Checkout with more processes (#147652)
The default action doesn't use more processes, possibly because most github provided runners only have 2 cpus, but we have more than that, so we might as well use them

Generally cuts maybe 1 min off of checkout time?

Changed checkout from pytorch/pytorch@main to pytorch/pytorch@my branch to test on 249a936998e66cc0d6ad8664e0e93ec1b9432a8b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147652
Approved by: https://github.com/ZainRizvi
2025-02-26 17:51:28 +00:00
ef61c290e1 [DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025)
Resolves https://github.com/pytorch/pytorch/issues/146767.

May also resolve https://github.com/pytorch/pytorch/issues/147584.

### Summary
This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons:

1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present.
2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution.

Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method.

### Consequence

DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`.

### Test
`pytest test/distributed/tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
`pytest test/distributed/tensor/parallel/test_tp_style.py`

Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025
Approved by: https://github.com/kwen2501
2025-02-26 17:33:22 +00:00
5ef94ca816 [BE] Do not copy arguments in variadic template (#147977)
By adding missing  `std::forward<Args>(args)...` and declaring template as passing args by reference

Noticed while working on creating `mtl_setBytes` specification that takes `MPSScalar` as argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147977
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-26 17:20:16 +00:00
ba9ed856e0 [FlexAttention] Improve error msg for embedding < 16 (#147765)
flex_attention uses tl.dot, which [does not support embedding < 16](https://github.com/triton-lang/triton/issues/2266) on input shapes. This PR adds explicit error message for users who are prototyping with small tensors.

Fixes #147701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147765
Approved by: https://github.com/drisspg
2025-02-26 17:06:35 +00:00
ac926f81cc [Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)
Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395
Approved by: https://github.com/eellison
2025-02-26 16:56:17 +00:00
fd1220e386 [ca] side-effect free inital trace: compiled_args (#147804)
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796
2025-02-26 16:37:27 +00:00
5e3069dde8 [ca] side-effect free initial trace: GraphTask (#147796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796
Approved by: https://github.com/jansel
ghstack dependencies: #147242
2025-02-26 16:37:27 +00:00
0a2da008f8 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-26 16:37:17 +00:00
08f4c1a233 [dynamo] add sourceless builder for types.MethodType (#147880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880
Approved by: https://github.com/jansel
2025-02-26 15:43:47 +00:00
edaf9ddeb5 Add basic Gaudi support to benchmarks/dynamo (#145920)
This PR adds basic Gaudi support to benchmarks/dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920
Approved by: https://github.com/eellison
2025-02-26 14:50:22 +00:00
be830c8b1c [Inductor][CPP] fix store mode atomic add (#147961)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/147848 and https://github.com/pytorch/pytorch/issues/146390. While addressing these issues, 2 problems were encountered:

- In `CppVecKernel`, when the number of threads is 1 and the mode is `atomic_add`, `store` did not `load/add` before storing. This has been fixed in this PR.

- In `CppTile2DKernel`, `store` did not support `atomic_add` mode. Support for this has been added in this PR.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_nn_fold
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147961
Approved by: https://github.com/malfet
2025-02-26 14:04:34 +00:00
f522d899fb Add MSVC version condition to "Fix for MSVC problem on Windows Arm64 (#136765)" (#145076)
This PR adds MSVC version guards around the if block presented on f7e36d8d6f9706ee9b9653538c4c8d2ba375a181. This commit was to provide a workaround for the problem reported here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 .
The issue is fixed now and only appears between versions 19.36 and 19.42.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145076
Approved by: https://github.com/malfet, https://github.com/alinpahontu2912

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-02-26 12:08:24 +00:00
60d94ea22b Add option to limit number of SMs used by matmul kernels (#147966)
Resubmission of #144974 which was reverted for unrelated reasons.

Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966
Approved by: https://github.com/danthe3rd
2025-02-26 12:01:12 +00:00
7ffae2c028 Split test_transformers.py (#147441)
Split test_transformers.py into test_transformers.py and test_transformers_privateuser1.py. Currently the privateuse1 test cases in test_transformers.py are skipped since they conflict with cuda test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147441
Approved by: https://github.com/drisspg
2025-02-26 11:54:24 +00:00
cf6d1e6824 [dynamo] add generic graph break hints (#147429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147429
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #147385
2025-02-26 09:20:28 +00:00
3fd68e4e2f [dynamo] make some more graph break messages readable in English [2/N] (#147385)
This is for "for some large number Z, make sure the error messages are readable English." - beginning to audit all `unimplemented` sites and making sure that all messages are at least English-readable. Hints may not necessarily be provided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147385
Approved by: https://github.com/jansel
2025-02-26 09:20:28 +00:00
7a06bfdd1c [inductor][ck] kBatch parametrized (#147885)
Summary:
# Why

Enable us to set the kBatch parameter, rather than bake it in

Especially for larger splitK scenarios, this can yield very good performance (up to 1.5x vs hipblaslt from initial tests)

## Why like this

The obvious question should be: why not add this to the op itself, and maybe even into the template/kernel. That would simplify the code.

The choice to have it as a "runtime" param that we fix is be able to reuse the compiled CK `.so` libraries, as now multiple choices of kBatch can be used with the exact same `.so` (as the shared library does not depend on kBatch, but takes it as a parameter)

# What

- copy cutlass approach for swizzle to have a "runtime" arg that we pass in but is really choice dependent
- pipe through everything from template and kernel
- hard-code it to be kBatch=1 for now (same as before, just now settable)

This is part of a series of Diffs, where next we need to figure out
1. how to filter out ops + kBatch that don't work
2. set this better for splitK scenarios (hand written heuristic)

Test Plan:
(with minor modifications)

```
# show it working with AOTI
buck2 run mode/opt-amd-gpu //scripts/henrylhtsang/repros:aot
```

```
# show it working with inductor only
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

Differential Revision: D70200008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147885
Approved by: https://github.com/ColinPeppler
2025-02-26 07:28:19 +00:00
a84db75e1b Revert "torch._scaled_mm with MXFP8 (#147548)"
This reverts commit 12b9674cb603438639298d6c9757ea93e18a7289.

Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - similar to previous, see below ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2684134336))
2025-02-26 07:17:24 +00:00
4216478250 Fix the benchmark config name from H100 benchmark (#147947)
When using the wrong benchmark configs, the benchmark jobs will be skipped.  The name should have the `_cuda_h100` suffix as used in the test matrix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147947
Approved by: https://github.com/wdvr
2025-02-26 06:40:07 +00:00
4ec6c1d1ec Fix test_halide.py report invocation to re-run failed tests (#147640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147640
Approved by: https://github.com/jansel
2025-02-26 06:32:22 +00:00
acca9b9cb0 Revert "[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803)"
This reverts commit 0b9da1ae0ad30ef228f132354b875bcaec214ace.

Reverted https://github.com/pytorch/pytorch/pull/147803 on behalf of https://github.com/wdvr due to breaking internal tests, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147803#issuecomment-2683938121))
2025-02-26 05:32:17 +00:00
12b9674cb6 torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-26 05:21:26 +00:00
9ed40af917 [BE][EZ] Delete MacOS-12.3 xfail list (#147905)
As PyTorch requires at least MacOS-13 (and Metal-3) to work, delete any pre-MacoS13 checks from test script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147905
Approved by: https://github.com/dcci
ghstack dependencies: #147892
2025-02-26 05:08:09 +00:00
a2399c9b44 [BE] Switch index_variable to torch.testing.make_tensor (#147892)
As it was a long-time todo and actually ublocks using this function for MPS devices (that do not support double)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147892
Approved by: https://github.com/dcci
2025-02-26 05:08:09 +00:00
c839fa4dd2 [Resubmit] Record input strides at time of tracing, constrain to them for triton fn (#147861)
Resubmit of https://github.com/pytorch/pytorch/pull/145448. it lost its changes on rebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147861
Approved by: https://github.com/zou3519
2025-02-26 05:05:06 +00:00
ba25e26baa [ROCm] Use IPT=8 for block radix sort (#147657)
Improve performance for shapes that use block radix sort by decreasing the item_per_thread to 8.
This will increase the thread block size leading to higher occupancy.

Co-author: @amd-sushetty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147657
Approved by: https://github.com/jeffdaily
2025-02-26 04:22:16 +00:00
f211818bc0 [c10d] Restrict use condition of NCCL mem pool (#147764)
Add check to see if CUDA driver support multicast, as does in Symmetric Memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147764
Approved by: https://github.com/syed-ahmed, https://github.com/yifuwang
2025-02-26 03:40:00 +00:00
d3fc583ff0 [cutlass backend] force_disable_caches for test_number_mm_precompiles (#147901)
Summary: Test is flaky right now.

Differential Revision: D70209511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147901
Approved by: https://github.com/ColinPeppler
2025-02-26 03:22:49 +00:00
9ad0ad6497 [MPS] Introduce a shader for entr(). (#147914)
To be used in eager/inductor in order to implement the missing operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147914
Approved by: https://github.com/malfet
2025-02-26 02:54:44 +00:00
805f7d97f7 [Inductor][Optimus] Fix a corner case in split cat aten pass (#147784)
Summary: We need to further check the input of the cat to make sure all of them are from the same split node.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/c875cbdd-5374-46cf-811c-45f91cf6ba3e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10977524161964655
Network: Up: 64KiB  Down: 27KiB  (reSessionID-2e5915cb-4894-48f6-ab1c-3981adb42dab)
Executing actions. Remaining     0/3                                                                         1.5s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:52.1s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

before
aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e

tlparse:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

after
aps-recgpt_ig_emb_pt2_comment_out-c03f74e353

Differential Revision: D70132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147784
Approved by: https://github.com/Microve
2025-02-26 02:19:48 +00:00
b533bb4b13 optimize the decomposition of aten.native_group_norm (#144733)
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.

The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],

The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],

I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-26 01:42:46 +00:00
12112fd198 Fix bug in FSDP wrapped module with zero argument (#147771)
Fixes https://github.com/pytorch/pytorch/issues/147531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147771
Approved by: https://github.com/awgu
2025-02-26 01:40:53 +00:00
8de6fe8c0b [docs] fix numpy docs reference (#147697)
Fix a link to numpy documentation that has moved and now 404's

I"ve checked other numpy doc links that point to docs.scipy.org (which then redirects to numpy.org) and they do work, so I am fixing just this 404.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147697
Approved by: https://github.com/soulitzer
2025-02-26 01:30:03 +00:00
90e3a3d86d Revert "[ca] trace saved variable unpacking (#147242)"
This reverts commit 68ddca94498fd7961cc5ebcb0dffafb8c2f4baca.

Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))
2025-02-26 00:40:16 +00:00
4d614baa30 Revert "[ca] side-effect free initial trace: GraphTask (#147796)"
This reverts commit 5758743f3c92f9cd9b61bc435602f13dd19c13d7.

Reverted https://github.com/pytorch/pytorch/pull/147796 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147796#issuecomment-2683599896))
2025-02-26 00:36:08 +00:00
143f0f0006 Revert "[ca] side-effect free inital trace: compiled_args (#147804)"
This reverts commit ec768d8dc04b334e01db1a90e4e6646e4e867e67.

Reverted https://github.com/pytorch/pytorch/pull/147804 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147804#issuecomment-2683594740))
2025-02-26 00:31:40 +00:00
3ecfe6be25 [Submodule] Turning flash-attention integration into 3rd party submod (#144120) (#146372)
Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
276dfe8150 [dynamo][cpp-guards] Disable dict-tag optim if the guard_manager has child accessors (#147694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147694
Approved by: https://github.com/isuruf
2025-02-26 00:02:08 +00:00
8e7e5ba182 Add sparse tensors constructed via legacy constructor to _sparse_tensors_to_validate (#147759)
This is a redo of https://github.com/pytorch/pytorch/pull/147408 which added validation at the end of the legacy constructor calls.

The reason why I didn't land that was because in `legacy_load`, constructor would be called before storages of indices/values are set. So the tensor would not actually be validated.

Technically, torch.sparse.{Foo}Tensor should not even be called by our rebuild process since afaict this was the first PR that added support for sparse tensor serialization https://github.com/pytorch/pytorch/pull/27062 and it already uses `_rebuild_sparse_tensor` (which would add the rebuilt tensor to the list to validate), but torch.sparse.FooTensor is allowlisted

This PR adds tensors constructed as such to the list to validate at the end of torch.load.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147759
Approved by: https://github.com/albanD
2025-02-25 23:51:12 +00:00
c82c1411c6 Revert "torch._scaled_mm with MXFP8 (#147548)"
This reverts commit e34c15a05b027b9da0962c971d448138fcf94926.

Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2683517851))
2025-02-25 23:28:15 +00:00
0633f63f0d [cutlass backend] try fix standlone runner test (#147811)
Differential Revision: [D70147859](https://our.internmc.facebook.com/intern/diff/D70147859/)

Trying to fix this test one last time, especially when mixed mm is getting removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147811
Approved by: https://github.com/chenyang78
2025-02-25 23:27:02 +00:00
05bc8fe62e Revert "follow up to #147548, fix regression on MI300 (#147878)"
This reverts commit cc444e75d540daff127f0210b7f8965a5c2b8d2a.

Reverted https://github.com/pytorch/pytorch/pull/147878 on behalf of https://github.com/wdvr due to temporary reverting to revert an older one in the stack ([comment](https://github.com/pytorch/pytorch/pull/147878#issuecomment-2683515567))
2025-02-25 23:25:59 +00:00
2df9a8d72d [Inductor][Tests] Update get_divisible_by_16 function in test_torchinductor.py to work correctly with new Triton (#147865)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147865
Approved by: https://github.com/davidberard98
2025-02-25 23:14:13 +00:00
1e894d2635 Revert "Add option to limit number of SMs used by matmul kernels (#144974)"
This reverts commit af2d63637ed025789679a17c241e6bb466508a1d.

Reverted https://github.com/pytorch/pytorch/pull/144974 on behalf of https://github.com/wdvr due to reverting in order to revert #147548 that causes a merge conflict ([comment](https://github.com/pytorch/pytorch/pull/144974#issuecomment-2683461733))
2025-02-25 22:46:38 +00:00
cc444e75d5 follow up to #147548, fix regression on MI300 (#147878)
Removing curly braces seemed superficial but broke MI300 rowwise matmul.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147878
Approved by: https://github.com/drisspg
2025-02-25 22:16:28 +00:00
a821d69d92 Fix register constant to be usable in exportz (#147533)
Differential Revision: [D69939737](https://our.internmc.facebook.com/intern/diff/D69939737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147533
Approved by: https://github.com/zou3519
2025-02-25 21:10:47 +00:00
0d31c621a3 Revert "[inductor][triton] Ignore block ptr advances for removed buffers (#147193)"
This reverts commit 17766b7aad0d9931bb6b3485fcf3d4c7532c3557.

Reverted https://github.com/pytorch/pytorch/pull/147193 on behalf of https://github.com/wdvr due to failing tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147193#issuecomment-2683286358))
2025-02-25 21:04:04 +00:00
6eb3d1e762 [DCP] Cache save plans in default planner (#147343)
Summary:
This PR caches the save plans to significantly reduce the collective cost for successive checkpoint save attempts. Here is the high level approach:
-  Create the local plan and cache the same.
- In next iteration, compare the local plan with the cached plan metadata. If no change, do not send that local plan in the collective.
- Global plan step, will only create the global plan with the new delta plans and empty plans for the cached ones.
- Finish plan step will check for the empty plans. If its empty, it will grab the cached plan. If not, it will use the new plan provided.

Test Plan: UTs

Differential Revision: D69224491

## How to enable the caching:
DefaultSavePlanner introduces the enable_plan_caching which is set to False by default for now.
https://github.com/pytorch/pytorch/pull/147343/files#diff-579bbb7b82572753afa91085fbf954f7c7613ff8376da9b26153d5cc3a3c4ee8R77
Set this to True to enable the caching and we should see significant speed up in the subsequent checkpoint save attempts, specially for larger scale jobs. Reference issue: https://github.com/pytorch/pytorch/issues/123695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147343
Approved by: https://github.com/MeetVadakkanchery
2025-02-25 20:59:25 +00:00
8d921eb97f export method (#147573)
The `export` API takes a `nn.Module` and traces its `forward` method. However sometimes it is useful to export different methods of a `nn.Module`, either as a one-off for debugging or as a set of methods that are called in some sequence outside `export` (e.g., `encode` / `decode`). When multiple methods of the same module instance are exported, they should share the same of the common module instance.

This PR adds a couple of utils in `torch._export.utils` for this workflow.

The `wrap_method` util wraps a method as a `nn.Module` that can then be exported. See included test. We recommend using the same module instance to export multiple methods on that instance, in which case they are guaranteed to share  state. On serde, this state sharing is lost, so we provide another util, `sync_state`, to re-sync the state.

These utils are meant to be eventually replaced by API-level changes, but for now this can unblock users who need this workflow. In particular, in the future we can accept one or multiple method entrypoints, with their own args / kwargs / dynamic shape specifications, which can create a variant of `ExportedProgram` with multiple graphs that share state; then we can automatically ensure that the state sharing is preserved through serde.

Differential Revision: [D69960801](https://our.internmc.facebook.com/intern/diff/D69960801/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147573
Approved by: https://github.com/tugsbayasgalan
2025-02-25 20:58:54 +00:00
687fe64667 Fix crash in -[PTMCoreMLCompiler _compileModel:atPath:] (#147809)
Summary:
We could hit one of those exceptions:
https://github.com/apple/coremltools/blob/main/modelpackage/src/ModelPackage.cpp#L205-L225

And it would make this code path crash.

Test Plan: build.

Differential Revision: D70122378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147809
Approved by: https://github.com/mcr229
2025-02-25 20:56:16 +00:00
ec768d8dc0 [ca] side-effect free inital trace: compiled_args (#147804)
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796
2025-02-25 20:38:51 +00:00
5758743f3c [ca] side-effect free initial trace: GraphTask (#147796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796
Approved by: https://github.com/jansel
ghstack dependencies: #147242
2025-02-25 20:38:51 +00:00
68ddca9449 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-25 20:38:51 +00:00
adf0f4ffd2 [custom op] fix inductor cpp codegen when returning a list of single tensor (#147649)
For a custom op that returns a list of a single tensor with unbacked symint shape:
```python

@torch.library.custom_op(
    "aoti_custom_ops::fn_ret_list_of_single_tensor", mutates_args={}
)
def fn_ret_list_of_single_tensor(x: torch.Tensor) -> list[torch.Tensor]:
    s = x.sum().to(torch.int64)
    return [torch.randn(s.item())]

@fn_ret_list_of_single_tensor.register_fake
def _(x):
    ctx = torch._custom_op.impl.get_ctx()
    i0 = ctx.new_dynamic_size()
    return [torch.randn(i0)]
```

Before the fix, we have the following error:
```
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp& std::get(const std::variant<_Types ...>&)’
  456 |     auto u0 = std::get<0>(buf1).size(0);
      |               ~~~~~~~~~~~^~~~~~
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note:   expected a type, got ‘0’
In file included from /data/users/yidi/pytorch/torch/include/c10/util/Exception.h:14,
                 from /data/users/yidi/pytorch/torch/include/c10/core/ScalarType.h:5,
                 from /data/users/yidi/pytorch/torch/include/ATen/AccumulateType.h:4,
                 from /data/users/yidi/pytorch/torch/include/ATen/native/Math.h:3,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec_base.h:31,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec512/vec512.h:8,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec.h:4,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional_base.h:6,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional.h:3,
                 from /tmp/tmp5iikarn2/3b/c3bi5gk6mslf6u4iaqafhxm64z6u65e3eain4xlary5blqnvv6xx.h:39,
                 from /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:366:
/usr/include/c++/11/variant:1145:27: note: candidate: ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’
 1145 |     constexpr const _Tp&& get(const variant<_Types...>&& __v)
      |                           ^~~
/usr/include/c++/11/variant:1145:27: note:   template argument deduction/substitution failed:
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’
  456 |     auto u0 = std::get<0>(buf1).size(0);
      |               ~~~~~~~~~~~^~~~~~
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note:   expected a type, got ‘0’
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147649
Approved by: https://github.com/angelayi
ghstack dependencies: #147130
2025-02-25 20:28:41 +00:00
824474cb35 [cond] support output sizes mismatch in front end (#147130)
This PR finishes https://github.com/pytorch/pytorch/pull/137615 by addressing the TODOs and comments left there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147130
Approved by: https://github.com/zou3519
2025-02-25 20:28:41 +00:00
de80b6f0d3 Updated test_cuda.py to rerun tests (#147040)
Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](d4871750d9).

Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-25 19:58:42 +00:00
361b6c97cd cpp_wrapper: Fixup output code indentation (#147215)
Closes #142165.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147215
Approved by: https://github.com/desertfire
ghstack dependencies: #146109, #146424
2025-02-25 19:50:37 +00:00
7c515b2da4 cpp_wrapper: fix test_torchinductor* tests (#146424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146424
Approved by: https://github.com/desertfire
ghstack dependencies: #146109
2025-02-25 19:50:37 +00:00
46d1422afd cpp_wrapper: fix inductor triton tests (#146109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146109
Approved by: https://github.com/desertfire
2025-02-25 19:50:37 +00:00
9740d69e78 [logging] Add toplevel dynamo_compile / tlparse logging for AOTI (#147760)
Summary:
This adds the proper context managers in `compile_fx_aot` such that we get:
1) A toplevel chromium event (i.e., tlparse)
2) A single `dynamo_compile` log entry

Test Plan:
Before:
* Scuba (we only log the dynamo event): https://fburl.com/scuba/dynamo_compile/sandbox/gaqowzrd
* Perfetto trace: https://fburl.com/vol7r6w1

After:
* Scuba (we log the dynamo _and_ compile_fx_aot event): https://fburl.com/scuba/dynamo_compile/sandbox/cx2we8w8
* Perfetto trace (click on the toplevel event to see the additional metadata): https://fburl.com/sziy40r9

Differential Revision: D70113859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147760
Approved by: https://github.com/desertfire
2025-02-25 19:41:39 +00:00
14b9f7f7bc Remove link to search survey (#147751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147751
Approved by: https://github.com/malfet
2025-02-25 19:26:59 +00:00
17766b7aad [inductor][triton] Ignore block ptr advances for removed buffers (#147193)
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 19:14:55 +00:00
ea6938a1f7 Add XuehaiPan to CODEOWNERS for C++ PyTree utilities (#137408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137408
Approved by: https://github.com/zou3519
2025-02-25 18:48:32 +00:00
580f1183b4 Enable ruff rule S324 (#147665)
Fixes #147627

- Add `S324` in `pyproject.toml `
- Running check and clean warnings

```bash
lintrunner --take RUFF --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 18:27:34 +00:00
6061664266 Enabled force_shape_pad for triton tests in test_kernel_benchmark (#147620)
During ROCm runs we naturally have those tests show that padding path will be slower for our archs and the pad_mm chooses to opt out of padding thus failing those tests.

Reasoning for this is per my understanding those tests don't check IF the operation should be padded in the first place, but HOW is it padded and if it's done in a correct way. More than that the tests shouldn't really be hardware dependent or have some condition for them.

Similar PR for reference: https://github.com/pytorch/pytorch/pull/141768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147620
Approved by: https://github.com/jeffdaily, https://github.com/chenyang78, https://github.com/shunting314
2025-02-25 18:06:48 +00:00
651e6aacf9 [ROCm] Remove benign warning about missing amdgpu.ids (#147791)
Fixes #144203.

We build a custom libdrm when preparing our docker image.  We attempt to locate the amdgpu.ids file relative to the python binary, but this is not possible for venv installs of pytorch when the python binary is a symlink.  Not finding amdgpu.ids causes `torch.cuda.get_device_name()` to return "AMD Radeon Graphics" as a generic name instead of something specific such as "AMD Instinct MI250X / MI250".  The libdrm warning is noisy, so we are removing it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147791
Approved by: https://github.com/jeffdaily
2025-02-25 17:17:25 +00:00
e5a13410cd Fix the tiny doc descriptions (#147319)
As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147319
Approved by: https://github.com/zou3519
2025-02-25 17:10:16 +00:00
346bbefa63 [BE] Parameterize TestSDPA in test_mps.py (#147856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147856
Approved by: https://github.com/Skylion007
2025-02-25 16:07:24 +00:00
810d2a3dbd [ARM] Fix bug in _ref_test_helper in test_ops and fix failing test on Aarch64 (#146597)
We have a failing unit test on Aarch64

```
Exception: Caused by reference input at index 34: SampleInput(input=Tensor[size=(5, 5, 4), device="cpu", dtype=torch.complex64, contiguous=False], args=(), kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=34 python test/test_ops.py TestCommonCPU.test_python_ref__refs_square_cpu_complex64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

After debugging it I found that `ex` variable is not being reset to None on each loop inside _ref_test_helper. Which after fixing, highlighted another expectedFailure to reenable - `nn.functional.hinge_embedding_loss` which was incorrectly being skipped due to the same problem.

4a545eb85d/test/test_ops.py (L546)
ex variable is not reset after this for next loop iteration

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146597
Approved by: https://github.com/digantdesai
2025-02-25 14:15:10 +00:00
a695aae89b [MPS] fix attention for >4d tensors (#147545)
Fixes #147443

and adds tests for >4d tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147545
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:55:28 +00:00
0b9da1ae0a [AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803)
Summary: Let CppBuilder handle all the cpp build logic

Differential Revision: [D70146185](https://our.internmc.facebook.com/intern/diff/D70146185)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147803
Approved by: https://github.com/malfet
ghstack dependencies: #147805, #147806, #147807
2025-02-25 13:33:12 +00:00
cc1c9826d4 [AOTI][refactor] Fix a typo (#147807)
Summary: defination -> definition

Differential Revision: [D70146182](https://our.internmc.facebook.com/intern/diff/D70146182)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147807
Approved by: https://github.com/malfet
ghstack dependencies: #147805, #147806
2025-02-25 13:33:12 +00:00
7ed0670e21 [AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147806)
Summary: Consolidate cpp compilation action to CppBuilder. Reland https://github.com/pytorch/pytorch/pull/147680

Differential Revision: [D70146183](https://our.internmc.facebook.com/intern/diff/D70146183)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147806
Approved by: https://github.com/malfet
ghstack dependencies: #147805
2025-02-25 13:33:03 +00:00
2680e835c8 [AOTI][refactor] Rename use_absolute_path to use_relative_path (#147805)
Summary: The option really means to compile a cpp file using its basename instead of the its full path. Reland https://github.com/pytorch/pytorch/pull/147679.

Differential Revision: [D70146184](https://our.internmc.facebook.com/intern/diff/D70146184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147805
Approved by: https://github.com/malfet
2025-02-25 13:32:54 +00:00
7e37fb0a4c [MPS] faster integer matmul for mps (#147526)
There is a naive matmul kernel written for MPS matmul which is used when input types are integer(and some other cases for older MacOSes). The old version of matmul is naive with global memory accesses which really tanks the performance especially when matrix is sufficiently large.

This PR optimizes it (even though there might be more optimizations with using simdgroup matrices which I'll cover in followup since writing that kernel will take more time)

## Performance comparison on M1 Pro:
![performance_comparison](https://github.com/user-attachments/assets/6ea8de5a-8231-4c5b-8dc9-caa79ea6879a)

You can get these numbers by running this script with old kernel compiled and then new kernel compiled(Make sure to change the csv where each output is written):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [32, 128, 512, 1024, 2048, 4096]
num_runs = 10
warmup_runs = 3

def run_int_mm(A, B):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = A @ B
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps")
        B_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_int_mm(A_mps, B_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_int_mm(A_mps, B_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('int_mm_benchmark_times_old.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147526
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:15:18 +00:00
b63c601614 Update merge rules for oneDNN part (#147615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147615
Approved by: https://github.com/atalman
2025-02-25 11:26:59 +00:00
af2d63637e Add option to limit number of SMs used by matmul kernels (#144974)
Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144974
Approved by: https://github.com/eqy, https://github.com/albanD
2025-02-25 10:19:19 +00:00
94969d0a40 [inductor][user triton] Handle scf.yield more accurately (#147762)
**TL;DR**: Previously, the mutation analysis for scf.if/scf.for would bundle all the scf.yield arguments into a single op (the scf.yield), such that a mutation on any returned value from the scf.if/scf.for would register as a mutation to _all_ of the scf.yield args. To fix this, this PR artificially introduces a new scf.yield op for each of the scf.yield args.

**Context**: The relevant kernel is something like this one (added as a test in test_triton_kernels.py)

```python
        @triton.jit
        def branch_with_multiple_yield_args(
            in_ptr0,
            in_ptr1,
            out_ptr,
            conditional_ptr,
            n_elements,
            BLOCK_SIZE: "tl.constexpr",
        ):
            pid = tl.program_id(axis=0)
            block_start = pid * BLOCK_SIZE
            offsets = block_start + tl.arange(0, BLOCK_SIZE)
            mask = offsets < n_elements
            conditional = tl.load(conditional_ptr)
            if conditional:
                in0 = in_ptr0 + 1
                in1 = in_ptr1 + 1
                out = out_ptr + 1
            else:
                in0 = in_ptr0
                in1 = in_ptr1
                out = out_ptr
            x = tl.load(in0 + offsets, mask=mask)
            y = tl.load(in1 + offsets, mask=mask)
            tl.store(out + offsets, x + y, mask=mask)
```

The mutation analysis starts with the `tl.store` - and then does a DFS backwards towards the parameters.  When a new op is encountered in the DFS, the analysis pass recurses on the op's arguments.

The if branch gets converted to TTIR like this:

```mlir
    %21:3 = scf.if %20 -> (!tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32>) {
      ...
      scf.yield %31, %32, %33 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc10)
    } else {
      scf.yield %arg0, %arg1, %arg2 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc11)
    } loc(#loc7)
```

and so the "source" op of the `out` variable is marked as the `scf.yield` op - and then all of the arguments to `scf.yield` are marked as mutable (including arg0, arg1, and arg2 - only one of which is actually mutated).

**This PR** we duplicate the `scf.yield` to add one `scf.yield` per return value. That way we avoid marking all the returns from the scf.if/scf.for as mutated when only some are.

Differential Revision: [D70118202](https://our.internmc.facebook.com/intern/diff/D70118202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147762
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-02-25 08:41:00 +00:00
7bd2e3bca1 Update torch-xpu-ops commit pin (#147743)
Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](306a0ffb6e), includes:

- Bugfix (LayerNorm/Nonzeros)
- Update AOT target

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743
Approved by: https://github.com/EikanWang
2025-02-25 08:06:35 +00:00
866dc45d3c [Inductor][ROCm][CK] Unhardedcoded kernel shapes for ck_conv_template codegen (#147504)
## [Inductor][ROCm][CK] Parameterize `ck_conv_template` Codegen

### Description
Previously, ROCm CK kernel codegen templates were hardcoded with fixed values for convolution parameters:

- `index_t GroupCount`
- `index_t NBatch`
- `index_t NOutChannels`
- `index_t NInChannels`
- `vector<index_t> FilterSize`
- `vector<index_t> InputSize`
- `vector<index_t> ConvolutionStrides`
- `vector<index_t> Dilations`
- `vector<index_t> LeftPads`
- `vector<index_t> RightPads`

This PR updates `ck_conv_template` to accept these parameters dynamically from Inductor. By doing so, we reduce the number of generated templates, improving flexibility and maintainability.

### Testing
- Verified correctness by running relevant test cases, i.e `test/inductor/test_ck_backend.py`
- Ensured generated kernels reflect the updated parameterization, i.e generated templates in `/tmp/torchinductor_root/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147504
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/tenpercent

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-02-25 07:48:07 +00:00
d73b927662 [DSD] Fixes issue when there is a PG without parameters (#147730)
Fixes https://github.com/pytorch/pytorch/issues/143828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147730
Approved by: https://github.com/mori360
2025-02-25 07:25:38 +00:00
fb73b0c7c5 Revert "use copy2d in h2d/d2h copy when possible (#146256)"
This reverts commit 0bc036a9e98d2cc92ff9dd367342b1f2efcc15f0.

Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))
2025-02-25 07:06:38 +00:00
bb7e8fbd66 [CacheBench] Add hf_T5 llama moco to cachebench (#147783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147783
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781, #147782
2025-02-25 04:34:45 +00:00
895564d6b6 [CacheBench] Add huggingface (#147782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147782
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781
2025-02-25 04:34:45 +00:00
c4fb6ae55d [CacheBench] Separate dynamic into its own option (#147781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147781
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780
2025-02-25 04:34:34 +00:00
60d4cbfc06 [CacheBench] Add repeat option so that we can have more accurate cache results (#147780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147780
Approved by: https://github.com/huydhn
ghstack dependencies: #147688
2025-02-25 04:34:25 +00:00
ab3b814af3 [CacheBench] Add ciflow/trunk test (#147688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147688
Approved by: https://github.com/huydhn
2025-02-25 04:34:16 +00:00
4b7604ec10 Delete Mixed MM Special Casing (#147151)
Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup:

- prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not.
- similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path.

It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication.

The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete.

Future optimizations could include:

- cutlass prologue path
- making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit.

Differential Revision: [D70114858](https://our.internmc.facebook.com/intern/diff/D70114858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2025-02-25 04:29:54 +00:00
890213f65f Revert "[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679)"
This reverts commit 0b52d801d2297ad6c38e631eedfd4dead9360e1b.

Reverted https://github.com/pytorch/pytorch/pull/147679 on behalf of https://github.com/desertfire due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147679#issuecomment-2680389225))
2025-02-25 04:11:13 +00:00
9b06b30468 Revert "[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680)"
This reverts commit 22fae0d948ac14c72b510fafc2283072d744dff9.

Reverted https://github.com/pytorch/pytorch/pull/147680 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147680#issuecomment-2680383986))
2025-02-25 04:06:40 +00:00
9478c90e2b [Quant] flip: throw runtime error for QUInt4x2 and QUInt2x4 input (#147430)
Fixes #147208

**Summary**
The `flip` op causes memory corruption for `torch.quint4x2` and `torch.quint2x4` inputs. It is because the TensorIterator-based implementation does not support multiple elements per byte. And `torch.quint4x2` and `torch.quint2x4` are deprecated in PyTorch. So, we add a check here to throw a runtime error if input dtyps is `torch.quint4x2` or `torch.quint2x4`.

**Test plan**
```
pytest -s test/test_shape_ops.py -k test_flip_unsupported_dtype
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147430
Approved by: https://github.com/mingfeima, https://github.com/ngimel
2025-02-25 03:47:40 +00:00
20295c017e Fix import of getArtifactLogger for ir_pre_fusion and ir_post_fusion (#147560)
Fixes #147002

There was an issue with the previous PR https://github.com/pytorch/pytorch/pull/147248 that didn't show up in CI,
where a logging import was not complete in torch/_inductor/debug.py before importing it.
This only happened if someone directly imported the file without doing any other imports before.

Also set to off_by_default by request to reduce log spew.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147560
Approved by: https://github.com/Skylion007
2025-02-25 03:36:08 +00:00
e34c15a05b torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-25 03:32:22 +00:00
cyy
8f728e28dd Enable ASAN in CUDA tests (#147512)
It should work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147512
Approved by: https://github.com/soulitzer
2025-02-25 02:58:39 +00:00
b0fa92042b Fix torch.mean out dtype check (#147188)
**For CPU**:
Type promotion is supported for torch.mean

**For Meta**:
Not supported for torch.mean

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147188
Approved by: https://github.com/albanD
2025-02-25 02:50:03 +00:00
33ff96b3f9 cpp_builder: unbreak clang++ detection (#147775)
Fixes an issue where `_is_gcc` would match on `clang++` due to the string ending with `g++`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147775
Approved by: https://github.com/desertfire
2025-02-25 02:33:01 +00:00
dacdc9782b [Inductor] Add input value checking to randint meta function (#147191)
Fixes #147070

Adding value checking for the range to the meta function, similar to which in the CUDA/CPU aten op.

Test with
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_tensor_creation_ops.py -k test_randint_inference
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147191
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-25 02:18:16 +00:00
c644f4c5fe [Inductor] Fix the decompositions of torch isin (#147519)
**Summary**
Fixed two decomposition issues in `torch.isin`:

- Issue 1: As reported in [#147329](https://github.com/pytorch/pytorch/issues/147329), the current decomposition does not support cases where test_element is a scalar. This is now implemented by referring to the ead970c8d0/aten/src/ATen/native/TensorCompare.cpp (L1004-L1008)

- Issue 2: Found while enabling a unit test with `elements = 1` and `test_elements = torch.tensor([1, 2, 3, 4])`, where Inductor produced different results compared to eager mode. This issue is fixed by referring to ead970c8d0/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L329-L338)

**Test Plan**
```
python test/inductor/test_torchinductor.py -k test_isin_tensor_scalar
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147519
Approved by: https://github.com/jgong5, https://github.com/FFFrog, https://github.com/peterbell10
2025-02-25 01:49:44 +00:00
2c8cd41c1f Delete unused conda-aws-upload environment (#147792)
As this environment only contains keys for Anaconda uploads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147792
Approved by: https://github.com/atalman
2025-02-25 01:42:52 +00:00
43074680b5 [ROCm] Add support for gfx1102 arch to wheel builds. (#147761)
[gfx1102 is not officially supported](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) but most ROCm libs have gfx1102 code objects available since ROCm 5.5.  Now that we're using `--offload-compress` we can fit another gfx target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147761
Approved by: https://github.com/jeffdaily
2025-02-25 01:35:52 +00:00
97557b9833 [Inductor] Update set_driver_to_gpu code to avoid backend re-initialization with new Triton (#147621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147621
Approved by: https://github.com/jansel
2025-02-25 00:04:54 +00:00
55bf3ff3a5 [Docs] Add OpDTypes.any_common_cpu_cuda_one (#147605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147605
Approved by: https://github.com/soulitzer
2025-02-24 23:23:43 +00:00
e72b4c61bf Revert "Upgrade submodule oneDNN to v3.7 (#147498)"
This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c.

Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))
2025-02-24 22:57:39 +00:00
81dccd706b [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-24 22:47:52 +00:00
a71d8b7246 Fix ReferenceError: weakly-referenced object no longer exists in cycle detector (#146922)
Summary: weakref.proxy objects will throw errors when they re dead.  We just do not bother visulaizing them. They are weak, so they aren't relevant to cycles anyway.

Differential Revision: D69270429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146922
Approved by: https://github.com/tianfengfrank, https://github.com/Chillee
2025-02-24 22:27:39 +00:00
22fae0d948 [AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680)
Consolidate cpp compilation action to CppBuilder

Differential Revision: [D69723632](https://our.internmc.facebook.com/intern/diff/D69723632/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147680
Approved by: https://github.com/yushangdi, https://github.com/angelayi
ghstack dependencies: #147679
2025-02-24 21:45:15 +00:00
0b52d801d2 [AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679)
The option really means to compile a cpp file using its basename instead of the its full path.

Differential Revision: [D69722709](https://our.internmc.facebook.com/intern/diff/D69722709/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147679
Approved by: https://github.com/angelayi
2025-02-24 21:44:33 +00:00
96acb56626 [ROCm] Optimize the stride one indexing backwards kernel (#146420)
This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <Hashem.Hashemi@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-24 21:19:06 +00:00
89b9c12de8 remove prints from partitioner (#147749)
See c57894cd74..22d8f9a657 (r1968015955)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147749
Approved by: https://github.com/Skylion007, https://github.com/laithsakka
2025-02-24 21:03:45 +00:00
8eb400ef66 [BE] TCPStore: use typed errors for assertions (#147647)
This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend  to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python.

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647
Approved by: https://github.com/fduwjj
2025-02-24 20:58:10 +00:00
19fd21fe7e [Inductor] Hot fix after #146917 (#147639)
This pull request reverts the changes to `torch/_inductor/ir.py` file that were added in #146917.

Where I tested, there were changes only from `torch/_inductor/codegen/cpp_wrapper_gpu.py`, it turns out that changes in `torch/_inductor/ir.py` file are not really needed. So it's my fault, I didn't sync the environments (between several machines) correctly.

@davidberard98 @YUNQIUGUO maybe that's why the tests on CUDA didn't pass?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147639
Approved by: https://github.com/etaf, https://github.com/davidberard98
2025-02-24 20:34:48 +00:00
754fb834db [BE][CI] bump ruff to 0.9.0: string quote styles (#144569)
Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting

- Change the outer quotes to double quotes for nested f-strings

```diff
- f'{", ".join(args)}'
+ f"{', '.join(args)}"
```

- Change the inner quotes to double quotes for triple f-strings

```diff
  string = """
-     {', '.join(args)}
+     {", ".join(args)}
  """
```

- Join implicitly concatenated strings

```diff
- string = "short string " "short string " f"{var}"
+ string = f"short string short string {var}"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569
Approved by: https://github.com/Skylion007
ghstack dependencies: #146509
2025-02-24 19:56:09 +00:00
52f6d4aa30 [BE][CI][Easy] bump ruff to 0.9.0: long statements in docstrings (#146509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146509
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-24 19:56:08 +00:00
9605c5063b [ROCm][TunableOp] Speed-up matmul_small_brute_force_tunableop unit test (#147659)
This PR has a UT speed-up and some refactoring of tests.

A previous PR https://github.com/pytorch/pytorch/pull/142422 fixed this matmul_small_brute_force_tunableop for the FP16 data type by adding TunableOp numerical checks. It had the unfortunate side effect that it increased the execution time for the FP32 and FP64 data types by a significant margin. This PR *reduces* the execution time by 20+ minutes.

We also move a hipBLASLt version check to a different tunableop UT for simplicity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147659
Approved by: https://github.com/jeffdaily
2025-02-24 19:44:38 +00:00
69c4f6ff13 [Minor] Fix minor mistake in docstring of replace_pattern (#147611)
Fixes #147610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147611
Approved by: https://github.com/soulitzer
2025-02-24 19:33:44 +00:00
b9b1fd9b93 [Intel GPU] qlinear.pointwise with mixed dtype support (#136753)
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_int8_mixed_bf16_xpu \
    -k test_qlinear_relu_int8_mixed_bf16_xpu \
    -k test_qlinear_add_int8_mixed_bf16_xpu
```

# Runtime exemplification
```bash
#qlinear+bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242
# qlinear_add + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922
# qlinear_add_relu + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277
```
As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136753
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337, #135465

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-24 19:27:50 +00:00
075b91bef1 [Intel GPU] qconv.pointwise with mixed dtype XPU support (#135465)
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qconv_pointwise` op. With current PR, we allow `qconv` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qconv2d_int8_mixed_bf16_xpu \
    -k test_qconv2d_relu_int8_mixed_bf16_xpu \
    -k test_qconv2d_hardtanh_int8_mixed_bf16_xpu \
    -k test_qconv2d_hardswish_int8_mixed_bf16_xpu \
    -k test_qconv2d_silu_int8_mixed_bf16_xpu \
    -k test_qconv2d_add_int8_mixed_bf16_xpu \
    -k test_qconv2d_add_relu_int8_mixed_bf16_xpu
```

# Runtime verification
```bash
#qconv + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0539551
# qconv_silu + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0588379
# qconv_hardswish + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_hardswish:0.166667:0.5,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0568848
```
The `dst_bf16::blocked:acdb::f0` attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135465
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-24 19:27:50 +00:00
ffa19b9024 [ROCm][Windows] Fix unrecognized constexpr std::memcpy for HIP-clang (#147316)
Since in MSVC's 2019/2022 implementation of STL memcpy is not defined as a constexpr function, HIP clang compiler on Windows cannot evaluate the following memcopy as one that could be resolved during the compile time. To resolve this, a `__builtin_memcpy` is used instead which doesn't have this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147316
Approved by: https://github.com/jeffdaily
2025-02-24 18:28:59 +00:00
900a774781 Revert "[ROCm] Update periodic.yml to use 2GPU runners (#146839)"
This reverts commit b6273d7f4ba4fbb126eb96816287641ca1e4efc6.

Reverted https://github.com/pytorch/pytorch/pull/146839 on behalf of https://github.com/jithunnair-amd due to This change is not needed anymore since our 4-GPU runners are back online and stable so far ([comment](https://github.com/pytorch/pytorch/pull/146839#issuecomment-2679145448))
2025-02-24 17:17:58 +00:00
cde12207a0 [Intel GPU] Add SDPA implementation on XPU with OneDNN (#147612)
Add XPU implementation of OneDNN based SDPA operator. Will be integrated and enabled later.

Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147612
Approved by: https://github.com/EikanWang
2025-02-24 16:12:04 +00:00
576ed1e400 Upgrade submodule oneDNN to v3.7 (#147498)
This PR is to upgrade submodule oneDNN to v3.7.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498
Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
2025-02-24 14:32:51 +00:00
80d3afc698 [inductor] Improve type annotations in _inductor/pattern_matcher.py (#146626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146626
Approved by: https://github.com/Skylion007
2025-02-24 14:30:35 +00:00
d0f08dc3eb Update slow tests (#147728)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147728
Approved by: https://github.com/pytorchbot
2025-02-24 11:48:19 +00:00
cba14212e6 [FX] micro-optimization map_aggregate(immutable_dict) (#147691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147691
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #147699, #144640
2025-02-24 09:14:08 +00:00
a50af71fb6 [FX] Refactor immutable collections implementation (#144640)
Get rid of dynamic class creation via `type(name, bases, ...)`. Convert it to classic static class definition for better readability and static analysis support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144640
Approved by: https://github.com/jansel
ghstack dependencies: #147699
2025-02-24 09:14:08 +00:00
dc9a03d30c [Window] Fix invalid file path on windows. (#147708)
This PR aims to fix the invalid path for windows: `C:\\Users\\sdp\\AppData\\Local\\Temp\\tmp0wugz2qm\\dynamo\\code_state___main__.TestFxGraphCache.test_cache_hot_load_pgo:None:.pkl.lock`
Windows does not allow chars `\ / : * ? " < > |` in a path.

And this PR also replace `os.rename` to `os.replace` in torch/_dynamo/pgo.py because `os.replace` allows target file exists on Windows, but not `os.rename` .
| Function                      | `os.rename()`              | `os.replace()`             |
|--------------------------------|----------------------------|----------------------------|
| Rename a file                 |                           |                           |
| Move a file                   |                           |                           |
| Overwrite an existing file     |  (Error on Windows)       |  (Will overwrite)         |
| Overwrite an existing directory |  (Error on Windows)                         |   (Error on Windows)                        |
| Move across disks             |                           |                           |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147708
Approved by: https://github.com/jansel
2025-02-24 08:31:11 +00:00
5b6ad682bc Revert "[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)"
This reverts commit 85ea67983421acc30ccc76f7a159042e75c6ea08.

Reverted https://github.com/pytorch/pytorch/pull/147254 on behalf of https://github.com/jeanschmidt due to introduced reds on main ([comment](https://github.com/pytorch/pytorch/pull/147254#issuecomment-2677700862))
2025-02-24 08:20:16 +00:00
8d618f3da7 [AOTI][XPU] Suppress multi-line comment warning for XPU. (#147710)
This PR aim to suppress multi-line comment waring in sycl header when building Inductor cpp_wrapper .
```
/intel/oneapi/compiler/2025.0/include/sycl/detail/builtins/builtins.hpp:235:1: warning: multi-line comment [-Wcomment]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147710
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-02-24 07:28:59 +00:00
cee03b7746 [Inductor] Update should_decompose_mm condition for CPU (#147673)
Summary:
Previously, for cpu we decompose addmm if
```
check_device(mat1, mat2, device="cpu")
        and mat1.shape[0] == 1
        and mat2.shape[0] <= 64
        and mat2.shape[1] <= 16
```
We have a new case where `mat2.shape[2] = 304`, and benchmark shows that it will beneficial if we decompose, so update the condition to
```
check_device(mat1, mat2, device="cpu")
        and mat1.shape[0] == 1
        and mat2.shape[0] <= 64
        and mat2.shape[1] <= 512
```

Differential Revision: D70033166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147673
Approved by: https://github.com/houseroad
2025-02-24 05:51:50 +00:00
8b65dbad13 [MPS/Inductor] Add support for xlog1py. (#147709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147709
Approved by: https://github.com/jansel
2025-02-24 05:28:52 +00:00
baccadb2f1 xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431)
Initially discussed here: https://github.com/pytorch/pytorch/pull/132945#discussion_r1957366131

Previously `torch.xpu.get_arch_list()` got relaxed to work even if XPU device is not available. However, we overlooked the case when pytorch is not compiled with XPU support. In such a case function throws an exception. This commit adjusts this behavior and makes function return `[]` even if pytorch is not compiled with XPU support.

CC: @EikanWang @fengyuan14 @guangyey @malfet @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147431
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-02-24 01:35:54 +00:00
7c52ef2424 Add XPU to is_compile_supported to support roi_align op in torchvision (#147541)
Part of the required fix for https://github.com/intel/torch-xpu-ops/issues/1264.

To support `roi_align`, torchvision uses `is_compile_supported` in `torch/_dynamo/utils.py` to compile a non-deterministic version of the op for backwards passes. This PR adds XPU device to the supported compile devices.

The `is_compile_supported()` util function has extremely limited usage, only being used in `torchvision.ops.roi_align` and `torch.utils._content_store.has_storage()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147541
Approved by: https://github.com/guangyey, https://github.com/jansel

Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
2025-02-24 01:32:36 +00:00
4e934ee5a7 [MPS] Add eager support for xlog1py. (#147687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147687
Approved by: https://github.com/malfet
2025-02-24 01:23:59 +00:00
eqy
718cf68aee [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-23 22:01:39 +00:00
b5d7aefa57 [BE] add missing overload annotations for tree_map_only (#147699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147699
Approved by: https://github.com/Skylion007
2025-02-23 20:21:07 +00:00
f47573f70d Add super().setUp() to some test cases (#147651)
I saw that their disabled issues were getting spammed with comments, meaning that they were still running in CI despite having a disable issue, so I added the super().setUp() call to check if there's a disable issue for them since they were missing it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147651
Approved by: https://github.com/huydhn
2025-02-23 18:21:17 +00:00
f03e7f3801 [MPS] Workaround rng bug for 5D tensors (#147667)
For some reason MPSGraph returns repeated values is tensor dimention is
larger than 4, which can be clearly seen by running following
```swift
import Metal
import MetalPerformanceShadersGraph

func randMPS(device: MTLDevice, obuf: MTLBuffer, nelem: Int, ndim: Int = 5) {
  let graph = MPSGraph()
  var dims = Array(repeating: 1, count: ndim)
  dims[0] = nelem
  let shape = dims.map { NSNumber(value: $0) }
  let randNode = graph.randomUniformTensor(withShape: shape, seed: 42, name: nil)
  let mpsOutputBuffer = MPSGraphTensorData(obuf, shape: shape, dataType: .float32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [:], targetOperations: nil, resultsDictionary: [randNode: mpsOutputBuffer])
}

func printBuf(_ prefix: String, buf: MTLBuffer, nelem: Int) {
  let buf_data = buf.contents().assumingMemoryBound(to: Float.self)
  print(prefix)
  for i in 0..<nelem {
      print(buf_data[i], terminator: i != nelem - 1 ? " " : "\n")
  }
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let nelem = 2
guard let buf = device.makeBuffer(length:nelem * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 4)
printBuf("4D uniform", buf: buf, nelem: nelem)

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 5)
printBuf("5D uniform", buf: buf, nelem: nelem)
```

Workaround by flatting the tensor if it's contiguous

Fixes https://github.com/pytorch/pytorch/issues/147624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147667
Approved by: https://github.com/dcci
2025-02-23 16:52:01 +00:00
3e2d9d079e Revert "[ROCm] OCP FP8 Support for new GPUs (#146632)"
This reverts commit f95ab46797e1f3e8cc48ce2f45e4f6985132fb19.

Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))
2025-02-23 12:04:50 +00:00
d0adff761e Propagate AttributeError to user code in user_defined.py (#146497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146497
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #146496
2025-02-23 01:18:28 +00:00
8c761ac7e3 Handle is/is not (#146496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146496
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-02-23 01:18:28 +00:00
b084635735 [MPS/inductor] Adjust more tests that depends on non-divisible input sizes (#147681)
Also adjust a comment while I'm at it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147681
Approved by: https://github.com/jansel
2025-02-23 00:33:26 +00:00
6a5e3917a7 [MPS] Add inductor support for spherical_bessel_j0. (#147650)
Counterpart to my previous patch that added support for the op in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147650
Approved by: https://github.com/jansel
2025-02-23 00:32:36 +00:00
f9c117f859 [mps/inductor] XFAIL adaptive_avg_pool_with_output_size_0. (#147676)
Non-divisible input sizes are not implemented on MPS device yet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147676
Approved by: https://github.com/malfet
2025-02-22 20:17:33 +00:00
db15cb0988 [Submodule] [Cutlass] Update to 3.8.0 tag (#147655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655
Approved by: https://github.com/henrylhtsang, https://github.com/eqy
2025-02-22 20:05:31 +00:00
85ea679834 [TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)
[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)

Summary:

# context
* more details in the [post](https://fb.workplace.com/groups/1075192433118967/permalink/1587079018596970/)
* disable contextlib with PT2

Test Plan:
* run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+dynamo,+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_ultra_mini training.pipeline_type=pt2 data_loader.dataset.table_ds=[2024-12-02] 2>&1 | tee -a output.log
```
* old tlparse
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpYYAS3o/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
* new tlparse
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpUJhCGZ/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Reviewed By: Microve

Differential Revision: D68480678
2025-02-22 18:57:55 +01:00
fa8e3a28a7 Revert "[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)"
This reverts commit 533b884870acd951e684e0bf551eb76904dec047.

Reverted https://github.com/pytorch/pytorch/pull/141178 on behalf of https://github.com/jeanschmidt due to Broke internal arvr signals, see D69971019. @jbschlosser please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/141178#issuecomment-2676317470))
2025-02-22 17:28:12 +00:00
bea72180ed Revert "[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)"
This reverts commit e7bf490c430ac5a70ccb7ab8e954d3386fd29413.

Reverted https://github.com/pytorch/pytorch/pull/144572 on behalf of https://github.com/jeanschmidt due to Broke internal signals, D69994027, I'll find someone to help get this change merged ([comment](https://github.com/pytorch/pytorch/pull/144572#issuecomment-2676314308))
2025-02-22 17:19:38 +00:00
3409cbd177 Revert "Delete Mixed MM Special Casing (#147151)"
This reverts commit d6bb1d7f0a9dc3d11d2864da9ab46872377a6e52.

Reverted https://github.com/pytorch/pytorch/pull/147151 on behalf of https://github.com/jeanschmidt due to Broke a few internal signals, see comments on D69994157 ([comment](https://github.com/pytorch/pytorch/pull/147151#issuecomment-2676312215))
2025-02-22 17:14:32 +00:00
72b4f35cb5 [CI] Reduce the AOT target list to reduce build time (#147601)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147601
Approved by: https://github.com/atalman
2025-02-22 14:43:26 +00:00
3cc3d7e08f Also support non-contiguous activation for torch._weight_int8pack_mm on CPU (#147588)
### Problem
Non-contiguous activation for `torch._weight_int8pack_mm` is unsupported on CPU.
So, with int8 WoQ with B16 activation with torchao, for batch-size 2 & above, an assertion is hit regarding non-contiguous A being unsupported. Such an issue was encountered with LLaMA models.

### Solution
Also support non-contiguous activation for `torch._weight_int8pack_mm`, so long as it's contiguous on the last dimension & remove the assertion that requires contiguous activation.

### Alternative solutions considered
Could modify LLaMA model in transformers library to call `contiguous` after obtaining the final hidden state, just before computing logits with the LM head. However, [it](https://github.com/huggingface/transformers/pull/36078) might cause some regression for other users of that code.

Another aspect to this issue is - is latency always lower if we make an activation tensor contiguous before linear or `torch._weight_int8pack_mm` is called on CPU? I guess we need some data-points to analyze this part, although I think the performance should be good enough with this patch, since the first cache lines of rows of A are being explicitly prefetched in the existing code (and it also avoids copy, which a `contiguous` call would do).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147588
Approved by: https://github.com/mingfeima, https://github.com/leslie-fang-intel, https://github.com/malfet
2025-02-22 08:29:07 +00:00
e1bf892d90 [DDP] Temporarily disable comm mem (#147663)
For fear that it incur slightly more memory usage and cause some applications at tight memory margin to OOM.
(bc the comm mem pool is a separate pool than the regular pool ?)

Differential Revision: [D70026681](https://our.internmc.facebook.com/intern/diff/D70026681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147663
Approved by: https://github.com/d4l3k
2025-02-22 05:55:43 +00:00
086d146f6f Update ruff linter for PEP585 (#147540)
This turns on PEP585 enforcement in RUFF.

- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-22 04:45:17 +00:00
77d2780657 Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549)
running python test/strobelight/examples/compile_time_profile_example.py
```
strobelight_compile_time_profiler, line 123, 2025-02-20 14:08:08,409, INFO: compile time strobelight profiling enabled
strobelight_compile_time_profiler, line 159, 2025-02-20 14:08:08,409, INFO: Unique sample tag for this run is: 2025-02-20-14:08:081656673devgpu005.nha1.facebook.com
strobelight_compile_time_profiler, line 160, 2025-02-20 14:08:09,124, INFO: URL to access the strobelight profile at the end of the run: https://fburl.com/scuba/pyperf_experimental/on_demand/9felqj0i

strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:12,436, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:15,553, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:16,170, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 214, 2025-02-20 14:08:16,877, INFO: profiling frame 1/0
strobelight_function_profiler, line 247, 2025-02-20 14:08:19,416, INFO: strobelight run id is: 4015948658689996
strobelight_function_profiler, line 249, 2025-02-20 14:08:21,546, INFO: strobelight profiling running
strobelight_function_profiler, line 289, 2025-02-20 14:08:25,964, INFO: work function took 4.417063233006047 seconds
strobelight_function_profiler, line 230, 2025-02-20 14:08:28,310, INFO: strobelight profiling stopped
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Total samples: 119
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/73h2f7ur
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/zs06fi9e
strobelight_compile_time_profiler, line 167, 2025-02-20 14:08:44,308, INFO: 1 strobelight success runs out of 1 non-recursive compilation events.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147549
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #147547
2025-02-22 03:44:53 +00:00
fc095a885c move _strobelight/example to avoid graph breaks (#147547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147547
Approved by: https://github.com/bobrenjc93
2025-02-22 03:44:53 +00:00
fecd3f7ecb [ROCm] change is_hip_clang() to always return True (#147646)
hipify is replacing kernel launchs <<< >>> with hipLaunchKernelGGL() macro and this is a regression caused by /opt/rocm/hip/.hipinfo no longer existing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147646
Approved by: https://github.com/jeffdaily, https://github.com/petrex
2025-02-22 03:26:55 +00:00
b11d5cd584 [Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481)
This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #147347
2025-02-22 02:53:16 +00:00
2d433cf1ad [Inductor UT][Windows][XPU] Enable Inductor UT on XPU Windows. (#147347)
This PR removes the restrictions on general cases for XPU on Windows, allowing us to run Inductor UT on Windows.
Additionally, this series of PRs has also fixed all XPU Inductor UT issues on Windows. However, due to resource constraints, we have not yet set up a Windows CI pipeline online.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147347
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-02-22 02:53:16 +00:00
84fcf1bb11 constexpr all the things in irange.h (#147633)
I got complaints while irangeifying some files in ExecuTorch
that irange could not be used in a constexpr function. This made the
complaints go away.

I added a constexpr function in irange_test that used to fail to build
with `error: variable of non-literal type 'iterator' (aka
'integer_iterator<int, true>') cannot be defined in a constexpr
function before C++23` and now builds fine.

Differential Revision: [D69959614](https://our.internmc.facebook.com/intern/diff/D69959614/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147633
Approved by: https://github.com/albanD
2025-02-22 01:51:51 +00:00
6e0b09728a [export] Remove report from draft-export output (#147558)
Summary: This matches the export API. To print the report, people can just do `print(ep._report)`. This information is also displayed in the terminal after the draft_export call.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D69689154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147558
Approved by: https://github.com/pianpwk
2025-02-22 00:54:29 +00:00
1c334893dc [CacheBench] Refactor code to prepare for mode benchmarks (#147641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147641
Approved by: https://github.com/huydhn
2025-02-22 00:20:54 +00:00
5d26b7108f [PP] Remove extra code and docs BE (#147636)
current docs:
<img width="746" alt="image" src="https://github.com/user-attachments/assets/4c4088fc-ee97-4a82-be28-e33eb35e76f5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147636
Approved by: https://github.com/awgu
2025-02-22 00:10:31 +00:00
f95ab46797 [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 23:44:08 +00:00
b1a81a4a65 Don't use '-e' when installing Triton (#147228)
Currently the install_triton.sh script uses "pip install -e ." to install Triton.
Using the -e is sometimes appropriate for develop work but is less appropriate for delivery.
To make matters worse it seems the behavior of the -e various depending on the version of pip invovled.

This PR removes the -e and installs Triton normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147228
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-21 23:00:12 +00:00
995b125cdd [CI] Build sm89 with more procs experiment (#147487)
Add a build that uses 4 out of the 8 processes available on a linux.2xlarge/c5.2xlarge.  Currently it's set to 2 because it would oom, but I'm curious as to how often people's builds oom.  I can't test this on my own because of caching, so it has to run on pull request

This might result in a failing job on may people's PRs and I'm not sure how to get around it.  I named it stable to make it automatically get sorted into the stable group for Dr. CI but it'll still show up
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147487
Approved by: https://github.com/huydhn
2025-02-21 22:07:00 +00:00
7c8c82cd64 [trymerge] Post initial starting merge comment on stacked PRs (#147028)
Post a small comment stating if a PR is being merged as part of a stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028
Approved by: https://github.com/ZainRizvi
2025-02-21 22:05:00 +00:00
698f6f9fae specify only some dimensions in shapes collection (#147534)
Differential Revision: [D69936316](https://our.internmc.facebook.com/intern/diff/D69936316/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147534
Approved by: https://github.com/bobrenjc93
2025-02-21 22:02:42 +00:00
2fb9416e6f [inductor][cpu] Move VNNI weight packing into AMX GEMM kernel for contiguous BMM weights (#146843)
Currently, the bfloat16 microkernel that uses AMX vectorization requires that the weights are in an interleaved VNNI format. For GEMM code, this hasn't been an issue because GEMM currently only supports constant weights, so the VNNI weight packing is done during compile-time and saved as a constant tensor to the graph. But for BMM ops where weights are not required to be constant, current code does an expensive reshape/VNNI packing for all BMM weights.

This PR removes the need for the reshape/packing for non-constant inputs by moving VNNI packing inside the AMX microkernel. A new `K * block_n` buffer is used to store the temporary packed weights. Weight packing involves interleaving 2 rows of weights.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146843
Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-21 21:46:00 +00:00
d91be786cb [cutlass backend] clear_on_fresh_inductor_cache when generatings cutlass ops (#147586)
Differential Revision: [D69966732](https://our.internmc.facebook.com/intern/diff/D69966732/)

This is needed if we want to generate cutlass ops with different instantiation level in one session.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147586
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-02-21 21:28:41 +00:00
ef6b16ea9d Revert "[trymerge] Post initial starting merge comment on stacked PRs (#147028)"
This reverts commit 0295aabf6071c7da62325e6a29e04ed09a3e34ef.

Reverted https://github.com/pytorch/pytorch/pull/147028 on behalf of https://github.com/clee2000 due to I think this broke merge for non ghstack prs ([comment](https://github.com/pytorch/pytorch/pull/147028#issuecomment-2675532017))
2025-02-21 21:02:19 +00:00
05e6f15966 Revert "[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)"
This reverts commit e758d8b4d1632ea765bf8bc8e87b6039ae708b9f.

Reverted https://github.com/pytorch/pytorch/pull/147395 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D69890757 - servicelab_benchmark_pyper_local_runner, @eellison please help the author get this change landed ([comment](https://github.com/pytorch/pytorch/pull/147395#issuecomment-2675521966))
2025-02-21 20:56:40 +00:00
6eb795c9e8 [associative_scan] compile backend change to "eager" (#146973)
This PR fixes some issues with torch export discussed here: https://github.com/pytorch/pytorch/pull/140043#discussion_r1941932960

However, this backend change does still not resolve the failure for specific shapes mentioned here: https://github.com/pytorch/pytorch/issues/137943#issuecomment-2649564994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146973
Approved by: https://github.com/ydwu4
2025-02-21 20:21:41 +00:00
5ed1e23e3a Fix type stubs for SymmetricMemory (#146310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146310
Approved by: https://github.com/yifuwang
2025-02-21 19:59:43 +00:00
fd8ae1aa04 [ROCm] gfx940 and gfx941 cleanup (#147394)
Removing gfx architectures not supported by ROCm.

NOTE: For users wanting to build PyTorch for gfx archs that are *not* supported by the official wheels on download.pytorch.org, you can build PyTorch from source for your desired gfx arch [using the PYTORCH_ROCM_ARCH env var](https://github.com/pytorch/pytorch/blob/main/README.md#amd-rocm-support).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147394
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-02-21 19:42:12 +00:00
c0ee62573a [Easy][optim] Add LBFGS params optional desc (#147579)
[LBFGS docs](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#torch.optim.LBFGS) missing `optional` description for params in compare with other optimizer docs, like [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/34877490-16b4-4c68-bf6c-405bae563352)

### After

![image](https://github.com/user-attachments/assets/7fba94c8-7091-47b8-bdf1-ca7d779a027f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147579
Approved by: https://github.com/janeyx99
2025-02-21 19:38:10 +00:00
b5c3bb6185 Add continuous run for cachebench (#147546)
This PR adds a continuous run for cache bench.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147546
Approved by: https://github.com/huydhn
ghstack dependencies: #147537
2025-02-21 19:02:17 +00:00
76ce194b8e For addmm and bmm, check if config.autotune_fallback_to_aten before using aten as a fallback. Also fix bmm cutlass backend (#147148)
This PR also fixes BMM, which was silently failing for a while.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147148
Approved by: https://github.com/eellison
2025-02-21 18:41:52 +00:00
0295aabf60 [trymerge] Post initial starting merge comment on stacked PRs (#147028)
Post a small comment stating if a PR is being merged as part of a stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028
Approved by: https://github.com/ZainRizvi
2025-02-21 18:05:05 +00:00
2190ca7f47 Use __qualname__ in add_safe_globals and update Unpickling error raised for Unsupported GLOBAL (#146815)
- Fixes #146814

Change
```python
for f in _marked_safe_globals_set:
    module, name = f.__module__, f.__name__
```
to

```python
for f in _marked_safe_globals_set:
    module, name = f.__module__, f.__qualname__
```
for avoiding same key string overwrite.

A test is also added.
```
python test/test_serialization.py TestSerialization.test_serialization_nested_class
```

- Fixes #146886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146815
Approved by: https://github.com/mikaylagawarecki
2025-02-21 18:04:59 +00:00
f4e4cfcb91 [caffe2] Ignore compiler option when building using clang (#147556)
Summary:
Skip adding unrecognized option optimize("-fno-tree-loop-vectorize") when building using clang

This piece of code began to be compiled after armv9a has been set as default compilation profile

Test Plan: buck2 run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12 lego/scripts:lego_cli -- run-locally --model_entity_id ${MODEL} --config_version ${CONFIG_VERSION} --disable_generate_new_checkpoint --checkpoint_version 0 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 | tee aimp_697790515.log

Reviewed By: andrewjcg

Differential Revision: D69947027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147556
Approved by: https://github.com/janeyx99
2025-02-21 17:46:04 +00:00
a0c7d96028 [Easy] Add Delimeter To Show Where Allocation Addr Begins (#147461)
Summary: When we print the addr we append an "s" or a "b" to the beginning of an addr. Since the addr is in hex, a user might be confused and think the "b" is part of the address. Added an approstrophe to clear this up

Test Plan: CI

Differential Revision: D69828538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147461
Approved by: https://github.com/zdevito
2025-02-21 17:19:53 +00:00
784f64bb05 [inductor] triton support port-#5512, update cpp wrapper for gpu (#146917)
In short, this pull request enhances `constexprs` expression filtering.

Note: I tested the changes on xpu backend.

Part of https://github.com/pytorch/pytorch/issues/144103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146917
Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/davidberard98, https://github.com/YUNQIUGUO
2025-02-21 17:10:53 +00:00
6a6de0e09d better error message (#147532)
Differential Revision: [D69939736](https://our.internmc.facebook.com/intern/diff/D69939736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147532
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2025-02-21 17:08:47 +00:00
a8ce4d1846 Add cachebench (#147537)
This PR adds a new benchmark called cachebench in order to measure/demonstrate the prowess of PT2 caching.
```
python benchmarks/dynamo/cachebench.py --output="result.json"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147537
Approved by: https://github.com/jamesjwu
2025-02-21 17:06:45 +00:00
af1072ffb6 [Intel GPU] Enable BUILD_GRAPH for xpu_mkldnn (#147608)
For preparation of OneDNN based XPU SDPA enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147608
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-02-21 16:12:30 +00:00
d6bb1d7f0a Delete Mixed MM Special Casing (#147151)
Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup:

- prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not.
- similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path.

It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication.

The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete.

Future optimizations could include:

- cutlass prologue path
- making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2025-02-21 16:02:40 +00:00
36c461af95 Support SymmetricMemory's signaling kernels on sm60 and sm70 (#146308)
By leveraging libcudacxx's utilities: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/atomic_ref.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146308
Approved by: https://github.com/yifuwang
2025-02-21 15:29:02 +00:00
7ce4974e50 Fix PEP585 update (#147536)
Summary: D69920347 causes a pyre failure due to changing a base object from typing.Iterable to abc.Iterable.  For now revert that change until it can be dealt with on its own.

Test Plan:
failures from D69920347 pass locally
unit tests pass

Reviewed By: oulgen

Differential Revision: D69936518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147536
Approved by: https://github.com/jeanschmidt
2025-02-21 14:37:03 +00:00
654f2666d9 Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.12xlarge.memory.ephemeral`.

This change depends on https://github.com/pytorch/test-infra/pull/6316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-21 14:15:40 +00:00
51748a5d1a Update OpenBLAS to 0.3.29 (#144857)
* Improvements for GEMM to GEMV kernels
 * Improvements for SVE kernels for SGEMV and DGEMV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144857
Approved by: https://github.com/malfet
2025-02-21 10:07:06 +00:00
71d2827eeb Code Refactoring for getting start and stride from global ranks (#147230)
Summary: Code Refactoring for getting start and stride from global ranks, this function can be used in different collective backend.

Differential Revision: D69555405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147230
Approved by: https://github.com/kwen2501
2025-02-21 10:02:50 +00:00
e7bf490c43 [ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)
This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm.

Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 10:01:27 +00:00
cffe7183f1 [cutlass backend] Fix standalone runner test after swizzle became a runtime parameter (#147554)
Differential Revision: [D69945114](https://our.internmc.facebook.com/intern/diff/D69945114/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147554
Approved by: https://github.com/mlazos
2025-02-21 09:27:44 +00:00
cyy
b61a556427 Turn onnx functions into static (#147598)
To avoid exposing ONNX symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598
Approved by: https://github.com/justinchuby
2025-02-21 07:40:28 +00:00
3395da7f7c Revert "Build a storage reader/writer to write checkpoints in HF format (#146352)"
This reverts commit c615b8c174c80936d365d19d8b8f4d9ad9a195f3.

Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))
2025-02-21 07:30:52 +00:00
e5da9df421 Revert "Increase memory for linux binary builds (#147542)"
This reverts commit 87e6e2924eb706b928cdfc4a11623b39259fa830.

Reverted https://github.com/pytorch/pytorch/pull/147542 on behalf of https://github.com/jeanschmidt due to seems that it is best to use another machine type ([comment](https://github.com/pytorch/pytorch/pull/147542#issuecomment-2673765724))
2025-02-21 07:14:57 +00:00
4986f0f52e [PT2]: allow empty dict to pass type check (#147167) (#147480)
Summary:

Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models.
```
terminate called after throwing an instance of 'c10::Error'
  what():  forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'.
```
Let empty dict pass type check.

please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover:

Test Plan:
```
MODEL_ENTITY_ID=691508446
SNAPSHOT_ID=0
OTHER_MODEL_ENTITY_ID=649645886
OTHER_SNAPSHOT_ID=0
MODULE=local

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \
    --loadMode=BenchmarkAB \
    --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \
    --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \
    --moduleName=${module} \
    --submodToDevice "" \
    --benchmarkDontRebatchSamples=true \
    --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt
```

Reviewed By: yjhao

Differential Revision: D69871393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480
Approved by: https://github.com/henryoier, https://github.com/jeanschmidt
2025-02-21 07:00:46 +00:00
c74b59fc1f [ROCm][TunableOp] resolve the rocBLAS version dynamically (#147363)
Dynamically gets rocBLAS version instead of relying on some preprocessing-time definitions which may be stale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147363
Approved by: https://github.com/pruthvistony, https://github.com/naromero77amd, https://github.com/jeffdaily
2025-02-21 06:50:21 +00:00
86ae672b6a Use has_triton_package in _inductor.runtime.hints (#147442)
Fixes #ISSUE_NUMBER
Use existing method for triton check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147442
Approved by: https://github.com/Skylion007
2025-02-21 05:52:00 +00:00
533b884870 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-02-21 05:22:19 +00:00
a2c3a2c5c4 Support serialization for uintx/intx in weights_only (#147500)
Summary:
Fixing the issue reported by huggingface

Test Plan:
python test/test_serialization.py -k test_serialization_uintx_intx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147500
Approved by: https://github.com/mikaylagawarecki
2025-02-21 04:38:44 +00:00
c615b8c174 Build a storage reader/writer to write checkpoints in HF format (#146352)
Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.

Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage

N6476188 --> able to save and load tensor in hf format

Differential Revision: D68444967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352
Approved by: https://github.com/saumishr
2025-02-21 03:31:21 +00:00
fe100c3c5b Add libtorch nightly build for CUDA 12.8 (#146265)
Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error

"Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note.

Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support.

https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146265
Approved by: https://github.com/atalman
2025-02-21 03:04:06 +00:00
ba214ab56c TCPStore: soft fail bind when agent store active (#147465)
This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`.

This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical.

This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests.

https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465
Approved by: https://github.com/fduwjj
2025-02-21 03:02:26 +00:00
8a5265cb37 [Intel GPU] qlinear_pointwise.binary[_tensor] XPU support (#135337)
# Motivation
This PR intends to enable quantized fusion `qlinear+add` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary")` and `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary_tensor")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_add_xpu
```

# Runtime Verification
```bash
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu,,4x4:4x4,0.0319824
```
The verbose is collected from UT. We can see the attribute ` attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu`, the post add and ReLU is successfully fused on GEMM computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135337
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/liangan1, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-21 02:09:28 +00:00
8b818ab58f Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353)
Fixes #147303.
Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353
Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch
2025-02-21 01:33:33 +00:00
ac88a6c00d [fx] demote node prepend to self log from warning to debug (#147538)
FIXES https://github.com/pytorch/pytorch/issues/147175

This is harmless, not sure why this is a user warning. Writing reordering graph passes is more concise when we ignore this warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147538
Approved by: https://github.com/yanboliang
2025-02-21 01:32:34 +00:00
4b35139a46 [ROCm][TunableOp] Fix TunableOp warmup environment variable. (#147412)
This PR corrects the behavior of the TunableOp warmup variables:
```
PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS
PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS
```

See the updated comments which describe how the environment variables are intended to work. Previously, if you only set one of the two environment variables the warmup iters would always be zero.

Manually tested the four possible combinations to make sure things still behavior as intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147412
Approved by: https://github.com/jeffdaily
2025-02-21 00:29:58 +00:00
fdb1305ace reland "[sigmoid] Test OSS model runner with test_export.py" (#147535)
Summary: There are ~260 tests for all the corner cases of export from test_export.py. utitlizing to test sigmoid in the OSS setting.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r _sigmoid

Differential Revision: D69937387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147535
Approved by: https://github.com/yiming0416
2025-02-20 23:45:13 +00:00
87e6e2924e Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.24xlarge.ephemeral`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-20 23:02:45 +00:00
be0df96b50 Fix c++ implementation of strip_function_call (#147436)
#143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors.

- Fix all the UCS handling (and make some of the common code more common)
- Make sure all the error paths return `nullptr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436
Approved by: https://github.com/jansel
2025-02-20 20:41:21 +00:00
af31640391 [cutlass backend] enable mixed mm test (cutlass2x) for H100 (#147474)
I am okay with not landing this as well. The motivation is to make developing on H100 smoother.

The reason the current test works on A100 but not H100 is because of alignment issue. Which was caused by arch specific filtering logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147474
Approved by: https://github.com/alexsamardzic, https://github.com/ColinPeppler
2025-02-20 20:28:44 +00:00
d068141c3b [cutlass backend] add subproc tests (#147173)
I want to separate subproc autotuning from the main tests. And I observed that for addmm, it can work without subproc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147173
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #147169
2025-02-20 20:07:42 +00:00
2565951f8a [cutlass backend] remove triton from most tests and add an integration test (#147169)
Removing aten and triton from the list of backends for the tests that have it. Instead, add a small integration test to make sure autotuning works fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147169
Approved by: https://github.com/ColinPeppler
2025-02-20 20:07:42 +00:00
fb1f7f6a09 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/native/miopen/Conv_miopen.cpp +1 (#147496)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69755123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147496
Approved by: https://github.com/Skylion007
2025-02-20 19:00:38 +00:00
6971b77510 [CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935)
Summary: Adds wait_event and record_event endpoints to CPU stream in order to facilitate device-agnostic code. Both methods are noops.

Test Plan: CI

Differential Revision: D68833927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145935
Approved by: https://github.com/Skylion007
2025-02-20 18:50:55 +00:00
863ac20659 [CI] Do not overwrite return code of test file when fails for rerun disabled tests (#147484)
Do not overwrite the return code of a single file when it fails.  This will allow the log to be printed to stdout and the gha logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147484
Approved by: https://github.com/ZainRizvi
2025-02-20 17:51:58 +00:00
83bb921a5a [ROCm] Update meta_registration for efficient attention (#146979)
Fixes a series of failing and skipped unit tests.

For nvidia hw, the longsumexp last dimension is required to be a multiple of 32.  This is not the case for rocm.

A related issue: https://github.com/pytorch/pytorch/issues/146848

The unit tests in question:
```bash
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_6_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_6_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146979
Approved by: https://github.com/shunting314
2025-02-20 15:05:13 +00:00
382fbcc1e4 add the torch.float8_e8m0fnu dtype to PyTorch (#147466)
Summary:

Continuing the work from https://github.com/pytorch/pytorch/pull/146427

Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in
https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format.  Example of basic functionality:

```python
import torch

# round trip
x0 = torch.randn(4, 4, dtype=torch.float32)
x1 = x0.to(torch.float8_e8m0fnu)  # RNE rounding
x2 = x1.to(torch.float32)  # 2 ** exponent

# creation with empty
x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu)

# printing
print(x0)
```

Done in this PR:
* numerical correctness
* op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32
* printing a tensor works

For future PRs:
* performance optimizations for casting
* torch._scaled_mm
* PT2
* various cleanups (detailed in comments with issue numbers)

Test Plan:

```
pytest test/quantization/core/experimental/test_float8.py -s
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466
Approved by: https://github.com/drisspg
2025-02-20 13:55:42 +00:00
574371d828 Add current cuda device index to FXGraphCache key (#147464)
This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405.
It does *not* handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key.

Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key.

A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts.

I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change.

Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2025-02-20 12:38:21 +00:00
ead970c8d0 Revert "Add cifllow/riscv64 label"
This reverts commit 5116b27792d37c38039459c922a466581e219fc2.
(I've pushed to the wrong branch by accident)
2025-02-20 11:55:52 +01:00
5116b27792 Add cifllow/riscv64 label 2025-02-20 11:09:06 +01:00
6beba8dcce Optimize graph.py typing (#147099)
Optimize `graph.py` methods type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147099
Approved by: https://github.com/cyyever, https://github.com/aorenste
2025-02-20 09:32:30 +00:00
f9b8121350 Make Inductor scheduler aware of _scaled_mm (#146992)
This is used for example to estimate runtime when doing comms overlap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146992
Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/shunting314
2025-02-20 09:02:31 +00:00
9da250aada type fully_shard so that the return value can be chained with typing enabled (#147489)
This allows for

```
fsdped = fully_shard(model)
fsdped.set_xyz()
```

same applies if `model` is actually a list of modules

Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489
Approved by: https://github.com/Skylion007
ghstack dependencies: #147488
2025-02-20 08:43:16 +00:00
6a72aaadae Fix torch.max optional args dim, keepdim description (#147177)
[`torch.max`](https://pytorch.org/docs/stable/generated/torch.max.html#torch.max) optional args `dim`, `keepdim` not described in document, but users can ignore them.

```python
>>> import torch
>>> a = torch.randn(3,1,3)
>>> a.max()
tensor(1.9145)
>>> a.max(dim=1)
torch.return_types.max(
values=tensor([[ 1.1436, -0.0728,  1.3312],
        [-0.4049,  0.1792, -1.2247],
        [ 0.8767, -0.7888,  1.9145]]),
indices=tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]))

```

## Changes

- Add `optional` description for `dim`, `keepdim`
- Add example of using `dim`, `keepdim`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/3391bc45-b636-4e64-9406-04d80af0c087)

### After

![image](https://github.com/user-attachments/assets/1d70e282-409c-4573-b276-b8219fd6ef0a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147177
Approved by: https://github.com/colesbury
2025-02-20 08:18:09 +00:00
452315c84f Fix RuntimeError: value cannot be converted to type int64_t without overflow (#147492)
The exact call is coming from here:

78a94c9114/torch/_inductor/memory.py (L161)

I have no idea why this error is being thrown and what mode/modes might be failing for this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147492
Approved by: https://github.com/eellison
2025-02-20 08:00:26 +00:00
a000c7e6d2 Add hint message for pack_padded_sequence (#146747)
Fixes #144207

Add truncate hint message in docs [torch.nn.utils.rnn.pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html)

## Test Result

![image](https://github.com/user-attachments/assets/46258f36-f6c7-4f11-9213-8513e52a9001)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146747
Approved by: https://github.com/mikaylagawarecki
2025-02-20 06:27:07 +00:00
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
76ad19a549 [dynamo][codegen] Implement CSE for pre-graph graph-arg bytecode reconstruction (#147425)
This reduces fixed overhead seen in a few internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147425
Approved by: https://github.com/jansel, https://github.com/StrongerXi
2025-02-20 05:42:52 +00:00
8f6b9403c1 [audio hash update] update the pinned audio hash (#147423)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147423
Approved by: https://github.com/pytorchbot
2025-02-20 05:39:46 +00:00
77aa602871 [torchbind] Differentiate ScriptModule and ScriptObject with qualified name (#147399)
Summary:
This pr add a _is_script_object method to differentiate scriptModule and scriptObject, where the formal inherits from ScriptObject in C++ so they both passes the isinstance(obj, torch.ScriptObject) check.

The qualified name of ScriptObject (i.e. custom class) would starts with "__torch__.torch.classes", this has been a widely used assumption for dealing with custom class across our code base.

Test Plan: Add new test.

Differential Revision: D69685316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147399
Approved by: https://github.com/yushangdi
2025-02-20 04:57:57 +00:00
7185ca8348 [Cutlass] Add test verifying number of precompiles (#147477)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147477
Approved by: https://github.com/henrylhtsang
2025-02-20 04:47:57 +00:00
5f5b44f6bf [ROCm] Update inductor-periodic.yml to use the correct label (#147473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147473
Approved by: https://github.com/jeffdaily
2025-02-20 04:44:18 +00:00
0d56b7e665 Support size oblivious max equation (#147344)
Addresses https://github.com/pytorch/pytorch/issues/125914 by detecting when we have a sym_max between {0, 1} and a summation of size-like unbacked symints.

The basic idea is max(1, u0 + u1) can be simplified to u0 + u1 if both u0 and u1 are size-like since their value ranges are [2, inf].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147344
Approved by: https://github.com/angelayi
2025-02-20 04:33:19 +00:00
0b0da81021 Support static method of torchbind attributes in torch.compile with inductor backend (#146927)
As title.

Many changes adapted from https://github.com/pytorch/pytorch/pull/129537.

Also this diff is only for *static* method of torchbind *attributes*. Some case that's not supported/tested:
- dynamic torchbind objects
-  torchbind objects as an input to the module.

Note that in JIT Inductor, the attributes are lifted as inputs. So even if we just have torchbind objects as attributes, they will show up as inputs in the graph.

Example generated python code in torch.compile with inductor backend for the test case in `inductor/test_torchbind.py` (P1730554370):

```python
async_compile.wait(globals())
del async_compile

def call(args):
    arg1_1, arg2_1, arg3_1 = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    assert_size_stride(arg2_1, (2, 3), (3, 1))
    buf2 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, arg2_1, buf2)
    del arg1_1
    del arg2_1
    # Topologically Sorted Source Nodes: [x, takes_foo_tuple_return], Original ATen: [aten.add]
    buf3 = torch.ops._TorchScriptTesting.takes_foo_tuple_return.default(arg3_1, buf2)
    buf4 = buf3[0]
    assert_size_stride(buf4, (2, 3), (3, 1))
    buf5 = buf3[1]
    assert_size_stride(buf5, (2, 3), (3, 1))
    buf6 = buf4; del buf4  # reuse
    cpp_fused_add_1(buf6, buf5)
    del buf5
    # Topologically Sorted Source Nodes: [y, b], Original ATen: [aten.add]
    buf7 = torch.ops._TorchScriptTesting.takes_foo.default(arg3_1, buf6)
    del buf3
    del buf6
    buf8 = buf7
    assert_size_stride(buf8, (2, 3), (3, 1))
    # Topologically Sorted Source Nodes: [c], Original ATen: []
    buf9 = torch.ops.higher_order.call_torchbind(arg3_1, 'add_tensor', buf2)
    del arg3_1
    del buf7
    buf10 = buf9
    assert_size_stride(buf10, (2, 3), (3, 1))
    del buf9
    buf11 = buf2; del buf2  # reuse
    cpp_fused_add_2(buf11, buf8, buf10)
    return (buf11, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg1_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
    import pickle
    global arg3_1
    arg3_1 = pickle.loads(b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x05torch\x94\x8c\x0cScriptObject\x94\x93\x94)\x81\x94]\x94(K\nK\x14e\x8c0__torch__.torch.classes._TorchScriptTesting._Foo\x94\x86\x94b.')
    fn = lambda: call([arg1_1, arg2_1, arg3_1])
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146927
Approved by: https://github.com/angelayi
2025-02-20 03:33:19 +00:00
de1cb0f351 capture the return value in the contract typing (#147488)
----

* the existing typing makes the return type `Optional[nn.Module]`
* this doesn't seem to be what the decorator actually does as it does
  not alter the original return type
* This PR aims to fix the typing

Differential Revision: [D69888120](https://our.internmc.facebook.com/intern/diff/D69888120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147488
Approved by: https://github.com/Skylion007
2025-02-20 03:32:34 +00:00
fea718f062 [BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730)
Our three main users are OK with this, with two of them (foreach_map,
invoke_quant) prefering it like this.

I was originally worried about BC issues (this now means you cannot add
any positional args) but I think that's not a concern -- one can always
add kwonly args.

Test Plan
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730
Approved by: https://github.com/ydwu4, https://github.com/mlazos
2025-02-20 02:30:36 +00:00
f79b352f5a [Intel GPU] qconv_pointwise.binary XPU support (#135189)
# Motivation
This PR intends to enable quantized fusion `qconv+add` and `qconv+add+relu` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qconv2d_pointwise.binary")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_add_xpu \
   -k test_qconv2d_add_relu_xpu 2>&1
```

# Runtime exemplification
Following is the oneDNN verbose collected from UT
```bash
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_s8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_linear:1:0.337704+sum:0.0241217+eltwise_relu,alg:convolution_direct,mb1_ic3oc6_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.151123
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135189
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jerryzh168
ghstack dependencies: #133307

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-20 02:02:54 +00:00
93316cfe94 Move ir_pre_fusion.txt and ir_post_fusion.txt to TORCH_LOGS (#147248)
Fixes #147002

Moves ir_{pre, post}_fusion.txt to be controlled by TORCH_LOGS instead of TORCH_COMPILE_DEBUG.
Updated tests of these logs as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147248
Approved by: https://github.com/eellison
2025-02-20 00:26:17 +00:00
16e202a38e [dynamo] improved graph break messages for some common graph break sites [1/N] (#146525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146525
Approved by: https://github.com/jansel
2025-02-20 00:08:13 +00:00
1e94c7aaa4 [draft_export] only clear pending unbacked symbols for overwritten kernels (#147427)
This was wrong, we were doing this in all cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147427
Approved by: https://github.com/angelayi
2025-02-20 00:07:54 +00:00
3986c3e4a6 [reland][cutlass backend] Do not change dtype of GEMM template for cutlass 3x (#147434)
Reland of https://github.com/pytorch/pytorch/pull/146877
incorporate forward fix (didn't land): https://github.com/pytorch/pytorch/pull/147185

Summary:
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69825865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147434
Approved by: https://github.com/ColinPeppler
2025-02-20 00:07:07 +00:00
a88d7d4268 [util] fetch logical count cpu (#147413)
To match with Vcpu count with aws:

after (96), before (48)
Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal
before: https://hud.pytorch.org/utilization/13377376406/37360984234/1
after: https://hud.pytorch.org/utilization/13401543806/37435031356/1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413
Approved by: https://github.com/clee2000
2025-02-19 23:44:54 +00:00
004d65aeb0 Add type hints to cuda kernel (#147471)
Missed this in a previous PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147471
Approved by: https://github.com/eellison
2025-02-19 23:35:10 +00:00
48203bec63 [BE] remove sysconfig.get_config_var("LIBDIR") from cuda lib paths (#147409)
Summary: I think the path is not needed anymore. It was added in https://github.com/pytorch/pytorch/pull/126408, but  it has been a while since then. See if CI complains.

Differential Revision: D69573185

See also  https://github.com/pytorch/pytorch/pull/147158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147409
Approved by: https://github.com/chenyang78
2025-02-19 23:04:22 +00:00
f63db6255f Re-land exclude upsample_bilinear2d.vec and nearest2d.vec from default export decomposition table (#147153)
Note: This is a re-land of https://github.com/pytorch/pytorch/pull/141791, which I reverted due to breaking some Meta-internal tests - an internal ET delegate did not handle the non-decomposed upsample_nearest2d, and it was not caught in CI. I've resolved that issue and should be ready to safely re-land.

Summary:
As upsample_bilinear2d.vec and upsample_nearest2d.vec are core ATen ops, they should not be decomposed by default in the export path. Because the operators have CompositeImplicitAutograd dispatch, their decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.

In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining two ops, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.

Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and recompose it, but this is no longer necessary.

Test Plan:
Added a new test (`test_default_decomposition_core_cia_ops`) in test_export.py to verify that upsample_bilinear2d.vec (and in the future, other core-tagged CIA ops) are not decomposed by default. Also, I manually validated end to end with ExecuTorch that the op is not decomposed in to_edge (see N6238522).

```
buck test //caffe2/test:test_export -- test_default_decomposition_core_cia_ops
```

Differential Revision: D69625112

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147153
Approved by: https://github.com/manuelcandales
2025-02-19 23:03:29 +00:00
fb55bac3de [fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439)
The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script.

Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439
Approved by: https://github.com/fegin
2025-02-19 22:09:16 +00:00
41ae15faa3 [ONNX] Add scaffolding for onnx decomp and logic for op tests (#147392)
Create scaffold for onnx op test data and common logic. This PR creates the scaffolding for new onnx decomp functions described in https://github.com/pytorch/pytorch/issues/139301. It adds two ops: abs and add, and enables the related tests.

https://github.com/pytorch/pytorch/issues/139301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147392
Approved by: https://github.com/titaiwangms
ghstack dependencies: #147396
2025-02-19 21:55:12 +00:00
24738768a8 more dist ops in non strict (#147417)
Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`.

Test Plan: added unit tests

Differential Revision: D69813991

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417
Approved by: https://github.com/angelayi
2025-02-19 21:29:16 +00:00
394676759d ci: Add h100 nightly perf testing (#146868)
This infrastructure has been up for a while so add a workflow to actually run things on it.

> [!IMPORTANT]
> We only have **14** linux.aws.h100 runners so it might be beneficial for us to actually pair this list down.
> Will leave it up to the compiler team to comment on this PR on which tests are actually important vs. what is not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146868
Approved by: https://github.com/eellison, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-02-19 21:13:17 +00:00
8bea08e5bc [BE] Fix tensor stub (#147384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147384
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/atalman
2025-02-19 19:47:03 +00:00
e758d8b4d1 [Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)
Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395
Approved by: https://github.com/eellison
2025-02-19 19:45:01 +00:00
279c7f262e [ONNX] Refactor dispatcher and registry (#147396)
This PR sets up the registry to accept onnx decomp functions to be moved into PyTorch (https://github.com/pytorch/pytorch/issues/139301).

The ops from onnx script are currently appended to the registry. When the ops are moved into PyTorch, the moved ops takes precedence because they appear first in the registry list.

After the migration hooks for loading ops from onnx script will be removed.

1. Use a private field `_pt_onnx_signature` to store function signatures to avoid conflicts
2. Update the registry to record the signature in OnnxDecompMeta and update the dispatcher to leverage the data structure
3. Update registry to prepare for onnx op registration, and update the the onnx_impl decorator to support a no_compile option

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147396
Approved by: https://github.com/titaiwangms
2025-02-19 19:38:28 +00:00
4f3c070b25 [inductor] GraphLowering code movement (#147335)
moved these methods under __init__ to be more idiomatic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147335
Approved by: https://github.com/eellison
ghstack dependencies: #147331
2025-02-19 19:32:30 +00:00
5a3a50c791 Update Arm Compute Library (ACL) to v25.02 (#147454)
Among many things, this version of ACL fixes the redundant declaration  warning that we're blocked on in (#145942, #146620, #147337) and introduces better scheduling heuristics for GEMMs

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147454
Approved by: https://github.com/malfet
2025-02-19 18:51:08 +00:00
9fee408daa [caffe2] disable warning for unused arguments (#147411)
Summary: Disable warnings on unused command line arguments for ukernels_asm.

Test Plan:
On top of D69602077:
```
$ buck2 build --flagfile fbsource//xplat/mode/arstudio/auto.py fbsource//xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:ukernels_asmAppleMac
```

Differential Revision: D69807977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147411
Approved by: https://github.com/kimishpatel
2025-02-19 17:54:31 +00:00
5220d402b5 [ROCm] TopK optimizations for AMD GPUs (#146387)
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146387
Approved by: https://github.com/malfet, https://github.com/ngimel
2025-02-19 17:10:59 +00:00
e6c86952c6 Add CUDA 12.8 windows nightly build (#147037)
https://github.com/pytorch/pytorch/issues/145570

windows AMI is deployed to prod today, prepping the windows cuda 12.8 build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147037
Approved by: https://github.com/atalman
2025-02-19 16:59:32 +00:00
8cbf7d0d6e [Inductor UT][XPU] Skip fft_c2c case since it's not implemented on XPU. (#147351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147351
Approved by: https://github.com/jansel
2025-02-19 16:03:03 +00:00
ed83b0b70b [ddp] decouple python reducer from compilation mode (#147123)
Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147123
Approved by: https://github.com/fegin
2025-02-19 15:51:40 +00:00
303ad1916f [FlexAttention] Fix weird generate stride call in flex decode (#147435)
# Summary
Seems like we had a redundant tuple unpack and that doesn't appear to be supported in new triton

Fixes https://github.com/pytorch/pytorch/issues/147373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147435
Approved by: https://github.com/BoyuanFeng
2025-02-19 12:12:27 +00:00
77dbd28535 [Cutlass] Restore search space for swizzle (#147224)
This restores the previous search space, since swizzle is now a runtime parameter, there shouldn't be extra compile-time overhead from searching this now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147224
Approved by: https://github.com/eellison
ghstack dependencies: #147222, #147223
2025-02-19 09:22:51 +00:00
e9b3ff0570 [Cutlass] Add support for runtime param choices, starting with swizzle (#147223)
This PR adds support for swizzle as a runtime parameter choice. Future runtime parameter choices can be added to the [get_runtime_arg_info](2d40f9fb52/torch/_inductor/codegen/cuda/cuda_template.py (L282)) list method and then possible choices can be [looped over similarly to swizzle](933f921b36/torch/_inductor/codegen/cuda/gemm_template.py (L532)). For precompile, we now filter choices by hash to only compile each distinct kernel source once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147223
Approved by: https://github.com/Chillee, https://github.com/eellison
ghstack dependencies: #147222
2025-02-19 09:22:51 +00:00
81eb2a78ad [Inductor] Add autotuning artifact logging (#147222)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147222
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-02-19 09:22:42 +00:00
655b061ef0 [inductor] Freeze runtime asserts after shape prop but before codegen (#147331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147331
Approved by: https://github.com/eellison
2025-02-19 06:29:13 +00:00
454fbd5bbe realize stride symbols in estimate_runtime (#146752)
Unfortuanlty could not create a local repo, or unit test.
fix https://github.com/pytorch/pytorch/issues/146686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146752
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-02-19 06:02:49 +00:00
2c3680ce38 [apf] Fix input adapter (#147238)
Summary: Add support for inputs that no longer exist in `input_fields`, but is not actually used by the original program. In this case, we just give it a dummy input based on the node's metadata.

Test Plan: Verified for S488841

Differential Revision: D69328093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147238
Approved by: https://github.com/pianpwk
2025-02-19 04:49:58 +00:00
465930ee81 Revert "[ROCm] ROCm-specific gemm tuning parameters" (#147388)
Summary:
This diff reverts D69573225 / https://github.com/pytorch/pytorch/pull/143286

15% cold compile time regression, see https://fb.workplace.com/groups/1075192433118967/permalink/1608559059782299/

Test Plan: NA

Differential Revision: D69790102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147388
Approved by: https://github.com/yanboliang
2025-02-19 04:47:35 +00:00
4ece056791 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-19 03:52:26 +00:00
bd370c138a fix pt2e block wise quantization unit test (#147406)
Differential Revision: D69806596

https://github.com/pytorch/pytorch/pull/146946 breaks the unit test, because the quant nodes are folded by default now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147406
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168
2025-02-19 02:40:27 +00:00
5006932cbc [cutlass backend] forward fix of standalone runner for fbcode (#147158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147158
Approved by: https://github.com/chenyang78
2025-02-19 02:02:10 +00:00
f16d30137c [OSS] Update FileSystem methods to properly handle a string argument (#145751)
Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used.

Test Plan: N6475361 methods don't throw an error with a string arg like they were previously

Differential Revision: D68713937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751
Approved by: https://github.com/pradeepfn
2025-02-19 01:50:24 +00:00
953f7834cc [ONNX] Pick up missing types in dynamic shapes renaming (#147407)
Found in `_check_dynamic_shapes` that int and None type are valid inputs of dynamic_shapes.
This PR adds the support on these two types and add the tests to guard the sync of ONNX flatten logic and the one in expor.t
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147407
Approved by: https://github.com/justinchuby
2025-02-19 01:49:53 +00:00
757d7f28d1 [CD] Increase timeout for windows binary builds (#147390)
Mitigates https://github.com/pytorch/pytorch/issues/147376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147390
Approved by: https://github.com/huydhn, https://github.com/jeanschmidt, https://github.com/malfet
2025-02-19 01:15:04 +00:00
959d79f85f [ONNX] Move and improve error reproduction logic in test (#147391)
https://github.com/pytorch/pytorch/issues/139301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147391
Approved by: https://github.com/titaiwangms
2025-02-19 00:00:11 +00:00
babb2dc2af Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 6f7e67c43c13b5675b4ff60cbaa71e5083a22481.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/wdvr due to failing inductor mkldnn_pattern_matcher_cpu tests ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2667186865))
2025-02-18 23:58:31 +00:00
525ca80f53 add unbacked strict mode (#147333)
fixes #145775

This is the first step in introducing a "strict" mode where we don't silent specialize and don't silent graph break. At a high level when we do mark_unbacked(... strict=True), anytime we specialize an unbacked symint we will explicitly error and tell the user their unbacked dimension was specialized to a single value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147333
Approved by: https://github.com/laithsakka
2025-02-18 23:33:55 +00:00
5d547d82e6 Add no_data_dependent_graph_break mode (#147342)
This adds a strict mode `TORCHDYNAMO_UNBACKED_STRICT` to prevent graph breaking when we guard on data dependent. This is a better UX for those who are actively trying to make their model more dynamic, but aren't close enough to full graph to use that flag directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147342
Approved by: https://github.com/laithsakka
2025-02-18 23:33:47 +00:00
bae049b439 Update addr doc (#146482)
Fixes https://github.com/pytorch/pytorch/issues/146399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146482
Approved by: https://github.com/janeyx99
2025-02-18 23:25:38 +00:00
ca397d82a6 [Sigmoid] Fix issues with constant folding and fba_ops (#146948)
Summary:
There are 2 issues:

- `skip_folding_node_fn` isn't considered when propagating constant values. So given a skipped node with constant inputs, it outputs a constant and its users can output constant values and then be included in the constant graph. However, the skipped node is not included in the constant graph when extracting the constant graph. This issue is fixed by checking for skipped node when propagating the constant values and making the skipped node to output unknown value (not constant) so that its users cannot output constant.

- `fba_linear` op can be included in the constant graph but it is not implemented for CPU so constant graph cannot be executed. This issue is fixed by converting `fba_linear` to `aten.addmm`.

- A refactor to allow more fba_ops to be included in the constant graph (via mapping fba_ops to aten ops).

Reviewed By: StellarrZ

Differential Revision: D68716393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146948
Approved by: https://github.com/zhxchen17
2025-02-18 23:17:47 +00:00
c9a15d980f [FSDP2] Simplify shard_placement_fn in test (#146847)
Summary: Found this while checking `shard_placement_fn` for Shampoo shard independent implementation.

Test Plan: OSS CI & tests

Differential Revision: D69412878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146847
Approved by: https://github.com/awgu
2025-02-18 23:01:26 +00:00
c8433c2c6c [BE] correct docs for clock_rate to MHz, fixes #147098 (#147393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147393
Approved by: https://github.com/andrewor14
2025-02-18 22:59:58 +00:00
a21a123fd5 Add fqn_modifier at loading_state_dict and unit test (#146557)
In Fusion model, users might change the state_dict keys by state_dict_hook
The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys.

The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default
During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook.

For example:
There's a state_dict_hook:

```
def _state_dict_hook(self, destination, prefix, keep_vars):
    """Remove "embedding" from the original embedding in the state_dict
    name. This keeps the orginal state dict name for the embedding
    from before fusing with the FusionEmbedding.

    [!Note] This update changes the order of the OrderedDict
    """
    key = prefix + "embedding.weight"
    new_key = prefix + "weight"
    destination[new_key] = destination[key]
    del destination[key]
```

In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module
```
def fqn_modifiers(self) -> Dict[str, str]:
    return {
        "weight": "embedding",
    }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557
Approved by: https://github.com/fegin
2025-02-18 22:54:41 +00:00
7622e29a37 Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))
2025-02-18 22:23:35 +00:00
3f35664ee8 More precise check for shared storage check in inductor/reinplace pass (#147050)
Currently if two tensor share storage we have some logic to avoid re-inplacing. Before this PR two tensors share storage if use same underlying storage even if they do not overlap. This diff enhance the checks to avoid cases when we know tensors do not overlap easily.
mitigate https://github.com/pytorch/pytorch/issues/139628 but does not fix the inductor issue in it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147050
Approved by: https://github.com/zou3519
2025-02-18 21:55:34 +00:00
63e8ad49b8 [dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355)
This PR and the previous:
- Moves parts of `eval_frame.c` to C++.
- Reduces code duplication in `dynamo__custom_eval_frame` and makes the control flow more clear.
- Enables `convert_frame` to signal to `eval_frame.cpp` in a general manner how to evaluate this frame, recursive frames, and future frames with the same code object (default/compile, skip, run-only). e.g. this will allow us to change skipping/cache limit hit eval_frame behavior directly from convert_frame without requiring changes to C/C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146355
Approved by: https://github.com/jansel
ghstack dependencies: #145603
2025-02-18 21:37:12 +00:00
75db0fd8a0 [dynamo] refactor dynamo__custom_eval_frame to C++, refactor SKIP_CODE[_RECURSIVE] (#145603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145603
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-02-18 21:37:12 +00:00
eb892cd768 [codegen] enable SORT and TUPLE_REDUCTION for AMD Triton (#147340)
Looks like Triton's AMD backend supports multiple inputs already.
Let's enable SORT and TUPLE_REDUCTION for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147340
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/eellison
2025-02-18 21:15:23 +00:00
1b047d5d7a Add link to non_blocking/pinmem tutorial in Tensor.to docstrings (#145651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145651
Approved by: https://github.com/svekars
2025-02-18 20:38:01 +00:00
clr
166419b9c1 dynamo: Don't crash when encountering a object with no __name__ (#147246)
This was triggering on ScriptFunctions. Note that other than badly implemented c functiosn, this seems to be almost impossible to trigger, so I wrote a smaller unit test, rather than a full repro. Let me know if people feel strongly and want a full reproduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147246
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007
2025-02-18 20:35:49 +00:00
12v
74682e8595 Fix typo (#147330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147330
Approved by: https://github.com/srinivasreddy, https://github.com/Skylion007
2025-02-18 20:20:34 +00:00
d9b3d76b85 Fix linter warnings (#147386)
https://github.com/pytorch/pytorch/pull/145866 accidentally introduced a warning about const casts and also comparison of unsigned long int with signed long int.

This PR fixes both of those warnings.

Tested by running:

```
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o
```

And I got no warnings or errors. Same with `python setup.py develop`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147386
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-02-18 20:03:16 +00:00
302f56a1f2 Revert "Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)"
This reverts commit 59b7e52ad8f6146b4364515a7f3e54d6f3edd6da.

Reverted https://github.com/pytorch/pytorch/pull/146845 on behalf of https://github.com/jeanschmidt due to Seems to break a few code dependencies in multiple places ([comment](https://github.com/pytorch/pytorch/pull/146845#issuecomment-2666656834))
2025-02-18 19:01:27 +00:00
1527 changed files with 59113 additions and 32230 deletions

View File

@ -20,7 +20,7 @@ cd /
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel
pip install auditwheel==6.2.0
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

View File

@ -39,7 +39,7 @@ def build_ArmComputeLibrary() -> None:
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v24.09",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
@ -99,10 +99,14 @@ def update_wheel(wheel_path, desired_cuda) -> None:
if "126" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
elif "128" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
else:
libs_to_copy += [
@ -204,7 +208,7 @@ if __name__ == "__main__":
else:
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "
elif branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
if enable_mkldnn:
build_ArmComputeLibrary()

View File

@ -329,7 +329,7 @@ def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None
]
)
host.run_cmd(
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v24.09 {git_clone_flags}"
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"
)
host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")
@ -761,7 +761,7 @@ def start_build(
version = host.check_output("cat pytorch/version.txt").strip()[:-2]
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1"
if branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1"
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
if enable_mkldnn:

View File

@ -1,4 +1,8 @@
#!/bin/bash
# The purpose of this script is to:
# 1. Extract the set of parameters to be used for a docker build based on the provided image name.
# 2. Run docker build with the parameters found in step 1.
# 3. Run the built image and print out the expected and actual versions of packages installed.
set -ex
@ -95,11 +99,11 @@ fi
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
@ -154,6 +158,65 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=9
@ -322,7 +385,7 @@ case "$image" in
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
@ -330,7 +393,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes

View File

@ -1 +1 @@
5e4d6b6380d575e48e37e9d987fded4ec588e7bc
01a22b6f16d117454b7d21ebdc691b0785b84a7f

View File

@ -1 +1 @@
v2.25.1-1
v2.26.2-1

View File

@ -1 +1 @@
e98b6fcb8df5b44eb0d0addb6767c573d37ba024
0bcc8265e677e5321606a3311bf71470f14456a8

View File

@ -1 +1 @@
4b3bb1f8da0ded6ccd572dd1358ef45af5a1befe
96316ce50fade7e209553aba4898cd9b82aab83b

View File

@ -1,6 +1,6 @@
set -euo pipefail
readonly version=v24.04
readonly version=v25.02
readonly src_host=https://github.com/ARM-software
readonly src_repo=ComputeLibrary

View File

@ -37,7 +37,7 @@ install_ubuntu() {
if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then
maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then
maybe_libnccl_dev="libnccl2=2.25.1-1+cuda12.4 libnccl-dev=2.25.1-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
else
maybe_libnccl_dev=""
fi

View File

@ -66,7 +66,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.28=*openmp*"
conda_install "openblas==0.3.29=*openmp*"
else
conda_install "mkl=2021.4.0 mkl-include=2021.4.0"
fi

View File

@ -2,7 +2,7 @@
set -ex
NCCL_VERSION=v2.25.1-1
NCCL_VERSION=v2.26.2-1
CUDNN_VERSION=9.5.1.17
function install_cusparselt_040 {

View File

@ -3,19 +3,8 @@
set -ex
NCCL_VERSION=v2.21.5-1
CUDNN_VERSION=9.5.1.17
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
NCCL_VERSION=v2.26.2-1
CUDNN_VERSION=9.8.0.87
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
@ -28,140 +17,7 @@ function install_cusparselt_063 {
rm -rf tmp_cusparselt
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run
./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_063
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function install_126 {
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.6 /usr/local/cuda
# install CUDA 12.6.3 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run
chmod +x cuda_12.6.3_560.35.05_linux_sbsa.run
./cuda_12.6.3_560.35.05_linux_sbsa.run --toolkit --silent
rm -f cuda_12.6.3_560.35.05_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_063
ldconfig
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
# CUDA 12.6 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.6 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.6/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
function install_128 {
CUDNN_VERSION=9.7.1.26
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container
@ -198,10 +54,6 @@ function install_128 {
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
;;
12.6) install_126; prune_126
;;
12.8) install_128;
;;
*) echo "bad argument $1"; exit 1

View File

@ -53,7 +53,7 @@ setup_executorch() {
export EXECUTORCH_BUILD_PYBIND=ON
export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
as_jenkins .ci/scripts/setup-linux.sh cmake || true
as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true
popd
}

View File

@ -4,10 +4,15 @@ set -ex
[ -n "$NINJA_VERSION" ]
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"
arch=$(uname -m)
if [ "$arch" == "aarch64" ]; then
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux-aarch64.zip"
else
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"
fi
pushd /tmp
wget --no-verbose --output-document=ninja-linux.zip "$url"
unzip ninja-linux.zip -d /usr/local/bin
rm -f ninja-linux.zip
popd
popd

View File

@ -32,7 +32,7 @@ pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.17.0
pip_install onnxscript==0.1.0 --no-deps
pip_install onnxscript==0.2.2 --no-deps
# required by onnxscript
pip_install ml_dtypes

View File

@ -4,7 +4,7 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.28 --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="

View File

@ -115,7 +115,7 @@ index a5007ffc..13fa07fc 100644
if (!fp) {
- fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,
- strerror(errno));
+ fprintf(stderr, "amdgpu.ids: No such file or directory\n");
+ //fprintf(stderr, "amdgpu.ids: No such file or directory\n");
return;
}

View File

@ -60,15 +60,15 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 pip_install -e .
CXX=g++-9 pip_install .
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 pip_install -e .
CXX=g++-9 pip_install .
else
pip_install -e .
pip_install .
fi
if [ -n "${CONDA_CMAKE}" ]; then

View File

@ -47,6 +47,9 @@ function install_ubuntu() {
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_PACKAGES="${XPU_PACKAGES} intel-oneapi-dnnl=2025.0.1-6"
fi
apt-get install -y ${XPU_PACKAGES}
# Cleanup
@ -82,6 +85,9 @@ gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.
EOF
# Install Intel Support Packages
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_PACKAGES="${XPU_PACKAGES} intel-oneapi-dnnl-2025.0.1-6"
fi
yum install -y ${XPU_PACKAGES}
# The xpu-smi packages
dnf install -y xpu-smi

View File

@ -39,7 +39,7 @@ case ${GPU_ARCH_TYPE} in
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx942"
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)

View File

@ -1,153 +0,0 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=10.2
ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
FROM quay.io/pypa/manylinux2014_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
RUN yum install -y yum-utils centos-release-scl sudo
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.2
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
# ninja
RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

View File

@ -38,6 +38,12 @@ RUN yum install -y \
sudo \
gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

View File

@ -48,7 +48,7 @@ case ${GPU_ARCH_TYPE} in
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
;;
cpu-cxx11-abi)
@ -97,7 +97,7 @@ case ${GPU_ARCH_TYPE} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
fi
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101"
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
xpu)
@ -121,7 +121,8 @@ fi
(
set -x
if [ "$(uname -m)" != "s390x" ]; then
# Only activate this if in CI
if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
@ -139,7 +140,7 @@ fi
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GITHUB_REF=${GITHUB_REF:-"dev")}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

View File

@ -3,7 +3,7 @@
# Script used only in CD pipeline
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/
CURL_DOWNLOAD_URL=https://curl.askapache.com/download
CURL_DOWNLOAD_URL=https://curl.se/download
AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

View File

@ -90,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.13.0
mypy==1.14.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.10.0
#Pinned versions: 1.14.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -294,7 +294,7 @@ ghstack==0.8.0
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.5
jinja2==3.1.6
#Description: jinja2 template engine
#Pinned versions: 3.1.4
#test that import:
@ -339,7 +339,7 @@ onnx==1.17.0
#Pinned versions:
#test that import:
onnxscript==0.1.0
onnxscript==0.2.2
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:

View File

@ -1 +1 @@
3.2.0
3.3.0

View File

@ -12,7 +12,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \
-e DESIRED_CUDA=${DESIRED_CUDA} \
-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \
"pytorch/manylinux-builder:cuda${DESIRED_CUDA}-main" \
"pytorch/manylinux2_28-builder:cuda${DESIRED_CUDA}-main" \
magma/build_magma.sh
.PHONY: all

View File

@ -54,11 +54,11 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.8)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX" #Ripping out 5.0 and 6.0 due to ld error
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.4)

View File

@ -173,6 +173,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
export TORCH_XPU_ARCH_LIST=pvc
fi
# sccache will fail for CUDA builds if all cores are used for compiling
@ -191,7 +192,7 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
@ -377,8 +378,10 @@ else
# This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization
# is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has
# 16 CPUs
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
if [ -z "$MAX_JOBS_OVERRIDE" ]; then
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
fi
# NB: Install outside of source directory (at the same level as the root
# pytorch folder) so that it doesn't get cleaned away prior to docker push.

View File

@ -202,7 +202,7 @@ function install_torchrec_and_fbgemm() {
function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git
git clone --recursive -b r2.7 https://github.com/pytorch/xla.git
pushd xla
# pin the xla hash so that we don't get broken by changes to xla
git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"

View File

@ -46,7 +46,9 @@ def train(args, model, device, train_loader, optimizer, epoch):
optimizer.step()
if batch_idx % args.log_interval == 0:
print(
f"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}" # noqa: B950
f"Train Epoch: {epoch} "
f"[{batch_idx * len(data)}/{len(train_loader.dataset)} "
f"({100.0 * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}"
)
if args.dry_run:
break
@ -71,7 +73,9 @@ def test(model, device, test_loader):
test_loss /= len(test_loader.dataset)
print(
f"\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\n" # noqa: B950
f"\nTest set: Average loss: {test_loss:.4f}, "
f"Accuracy: {correct}/{len(test_loader.dataset)} "
f"({100.0 * correct / len(test_loader.dataset):.0f}%)\n"
)

View File

@ -166,6 +166,10 @@ def test_cuda_gds_errors_captured() -> None:
major_version = int(torch.version.cuda.split(".")[0])
minor_version = int(torch.version.cuda.split(".")[1])
if target_os == "windows":
print(f"{target_os} is not supported for GDS smoke test")
return
if major_version < 12 or (major_version == 12 and minor_version < 6):
print("CUDA version is not supported for GDS smoke test")
return

View File

@ -314,6 +314,13 @@ test_python() {
assert_git_not_dirty
}
test_lazy_tensor_meta_reference_disabled() {
export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
echo "Testing lazy tensor operations without meta reference"
time python test/run_test.py --include lazy/test_ts_opinfo.py --verbose
export -n TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE
}
test_dynamo_wrapped_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
@ -476,6 +483,8 @@ elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--export-aot-inductor)
elif [[ "${TEST_CONFIG}" == *max_autotune_inductor* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--inductor --inductor-compile-mode max-autotune)
elif [[ "${TEST_CONFIG}" == *inductor* && "${TEST_CONFIG}" != *perf* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--inductor)
fi
@ -490,6 +499,59 @@ else
DYNAMO_BENCHMARK_FLAGS+=(--device cuda)
fi
test_cachebench() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local BENCHMARK
if [[ "${SHARD_NUMBER}" == 1 ]]; then
local BENCHMARK=torchbench
elif [[ "${SHARD_NUMBER}" == 2 ]]; then
local BENCHMARK=huggingface
else
echo "invalid SHARD_NUMBER: ${SHARD_NUMBER}"
exit 1
fi
local mode_options=("training" "inference")
for mode in "${mode_options[@]}"; do
$TASKSET python "benchmarks/dynamo/cachebench.py" \
--mode "$mode" \
--device cuda \
--benchmark "$BENCHMARK" \
--repeat 3 \
--output "$TEST_REPORTS_DIR/cachebench_${BENCHMARK}_${mode}.json"
$TASKSET python "benchmarks/dynamo/cachebench.py" \
--mode "$mode" \
--dynamic \
--device cuda \
--benchmark "$BENCHMARK" \
--repeat 3 \
--output "$TEST_REPORTS_DIR/cachebench_${BENCHMARK}_${mode}_dynamic.json"
done
}
test_verify_cachebench() {
TMP_TEST_REPORTS_DIR=$(mktemp -d)
TEST_OUTPUT="$TMP_TEST_REPORTS_DIR/test.json"
$TASKSET python "benchmarks/dynamo/cachebench.py" \
--mode training \
--device cpu \
--model nanogpt \
--benchmark torchbench \
--output "$TEST_OUTPUT"
# -s checks file exists and is non empty
if [[ ! -s "$TEST_OUTPUT" ]]; then
echo "Cachebench failed to produce an output."
echo "Run 'python benchmarks/dynamo/cachebench.py' to make sure it works"
exit 1
fi
}
test_perf_for_dashboard() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
@ -518,6 +580,8 @@ test_perf_for_dashboard() {
test_inductor_set_cpu_affinity
elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then
device=cuda_a10g
elif [[ "${TEST_CONFIG}" == *h100* ]]; then
device=cuda_h100
elif [[ "${TEST_CONFIG}" == *rocm* ]]; then
device=rocm
fi
@ -698,6 +762,8 @@ test_dynamo_benchmark() {
fi
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
elif [[ "${TEST_CONFIG}" == *max_autotune_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
else
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"
@ -1107,8 +1173,9 @@ build_xla() {
apply_patches
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
# These functions are defined in .circleci/common.sh in pytorch/xla repo
retry install_deps_pytorch_xla $XLA_DIR $USE_CACHE
retry install_pre_deps_pytorch_xla $XLA_DIR $USE_CACHE
CMAKE_PREFIX_PATH="${SITE_PACKAGES}/torch:${CMAKE_PREFIX_PATH}" XLA_SANDBOX_BUILD=1 build_torch_xla $XLA_DIR
retry install_post_deps_pytorch_xla
assert_git_not_dirty
}
@ -1415,7 +1482,7 @@ test_executorch() {
bash examples/models/llama3_2_vision/install_requirements.sh
# NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch
# from the PR
bash .ci/scripts/setup-linux.sh cmake
bash .ci/scripts/setup-linux.sh --build-tool cmake
echo "Run ExecuTorch unit tests"
pytest -v -n auto
@ -1439,7 +1506,7 @@ test_executorch() {
test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Dynamo tests
@ -1507,6 +1574,16 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
install_torchvision
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == cachebench ]]; then
install_torchaudio cuda
install_torchvision
checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_cachebench
elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then
install_torchaudio cpu
install_torchvision
checkout_install_torchbench nanogpt
PYTHONPATH=$(pwd)/torchbench test_verify_cachebench
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
install_torchaudio cpu
@ -1562,6 +1639,7 @@ elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
test_python_shard "$SHARD_NUMBER"
test_aten
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_lazy_tensor_meta_reference_disabled
test_without_numpy
install_torchvision
test_python_shard 1

View File

@ -17,32 +17,24 @@ curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%
:: Install the Visual Studio Build Tools with C++ components
echo Installing Visual Studio Build Tools with C++ components...
echo Installing MSVC %MSVC_VERSION%
if "%MSVC_VERSION%" == "latest" (
"%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^
--add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^
--add Microsoft.VisualStudio.Component.VC.ASAN ^
--add Microsoft.VisualStudio.Component.VC.CMake.Project ^
--add Microsoft.VisualStudio.Component.VC.Tools.ARM64 ^
--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64
) else if "%MSVC_VERSION%" == "14.40" (
"%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^
--add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^
--add Microsoft.VisualStudio.Component.VC.ASAN ^
--add Microsoft.VisualStudio.Component.VC.CMake.Project ^
--add Microsoft.VisualStudio.Component.VC.14.40.17.10.ARM64 ^
--add Microsoft.VisualStudio.Component.VC.14.40.17.10.x86.x64
) else if "%MSVC_VERSION%" == "14.36" (
"%INSTALLER_FILE%" --norestart --nocache --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^
--add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^
--add Microsoft.VisualStudio.Component.VC.ASAN ^
--add Microsoft.VisualStudio.Component.VC.CMake.Project ^
--add Microsoft.VisualStudio.Component.VC.14.36.17.6.ARM64 ^
--add Microsoft.VisualStudio.Component.VC.14.36.17.6.x86.x64
)
"%INSTALLER_FILE%" --norestart --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^
--add Microsoft.VisualStudio.Workload.VCTools ^
--add Microsoft.VisualStudio.Component.Windows10SDK ^
--add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^
--add Microsoft.VisualStudio.Component.VC.ASAN ^
--add Microsoft.VisualStudio.Component.VC.CMake.Project ^
--add Microsoft.VisualStudio.Component.VC.CoreBuildTools ^
--add Microsoft.VisualStudio.Component.VC.CoreIde ^
--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest ^
--add Microsoft.VisualStudio.Component.VC.Tools.ARM64EC ^
--add Microsoft.VisualStudio.Component.VC.Tools.ARM64 ^
--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64
echo exitcode = %errorlevel%
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed to install Visual Studio Build Tools with C++ components. (exitcode = %errorlevel%)"
echo Failed to install Visual Studio Build Tools with C++ components.
exit /b 1
)

View File

@ -6,22 +6,25 @@ echo Dependency Python installation started.
if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
if "%PYTHON_VERSION%"=="Python312" (
echo Python version is set to Python312
set DOWNLOAD_URL="https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe"
) else if "%PYTHON_VERSION%"=="Python311" (
echo Python version is set to Python311
set DOWNLOAD_URL="https://www.python.org/ftp/python/3.11.9/python-3.11.9-arm64.exe"
if "%DESIRED_PYTHON%" == "3.13" (
echo Python version is set to 3.13
set DOWNLOAD_URL=https://www.python.org/ftp/python/3.13.2/python-3.13.2-arm64.exe
) else if "%DESIRED_PYTHON%" == "3.12" (
echo Python version is set to 3.12
set DOWNLOAD_URL=https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe
) else if "%DESIRED_PYTHON%" == "3.11" (
echo Python version is set to 3.11
set DOWNLOAD_URL=https://www.python.org/ftp/python/3.11.9/python-3.11.9-arm64.exe
) else (
echo PYTHON_VERSION not defined, Python version is set to Python312
set DOWNLOAD_URL="https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe"
echo DESIRED_PYTHON not defined, Python version is set to 3.12
set DOWNLOAD_URL=https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe
)
set INSTALLER_FILE=%DOWNLOADS_DIR%\python-installer.exe
:: Download installer
echo Downloading Python...
curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%
curl -L -o "%INSTALLER_FILE%" "%DOWNLOAD_URL%"
:: Install Python
echo Installing Python...

View File

@ -14,7 +14,7 @@ where python
:: install dependencies
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest numpy
pip install pytest numpy protobuf expecttest hypothesis
:: find file name for pytorch wheel
for /f "delims=" %%f in ('dir /b "%PYTORCH_FINAL_PACKAGE_DIR%" ^| findstr "torch-"') do set "TORCH_WHEEL_FILENAME=%PYTORCH_FINAL_PACKAGE_DIR%\%%f"

View File

@ -1,8 +1,6 @@
@echo off
setlocal
set "ORIG_PATH=%PATH%"
if "%PACKAGE_TYPE%" == "wheel" goto wheel
if "%PACKAGE_TYPE%" == "libtorch" goto libtorch
@ -10,21 +8,7 @@ echo "unknown package type"
exit /b 1
:wheel
echo "install wheel package"
echo Running pip install...
pip install -q --pre numpy protobuf
echo Error level after pip install: %ERRORLEVEL%
if errorlevel 1 exit /b 1
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do pip install "%%i"
if errorlevel 1 exit /b 1
goto smoke_test
:smoke_test
python -c "import torch"
if ERRORLEVEL 1 exit /b 1
call %PYTORCH_ROOT%\.ci\pytorch\windows\arm64\bootstrap_tests.bat
echo Running python rnn_smoke.py...
python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke_win_arm64.py
@ -39,10 +23,12 @@ goto end
:libtorch
echo "install and test libtorch"
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do tar -xf "%%i" -C tmp
if not exist tmp mkdir tmp
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do C:\Windows\System32\tar.exe -xf "%%i" -C tmp
if ERRORLEVEL 1 exit /b 1
pushd tmp\libtorch
pushd tmp
set VC_VERSION_LOWER=14
set VC_VERSION_UPPER=36
@ -60,6 +46,4 @@ if ERRORLEVEL 1 exit /b 1
.\simple-torch-test.exe
if ERRORLEVEL 1 exit /b 1
:end
set "PATH=%ORIG_PATH%"
popd
:end

View File

@ -71,11 +71,20 @@ if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install --pre numpy==2.1.2
if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.11" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.10" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf networkx
if errorlevel 1 exit /b 1
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do %PYTHON_EXEC% -m pip install "%%i"
if "%PYTORCH_BUILD_VERSION:dev=%" NEQ "%PYTORCH_BUILD_VERSION%" (
set "CHANNEL=nightly"
) else (
set "CHANNEL=test"
)
set "EXTRA_INDEX= "
if "%CUDA_VERSION%" == "xpu" set "EXTRA_INDEX=--index-url https://download.pytorch.org/whl/%CHANNEL%/xpu"
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do %PYTHON_EXEC% -m pip install "%%i" %EXTRA_INDEX%
if errorlevel 1 exit /b 1
goto smoke_test

View File

@ -47,9 +47,9 @@ set XPU_EXTRA_INSTALLED=0
set XPU_EXTRA_UNINSTALL=0
if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.0] (
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/efc86abd-cb77-452e-a03f-a741895b8ece/intel-deep-learning-essentials-2025.0.0.336_offline.exe
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d6d6c17-ca2d-4735-9331-99447e4a1280/intel-deep-learning-essentials-2025.0.1.28_offline.exe
set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product
set XPU_BUNDLE_VERSION=2025.0.0+335
set XPU_BUNDLE_VERSION=2025.0.1+20
set XPU_BUNDLE_INSTALLED=0
set XPU_BUNDLE_UNINSTALL=0
set XPU_EXTRA_URL=NULL

View File

@ -31,9 +31,9 @@ fi
export DOCKER_IMAGE=${DOCKER_IMAGE:-}
if [[ -z "$DOCKER_IMAGE" ]]; then
if [[ "$DESIRED_CUDA" == cpu ]]; then
export DOCKER_IMAGE="pytorch/manylinux:cpu"
export DOCKER_IMAGE="pytorch/manylinux2_28:cpu"
else
export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"
export DOCKER_IMAGE="pytorch/manylinux2_28-builder:${DESIRED_CUDA:2}"
fi
fi
@ -74,6 +74,12 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == cu128 ]]; then
TRITON_CONSTRAINT="platform_system == 'Linux'"
fi
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
@ -98,11 +104,11 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B
fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+git${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+git${TRITON_SHORTHASH}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

View File

@ -0,0 +1,22 @@
#!/bin/bash
set -eux -o pipefail
source "${BINARY_ENV_FILE:-/c/w/env}"
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
export USE_SCCACHE=1
export SCCACHE_IGNORE_SERVER_IO_ERROR=1
echo "Free space on filesystem before build:"
df -h
export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
pytorch/.ci/pytorch/windows/arm64/build_libtorch.bat
elif [[ "$PACKAGE_TYPE" == 'wheel' ]]; then
pytorch/.ci/pytorch/windows/arm64/build_pytorch.bat
fi
echo "Free space on filesystem after build:"
df -h

View File

@ -0,0 +1,6 @@
#!/bin/bash
set -eux -o pipefail
source "${BINARY_ENV_FILE:-/c/w/env}"
pytorch/.ci/pytorch/windows/arm64/smoke_test.bat

View File

@ -13,6 +13,7 @@ export VC_YEAR=2022
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export USE_SCCACHE=0
export XPU_VERSION=2025.0
export XPU_ENABLE_KINETO=1
fi
echo "Free space on filesystem before build:"

View File

@ -12,6 +12,7 @@ bugprone-*,
-bugprone-macro-parentheses,
-bugprone-lambda-function-name,
-bugprone-reserved-identifier,
-bugprone-return-const-ref-from-parameter,
-bugprone-swapped-arguments,
clang-analyzer-core.*,
clang-analyzer-cplusplus.*,
@ -24,6 +25,7 @@ cppcoreguidelines-*,
-cppcoreguidelines-avoid-non-const-global-variables,
-cppcoreguidelines-interfaces-global-init,
-cppcoreguidelines-macro-usage,
-cppcoreguidelines-macro-to-enum,
-cppcoreguidelines-owning-memory,
-cppcoreguidelines-pro-bounds-array-to-pointer-decay,
-cppcoreguidelines-pro-bounds-constant-array-index,
@ -46,6 +48,7 @@ misc-*,
-misc-no-recursion,
-misc-non-private-member-variables-in-classes,
-misc-unused-using-decls,
-misc-use-internal-linkage,
modernize-*,
-modernize-macro-to-enum,
-modernize-return-braced-init-list,
@ -55,6 +58,7 @@ modernize-*,
-modernize-use-trailing-return-type,
-modernize-use-nodiscard,
performance-*,
-performance-enum-size,
readability-container-size-empty,
readability-delete-null-pointer,
readability-duplicate-include

View File

@ -38,6 +38,7 @@ per-file-ignores =
torchgen/api/types/__init__.py: F401,F403
torchgen/executorch/api/types/__init__.py: F401,F403
test/dynamo/test_higher_order_ops.py: B950
test/dynamo/test_error_messages.py: B950
torch/testing/_internal/dynamo_test_failures.py: B950
# TOR901 is only for test, we want to ignore it for everything else.
# It's not easy to configure this without affecting other per-file-ignores,

View File

@ -1,5 +1,7 @@
self-hosted-runner:
labels:
# GitHub hosted runner that actionlint doesn't recognize because actionlint version (1.6.21) is too old
- ubuntu-24.04
# GitHub hosted x86 Linux runners
- linux.20_04.4x
- linux.20_04.16x
@ -10,7 +12,6 @@ self-hosted-runner:
- linux.9xlarge.ephemeral
- am2.linux.9xlarge.ephemeral
- linux.12xlarge
- linux.12xlarge.ephemeral
- linux.24xlarge
- linux.24xlarge.ephemeral
- linux.arm64.2xlarge
@ -42,6 +43,8 @@ self-hosted-runner:
- windows.8xlarge.nvidia.gpu
- windows.8xlarge.nvidia.gpu.nonephemeral
- windows.g5.4xlarge.nvidia.gpu
# Windows ARM64 runners
- windows-11-arm64
# Organization-wide AMD hosted runners
- linux.rocm.gpu
- linux.rocm.gpu.2

View File

@ -40,6 +40,11 @@ runs:
fi
mkdir "${GITHUB_WORKSPACE}"
# Use all available CPUs for fetching
cd "${GITHUB_WORKSPACE}"
git config --global fetch.parallel 0
git config --global submodule.fetchJobs 0
- name: Checkout PyTorch
uses: actions/checkout@v4
with:

View File

@ -1 +1 @@
f084f34bbb743fada85f66b0ed8041387565e69c
c670ad81fda266b6598aeeef434583eb98197ae8

View File

@ -1 +1 @@
b2b890e962f5fb6f481e5da2eb4a43bb990d0f1b
r2.7

2
.github/labeler.yml vendored
View File

@ -98,7 +98,7 @@
- test/distributed/**
- torch/testing/_internal/distributed/**
"module: distributed_checkpoint":
"release notes: distributed (checkpoint)":
- torch/distributed/checkpoint/**
- test/distributed/checkpoint/**

View File

@ -334,6 +334,7 @@
- XiaobingSuper
- jgong5
- mingfeima
- EikanWang
mandatory_checks_name:
- EasyCLA
- Lint
@ -366,6 +367,7 @@
- jgong5
- vfdev-5
- leslie-fang-intel
- EikanWang
mandatory_checks_name:
- EasyCLA
- Lint
@ -379,6 +381,7 @@
approved_by:
- leslie-fang-intel
- jgong5
- EikanWang
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -7,6 +7,7 @@ ciflow_push_tags:
- ciflow/inductor
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
@ -16,6 +17,7 @@ ciflow_push_tags:
- ciflow/nightly
- ciflow/periodic
- ciflow/rocm
- ciflow/rocm-mi300
- ciflow/s390
- ciflow/slow
- ciflow/trunk

View File

@ -5,7 +5,7 @@
# functorch/docs/requirements.txt
# .ci/docker/requirements-ci.txt
boto3==1.35.42
jinja2==3.1.5
jinja2==3.1.6
lintrunner==0.10.7
ninja==1.10.0.post1
nvidia-ml-py==11.525.84

View File

@ -123,7 +123,7 @@ def main() -> None:
parser = ArgumentParser("Build Triton binaries")
parser.add_argument("--release", action="store_true")
parser.add_argument(
"--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu"]
"--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu", "aarch64"]
)
parser.add_argument("--py-version", type=str)
parser.add_argument("--commit-hash", type=str)

View File

@ -39,9 +39,9 @@ SUPPORTED_PERIODICAL_MODES: dict[str, Callable[[Optional[str]], bool]] = {
}
# The link to the published list of disabled jobs
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json"
DISABLED_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json?versionId=n.FT07XR3dLMwOLBwmRNquyYSeGk8Het"
# and unstable jobs
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json"
UNSTABLE_JOBS_URL = "https://ossci-metrics.s3.amazonaws.com/unstable-jobs.json?versionId=.Ox7WAXa21I1PVqadHyPfhMRPhl0aCnD"
# Some constants used to handle disabled and unstable jobs
JOB_NAME_SEP = "/"

View File

@ -16,16 +16,15 @@ from typing import Optional
# NOTE: Also update the CUDA sources in tools/nightly.py when changing this list
CUDA_ARCHES = ["11.8", "12.4", "12.6", "12.8"]
CUDA_ARCHES = ["11.8", "12.6", "12.8"]
CUDA_STABLE = "12.6"
CUDA_ARCHES_FULL_VERSION = {
"11.8": "11.8.0",
"12.4": "12.4.1",
"12.6": "12.6.3",
"12.8": "12.8.0",
}
CUDA_ARCHES_CUDNN_VERSION = {
"11.8": "9",
"12.4": "9",
"12.6": "9",
"12.8": "9",
}
@ -41,7 +40,7 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]
CPU_S390X_ARCH = ["cpu-s390x"]
CUDA_AARCH64_ARCHES = ["12.6-aarch64", "12.8-aarch64"]
CUDA_AARCH64_ARCHES = ["12.8-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
@ -58,21 +57,6 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.4": (
"nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.6": (
"nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
@ -84,7 +68,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -100,19 +84,23 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"xpu": (
"intel-cmplr-lib-rt==2025.0.2 | "
"intel-cmplr-lib-ur==2025.0.2 | "
"intel-cmplr-lic-rt==2025.0.2 | "
"intel-sycl-rt==2025.0.2 | "
"intel-cmplr-lib-rt==2025.0.4; platform_system == 'Linux' | "
"intel-cmplr-lib-ur==2025.0.4; platform_system == 'Linux' | "
"intel-cmplr-lic-rt==2025.0.4; platform_system == 'Linux' | "
"intel-sycl-rt==2025.0.4; platform_system == 'Linux' | "
"intel-cmplr-lib-rt==2025.0.5; platform_system == 'Windows' | "
"intel-cmplr-lib-ur==2025.0.5; platform_system == 'Windows' | "
"intel-cmplr-lic-rt==2025.0.5; platform_system == 'Windows' | "
"intel-sycl-rt==2025.0.5; platform_system == 'Windows' | "
"tcmlib==1.2.0 | "
"umf==0.9.1 | "
"intel-pti==0.10.0"
"intel-pti==0.10.1"
),
}
@ -246,14 +234,8 @@ def generate_libtorch_matrix(
if os == "linux":
arches += CUDA_ARCHES
arches += ROCM_ARCHES
# skip CUDA 12.8 builds for libtorch
if "12.8" in arches:
arches.remove("12.8")
elif os == "windows":
arches += CUDA_ARCHES
# skip CUDA 12.8 builds on Windows
if "12.8" in arches:
arches.remove("12.8")
if libtorch_variants is None:
libtorch_variants = [
"shared-with-deps",
@ -281,11 +263,15 @@ def generate_libtorch_matrix(
gpu_arch_type, gpu_arch_version
),
"libtorch_variant": libtorch_variant,
"libtorch_config": abi_version if os == "windows" else "",
"devtoolset": abi_version if os != "windows" else "",
"libtorch_config": abi_version
if os in ("windows", "windows-arm64")
else "",
"devtoolset": abi_version
if os not in ("windows", "windows-arm64")
else "",
"container_image": (
LIBTORCH_CONTAINER_IMAGES[(arch_version, abi_version)]
if os != "windows"
if os not in ("windows", "windows-arm64")
else ""
),
"package_type": "libtorch",
@ -318,9 +304,6 @@ def generate_wheels_matrix(
arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES
elif os == "windows":
arches += CUDA_ARCHES + XPU_ARCHES
# skip CUDA 12.8 builds on Windows until available
if "12.8" in arches:
arches.remove("12.8")
elif os == "linux-aarch64":
# Separate new if as the CPU type is different and
# uses different build/test scripts
@ -349,7 +332,7 @@ def generate_wheels_matrix(
continue
if use_split_build and (
arch_version not in ["12.6", "12.4", "11.8", "cpu"] or os != "linux"
arch_version not in ["12.6", "12.8", "11.8", "cpu"] or os != "linux"
):
raise RuntimeError(
"Split build is only supported on linux with cuda 12*, 11.8, and cpu.\n"
@ -360,26 +343,27 @@ def generate_wheels_matrix(
# cuda linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if (
arch_version in ["12.8", "12.6", "12.4", "11.8"]
arch_version in ["12.8", "12.6", "11.8"]
and os == "linux"
or arch_version in CUDA_AARCH64_ARCHES
):
desired_cuda = translate_desired_cuda(gpu_arch_type, gpu_arch_version)
ret.append(
{
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": gpu_arch_version,
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"desired_cuda": desired_cuda,
"use_split_build": "True" if use_split_build else "False",
"devtoolset": "cxx11-abi",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]
if os != "linux-aarch64"
else ""
PYTORCH_EXTRA_INSTALL_REQUIREMENTS[
f"{desired_cuda[2:4]}.{desired_cuda[4:]}" # for cuda-aarch64: cu126 -> 12.6
]
if os == "linux-aarch64"
else PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]
),
"build_name": (
f"{package_type}-py{python_version}-{gpu_arch_type}"
@ -389,8 +373,8 @@ def generate_wheels_matrix(
), # include special case for aarch64 build, remove the -aarch64 postfix
}
)
# Special build building to use on Colab. Python 3.11 for 12.4 CUDA
if python_version == "3.11" and arch_version == "12.4":
# Special build building to use on Colab. Python 3.11 for 12.6 CUDA
if python_version == "3.11" and arch_version == CUDA_STABLE:
ret.append(
{
"python_version": python_version,
@ -433,7 +417,7 @@ def generate_wheels_matrix(
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["xpu"]
if gpu_arch_type == "xpu"
else PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.4"]
else PYTORCH_EXTRA_INSTALL_REQUIREMENTS[CUDA_STABLE]
if os != "linux"
else ""
),
@ -445,5 +429,4 @@ def generate_wheels_matrix(
validate_nccl_dep_consistency("12.8")
validate_nccl_dep_consistency("12.6")
validate_nccl_dep_consistency("12.4")
validate_nccl_dep_consistency("11.8")

View File

@ -96,6 +96,7 @@ class BinaryBuildWorkflow:
class OperatingSystem:
LINUX = "linux"
WINDOWS = "windows"
WINDOWS_ARM64 = "windows-arm64"
MACOS = "macos"
MACOS_ARM64 = "macos-arm64"
LINUX_AARCH64 = "linux-aarch64"
@ -151,7 +152,7 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["11.8", "12.4", "12.6", "12.8"],
arches=["11.8", "12.6", "12.8"],
python_versions=["3.9"],
),
branches="main",
@ -261,6 +262,52 @@ WINDOWS_BINARY_SMOKE_WORKFLOWS = [
),
]
WINDOWS_ARM64_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS_ARM64,
package_type="wheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.WINDOWS_ARM64,
arches=["cpu"],
python_versions=["3.12"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS_ARM64,
package_type="libtorch",
abi_version=generate_binary_build_matrix.RELEASE,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.WINDOWS_ARM64,
generate_binary_build_matrix.RELEASE,
arches=["cpu"],
libtorch_variants=["shared-with-deps"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS_ARM64,
package_type="libtorch",
abi_version=generate_binary_build_matrix.DEBUG,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.WINDOWS_ARM64,
generate_binary_build_matrix.DEBUG,
arches=["cpu"],
libtorch_variants=["shared-with-deps"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_LIBTORCH},
isolated_workflow=True,
),
),
]
MACOS_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.MACOS_ARM64,
@ -355,6 +402,10 @@ def main() -> None:
jinja_env.get_template("windows_binary_build_workflow.yml.j2"),
WINDOWS_BINARY_SMOKE_WORKFLOWS,
),
(
jinja_env.get_template("windows_arm64_binary_build_workflow.yml.j2"),
WINDOWS_ARM64_BINARY_BUILD_WORKFLOWS,
),
(
jinja_env.get_template("macos_binary_build_workflow.yml.j2"),
MACOS_BINARY_BUILD_WORKFLOWS,

30
.github/scripts/get_ci_variable.py vendored Executable file
View File

@ -0,0 +1,30 @@
#!/usr/bin/env python3
"""Helper script - Return CI variables such as stable cuda, min python version, etc."""
import argparse
import sys
def main(args: list[str]) -> None:
import generate_binary_build_matrix
parser = argparse.ArgumentParser()
parser.add_argument(
"--cuda-stable-version",
action="store_true",
help="get cuda stable version",
)
parser.add_argument(
"--min-python-version",
action="store_true",
help="get min supported python version",
)
options = parser.parse_args(args)
if options.cuda_stable_version:
return print(generate_binary_build_matrix.CUDA_STABLE)
if options.min_python_version:
return print(generate_binary_build_matrix.FULL_PYTHON_VERSIONS[0])
if __name__ == "__main__":
main(sys.argv[1:])

View File

@ -57,10 +57,10 @@ def gh_fetch_url_and_headers(
print(
f"""{url}
Rate limit exceeded:
Used: {err.headers['X-RateLimit-Used']}
Limit: {err.headers['X-RateLimit-Limit']}
Remaining: {err.headers['X-RateLimit-Remaining']}
Resets at: {err.headers['x-RateLimit-Reset']}"""
Used: {err.headers["X-RateLimit-Used"]}
Limit: {err.headers["X-RateLimit-Limit"]}
Remaining: {err.headers["X-RateLimit-Remaining"]}
Resets at: {err.headers["x-RateLimit-Reset"]}"""
)
else:
print(f"Error fetching {url} {err}")

View File

@ -63,9 +63,9 @@ def gh_get_labels(org: str, repo: str) -> list[str]:
update_labels(labels, info)
last_page = get_last_page_num_from_header(header)
assert (
last_page > 0
), "Error reading header info to determine total number of pages of labels"
assert last_page > 0, (
"Error reading header info to determine total number of pages of labels"
)
for page_number in range(2, last_page + 1): # skip page 1
_, info = request_for_labels(prefix + f"&page={page_number}")
update_labels(labels, info)

View File

@ -33,7 +33,7 @@ class PRIdentifier(str):
__slots__ = ()
def __new__(cls, value: str) -> "PRIdentifier":
md5 = hashlib.md5(value.encode("utf-8")).hexdigest()
md5 = hashlib.md5(value.encode("utf-8"), usedforsecurity=False).hexdigest()
return super().__new__(cls, md5)

View File

@ -485,7 +485,7 @@ def get_check_run_name_prefix(workflow_run: Any) -> str:
if workflow_run is None:
return ""
else:
return f'{workflow_run["workflow"]["name"]} / '
return f"{workflow_run['workflow']['name']} / "
def is_passing_status(status: Optional[str]) -> bool:
@ -545,7 +545,7 @@ def add_workflow_conclusions(
if not isinstance(checkrun_node, dict):
warn(f"Expected dictionary, but got {type(checkrun_node)}")
continue
checkrun_name = f'{get_check_run_name_prefix(workflow_run)}{checkrun_node["name"]}'
checkrun_name = f"{get_check_run_name_prefix(workflow_run)}{checkrun_node['name']}"
existing_checkrun = workflow_obj.jobs.get(checkrun_name)
if existing_checkrun is None or not is_passing_status(
existing_checkrun.status
@ -1224,9 +1224,17 @@ class GitHubPR:
if not self.is_ghstack_pr():
msg = self.gen_commit_message()
pr_branch_name = f"__pull-request-{self.pr_num}__init__"
repo.fetch(f"pull/{self.pr_num}/head", pr_branch_name)
repo.fetch(self.last_commit()["oid"], pr_branch_name)
repo._run_git("merge", "--squash", pr_branch_name)
repo._run_git("commit", f'--author="{self.get_author()}"', "-m", msg)
# Did the PR change since we started the merge?
pulled_sha = repo.show_ref(pr_branch_name)
latest_pr_status = GitHubPR(self.org, self.project, self.pr_num)
if pulled_sha != latest_pr_status.last_commit()["oid"]:
raise RuntimeError(
"PR has been updated since CI checks last passed. Please rerun the merge command."
)
return []
else:
return self.merge_ghstack_into(
@ -1507,6 +1515,36 @@ def checks_to_markdown_bullets(
]
def post_starting_merge_comment(
repo: GitRepo,
pr: GitHubPR,
explainer: TryMergeExplainer,
dry_run: bool,
ignore_current_checks_info: Optional[
list[tuple[str, Optional[str], Optional[int]]]
] = None,
) -> None:
"""Post the initial merge starting message on the PR. Also post a short
message on all PRs in the stack."""
gh_post_pr_comment(
pr.org,
pr.project,
pr.pr_num,
explainer.get_merge_message(ignore_current_checks_info),
dry_run=dry_run,
)
if pr.is_ghstack_pr():
for additional_prs, _ in get_ghstack_prs(repo, pr):
if additional_prs.pr_num != pr.pr_num:
gh_post_pr_comment(
additional_prs.org,
additional_prs.project,
additional_prs.pr_num,
f"Starting merge as part of PR stack under #{pr.pr_num}",
dry_run=dry_run,
)
def manually_close_merged_pr(
pr: GitHubPR,
additional_merged_prs: list[GitHubPR],
@ -2130,13 +2168,7 @@ def merge(
check_for_sev(pr.org, pr.project, skip_mandatory_checks)
if skip_mandatory_checks:
gh_post_pr_comment(
pr.org,
pr.project,
pr.pr_num,
explainer.get_merge_message(),
dry_run=dry_run,
)
post_starting_merge_comment(repo, pr, explainer, dry_run)
return pr.merge_into(
repo,
dry_run=dry_run,
@ -2159,12 +2191,12 @@ def merge(
)
ignore_current_checks_info = failing
gh_post_pr_comment(
pr.org,
pr.project,
pr.pr_num,
explainer.get_merge_message(ignore_current_checks_info),
dry_run=dry_run,
post_starting_merge_comment(
repo,
pr,
explainer,
dry_run,
ignore_current_checks_info=ignore_current_checks_info,
)
start_time = time.time()

View File

@ -79,7 +79,7 @@ class TryMergeExplainer:
(
"<details><summary>Advanced Debugging</summary>",
"Check the merge workflow status ",
f"<a href=\"{os.getenv('GH_RUN_URL')}\">here</a>",
f'<a href="{os.getenv("GH_RUN_URL")}">here</a>',
"</details>",
)
)

View File

@ -0,0 +1,18 @@
@echo on
set PYTHON_PREFIX=%PY_VERS:.=%
set PYTHON_PREFIX=py%PYTHON_PREFIX:;=;py%
call .ci/pytorch/win-test-helpers/installation-helpers/activate_miniconda3.bat
:: Create a new conda environment
if "%PY_VERS%" == "3.13t" (
call conda create -n %PYTHON_PREFIX% -y -c=conda-forge python-freethreading python=3.13
) else (
call conda create -n %PYTHON_PREFIX% -y -c=conda-forge python=%PY_VERS%
)
:: Fix cmake version for issue https://github.com/pytorch/pytorch/issues/150480
call conda run -n %PYTHON_PREFIX% pip install wheel pybind11 certifi cython cmake==3.31.6 setuptools==72.1.0 ninja
dir "%VC_INSTALL_PATH%"
call "%VC_INSTALL_PATH%\VC\Auxiliary\Build\vcvarsall.bat" x64
call conda run -n %PYTHON_PREFIX% python .github/scripts/build_triton_wheel.py --device=%BUILD_DEVICE% %RELEASE%

View File

@ -0,0 +1,35 @@
#Requires -RunAsAdministrator
# Enable long paths on Windows
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
$VC_VERSION_major = [int] ${env:VC_VERSION}.split(".")[0]
$VC_DOWNLOAD_LINK = "https://aka.ms/vs/$VC_VERSION_major/release/vs_BuildTools.exe"
$VC_INSTALL_ARGS = @("--nocache","--quiet","--norestart","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating",
"--add Microsoft.VisualStudio.Component.VC.CoreBuildTools",
"--add Microsoft.VisualStudio.Component.VC.CoreIde",
"--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",
"--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
"--add Microsoft.VisualStudio.Component.Windows11SDK.22621")
echo "Downloading Visual Studio installer from $VC_DOWNLOAD_LINK."
curl.exe --retry 3 -kL $VC_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS ${env:VC_YEAR} Version ${env:VC_VERSION} installer failed"
exit 1
}
$InstallationPath = ${env:VC_INSTALL_PATH}
$VC_INSTALL_ARGS = "--installPath `"$InstallationPath`"" + " " + $VC_INSTALL_ARGS
echo "Installing Visual Studio version ${env:VC_VERSION} in $InstallationPath."
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VC_INSTALL_ARGS -NoNewWindow -Wait -PassThru
Remove-Item -Path vs_installer.exe -Force
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "VS ${env:VC_YEAR} installer exited with code $exitCode, which should be one of [0, 3010]."
exit 1
}

View File

@ -4,6 +4,7 @@
{%- set download_artifact_action = "actions/download-artifact@v4.1.7" -%}
{%- set timeout_minutes = 240 -%}
{%- set timeout_minutes_windows_binary = 300 -%}
{%- macro concurrency(build_environment) -%}
concurrency:
@ -31,7 +32,7 @@ concurrency:
{%- macro setup_ec2_windows() -%}
!{{ display_ec2_information() }}
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}

View File

@ -53,7 +53,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -111,7 +111,10 @@ jobs:
ALPINE_IMAGE: "docker.io/s390x/alpine"
{%- elif config["gpu_arch_type"] == "rocm" %}
runs_on: linux.rocm.gpu
{%- elif config["gpu_arch_type"] == "cuda" %}
{%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] == "12.8" %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 build needs sm_70+ runner
{%- elif config["gpu_arch_type"] == "cuda" and config["gpu_arch_version"] != "12.8"%}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
{%- else %}
@ -144,9 +147,9 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: !{{ config["container_image"] }}
- name: Test Pytorch binary
@ -165,12 +168,12 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ runner.temp }}/artifacts/"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: !{{ config["container_image"] }}
- name: Test Pytorch binary

View File

@ -76,7 +76,7 @@ jobs:
elif [ -d "/Applications/Xcode_13.3.1.app" ]; then
echo "DEVELOPER_DIR=/Applications/Xcode_13.3.1.app/Contents/Developer" >> "${GITHUB_ENV}"
fi
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Populate binary env
run: |
# shellcheck disable=SC1091

View File

@ -0,0 +1,197 @@
{% import 'common.yml.j2' as common %}
{% import 'upload.yml.j2' as upload %}
{%- block name -%}
# Template is at: .github/templates/windows_arm64_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: !{{ build_environment }}
{%- endblock %}
{%- macro set_runner_specific_vars() -%}
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: cmd
run: |
echo BINARY_ENV_FILE=%RUNNER_TEMP%/env>> %GITHUB_ENV%
echo PYTORCH_FINAL_PACKAGE_DIR=%RUNNER_TEMP%/artifacts>> %GITHUB_ENV%
echo WIN_PACKAGE_WORK_DIR=%RUNNER_TEMP%>> %GITHUB_ENV%
{%- endmacro %}
on:
push:
branches:
- !{{ branches }}
{%- if branches == "nightly" %}
tags:
# NOTE: Binary build pipelines should only get triggered on release candidate builds
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
{%- endif %}
{%- for label in ciflow_config.labels | sort %}
{%- if loop.first and branches != "nightly" %}
tags:
{%- endif %}
- '!{{ label }}/*'
{%- endfor %}
workflow_dispatch:
env:
BUILD_ENVIRONMENT: !{{ build_environment }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
PYTORCH_ROOT: /pytorch
DOWNLOADS_DIR: c:\temp\downloads
DEPENDENCIES_DIR: c:\temp\dependencies
ENABLE_APL: 1
ENABLE_OPENBLAS: 0
MSVC_VERSION : 14.42
AWS_DEFAULT_REGION: us-east-1
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
{%- for config in build_configs %}
!{{ config["build_name"] }}-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "windows-11-arm64"
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config, True) }}
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
{%- endif %}
steps:
!{{ set_runner_specific_vars() }}
- name: Bootstrap folders
shell: cmd
run: |
mkdir "%NIGHTLIES_PYTORCH_ROOT%"
mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Git
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
- name: Git checkout PyTorch - recursive
uses: actions/checkout@v4
with:
path: "pytorch"
submodules: recursive
- name: Bootstrap Python
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap APL
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_apl.bat"
- name: Bootstrap Rust
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
- name: Bootstrap sccache
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_sccache.bat"
- name: Bootstrap Libuv
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_libuv.bat"
- name: Populate binary env
shell: bash
run: |
"pytorch/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"pytorch/.circleci/scripts/binary_windows_arm64_build.sh"
- uses: !{{ common.upload_artifact_action }}
if: always()
with:
name: !{{ config["build_name"] }}
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
!{{ config["build_name"] }}-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- !{{ config["build_name"] }}-build
- get-label-type
runs-on: "windows-11-arm64"
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config, True) }}
steps:
!{{ set_runner_specific_vars() }}
- uses: !{{ common.download_artifact_action }}
name: Download Build Artifacts
with:
name: !{{ config["build_name"] }}
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
- name: Bootstrap Git
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_git.bat"
- name: Remove Pytorch folder
shell: cmd
run: |
rmdir /s /q "pytorch"
- name: Git checkout PyTorch
uses: actions/checkout@v4
with:
path: "pytorch"
submodules: recursive
- name: Bootstrap APL
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_apl.bat"
- name: Bootstrap Python
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
- name: Bootstrap Build Tools
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_buildtools.bat"
- name: Bootstrap Rust
shell: cmd
run: |
"pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
- name: Populate binary env
shell: bash
run: |
"pytorch/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"pytorch/.circleci/scripts/binary_windows_arm64_test.sh"
{%- if branches == "nightly" %}
!{{ upload.upload_binaries(config, True) }}
{%- endif %}
{%- endfor %}

View File

@ -55,7 +55,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -71,7 +71,7 @@ jobs:
{%- else %}
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
{%- endif %}
timeout-minutes: !{{ common.timeout_minutes }}
timeout-minutes: !{{ common.timeout_minutes_windows_binary }}
!{{ upload.binary_env(config, True) }}
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
@ -79,7 +79,7 @@ jobs:
steps:
!{{ common.setup_ec2_windows() }}
!{{ set_runner_specific_vars() }}
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Populate binary env
shell: bash
run: |
@ -107,10 +107,14 @@ jobs:
{%- else %}
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge.nonephemeral"
{%- endif %}
{%- else %}
{%- if branches == "nightly" %}
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
{%- else %}
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
{%- endif %}
timeout-minutes: !{{ common.timeout_minutes }}
{%- endif %}
timeout-minutes: !{{ common.timeout_minutes_windows_binary }}
!{{ upload.binary_env(config, True) }}
steps:
!{{ common.setup_ec2_windows() }}
@ -120,7 +124,7 @@ jobs:
with:
name: !{{ config["build_name"] }}
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="pytorch", checkout_pr_head=False) }}
- name: Populate binary env
shell: bash
run: |

View File

@ -47,7 +47,7 @@ jobs:
reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
fetch-depth: 1
submodules: false
@ -69,25 +69,25 @@ jobs:
runs-on: ${{ matrix.runner }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -97,7 +97,7 @@ jobs:
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.7
if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Output disk space left
@ -209,5 +209,5 @@ jobs:
file-suffix: bazel-${{ github.job }}_${{ steps.get-job-id.outputs.job-id }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
if: always()

View File

@ -18,7 +18,7 @@ on:
description: prefix for runner label
runs_on:
required: false
default: linux.12xlarge.ephemeral
default: linux.12xlarge.memory.ephemeral
type: string
description: Hardware to run this "build" job on, linux.12xlarge or linux.arm64.2xlarge.
timeout-minutes:
@ -150,13 +150,13 @@ jobs:
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}
@ -186,7 +186,6 @@ jobs:
- name: Checkout PyTorch to pytorch dir
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
@ -211,7 +210,7 @@ jobs:
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: ${{ inputs.DOCKER_IMAGE }}
@ -267,7 +266,7 @@ jobs:
- name: Teardown Linux
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
- name: Chown workspace
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

View File

@ -133,14 +133,14 @@ jobs:
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
if: inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
continue-on-error: true
with:
github-secret: ${{ secrets.github-token }}
# Setup the environment
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
no-sudo: ${{ inputs.build_environment == 'linux-aarch64-binary-manywheel' || inputs.build_environment == 'linux-s390x-binary-manywheel' }}
@ -163,7 +163,6 @@ jobs:
- name: Checkout PyTorch to pytorch dir
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
show-progress: false
path: pytorch
@ -194,12 +193,12 @@ jobs:
path: "${{ runner.temp }}/artifacts/"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.7
if: ${{ inputs.GPU_ARCH_TYPE == 'cuda' && steps.filter.outputs.is-test-matrix-empty == 'False' }}
- name: Pull Docker image
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: ${{ inputs.DOCKER_IMAGE }}
@ -209,7 +208,7 @@ jobs:
- name: Teardown Linux
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
- name: Chown workspace
if: always() && inputs.build_environment != 'linux-s390x-binary-manywheel'

View File

@ -66,7 +66,6 @@ on:
jobs:
upload:
runs-on: ubuntu-22.04
environment: ${{ (github.event_name == 'push' && (github.event.ref == 'refs/heads/nightly' || startsWith(github.event.ref, 'refs/tags/v'))) && 'conda-aws-upload' || '' }}
container:
image: continuumio/miniconda3:4.12.0
env:
@ -91,7 +90,7 @@ jobs:
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
no-sudo: true

View File

@ -84,7 +84,7 @@ jobs:
name: build-docs-${{ matrix.docs_type }}-${{ inputs.push }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
@ -95,7 +95,7 @@ jobs:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
- name: Setup Linux
uses: ./.github/actions/setup-linux
@ -110,12 +110,12 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -222,5 +222,5 @@ jobs:
s3-prefix: pytorch/pytorch/${{ github.event.pull_request.number }}/functorchdocs
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
if: always()

View File

@ -69,13 +69,11 @@ on:
required: false
type: string
default: ""
use_split_build:
max-jobs:
description: |
[Experimental] Build a libtorch only wheel and build pytorch such that
are built from the libtorch wheel.
Overwrite the number of jobs to use for the build
required: false
type: boolean
default: false
type: string
secrets:
HUGGING_FACE_HUB_TOKEN:
@ -108,7 +106,7 @@ jobs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
@ -118,7 +116,7 @@ jobs:
# checkout because when we run this action we don't *have* a local
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
no-sudo: true
@ -136,7 +134,7 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image-name: ${{ inputs.docker-image-name }}
@ -152,7 +150,7 @@ jobs:
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -210,7 +208,7 @@ jobs:
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
MAX_JOBS_OVERRIDE: ${{ inputs.max-jobs }}
run: |
START_TIME=$(date +%s)
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
@ -230,6 +228,12 @@ jobs:
DOCKER_SHELL_CMD=
fi
if [[ ${MAX_JOBS_OVERRIDE} == "" ]]; then
MAX_JOBS="$(nproc --ignore=2)"
else
MAX_JOBS="${MAX_JOBS_OVERRIDE}"
fi
# Leaving 1GB for the runner and other things
TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo)
# https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap
@ -241,7 +245,8 @@ jobs:
# shellcheck disable=SC2086
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e MAX_JOBS=${MAX_JOBS} \
-e MAX_JOBS_OVERRIDE \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \
@ -282,7 +287,7 @@ jobs:
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build && inputs.build-environment != 'linux-s390x-binary-manywheel'
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
@ -290,34 +295,15 @@ jobs:
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Store PyTorch Build Artifacts on S3 for split build
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build && inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Store PyTorch Build Artifacts for s390x
uses: actions/upload-artifact@v4
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build && inputs.build-environment == 'linux-s390x-binary-manywheel'
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.build-environment == 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
- name: Store PyTorch Build Artifacts for s390x for split build
uses: actions/upload-artifact@v4
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build && inputs.build-environment == 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
if-no-files-found: error
path: artifacts.zip
- name: Upload sccache stats
if: steps.build.outcome != 'skipped' && inputs.build-environment != 'linux-s390x-binary-manywheel'
uses: ./.github/actions/upload-sccache-stats
@ -326,7 +312,7 @@ jobs:
build-time: ${{ steps.build.outputs.build_time }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'
- name: Cleanup docker

View File

@ -80,7 +80,7 @@ jobs:
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
if: ${{ !contains(matrix.runner, 'gcp.a100') && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
@ -89,7 +89,7 @@ jobs:
docker exec -it $(docker container ps --format '{{.ID}}') bash
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
no-sudo: true
@ -107,7 +107,7 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image-name: ${{ inputs.docker-image }}
@ -123,7 +123,7 @@ jobs:
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -135,7 +135,7 @@ jobs:
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
uses: pytorch/test-infra/.github/actions/setup-nvidia@release/2.7
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Setup GPU_FLAG for docker run
@ -371,7 +371,7 @@ jobs:
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
- name: Upload the benchmark results
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.7
with:
benchmark-results-dir: test/test-reports
dry-run: false
@ -428,7 +428,7 @@ jobs:
workflow_attempt: ${{github.run_attempt}}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'
# NB: We are currently having an intermittent GPU-related issue on G5 runners with

View File

@ -71,11 +71,11 @@ jobs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
steps:
- name: Clean up disk space before running MacOS workflow
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.7
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
- name: Set xcode version
env:
@ -87,7 +87,7 @@ jobs:
- name: Setup miniconda
if: inputs.environment-file == ''
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.7
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
@ -97,7 +97,7 @@ jobs:
# environment even though the arch is x86-64
- name: Setup miniconda using the provided environment file
if: inputs.environment-file != ''
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.7
with:
python-version: ${{ inputs.python-version }}
environment-file: ${{ inputs.environment-file }}
@ -207,4 +207,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.7

View File

@ -41,7 +41,7 @@ jobs:
reenabled-issues: ${{ steps.filter.outputs.reenabled-issues }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
@ -82,7 +82,7 @@ jobs:
use-gha: true
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.7
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
@ -170,4 +170,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.7

View File

@ -82,11 +82,11 @@ jobs:
done
- name: Clean up disk space before running MacOS workflow
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.7
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
- name: Start monitoring script
id: monitor-script
@ -109,7 +109,7 @@ jobs:
use-gha: true
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
uses: pytorch/test-infra/.github/actions/setup-miniconda@release/2.7
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
@ -224,7 +224,7 @@ jobs:
file-suffix: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
- name: Upload the benchmark results
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.7
with:
benchmark-results-dir: test/test-reports
dry-run: false
@ -234,4 +234,4 @@ jobs:
- name: Clean up disk space
if: always()
continue-on-error: true
uses: pytorch/test-infra/.github/actions/check-disk-space@main
uses: pytorch/test-infra/.github/actions/check-disk-space@release/2.7

View File

@ -70,7 +70,7 @@ jobs:
steps:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
no-sudo: true
@ -92,12 +92,12 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -251,6 +251,11 @@ jobs:
# copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"
- name: Change permissions (only needed for MI300 runners for now)
if: ${{ always() && steps.test.conclusion && contains(matrix.runner, 'mi300') }}
run: |
docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "sudo chown -R 1001:1001 test"
- name: Print remaining test logs
shell: bash
if: always() && steps.test.conclusion
@ -297,7 +302,7 @@ jobs:
aws-region: us-east-1
- name: Upload the benchmark results
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@release/2.7
with:
benchmark-results-dir: test/test-reports
dry-run: false

View File

@ -54,7 +54,7 @@ jobs:
PR_NUMBER: ${{ github.event.pull_request.number }}
steps:
# - name: Checkout PyTorch
# uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
# uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
# with:
# fetch-depth: 1
# submodules: true

View File

@ -84,10 +84,10 @@ jobs:
git config --global core.fsmonitor false
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main
uses: pytorch/test-infra/.github/actions/cleanup-runner@release/2.7
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
@ -102,7 +102,7 @@ jobs:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
no-sudo: true

View File

@ -66,10 +66,10 @@ jobs:
git config --global core.fsmonitor false
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main
uses: pytorch/test-infra/.github/actions/cleanup-runner@release/2.7
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
@ -85,7 +85,7 @@ jobs:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
no-sudo: true

View File

@ -62,7 +62,7 @@ jobs:
steps:
# [see note: pytorch repo ref]
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
- name: Setup XPU
uses: ./.github/actions/setup-xpu
@ -80,12 +80,12 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

View File

@ -41,12 +41,12 @@ jobs:
CUDA_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: almalinux-builder${{ matrix.cuda_version == 'cpu' && '-' || '-cuda' }}${{matrix.cuda_version}}
docker-build-dir: .ci/docker/almalinux

View File

@ -32,7 +32,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -51,12 +51,12 @@ jobs:
GPU_ARCH_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: libtorch-cxx11-builder-cuda${{matrix.cuda_version}}
docker-build-dir: .ci/docker/libtorch
@ -93,12 +93,12 @@ jobs:
GPU_ARCH_VERSION: ${{ matrix.rocm_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: libtorch-cxx11-builder-rocm${{matrix.rocm_version}}
docker-build-dir: .ci/docker/libtorch
@ -129,12 +129,12 @@ jobs:
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: libtorch-cxx11-builder-cpu
docker-build-dir: .ci/docker/libtorch

View File

@ -41,7 +41,7 @@ jobs:
GPU_ARCH_TYPE: cpu-s390x
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
no-sudo: true

View File

@ -36,58 +36,13 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
build-docker-cuda:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
strategy:
matrix:
cuda_version: ["12.8", "12.6", "12.4", "11.8"]
env:
GPU_ARCH_TYPE: cuda
GPU_ARCH_VERSION: ${{ matrix.cuda_version }}
steps:
- name: Purge tools folder (free space for build)
run: rm -rf /opt/hostedtoolcache
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux-builder-cuda${{matrix.cuda_version}}
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
env:
DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}
DOCKER_ID: ${{ secrets.DOCKER_ID }}
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux-builder:cuda${{matrix.cuda_version}}
# NOTE: manylinux_2_28 are still experimental, see https://github.com/pytorch/pytorch/issues/123649
build-docker-cuda-manylinux_2_28:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -102,12 +57,12 @@ jobs:
- name: Purge tools folder (free space for build)
run: rm -rf /opt/hostedtoolcache
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: manylinux2_28-builder-cuda${{matrix.cuda_version}}
docker-build-dir: .ci/docker/manywheel
@ -138,7 +93,7 @@ jobs:
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge.ephemeral"
strategy:
matrix:
cuda_version: ["12.8", "12.6"]
cuda_version: ["12.8"]
env:
GPU_ARCH_TYPE: cuda-aarch64
GPU_ARCH_VERSION: ${{ matrix.cuda_version }}
@ -147,7 +102,7 @@ jobs:
uses: actions/checkout@v3
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: manylinuxaarch64-builder-cuda${{matrix.cuda_version}}
docker-build-dir: .ci/docker/manywheel
@ -184,12 +139,12 @@ jobs:
GPU_ARCH_VERSION: ${{ matrix.rocm_version }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: manylinux2_28-builder-rocm${{matrix.rocm_version}}
docker-build-dir: .ci/docker/manywheel
@ -214,42 +169,6 @@ jobs:
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux2_28-builder:rocm${{matrix.rocm_version}}
build-docker-cpu:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral"
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: manylinux-builder-cpu
docker-build-dir: .ci/docker/manywheel
always-rebuild: true
push: true
- name: Authenticate if WITH_PUSH
if: env.WITH_PUSH == 'true'
env:
DOCKER_TOKEN: ${{ secrets.DOCKER_TOKEN }}
DOCKER_ID: ${{ secrets.DOCKER_ID }}
run: |
if [[ "${WITH_PUSH}" == true ]]; then
echo "${DOCKER_TOKEN}" | docker login -u "${DOCKER_ID}" --password-stdin
fi
- name: Build Docker Image
if: env.WITH_PUSH == 'true'
uses: nick-fields/retry@v3.0.0
with:
shell: bash
timeout_minutes: 90
max_attempts: 3
retry_wait_seconds: 90
command: |
.ci/docker/manywheel/build.sh manylinux-builder:cpu
build-docker-cpu-manylinux_2_28:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
needs: get-label-type
@ -258,12 +177,12 @@ jobs:
GPU_ARCH_TYPE: cpu-manylinux_2_28
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: manylinux2_28-builder-cpu
docker-build-dir: .ci/docker/manywheel
@ -296,12 +215,12 @@ jobs:
GPU_ARCH_TYPE: cpu-aarch64
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: manylinuxaarch64-builder-cpu-aarch64
docker-build-dir: .ci/docker/manywheel
@ -334,12 +253,12 @@ jobs:
GPU_ARCH_TYPE: cpu-aarch64-2_28
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: manylinux2_28_aarch64-builder-cpu-aarch64
docker-build-dir: .ci/docker/manywheel
@ -375,12 +294,12 @@ jobs:
GPU_ARCH_TYPE: cpu-cxx11-abi
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: manylinuxcxx11-abi-builder-cpu-cxx11-abi
docker-build-dir: .ci/docker/manywheel
@ -413,12 +332,12 @@ jobs:
GPU_ARCH_TYPE: xpu
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
- name: Calculate docker image
if: env.WITH_PUSH == 'false'
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: manylinux2_28-builder-xpu
docker-build-dir: .ci/docker/manywheel

View File

@ -3,7 +3,7 @@ name: Build Triton wheels
on:
push:
branches:
- main
- release/2.7
tags:
# NOTE: Binary build pipelines should only get triggered on release candidate builds
# Release candidate tags look like: v1.11.0-rc1
@ -12,6 +12,8 @@ on:
- .github/workflows/build-triton-wheel.yml
- .github/scripts/build_triton_wheel.py
- .github/ci_commit_pins/triton.txt
- .github/scripts/windows/install_vs2022.ps1
- .github/scripts/windows/build_triton.bat
- .ci/docker/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
pull_request:
@ -19,6 +21,8 @@ on:
- .github/workflows/build-triton-wheel.yml
- .github/scripts/build_triton_wheel.py
- .github/ci_commit_pins/triton.txt
- .github/scripts/windows/install_vs2022.ps1
- .github/scripts/windows/build_triton.bat
- .ci/docker/ci_commit_pins/triton.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
@ -30,7 +34,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -40,37 +44,40 @@ jobs:
build-wheel:
name: "Build Triton Wheel"
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
runs-on: ${{ matrix.runs_on }}
strategy:
fail-fast: false
matrix:
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t" ]
device: ["cuda", "rocm", "xpu"]
docker-image: ["pytorch/manylinux-builder:cpu", "pytorch/manylinux2_28-builder:cpu"]
exclude:
- device: "rocm"
docker-image: "pytorch/manylinux-builder:cpu"
- device: "xpu"
docker-image: "pytorch/manylinux2_28-builder:cpu"
device: ["cuda", "rocm", "xpu", "aarch64"]
docker-image: ["pytorch/manylinux2_28-builder:cpu"]
include:
- device: "rocm"
rocm_version: "6.3"
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
- device: "cuda"
rocm_version: ""
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
- device: "xpu"
rocm_version: ""
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
- device: "aarch64"
rocm_version: ""
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.arm64.2xlarge"
timeout-minutes: 40
env:
DOCKER_IMAGE: ${{ matrix.device == 'rocm' && format('pytorch/manylinux2_28-builder:rocm{0}', matrix.rocm_version) || matrix.docker-image }}
DOCKER_IMAGE: ${{ matrix.device == 'rocm' && format('pytorch/manylinux2_28-builder:rocm{0}', matrix.rocm_version) || matrix.device == 'aarch64' && 'pytorch/manylinux2_28_aarch64-builder:cpu-aarch64' || matrix.docker-image }}
PY_VERS: ${{ matrix.py_vers }}
BUILD_DEVICE: ${{ matrix.device }}
PLATFORM: ${{ contains(matrix.docker-image, '2_28') && 'manylinux_2_28_x86_64' || 'manylinux2014_x86_64' }}
PLATFORM: 'manylinux_2_28_x86_64'
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
@ -78,7 +85,7 @@ jobs:
uses: ./.github/actions/setup-linux
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: ${{ env.DOCKER_IMAGE }}
@ -130,19 +137,22 @@ jobs:
fi
docker exec -t "${container_name}" yum install -y zlib-devel zip
docker exec -t "${container_name}" "${PYTHON_EXECUTABLE}" -m pip install -U setuptools==67.4.0 pybind11==2.13.1 auditwheel
if [[ ("${{ matrix.device }}" == "cuda" || "${{ matrix.device }}" == "rocm") && "${PLATFORM}" == "manylinux_2_28_x86_64" ]]; then
docker exec -t "${container_name}" "${PYTHON_EXECUTABLE}" -m pip install -U setuptools==67.4.0 pybind11==2.13.1 auditwheel wheel
if [[ ("${{ matrix.device }}" == "cuda" || "${{ matrix.device }}" == "rocm" || "${{ matrix.device }}" == "aarch64" ) ]]; then
# With this install, it gets clang 16.0.6.
docker exec -t "${container_name}" dnf install clang lld -y
WITH_CLANG_LDD="--with-clang-ldd"
fi
if [[ "${BUILD_DEVICE}" == xpu ]]; then
docker exec -t "${container_name}" yum install -y devtoolset-11-gcc-c++
docker exec -t "${container_name}" bash -c "source /opt/rh/devtoolset-11/enable && ${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE"
docker exec -t "${container_name}" bash -c "dnf install -y gcc-toolset-13-gcc-c++"
docker exec -t "${container_name}" bash -c "source /opt/rh/gcc-toolset-13/enable && ${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE"
else
docker exec -t "${container_name}" bash -c "${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE $WITH_CLANG_LDD"
fi
if [[ "${{ matrix.device }}" == "cuda" ]]; then
if [[ ("${{ matrix.device }}" == "cuda" || "${{ matrix.device }}" == "xpu") ]]; then
docker exec -t "${container_name}" bash -c "auditwheel repair --plat ${PLATFORM} //artifacts/*.whl"
else
docker exec -t "${container_name}" bash -c "mkdir //artifacts/wheelhouse"
@ -157,18 +167,104 @@ jobs:
path: ${{ runner.temp }}/artifacts/wheelhouse/*
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
if: always()
build-wheel-win:
name: "Build Triton Windows Wheel"
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
strategy:
fail-fast: false
matrix:
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t" ]
device: ["xpu"]
timeout-minutes: 40
env:
PY_VERS: ${{ matrix.py_vers }}
BUILD_DEVICE: ${{ matrix.device }}
VC_INSTALL_PATH: "C:\\MSVC-BuildTools-2022"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: false
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Enable long paths on Windows and install VS2022 17.13.2
env:
VC_YEAR: 2022
VC_VERSION: 17.13.2
shell: bash
working-directory: pytorch
run: |
powershell .github/scripts/windows/install_vs2022.ps1
- name: Build Triton wheel
env:
IS_RELEASE_TAG: ${{ startsWith(github.event.ref, 'refs/tags/v') }}
working-directory: pytorch
shell: bash
run: |
set -x
export RELEASE=""
if [[ "${IS_RELEASE_TAG}" == true ]]; then
export RELEASE="--release"
fi
.github/scripts/windows/build_triton.bat
mkdir -p "${RUNNER_TEMP}/artifacts/"
mv ./*.whl "${RUNNER_TEMP}/artifacts/"
- uses: actions/upload-artifact@v4.4.0
with:
name: pytorch-triton-wheel-${{ matrix.py_vers }}-${{ matrix.device }}
if-no-files-found: error
path: ${{ runner.temp }}/artifacts/*
upload-wheel:
runs-on: ubuntu-22.04
needs: build-wheel
needs:
- build-wheel
- build-wheel-win
permissions:
id-token: write
contents: read
container:
image: continuumio/miniconda3:4.12.0
environment: ${{ (github.event_name == 'push' && (github.event.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v'))) && 'conda-aws-upload' || '' }}
environment: ${{ (github.event_name == 'push' && github.event.ref == 'refs/heads/main') && 'nightly-wheel-upload' || '' }}
steps:
- uses: actions/checkout@v3

View File

@ -38,7 +38,7 @@ jobs:
runs-on: linux.20_04.4x
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
fetch-depth: 1

View File

@ -13,7 +13,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
submodules: false
fetch-depth: 1

View File

@ -19,7 +19,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}

View File

@ -33,7 +33,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -49,10 +49,13 @@ jobs:
matrix:
runner: [linux.12xlarge]
docker-image-name: [
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9,
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11,
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks,
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9,
pytorch-linux-focal-py3.9-clang10,
pytorch-linux-focal-py3.11-clang10,
@ -96,21 +99,21 @@ jobs:
# [see note: pytorch repo ref]
# deep clone (fetch-depth 0) required for git merge-base
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Build docker image
id: build-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
uses: pytorch/test-infra/.github/actions/calculate-docker-image@release/2.7
with:
docker-image-name: ${{ matrix.docker-image-name }}
always-rebuild: true
push: true
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: ${{ steps.build-docker-image.outputs.docker-image }}
@ -142,5 +145,5 @@ jobs:
if: always()
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
if: always()

View File

@ -37,7 +37,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -52,7 +52,7 @@ jobs:
matrix: ${{ steps.generate-matrix.outputs.matrix }}
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
uses: pytorch/pytorch/.github/actions/checkout-pytorch@release/2.7
with:
fetch-depth: 1
submodules: true
@ -82,7 +82,7 @@ jobs:
CUDNN_VERSION: ${{ matrix.cudnn_version }}
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
uses: pytorch/test-infra/.github/actions/setup-ssh@release/2.7
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
# [see note: pytorch repo ref]
@ -103,20 +103,24 @@ jobs:
password: ${{ secrets.GHCR_PAT }}
# Setup multi-arch image builds
- name: Set up QEMU
uses: docker/setup-qemu-action@v2
uses: docker/setup-qemu-action@v3
env:
QEMU_BINARY_PATH: ${{ runner.temp }}/bin
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
uses: docker/setup-buildx-action@v3
with:
version: v0.10.0
version: latest
driver-opts: image=moby/buildkit:v0.19.0
- name: Setup job specific variables
run: |
set -eou pipefail
# To get QEMU binaries in our PATH
echo "${RUNNER_TEMP}/bin" >> "${GITHUB_PATH}"
# Generate PyTorch version to use
echo "PYTORCH_VERSION=$(python3 .github/scripts/generate_pytorch_version.py --no-build-suffix)" >> "${GITHUB_ENV}"
{
echo "PYTORCH_VERSION=$(python3 .github/scripts/generate_pytorch_version.py --no-build-suffix)";
echo "STABLE_CUDA_VERSION=$(python3 .github/scripts/get_ci_variable.py --stable-cuda-version)"
} >> "${GITHUB_ENV}"
- name: Setup test specific variables
if: ${{ startsWith(github.event.ref, 'refs/tags/v') }}
run: |
@ -153,19 +157,19 @@ jobs:
docker push ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}"
# Please note, here we ned to pin specific verison of CUDA as with latest label
if [[ ${CUDA_VERSION_SHORT} == "12.4" ]]; then
if [[ ${CUDA_VERSION_SHORT} == "${STABLE_CUDA_VERSION}" ]]; then
docker tag ghcr.io/pytorch/pytorch-nightly:"${PYTORCH_NIGHTLY_COMMIT}${CUDA_SUFFIX}" \
ghcr.io/pytorch/pytorch-nightly:latest
docker push ghcr.io/pytorch/pytorch-nightly:latest
fi
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
uses: pytorch/test-infra/.github/actions/teardown-linux@release/2.7
if: always()
validate:
needs: build
uses: pytorch/test-infra/.github/workflows/validate-docker-images.yml@main
uses: pytorch/test-infra/.github/workflows/validate-docker-images.yml@release/2.7
with:
channel: nightly
channel: test
ref: main

View File

@ -38,7 +38,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -55,7 +55,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
@ -64,7 +64,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cpu-aarch64-test: # Testing
@ -80,7 +80,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
@ -104,7 +104,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
@ -113,53 +113,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_9-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_9-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -172,7 +125,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
@ -181,6 +134,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -198,7 +152,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
@ -218,7 +172,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
@ -227,7 +181,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cpu-aarch64-test: # Testing
@ -243,7 +197,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
@ -267,7 +221,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
@ -276,53 +230,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -335,7 +242,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
@ -344,6 +251,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -361,7 +269,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
@ -381,7 +289,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
@ -390,7 +298,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cpu-aarch64-test: # Testing
@ -406,7 +314,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
@ -430,7 +338,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
@ -439,53 +347,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -498,7 +359,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
@ -507,6 +368,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -524,7 +386,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
@ -544,7 +406,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
@ -553,7 +415,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cpu-aarch64-test: # Testing
@ -569,7 +431,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
@ -593,7 +455,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
@ -602,53 +464,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -661,7 +476,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
@ -670,6 +485,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -687,7 +503,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
@ -707,7 +523,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13"
@ -716,7 +532,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cpu-aarch64-test: # Testing
@ -732,7 +548,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13"
@ -756,7 +572,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13"
@ -765,53 +581,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -824,7 +593,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13"
@ -833,6 +602,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -850,7 +620,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13"
@ -870,7 +640,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
@ -879,7 +649,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-aarch64-test: # Testing
@ -895,7 +665,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
@ -919,7 +689,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu-aarch64
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-main
DOCKER_IMAGE: pytorch/manylinux2_28_aarch64-builder:cpu-aarch64-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
@ -928,53 +698,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda-aarch64-12_6-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda-aarch64-12_6-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -987,7 +710,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
@ -996,6 +719,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1013,7 +737,7 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8-aarch64
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"

View File

@ -33,7 +33,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -50,7 +50,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
@ -71,7 +71,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cpu-shared-with-deps-cxx11-abi

View File

@ -38,7 +38,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -55,7 +55,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
@ -76,7 +76,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cpu-shared-with-deps-cxx11-abi
@ -98,7 +98,7 @@ jobs:
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cpu-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cpu-shared-with-deps-cxx11-abi
@ -118,7 +118,7 @@ jobs:
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
@ -140,7 +140,7 @@ jobs:
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda11_8-shared-with-deps-cxx11-abi
@ -163,7 +163,7 @@ jobs:
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda11.8-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda11_8-shared-with-deps-cxx11-abi
@ -171,71 +171,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_4-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_4-shared-with-deps-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_4-shared-with-deps-cxx11-abi-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_4-shared-with-deps-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_4-shared-with-deps-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.4-main
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda12_4-shared-with-deps-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_6-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -248,7 +183,7 @@ jobs:
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
@ -270,7 +205,7 @@ jobs:
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda12_6-shared-with-deps-cxx11-abi
@ -293,7 +228,7 @@ jobs:
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.6-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda12_6-shared-with-deps-cxx11-abi
@ -301,6 +236,71 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_8-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.8-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: libtorch-cuda12_8-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_8-shared-with-deps-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_8-shared-with-deps-cxx11-abi-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.8-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda12_8-shared-with-deps-cxx11-abi
build_environment: linux-binary-libtorch-cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 build needs sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_8-shared-with-deps-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_8-shared-with-deps-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:cuda12.8-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-cuda12_8-shared-with-deps-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_2_4-shared-with-deps-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -313,7 +313,7 @@ jobs:
DESIRED_CUDA: rocm6.2.4
GPU_ARCH_VERSION: 6.2.4
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
@ -337,7 +337,7 @@ jobs:
GPU_ARCH_VERSION: 6.2.4
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
steps:
@ -351,7 +351,6 @@ jobs:
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
@ -364,9 +363,9 @@ jobs:
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: pytorch/libtorch-cxx11-builder:rocm6.2.4-main
docker-image: pytorch/libtorch-cxx11-builder:rocm6.2.4-2.7
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
@ -385,7 +384,7 @@ jobs:
DESIRED_CUDA: rocm6.2.4
GPU_ARCH_VERSION: 6.2.4
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.2.4-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm6_2_4-shared-with-deps-cxx11-abi
@ -405,7 +404,7 @@ jobs:
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: 6.3
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
@ -429,7 +428,7 @@ jobs:
GPU_ARCH_VERSION: 6.3
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
steps:
@ -443,7 +442,6 @@ jobs:
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
@ -456,9 +454,9 @@ jobs:
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
uses: pytorch/test-infra/.github/actions/pull-docker-image@release/2.7
with:
docker-image: pytorch/libtorch-cxx11-builder:rocm6.3-main
docker-image: pytorch/libtorch-cxx11-builder:rocm6.3-2.7
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
- name: Teardown ROCm
@ -477,7 +475,7 @@ jobs:
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: 6.3
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-main
DOCKER_IMAGE: pytorch/libtorch-cxx11-builder:rocm6.3-2.7
LIBTORCH_VARIANT: shared-with-deps
DESIRED_DEVTOOLSET: cxx11-abi
build_name: libtorch-rocm6_3-shared-with-deps-cxx11-abi

View File

@ -33,7 +33,7 @@ jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.7
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
@ -51,7 +51,7 @@ jobs:
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda11.8-main
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda11.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
@ -75,7 +75,7 @@ jobs:
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda11.8-main
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda11.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
@ -86,53 +86,6 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_9-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.4-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_4
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -145,14 +98,14 @@ jobs:
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.6-main
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.6-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.5.1.17; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-test: # Testing
@ -169,7 +122,7 @@ jobs:
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.6-main
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.6-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
@ -192,14 +145,14 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.25.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.57; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.7.1.26; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.3.14; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.41; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.2.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.7.53; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.6.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.26.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.55; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.61; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.0.11; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing
@ -216,13 +169,13 @@ jobs:
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.8-main
DOCKER_IMAGE: pytorch/manylinux2_28-builder:cuda12.8-2.7
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 build needs sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

Some files were not shown because too many files have changed in this diff Show More