Compare commits

...

6258 Commits

Author SHA1 Message Date
8522972133 torch.backends.mkldnn.flags() CM should not warn (#150416)
`torch.backends.mkldnn.flags()` CM should not warn (#150358)

By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined

Added regression test

Fixes https://github.com/pytorch/pytorch/issues/149829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358
Approved by: https://github.com/atalman

(cherry picked from commit 6470b373c16017f5cb8f1aa4060bb60632b18160)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-04-01 08:53:27 -07:00
c4b98c8364 [Build] Fix XPU builds inside venv (#150301)
Update the torch-xpu-ops commit to [3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5](3ee2bd2f13)

 - Fix the build error if users build torch xpu through python virtual environment. It was due to that torch-xpu-ops uses `${PYTHON_EXECUTABLE}` to get python path. However, `${PYTHON_EXECUTABLE}` is the sytem python path, while the pytorch root cmake is using the Python_EXECUTABLE ([Here](420a9be743/tools/setup_helpers/cmake.py (L310))) https://github.com/intel/torch-xpu-ops/issues/1461
 - code diff (026b2c8c7c..3ee2bd2f13)
   - base commit: 026b2c8c7c92a7b2cec5d26334006e3423251cc6
   - new commit: 3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5

(cherry picked from commit f74d5d576aedf053b7574f3eb06d12417d80625a)

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
2025-04-01 08:22:00 -07:00
d10ffd76db [Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#150395)
[Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#148863)

We found that the `pip install cmake` and `conda install cmake` has different behavior.
The reason is that the pip installed one doesn't find the corresponding libs under conda env. So we need to set the `CMAKE_PREFIX_PATH` for alignment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148863
Approved by: https://github.com/CuiYifeng, https://github.com/malfet

Co-authored-by: Cui, Yifeng <yifeng.cui@intel.com>
(cherry picked from commit ce52674b7651921630019de62323ee0bfd69516d)

Co-authored-by: Stonepia <tong.su@intel.com>
2025-04-01 10:56:57 -04:00
53a13e553d Enabling xpu in OffsetBasedRNGTracker . (#150389)
Enabling xpu in OffsetBasedRNGTracker . (#148360)

Else torch.distributed breaks on xpu devices.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360
Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
(cherry picked from commit f0e1a0838c1245a8763d1c67318b23940a3e9246)

Co-authored-by: _githubsgi <zozoxoxo897@gmail.com>
2025-04-01 10:54:51 -04:00
5745d6a770 [ROCm] cmake 4 workaround for hiprtc (#150361)
[ROCm] cmake 4 workaround for hiprtc (#150324)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150324
Approved by: https://github.com/jeffdaily, https://github.com/atalman, https://github.com/malfet

(cherry picked from commit 423e4a4568958845da52808e50d1cdd2ba7fa48d)

Co-authored-by: Faa Diallo <Faa.Diallo@amd.com>
2025-04-01 10:44:03 -04:00
60ddcd803e Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590) (#150352)
Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590)"

This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626.
2025-03-31 15:25:34 -07:00
f2b3b5c453 [MPS] Fix dot/mm for conj_tensors (#150237)
[MPS] Fix dot/mm for conj_tensors (#150157)

- Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key
- For matmul or dot, add `conjugateWithTensor:name:` calls before running the op
- Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo
- Filter  `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR)
- Preserve conj property when gathering the views, that fixes `cov` operator

Fixes https://github.com/pytorch/pytorch/issues/148156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157
Approved by: https://github.com/dcci

(cherry picked from commit 7c65911b11fc1cc7d93045f4cf923058e8a27782)

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-03-31 16:13:11 -04:00
71fa7def26 Fix #149806 : Fix path lookup in _preload_cuda_deps (#150068)
Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808)

@pytorchbot label "bug"

Fixes #149806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149808
Approved by: https://github.com/jansel

(cherry picked from commit 68b327341c748c869fdd7cb51cd05ab8ad6caaac)

Co-authored-by: Divain <fegnouche@hotmail.fr>
2025-03-31 13:07:56 -07:00
1a6c192dc4 Use schema as source of truth + support ones_like/empty_like (#149775)
Use schema as source of truth + support ones_like/empty_like (#149052)

This change does 2 important things:
(a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat!

(b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052
Approved by: https://github.com/albanD

(cherry picked from commit 988827cdfb6d5946049cac7141a5ca04f2177c0a)

Co-authored-by: Jane Xu <janeyx@meta.com>
2025-03-31 11:03:26 -07:00
e691e92297 Update Doc for Intel XPU Profiling (#150272)
Update Doc for Intel XPU Profiling (#134515)

Updated below two pages for Intel XPU
https://pytorch.org/docs/stable/torch.compiler_profiling_torch_compile.html
https://pytorch.org/docs/stable/profiler.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134515
Approved by: https://github.com/dvrogozh, https://github.com/malfet

(cherry picked from commit 7aacbab0b32596a3c334dca5d488e4620b79bb5e)

Co-authored-by: Louie Tsai <louie.tsai@intel.com>
2025-03-31 09:42:08 -07:00
2b73f403c7 Pin cmake to 3.31.2 for windows conda install (#150223)
Pin cmake to 3.31.2 for windows conda install (#150185)

Trying to fix nightly failures
Cmake 4.0 update https://pypi.org/project/cmake/4.0.0/ broke nightly builds
You can see it here: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=cuda11_8-build
and here: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=
This fix for Windows Builds. Linux and MacOS where already fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150185
Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi

(cherry picked from commit b0901d62ae2c2e909f91401eacebf3731df20cbe)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-28 14:28:41 -07:00
697cd9bbb1 [inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149993)
[inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149973)

Fixes #148938

Context:

In triton 3.3, triton kernels expect a global scratch space arg to be passed in. This is fixed in #148051, which fixed most of the AOTI/cpp_wrapper failures; the fix is to inject a (null) global scratch space arg passed as an argument to all kernels.

But in the case of TMA, we need to call a non-triton-generated function - init1DTMADescriptor. The same `generate_args_decl` function used for calling triton kernels (and modified in #148051 to insert a global scratch space) is used to prepare the arguments to init1DTMADescriptor, and so it had an extra global scratch space arg. Then we'd get a null pointer passed into init1DTMADescriptor, resulting in an IMA later on when the TMA use kernel

This PR: adds an option to `generate_args_decl` to specify whether this is a triton kernel (in which case we should add the global scratch space arg) or not (when we shouldn't add the extra arg).

Note: this doesn't appear in CI because we don't run these tests with Hopper machines in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149973
Approved by: https://github.com/drisspg

(cherry picked from commit a8d0c5c92818186119d4a94d98999acc3f549a7e)

Co-authored-by: David Berard <dberard@fb.com>
2025-03-28 16:40:47 -04:00
64ca70f83c Pin cmake==3.31.6 (#150193)
Pin cmake==3.31.6 (#150158)

I'm not sure if this is the right think to do, but cmake 4.0.0 got released on pypi and our builds are failing with it

Example:
aa70d62041 (39555975425-box)

I guess we have to go change all the cmake_minimum_required to >=3.5?

backwards compat still failing because its building with the base commit which this pr can't really change until it gets merged, but at least manywheel binary builds got past where they were originally failing

Also pin the conda installation, but the most recent version on conda is 3.31.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150158
Approved by: https://github.com/cyyever, https://github.com/malfet

(cherry picked from commit 0ece461ccafe5649d2d0f058ff5477765fd56499)

Co-authored-by: Catherine Lee <csl@fb.com>
2025-03-28 09:09:29 -07:00
1b84fd1503 Enable fast path for qlinear (static/dynamic) and qadd for AArch64 though ACL directly. (#149435)
* Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585)

This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation.
However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`.

- **For `qlinear_dynamic` (dynamically quantized matmuls):**

This PR yields an ****average speedup** (averaged over context_lengths of 2^3 up to 2^9) of ~ **50%** for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased`** with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below:
```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
from transformers import AutoModel, AutoConfig
import time
import numpy as np
from argparse import ArgumentParser

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__(description="huggingface model")
        self.add_argument("--context_length",
                            help="context length - number of input tokens",
                            type=int,
                            default=64
        )
        self.add_argument("--model",
                            help="model checkpoint - i.e. 'bert-base-uncased'",
                            type=str,
                            default=None)
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()
    model_name = args.model
    config = AutoConfig.from_pretrained(model_name)
    batch_size = 1
    model = AutoModel.from_pretrained(model_name)
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    model.eval()
    inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu")
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("Model = ", model_name)
    print("Context Length = ", args.context_length)
    print("Min (ms) = ", min(times))
    print("Mean (ms) = ", np.mean(times))
```

- **For `qlinear` (statically quantized matmuls):**

This PR yields an **average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below.
The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`.
The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64.

```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
from torch.quantization import QConfig
from torch.ao.quantization.observer import HistogramObserver, default_weight_observer
import torch
import torch.nn as nn
import numpy as np
import random
from argparse import ArgumentParser
import time

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__()
        self.add_argument("--M",
                            help="M dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--K",
                            help="K dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--N",
                            help="N dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--signed_input",
                            help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8",
                            action="store_true"
        )
        self.add_argument("--seed",
                          help="Random seed",
                          type=int,
                          default=42
        )
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

class LinearModel(nn.Module):
    def __init__(self, K, N):
        super(LinearModel, self).__init__()
        self.quant = torch.quantization.QuantStub()
        self.fc = nn.Linear(K, N)
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

def quantize_model(model, args):
    qconfig = QConfig(
            activation=HistogramObserver.with_args(reduce_range=False,
            dtype=torch.qint8 if args.signed_input else torch.quint8),
            weight=default_weight_observer,
    )
    # Prepare the model for static quantization
    # Specify quantization configurations
    model.qconfig = qconfig
    model_prepared = torch.quantization.prepare(model_fp32)

    # Calibrate the model with sample inputs
    # Example input data for calibration
    with torch.no_grad():
        sample_data = torch.randn(args.M, args.K)
        model_prepared(sample_data)
    # Convert the prepared model to a quantized model
    model_quantized = torch.quantization.convert(model_prepared)
    return model_quantized

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()

    set_seed(args.seed)
    model_fp32 = LinearModel(args.K, args.N)
    model_quantized = quantize_model(model_fp32, args)

    inputs = torch.randn(args.M, args.K)
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model_quantized(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model_quantized(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input)
    print("Min Times (ms) = ", min(times))
    print("Mean Times (ms) = ", np.mean(times))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 08a644a4c4a0f74cf3277e85e265a44a192079c5)

* Enable qint8 and quint8 add for AArch64 using ACL directly (#148653)

This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly.
Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.

Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653
Approved by: https://github.com/malfet
ghstack dependencies: #148585

(cherry picked from commit 6c2db8fab047b8a1d671c3c8dfbdd4c478c6d2e3)

* [Build] Guard per-op headers in ACLUtils.cpp (#149417)

To fix internal build failures, where per-op headers are not generated.
We really should have lint for something like that.

Test Plan: CI

Reviewed By: izaitsevfb

Differential Revision: D71406882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149417
Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb

(cherry picked from commit 5db3a4ac88ad9a3062a9f64dc64741b820208a91)

---------

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-03-28 07:51:16 -07:00
6b27e11a5b [Release-only] Pin intel-oneapi-dnnl to 2025.0.1-6 (#150132)
[CI] Fix the XPU CI build environment
2025-03-27 16:07:33 -07:00
18a926f547 update release 2.7 xla pin (#150126)
* update release 2.7 xla pin

* fix

* fix
2025-03-27 18:29:03 -04:00
ecd434bea9 Revert "Parallelize sort" (#150128)
Revert "Parallelize sort (#149765)"

This reverts commit 8d2186cd7952336d4f8b3f73648a5c0714a832b9 as it causes inductor test regression, see 5bed3fafc7/1
2025-03-27 12:04:15 -07:00
5bed3fafc7 [ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149432)
[ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245)

Fixes https://github.com/ROCm/hip/issues/3764.

Fixes and improvements to CUDA->HIP flag conversion for CPP extensions

- Log flag conversion for debugging purposes.
- Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance.
- Fix case where nvcc key may not exist
- Fix case where hipify should ignore flag values and only touch the flag itself

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245
Approved by: https://github.com/jeffdaily

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
(cherry picked from commit c0566e0dbf42f633624adb02015742509edcb444)

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2025-03-26 17:18:20 -07:00
9b4f085526 [MPS] fix attention enable_gqa crash on mps (#150067)
[MPS] fix attention enable_gqa crash on mps (#149147)

Fixes #149132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147
Approved by: https://github.com/malfet

(cherry picked from commit dd6e9df3d00851c44fb76341f3113fe9223dcfca)

Co-authored-by: Isalia20 <irakli.salia854@gmail.com>
2025-03-26 17:17:11 -07:00
d29e4c81d9 update aotinductor doc for XPU support (#149935)
update aotinductor doc for XPU support (#149299)

as title. Since the AOTInductor feature starting from 2.7 works on Intel GPU, add the related contents into its doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149299
Approved by: https://github.com/guangyey, https://github.com/desertfire

(cherry picked from commit 4ea580568a27e281b96d26d9380c786c2e2116e6)

Co-authored-by: Jing Xu <jing.xu@intel.com>
2025-03-26 18:42:56 -05:00
8d2186cd79 Parallelize sort (#149765)
Parallelize sort (#149505)

PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149505
Approved by: https://github.com/fadara01, https://github.com/Skylion007

(cherry picked from commit 842d51500be144d53f4d046d31169e8f46c063f6)

Co-authored-by: Annop Wongwathanarat <annop.wongwathanarat@arm.com>
2025-03-26 16:47:46 -05:00
b04d8358d9 ci/docker: use NCCL 2.26.2-1 (#149874)
ci/docker: use NCCL 2.26.2-1 (#149778)

Related to #149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149778
Approved by: https://github.com/Skylion007, https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
(cherry picked from commit ddc0fe903f3043246103d71b60a4fff0aeeef9e8)

Co-authored-by: Tristan Rice <rice@fn.lc>
2025-03-26 16:36:35 -04:00
d80afc07f0 [cherry-pick] Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540 (#149624)
Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: https://github.com/pytorch/pytorch/pull/149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: https://github.com/pytorch/pytorch/pull/148895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
2025-03-26 13:33:54 -07:00
84210a82ef [cherry-pick] nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351) (#149546)
* nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351)

Fixes #149153

Yaml generated from:

```
python .github/scripts/generate_ci_workflows.py
```

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

```
rm -rf third_party/nccl
python setup.py develop
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351
Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet

* fixed_regenerations

---------

Co-authored-by: Tristan Rice <rice@fn.lc>
2025-03-26 13:32:51 -07:00
4268b2f40a [MPSInductor] Move threadfence at the right location (#150037)
[MPSInductor] Move threadfence at the right location (#149437)

Not sure how it worked in the past, but fence should be before first read from the shared memory, not after it.
This bug was exposed by https://github.com/pytorch/pytorch/pull/148969 which removed unnecessary barrier before calling `threadgroup_reduce` functions
Test plan:
```
% python3 generate.py --checkpoint_path checkpoints/stories15M/model.pth --prompt "Once upon a time" --device mps --compile
```
Before that it produced gibberish, now it works fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149437
Approved by: https://github.com/manuelcandales, https://github.com/dcci

(cherry picked from commit 61a64c20c402e61027dad4a9e7a192ec0971d1d6)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-26 11:17:09 -07:00
12a6d2a0b8 Add triton as dependency to CUDA aarch64 build (#149945)
Add triton as dependency to CUDA aarch64 build (#149584)

Aarch64 Triton build was added by: https://github.com/pytorch/pytorch/pull/148705
Hence add proper contrain to CUDA 12.8 Aarch64 build

Please note we want to still use:
```platform_system == 'Linux' and platform_machine == 'x86_64'```
For all other builds.

Since these are prototype binaries only used by cuda 12.8 linux aarch64 build. Which we would like to serve from download.pytorch.org

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149584
Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 9b1127437e6ccf0c55a87607d9f551cc6424ca67)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-26 07:28:27 -07:00
464432ec47 Automate stable CUDA update and linter using min Python verison (#149981)
Automate stable CUDA update and linter using min Python verison (#148912)

1. Fixes: https://github.com/pytorch/pytorch/issues/145571 . Cuda Stable is the same cuda version that is published to pypi, also used to set Metadata section in the rest of whl scripts and tag the docker releases with latest tag.
2. Updates min python version used in linter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148912
Approved by: https://github.com/Skylion007, https://github.com/malfet

(cherry picked from commit 29fd875bc125582f29abbdf5559d3941899680be)

Co-authored-by: atalman <atalman@fb.com>
2025-03-26 07:03:35 -07:00
1f612dafb5 Removing doc references to PRE_CXX11_ABI. (#149952)
Removing doc references to PRE_CXX11_ABI. (#149756)

Fixes #149550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149756
Approved by: https://github.com/svekars, https://github.com/atalman

(cherry picked from commit 43ee67e8dc6827fbb7d12a5950ddf6a5c80771dc)

Co-authored-by: Alanna Burke <burkealanna@meta.com>
2025-03-25 14:04:12 -07:00
f63def6ac7 [XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149714)
[XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149511)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149511
Approved by: https://github.com/EikanWang

(cherry picked from commit ee6a0291653cd507d59bda6bcf5d848099a804d1)

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-03-25 16:04:51 -04:00
3a8e623a9b op should NOT be static in aoti_torch_call_dispatcher (#149644)
op should NOT be static in aoti_torch_call_dispatcher (#149208)

aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet

(cherry picked from commit 740ce0fa5f8c7e9e51422b614f8187ab93a60b8b)

Co-authored-by: Jane Xu <janeyx@meta.com>
2025-03-25 16:03:19 -04:00
bf727425a0 Symintify transpose_ (#149632)
Symintify transpose_ (#149057)

Fixes https://github.com/pytorch/pytorch/issues/148702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057
Approved by: https://github.com/yushangdi

(cherry picked from commit 8d7c430e84f4ad439ebdc81f9ab496a3665033a4)

Co-authored-by: angelayi <yiangela7@gmail.com>
2025-03-25 15:59:42 -04:00
8c7dbc939f [cherry-pick] [CI] Don't clean workspace when fetching repo (#147994) (#149129)
Revert "[CI] Don't clean workspace when fetching repo (#147994)"

This reverts commit e5fef8a08ebb8548e8413ae54ef0ad9a11f1f4c0.
2025-03-25 15:25:19 -04:00
644fdbad95 [Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149631)
[Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149473)

# Motivation
The PR fix a bug that wrongly decides the zero-point mask setting. Specifically, it deems zero-point is always not zeros due to scale is used for judgement. Fortunately, the bug only affects the performance. The accuracy is not affected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149473
Approved by: https://github.com/EikanWang, https://github.com/guangyey

(cherry picked from commit d67c1a027e61bd68908bc4c8e5275a983521366c)

Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com>
2025-03-25 10:07:25 -05:00
fb027c5692 [cherry-pick] Update ExecuTorch pin update (#149539) (#149630)
Update ExecuTorch pin update (#149539)

Latest commit in https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50

Follow-up to https://github.com/pytorch/pytorch/issues/144480#issuecomment-2731150636

Also, need to incorporate change from https://github.com/pytorch/executorch/pull/8817

Test Plan:

Monitor  linux-jammy-py3-clang12-executorch test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149539
Approved by: https://github.com/larryliu0820

(cherry picked from commit bc86b6c55a4f7e07548a92fe7c9b52ad2c88af35)
2025-03-25 10:03:16 -05:00
3b87bd8b82 Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#149878)
Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070)

**Issue:**
* The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards.
* Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve
* This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated.

**Fix:**
* Updated the build flags to explicitly use **-march=armv8-a+sve**, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before.
* This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware.

Test plan:
 - Allocate `a1.4xlarge` on AWS
 - Run following script using wheel produced by this PR
 ```python
import torch
def f(x):
    return x.sin() + x.cos()

print(torch.__version__)
f_c = torch.jit.script(f)
```
- Observe no crash
```
$ python3 foo.py
2.7.0.dev20250313+cpu
```
- Observe crash with 2.6.0
```
$ python3 foo.py
2.6.0+cpu
Illegal instruction (core dumped)
```

Fixes #146792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070
Approved by: https://github.com/malfet

(cherry picked from commit 09f7f62cfebb0067b93d227c13fe9a94b51af762)

Co-authored-by: maajidkhann <maajidkhan.n@fujitsu.com>
2025-03-24 13:57:16 -07:00
89b098a677 Add release branch push triggers to inductor-rocm-mi300.yml (#149871)
Add release branch push triggers to inductor-rocm-mi300.yml (#149672)

In similar vein as https://github.com/pytorch/pytorch/pull/149517

When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149672
Approved by: https://github.com/jeffdaily

(cherry picked from commit 1eab841185cb2d68a11e2e0604fd96d110778960)

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-03-24 13:24:10 -07:00
4cc4302b32 Do not depend on numpy during the import (#149731)
Do not depend on numpy during the import (#149683)

But a good followup would be to use torch primitives instead of numpy here
Fixes https://github.com/pytorch/pytorch/issues/149681

Test plan: Monkey-patch 2.7.0-rc and run `python -c "import torch;print(torch.compile(lambda x:x.sin() + x.cos())(torch.rand(32)))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149683
Approved by: https://github.com/seemethere

(cherry picked from commit 68dfd44e50f59c53698a24985039a27351862963)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-21 11:13:52 -07:00
c632e4fdb8 [ONNX] Expose verification utilities (#149375)
* [ONNX] Expose verification utilities (#148603)

Expose verification utilities to public documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603
Approved by: https://github.com/titaiwangms

(cherry picked from commit ebabd0efdddd91e11364e42227b746c419a39be4)

* [ONNX] Update types in VerificationInfo (#149377)

torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377
Approved by: https://github.com/titaiwangms

---------

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-03-20 08:14:38 -07:00
b23bfae9f7 Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031) (#149386)
**Summary**
Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907
This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT.

**Test plan**
```
pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031
Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire
2025-03-20 08:08:09 -07:00
1b8f496f87 Pin auditwheel to 6.2.0 (#149525) 2025-03-19 16:13:43 -07:00
c236b602ff Add release branch push triggers to rocm-mi300.yml (#149526) 2025-03-19 16:12:59 -07:00
6926f30654 BC fix for AOTIModelPackageLoader() constructor defaults (#149214)
BC fix for AOTIModelPackageLoader() constructor defaults (#149082)

The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082
Approved by: https://github.com/desertfire

(cherry picked from commit 5e1b715dda813d8c545378291261b565649df8e5)

Co-authored-by: Joel Schlosser <jbschlosser@meta.com>
2025-03-17 17:16:54 -05:00
483980d7f3 [AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149242)
[AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175)

The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175
Approved by: https://github.com/jansel

(cherry picked from commit 9ad6265d044075d1ceb27cf0f2af7495e586003c)

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-03-17 17:15:39 -05:00
7173a73cf4 [regression] Fix pin_memory() when it is called before device lazy initialization. (#149183)
[regression] Fix pin_memory() when it is called before device lazy initialization. (#149033)

PR #145752 has added a check in the isPinnedPtr to check if a device is initialized before checking if the tensor is pinned. Also that PR has added a lazy initialization trigger when an at::empty is called with a pinned param set to true. However, when the tensor is firstly created and it is pinned in a separate call by calling pin_memory() function, lazy device init is not called so is_pinned returns always false.

With this PR, the lazy initialization is moved to getPinnedMemoryAllocator function, thus it is assured that device is initialized before we pin a tensor.

Fixes #149032

@ngimel @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149033
Approved by: https://github.com/ngimel, https://github.com/albanD

(cherry picked from commit 420a9be743f8dd5d6296a32a1351c1baced12f1f)

Co-authored-by: Bartlomiej Stemborowski <bstemborowskix@habana.ai>
2025-03-17 15:37:24 -04:00
7bab7354df [cherry-pick] Revert #148823 - Make dynamism code robust to NotImplementedException (#149160)
Revert "Make dynamism code robust to NotImplementedException (#148823)"

This reverts commit 60576419a2a5cc09e4a92be870fda8f3fc305ddc.

Reverting from RC since it was reverted from the main branch
2025-03-14 10:50:24 -05:00
b1940b5867 Remove runtime dependency on packaging (#149125)
Remove runtime dependency on packaging (#149092)

Looks like after https://github.com/pytorch/pytorch/pull/148924
We are seeing this error in nightly test:
https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623

```
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module>
    from .lowering import fallback_node_due_to_unsupported_type
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module>
    from . import kernel
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module>
    from . import mm, mm_common, mm_plus_mm
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'
```

Hence removing runtime dependency on packaging since it may not be installed by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092
Approved by: https://github.com/drisspg, https://github.com/davidberard98

(cherry picked from commit 65d19a5699afbb0b123b6b264188f5610b925c5e)

Co-authored-by: atalman <atalman@fb.com>
2025-03-13 12:28:50 -04:00
abebbd5113 [Profiler][HPU] Fix incorrect availabilities for HPU (#149115)
[Profiler][HPU] Fix incorrect availabilities for HPU (#148663)

Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/albanD

(cherry picked from commit 75c8b7d9725af59d8348379a4165d1252c4ac208)

Co-authored-by: wdziurdz <witold.dziurdz@intel.com>
2025-03-13 08:09:45 -07:00
cdd7a2c72b [RLEASE ONLY CHANGES] Apply release only chnages to release 2.7 (#149056)
* [RLEASE ONLY CHANGES] Apply release only chnages to release 2.7

* fix_lint_workflow

* docker_release

* fix_check_binary
2025-03-12 15:44:15 -04:00
d94ea2647c [inductor] Fix profiler tests with latest Triton (#149059)
[inductor] Fix profiler tests with latest Triton (#149025)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149025
Approved by: https://github.com/yanboliang

(cherry picked from commit 488c4480f97e9d537905b0fbcb7236f88dca47f7)

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-03-12 15:21:51 -04:00
924a247fbb [MPS] Enable angle and atan2 for torch.long (#149017)
This check was added by https://github.com/pytorch/pytorch/pull/85817, that introduced no unit-tests and its content seems to be totally unrelated to title/subject of that PR. Anyway, right now it seems to be working fine on MacOS-13+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149017
Approved by: https://github.com/dcci
2025-03-12 04:48:52 +00:00
7b78a2c415 [MPSInductor] Fix argmin/argmax long reductions (#149021)
By adding an additional indexes array for aggregates and populating it when performing partial reductions.

And with that I can finally `torch.compile` TinyStories and get 600+ tokens/sec vs <200 on eager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149021
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004, #149020
2025-03-12 04:39:29 +00:00
758522d56a [MPSInductor][EZ] Fix argmin/max signatures (#149020)
threadgroup_argmin used to return input type, which is wrong, it should have returned `int` or `long`

Change signatures of both thredgroup_argmin and threadgroup_argmax to return int, as group size is small, no need to carry over large integeres
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149020
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004
2025-03-12 04:39:29 +00:00
fe22db9cc3 [MPSInductor] Fix min/max reductions over large dims (#149004)
Simple followup after sum/prod

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149004
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975
2025-03-12 04:39:19 +00:00
clr
2a7e997b3f test/dynamo/test_utils: Fix one broken test on different python versions (#148987)
We correctly handed different python version in the explicit ir_nodes test, but
didn't handle it in the dynamo_timed test. Just explicitly deleting the fields
there so the dynamo_timed test passes on all python versions.

(I noticed it breaking on 3.13).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148987
Approved by: https://github.com/jansel
2025-03-12 02:11:08 +00:00
e40a9e602b Add the max_autotune tests in the periodic jobs. (#143560)
To promptly detect issues with max_autotune, such as [#143102](https://github.com/pytorch/pytorch/issues/143102), add the max_autotune tests to the periodic CI to track the accuracy regularly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2025-03-12 01:47:46 +00:00
60576419a2 Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-12 01:01:57 +00:00
46f096bba6 Explicitly set use-ephemeral runners for windows nightly cpu test jobs (#149001)
This PR migrated windows builds to use ephemeral runners: https://github.com/pytorch/pytorch/pull/134463 however missed test jobs.

Explicitly set use-ephemeral runners for windows nightly cpu tests.
Please note we should be using already ephemeral runners for these after: https://github.com/pytorch/test-infra/pull/6377 (recently migrated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149001
Approved by: https://github.com/malfet
2025-03-11 23:51:39 +00:00
5b60749e9e [cudagraph] add log for skip reasons (#148797)
Summary: Add skip reasons to dynamo_compile so we can know popular skip reasons for cudagraph

Test Plan: {F1975906635}

Differential Revision: D70820791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148797
Approved by: https://github.com/masnesral
2025-03-11 23:31:48 +00:00
98a2d905bf [MPSInductor] Fix large prod and sum reductions (#148975)
After this change, if reduction dimension is larger than `max_threadgroup_size`,  emit a `for` loop from `codegen_iteration_ranges_entry` and wrap it up in `codegen_body()`
I.e. after this changes following command
```
% TORCH_LOGS=output_code python -c "import torch;print(torch.compile(lambda x:(x[0::2].sin()+(x[1::2] + .4).cos()).sum(dim=0) - 3.14)(torch.rand(4096, device='mps')))" 2>&1|cut -c 86-
```
will emit following shader
```metal
#include <c10/metal/random.h>
#include <c10/metal/special_math.h>
#include <c10/metal/utils.h>
#include <c10/metal/reduction_utils.h>
kernel void generated_kernel(
    device float* out_ptr1,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[1024];
    tmp_acc_0[r0_index] = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        if (r0_0 >= 2047) break;
        auto tmp0 = in_ptr0[2*r0_0];
        auto tmp2 = in_ptr0[1 + 2*r0_0];
        auto tmp1 = metal::precise::sin(tmp0);
        auto tmp3 = 0.4;
        auto tmp4 = tmp2 + tmp3;
        auto tmp5 = metal::precise::cos(tmp4);
        auto tmp6 = tmp1 + tmp5;
        tmp_acc_0[r0_index] += tmp6;
    }
    auto tmp7 = c10:🤘:threadgroup_sum(tmp_acc_0, 1024);
    auto tmp8 = 3.14;
    auto tmp9 = tmp7 - tmp8;
    out_ptr1[0] = static_cast<float>(tmp9);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148975
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #148969
2025-03-11 22:46:41 +00:00
2dcdb4ba78 [ez] include config as part of __all__ in torch.compiler (#148978)
Right now we are susceptive to a race condition where if the torch.compiler.config is not implicitly import via dynamo/builder.py, we will throw an error when trying to set compiler configs. This fixes it by including config in `__all__`.

Previous
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
>>> torch.compiler.config.dynamic_sources =
"L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
```

Now
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148978
Approved by: https://github.com/bdhirsh, https://github.com/laithsakka
2025-03-11 21:58:38 +00:00
a6459afb0e [dynamic shapes] add backed_size_oblivious option (#148696)
Adds option `torch.fx.experimental._config.backed_size_oblivious = True` to allocate `[0, inf]` instead of `[2, inf]` ranges for size backed symbols, and opting into size-oblivious semantics for them.

Helps in a number of cases like
- Keeps `[0, inf]` bounds for unbacked symbols, when we make a unbacked -> backed replacement
- More sound handling for 0/1 inputs at runtime when we lower from export
- Avoids ends-of-bounds, sys.maxsize constraint violations for exporting with named Dims (https://github.com/pytorch/pytorch/issues/146315, https://github.com/pytorch/pytorch/issues/146046)

May look towards turning this on globally for export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148696
Approved by: https://github.com/bobrenjc93
2025-03-11 21:52:34 +00:00
53a1a022a9 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 21:49:46 +00:00
b98af95401 Fix DCP link (#148974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148974
Approved by: https://github.com/svekars
2025-03-11 21:26:37 +00:00
6119ffc711 [ROCm][TunableOp] Fix TunableOp BLAS logging for online tuning case. (#148979)
In a previous PR https://github.com/pytorch/pytorch/pull/147034, there was a bad merge at the last minute.
BLAS logging works for offline tuning, but does not currently work for online tuning.

This PR fixes BLAS logging for online tuning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148979
Approved by: https://github.com/jeffdaily
2025-03-11 21:20:04 +00:00
e5fef8a08e [CI] Don't clean workspace when fetching repo (#147994)
Tested on 874c5dc4c98cc63a06bfc900d03683b02f110d7c'
Also tested on https://github.com/pytorch/pytorch/actions/runs/13798178199/job/38594767529?pr=148995#step:4:12

Don't remove the workspace when fetching.  The checkout action performs git clean -ffdx to remove untracked files and files in gitignore

This is probably not going to be useful if we switch entirely to ephemeral runners but w/e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-11 21:10:56 +00:00
72d9f88ef2 [release] Move triton pin to latest triton release/3.3.x (#148971)
This branch contains latest AMD cherry-picks:
https://github.com/triton-lang/triton/pull/6171
https://github.com/triton-lang/triton/pull/6165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148971
Approved by: https://github.com/danzimm
2025-03-11 21:10:42 +00:00
e6ef0620cc Add shim.h C API to call dispatcher on our own aten ops (#148832)
This PR still needs testing through some cpp extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148832
Approved by: https://github.com/albanD, https://github.com/atalman
ghstack dependencies: #148124
2025-03-11 21:02:04 +00:00
cf19efd3d9 Support basic TorchBind in aot_compile and aoti_compile_and_package (#148506)
Summary:
**Codegen**

- Skip some codegen parts for torchbind (such as arg decleration) because they are loaded in proxy executor, so we do not need to declare torchbind args in cpp code
- Added a helper method to get the schema of CallTorchBind HOP. The returned schema is only the schema of `obj.method()`.

**Serialization**
Add support for torchbind object in serialization

- For CallTorchBind HOP, we need to handle it specially because of it's schema. The output serialized args is in the format of `(obj, method, *args, **kwargs)`.
- it.TorchBindObject inputs are serialized to `as_custom_obj` Argument.

**Packaging**

Add torchbind objects file and `custom_objs_config.json` file to generated files output of `aot_compile`.

The json file is stored in the `data/aotinductor/<model_name>` folder in pt2 archive.

The torchbind objects are stored in data/constants/ folder in pt2 archive.
The format of torchbind objects are `f"{CUSTOM_OBJ_FILENAME_PREFIX}{custom_obj_idx}"`. e.g. `custom_obj_0`.
CustomClassHolder objects implement their own pickle methods.

Note that this `custom_objs_config.json` file is different from the `model_constants_config.json` file produced in package_sigmoid(). The keys in `custom_objs_config` directly correspond to the arg name in extern nodes json.
The key in `model_constants_config.json` produced by `package_sigmoid` is the attribute name in the user mode code.

This is required for both internal and OSS torchbind support.
For OSS torchbind support, we also need to package torchbind_constants into the .pt2 output.

**Work Left**
We still need to add torchbind support in ProxyExecutor for inductor.aoti_load_package to work. See other diffs in the stack.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D69490718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148506
Approved by: https://github.com/angelayi
2025-03-11 20:55:18 +00:00
f69e58e8e8 [CI] Update crossvit_9_240 as pass (#148989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989
Approved by: https://github.com/ZainRizvi
2025-03-11 20:54:39 +00:00
b54cf1a281 Revert "[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)"
This reverts commit 73c8068cf889829fb811fc75baac03163c9a42ee.

Reverted https://github.com/pytorch/pytorch/pull/148693 on behalf of https://github.com/ZainRizvi due to This is breaking lint on trunk. Please rebase these changes before merging them back in. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13796723235/job/38590020554) [HUD commit link](73c8068cf8) ([comment](https://github.com/pytorch/pytorch/pull/148693#issuecomment-2715671875))
2025-03-11 20:50:23 +00:00
c18858d633 [MPS] Make torch.mps.compile_shader public (#148972)
It was a private method in 2.6, but nothin changes in its API for 2.7
and it will likely remain the same in 2.8, so time to remove underscore
from its name

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148972
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/seemethere, https://github.com/albanD, https://github.com/dcci
2025-03-11 20:20:58 +00:00
abcec55532 gracefully handle tokenize.TokenError in funcname parser. Adds support for non-Python source (#148737)
This change allows defining python functions in non-python source and having them be able to compiled by torch.compile. The existing implementation already returns None for the case where the file couldn't be read, so returning None (by making an empty funcname cache) makes sense for the case of non-python source code too.

Example [basilisp](https://github.com/basilisp-lang/basilisp):
```clojure
(import torch)
(import [torch.nn.functional :as F])
(torch/rand 10)

(defn f {:decorators [torch/compile]} [x]
  (* (F/relu x) x))

(f (-> (torch/randn 100)
       (.cuda)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148737
Approved by: https://github.com/williamwen42
2025-03-11 19:49:28 +00:00
73c8068cf8 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-11 19:38:40 +00:00
5b8da17681 [cutlass backend] Add addmm and bmm tests for AOTI (#148929)
Needs to do:
1. Expand addmm tests to cover all 4 shapes
2. Add dynamic shape support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148929
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-03-11 19:38:24 +00:00
7b2ecb80eb [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148928)
Differential Revision: D70908557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148928
Approved by: https://github.com/angelayi
2025-03-11 19:36:30 +00:00
61f9b50e09 [ROCm] Fix TORCH_CHECK for hdim 512 support added in AOTriton 0.9b (#148967)
Fixes #148850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148967
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-03-11 19:21:10 +00:00
971606befa Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman
2025-03-11 19:12:46 +00:00
4d10da731b [ROCm] CK Memory-Efficient Attention (attention bias support) (#147778)
Implements CK as the backend for memory efficient attention with a couple caveats:

- Still enabled via `torch.backends.cuda.preferred_rocm_fa_library("ck")
- Does NOT support Nested Tensors

Using the mem_eff path allows us to use attention bias with a CK sdpa backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147778
Approved by: https://github.com/houseroad
2025-03-11 19:02:59 +00:00
a1cb67b69e [ROCm] Improve backwards indexing when stride is not one (#147630)
Improve backwards indexing when stride is not one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147630
Approved by: https://github.com/jeffdaily
2025-03-11 19:02:48 +00:00
daff65d671 Correctly propagate exception to parent tx (#146502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502
Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #146504, #146499
2025-03-11 18:55:45 +00:00
fb53e9e514 Add __context/cause/suppress_context/traceback__ to Exception (#146499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499
Approved by: https://github.com/zou3519, https://github.com/anijain2305
ghstack dependencies: #146504
2025-03-11 18:55:45 +00:00
4e7d264cf8 Introduce UserDefinedExceptionClassVariable (#146504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504
Approved by: https://github.com/anijain2305
2025-03-11 18:55:45 +00:00
8d08b49015 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-11 18:51:06 +00:00
c916a8efc5 Revert "Use the device interface for detecting Triton availability (#139171)"
This reverts commit 940b60db974f08a31c746eec2f9c399fc8a861ee.

Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))
2025-03-11 18:49:21 +00:00
57ee821a41 fix dynamo ide (#148849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148849
Approved by: https://github.com/bobrenjc93
2025-03-11 18:43:30 +00:00
883fb78c7e Update jinja2 version in requirements-gha-cache.txt
As previous version is vulnerable  to CVE-2025-27516
This closes Dependabot report
2025-03-11 11:42:38 -07:00
5ee9dbc0a1 Bump jinja2 from 3.1.5 to 3.1.6 in /.ci/docker (#148812)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-11 11:39:55 -07:00
cyy
a5f6b24d87 Remove outdated skipIfRocmVersionLessThan decorations (#148941)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148941
Approved by: https://github.com/jeffdaily
2025-03-11 18:37:40 +00:00
ef6296e7f2 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
b366f33606 [MPSInductor] Prep for mutlistage reductions (#148969)
----

- Move reduction variable initialization from `loads` to  `indexing_code`
- Move barriers from `codegen_kernel` to `reduction` and only use them for `any` reductions (as other reduction ops do  barriers explicitly inside the respective reduction functions)
- Use `self.compute` instead of `self.body` for all compute operations

Checked that number of before/after failures stays at `164 failed, 616 passed, 53 skipped`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148969
Approved by: https://github.com/dcci
2025-03-11 18:35:23 +00:00
dcc502f376 [ROCm][TunableOp] Add bias data type to params signature. (#146227)
Add bias vector data type in TunableOp params signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227
Approved by: https://github.com/jeffdaily
2025-03-11 18:31:22 +00:00
52acc1f955 [DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/140898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918
Approved by: https://github.com/fduwjj, https://github.com/mori360
ghstack dependencies: #148825
2025-03-11 18:24:12 +00:00
e0d4c43ad1 Add env for disabling meta reference on functionalization. (#148822)
Fix: https://github.com/pytorch/xla/issues/8755

This PR introduces `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE`
environment variable. Setting this variable makes it so the
functionalization kernels won't run the meta reference, which is used to
propagate expected sizes and strides.

Currently, PyTorch/XLA doesn't actually propagates the correct strides
to its tensors. It was also shown that calling these meta functions may
incur in significant overhead.

Running the provided minimal reproducer (see issue), we see a speedup
close to 4.3x:

- Baseline: 0.0747s
- `XLA_DISABLE_FUNCTIONALIZATION=1`: 0.0159s
- `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1`: 0.0175s

In summary, this PR:

- Creates the `disable_meta_reference()` function, which checks whether
  the environment variable is set
- Modifies codegen for functionalization kernels, adding the call to
  `disable_meta_reference()` function to the appropriate conditions
- Creates a new bash function for running `lazy/test_ts_opinfo.py` with
  the environment variable set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148822
Approved by: https://github.com/bdhirsh
2025-03-11 16:13:35 +00:00
09029010e5 [inductor] Fix create_specialize_impl error in latest Triton (#148933)
```py
$ python test/inductor/test_triton_kernels.py KernelTests.test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1
WARNING:torch._dynamo:Encountered an exception in identify_mutated_tensors, assuming every input is mutated
Traceback (most recent call last):
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 715, in identify_mutated_tensors
    ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 289, in generate_ttir
    specialization = _get_specialization(ordered_args.values())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 262, in _get_specialization
    specialize_impl = triton.runtime.jit.create_specialize_impl()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: create_specialize_impl() missing 1 required positional argument: 'specialize_extra'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148933
Approved by: https://github.com/yanboliang, https://github.com/davidberard98
2025-03-11 15:54:47 +00:00
16560d4e8f Revert "Refactor test/test_torch.py by moving testcase to test_indexing.py (#148875)"
This reverts commit 0fa0a740958ffc474843ceb1d19ee43c4bff4c09.

Reverted https://github.com/pytorch/pytorch/pull/148875 on behalf of https://github.com/ZainRizvi due to That torch.version failure you got in CI was a legitimate failure and is now breaking trunk. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13778023702/job/38534207536) [HUD commit link](0fa0a74095) ([comment](https://github.com/pytorch/pytorch/pull/148875#issuecomment-2714757288))
2025-03-11 15:27:25 +00:00
3945954741 Bump triton pin. Add aarch64 triton build (#148705)
1. Bumps pin for triton to release/3.3.x branch
2. Bump pin for triton-xpu
3. Remove ROCm xfail tests
4. Add aarch64 triton build:
	* Depends on: https://github.com/pytorch/pytorch/pull/148768
	* Fixes: https://github.com/pytorch/pytorch/issues/130558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148705
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/EikanWang
2025-03-11 15:12:21 +00:00
c983e1124c Revert "[WIP] Initial implementation of Grouped Gemm API (#148531)"
This reverts commit ff29791ed8f815bdbca1a5606de046380baca69d.

Reverted https://github.com/pytorch/pytorch/pull/148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](https://github.com/pytorch/pytorch/pull/148531#issuecomment-2714577498))
2025-03-11 14:40:58 +00:00
f1787ee0f7 [dynamo] Remove L scoping for recompilation messages (#148917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148917
Approved by: https://github.com/williamwen42
2025-03-11 14:26:26 +00:00
992838e702 [dynamo][guards] Do not ID_MATCH on numpy tensors (#148923)
Might help with https://github.com/pytorch/pytorch/issues/148535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148923
Approved by: https://github.com/jansel
2025-03-11 14:20:26 +00:00
ee21ccc816 Skip ao_sparsity TestComposability for missing FBGEMM (#144146)
Those tests (from test_ao_sparsity) require FBGEMM which may not be available. So add the skip decorator.

Fixes #87364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144146
Approved by: https://github.com/jerryzh168, https://github.com/jcaip
2025-03-11 13:02:18 +00:00
da4bb72a71 Backout D70075331 (#148824)
Summary:
The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0"

So we revert D70075331 as a workaround now.

Test Plan: The model could be lowered and published successfully. e.g. 702869739_16

Differential Revision: D70823254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824
Approved by: https://github.com/eqy
2025-03-11 12:51:17 +00:00
9ad64ce795 [triton 3.3] Forward-fix mm template selection logic (#148924)
Follow-up from https://github.com/pytorch/pytorch/pull/148662.

The logic from https://github.com/pytorch/pytorch/pull/148662 is incorrect; what we want is "choose the second template 'AMD-specific template' only if we're on hip AND triton version < 3.3" - negating it, the code should be "choose the cirst template if we're NOT on hip OR triton version >= 3.3".

Tested locally to verify that it fixes the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148924
Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/eellison
2025-03-11 09:05:44 +00:00
2bcc3acb90 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2025-03-11 08:02:30 +00:00
41e4728f74 update types on dynamo configs (#146873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146873
Approved by: https://github.com/williamwen42
2025-03-11 05:33:48 +00:00
1fcc4bc109 Don't look at TESTING_ONLY in fuzzer (#146870)
Lots of configs aren't meant to be set because they're testing only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146870
Approved by: https://github.com/masnesral
2025-03-11 05:32:25 +00:00
bed92a8523 [Window][Inductor UT] Fix for tempfile.NamedTemporaryFile(delete=True) not work on Windows. (#148632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148632
Approved by: https://github.com/jansel
2025-03-11 05:05:15 +00:00
ecfbfe1603 [AOTI] Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor (#148907)
Summary: shim.h is only meant for generic tensor util shim functions. We should switch to use the auto fallback generation, but it will need some extra care on the op schema.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148907
Approved by: https://github.com/janeyx99
2025-03-11 04:41:05 +00:00
940b60db97 Use the device interface for detecting Triton availability (#139171)
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.

This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
2025-03-11 03:56:11 +00:00
ff29791ed8 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 02:41:09 +00:00
621dadd4ca partitioner: when materializing unbacked tensor intermediates, apply hint to symbol, not expr (#144097)
Fixes https://github.com/pytorch/pytorch/issues/144095

open to suggestions: the `hint_int(..., fallback=...)` API feels like a bit of a footgun, because:

(1) we use the same guess for every unbacked symint (both symbols, and compound expressions)
(2) the user may have established some relationship between some unbacked symints that we are not taking into account.

I'm not sure how real of an issue (2) is - is it common to e.g. generate two unbacked symints, and then add a runtime assert that they are unequal?

Instead I did something simpler that's just enough to fix the linked issue: if we have a sympy expression containing an unbacked symbol (e.g. `u0 + 1`), then the partitioner will now fill in the symbol with our guess instead of the expression (plugging in `u0=4096` gets us 4097). This was important for an internal custom op, that had some logic like this:
```
def custom_op(x: [u0], y: [u0 + 1]):
    assert x.shape[0] = y.shape[0] - 1
    ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144097
Approved by: https://github.com/laithsakka
2025-03-11 02:11:57 +00:00
8c45d44abb Skip distributed subprocess test internally as they don't work (#148909)
Follow up from https://github.com/pytorch/pytorch/pull/146098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148909
Approved by: https://github.com/janeyx99
2025-03-11 02:07:45 +00:00
457ff9b7ae [reland][ca] side-effect free inital trace: compiled_args (#148376)
This reverts commit ea12fc8a9ff7da808e0b661ca07e9d4ce75d04bc.
Reland https://github.com/pytorch/pytorch/pull/147804, there was a bad import inserted by my linter.

Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376
Approved by: https://github.com/jansel
2025-03-11 01:57:36 +00:00
9fddbf3417 Update the comment (#148726)
Differential Revision: D70747931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148726
Approved by: https://github.com/yf225
2025-03-11 01:19:14 +00:00
0fa0a74095 Refactor test/test_torch.py by moving testcase to test_indexing.py (#148875)
Fix `FIXME` in `test_torch.py` by moving test-cases to `test_indexing.py`

```python
# FIXME: move to test indexing
# FIXME: move to indexing test suite
```

- Move tests in `test/test_torch.py` to `test_indexing.py`
- Remove `FIXME` comments

## TestResult

```bash
pytest test/test_torch.py -k TestTorchDeviceType -vv
pytest test/test_indexing.py  -k TestIndexing -vv
```

![image](https://github.com/user-attachments/assets/49a80985-e74a-4da6-a063-476e87e6aa8a)

![image](https://github.com/user-attachments/assets/77afa936-5dba-480c-b293-eb1f7bc74420)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148875
Approved by: https://github.com/soulitzer
2025-03-11 01:01:59 +00:00
c297c09a37 Fix invalid nested int guarding in broadcast_shapes() (#145957)
Fixes #145874

This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially.

Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957
Approved by: https://github.com/bobrenjc93

Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-03-11 00:53:13 +00:00
cyy
295f2ed4d1 Fix "invalid application of 'sizeof' to an incomplete type" (#148854)
Fixes with C++23 and constexpr std::unique_ptr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148854
Approved by: https://github.com/Skylion007
2025-03-11 00:40:00 +00:00
cyy
a6e71dbc88 Enable ASAN on inductor CUDA tests (#148749)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148749
Approved by: https://github.com/jansel
2025-03-10 23:53:40 +00:00
b215841ebb [MM] Add sm carevout to lowerings (#148793)
# Summary

See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using
the following to verify, need to figure out how to do proper guarding

This does  do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793
Approved by: https://github.com/lw, https://github.com/eellison
2025-03-10 23:49:26 +00:00
492f3fd5cf replace usages of upload_graph in inductor with tlparse (v2) (#148720)
Reland of https://github.com/pytorch/pytorch/pull/148703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720
Approved by: https://github.com/mengluy0125
2025-03-10 22:47:58 +00:00
5bbca7d328 [ROCm][Windows] Fix OpenMP Flags for clang-cl (#148097)
When clang-cl parses its command line arguments, it expects MSVC-style arguments (beggining with `/` such as `/WX`, `/MD`, etc.) to be provided, and clang-style arguments to be preceded by `-Xclang`, otherwise, the clang-style parameters are ignored as they are interpreted unrecognized compiler options.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148097
Approved by: https://github.com/jeffdaily
2025-03-10 22:47:15 +00:00
a95eb0c0a7 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 2149f6c6845d00711ffab648132b7377e8cd3edb.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))
2025-03-10 22:38:40 +00:00
12a95390ae [Minimizer] allow overriding of ShapeProp logic by subclasses of _MinimizerBase (#148784)
Summary:
The changes contained in this diff
- allow subclass Minimizer implementations to override the default shape propagation logic with custom logic
- copies over the meta attribute on get_attr graph nodes during the graph splitting step
- for both changes, behavior for existing classes do not change

Test Plan: CI

Differential Revision: D70799942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148784
Approved by: https://github.com/blaine-rister
2025-03-10 22:22:16 +00:00
fcb633fafa Introduce TORCH_ABI_VERSION and a runtime aoti_torch_abi_version C shim ABI (#148892)
Importable https://github.com/pytorch/pytorch/pull/148836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148892
Approved by: https://github.com/albanD
2025-03-10 22:22:10 +00:00
98b3f1db9f [Flex Attention] support num_heads > 1 in block_mask (#148857)
Previously flex decoding errors when block mask has num_heads > 1. So users have to use num_heads=1, or explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}`.

This PR fixes this issue. When not using grouped query attention (GQA, i.e., Hq == Hkv), we support block mask with num_heads = 1 and num_heads = num_query_heads (i.e., Hq). This is the same setting as flex attention kernel.

When using GQA (i.e., Hq != Hkv), we support block mask with num_heads = 1. When num_heads = Hq, we fall back to flex attention kernel so user don't need to explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}` anymore.

Why fallback? In the current flex decoding triton kernel, grouped query heads for the same kv head are handled by the same thread block. Supporting num_heads = Hq with GQA requires support different kv num blocks for different query heads in the same thread block, leading to lots of redundant workload. So we should better use the main flex_attention kernel where each query head is handled by a separate block.

Fixes #148527
Fixes #147267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148857
Approved by: https://github.com/drisspg
2025-03-10 22:02:50 +00:00
6ef15c7f46 [pytorch] Update flexattention bwd config generation (#148600)
Summary: Currently `flex_attention` template's backward config generation returns values for every case. This change instead stores intermediate values in `'bwd_config` returned at the end.

Test Plan: CI. Existing tests.

Differential Revision: D70649316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148600
Approved by: https://github.com/Skylion007
2025-03-10 22:00:56 +00:00
8701b302cc setuptools pinning (#148879)
Fixes #148877

---

On 9 March 2025, [setuptools](https://pypi.org/project/setuptools/#history) published a new version  and it is causing an issue on `pytorch` with the following error:

```
AttributeError: module 'distutils' has no attribute '_msvccompiler'. Did you mean: 'ccompiler'?
```

Last known working version is [75.8.2](https://pypi.org/project/setuptools/75.8.2/)

Currently it is affecting Windows ARM64 nightly build, however soon it might affect also Windows x64 builds. (conda version is not updated yet [setuptools conda](https://anaconda.org/anaconda/setuptools)

Locally both `Windows ARM64` and `Windows x64` are having same problem with the latest `setuptools` (>75.8.2)

---

This PR is pinning `setuptools` version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148879
Approved by: https://github.com/seemethere
2025-03-10 21:29:32 +00:00
c652772af7 [aarch64] install ninja for docker to build triton on arm (#148768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148768
Approved by: https://github.com/atalman, https://github.com/Skylion007

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-10 21:28:53 +00:00
b706044cca [ROCm][Windows] Enable hipblaslt for Windows (#148563)
This PR adds hipblaslt library as one of the Windows' dependencies. `rocBLAS` is added too, since certain symbols aren't detected with `hipblas` alone on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148563
Approved by: https://github.com/jeffdaily
2025-03-10 21:07:16 +00:00
2a1eeaeed8 Remove 12.4 x86 builds and 12.6 sbsa builds from nightly (#148895)
https://github.com/pytorch/pytorch/issues/145570

redo https://github.com/pytorch/pytorch/pull/148625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148895
Approved by: https://github.com/atalman
2025-03-10 20:55:09 +00:00
4a2173d9a0 [cutlass backend][ez] Incorporate AOTI dynamic shape test into main test of MM (#148786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148786
Approved by: https://github.com/jingsh
2025-03-10 20:35:10 +00:00
e9c12e819d Update torch-xpu-ops commit pin (#148881)
Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](026b2c8c7c), includes:

- Enable AOT for LNL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881
Approved by: https://github.com/EikanWang
2025-03-10 20:31:51 +00:00
ed969d1236 [DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825)
Summary:
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825
Approved by: https://github.com/fduwjj, https://github.com/mori360
2025-03-10 20:04:36 +00:00
494abeff8a CUDACachingAllocator,c10d: fixes for IPC release performance (#148805)
This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL.

1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock
2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there

Test plan:

Run with torchft + torchtitan

```
REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par
allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096

...

[rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61  loss:  7.4825  memory: 79.73GiB(83.89%)  tps: 317  tflops: 16.34  mfu: 1.65%
```

Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors

![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d)
![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito
2025-03-10 19:47:04 +00:00
2e4874e48d Update RELEASE.md with latest changes to release process and release 2.7 information (#148888)
1. Update for Release 2.7 compatibility matrix
2. Remove mention of builder project, the scripts for release management were migrated to test-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148888
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2025-03-10 19:20:27 +00:00
clr
6b0fd741d1 dynamo: Count number of opcodes processes (#147149)
This gives us a decent proxy for how big of a graph we functionally had to parse.

Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149
Approved by: https://github.com/jansel
2025-03-10 19:20:09 +00:00
3129faf8be Optimize shard_dim_alltoall to use alltoall_single (#148868)
as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies

This PR uses alltoall_single instead, so that we could minimize tensor copies.

tested on all the shard dim change tests and it works properly:
```
pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868
Approved by: https://github.com/tianyu-l
2025-03-10 18:38:12 +00:00
ed7e964f2b codecache.py: use str.format rather than % formatting (#148691)
Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691
Approved by: https://github.com/desertfire
2025-03-10 18:33:58 +00:00
d1f21d8ec3 Enable Direct Use of Arm Compute Library (ACL) in ATen (#148584)
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set.
Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without  oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620.
This patch enables such use cases by exposing ACL to ATen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584
Approved by: https://github.com/malfet
2025-03-10 18:29:51 +00:00
00cabd4235 [Inductor][Windows] add env_var switch to turn all Windows inductor UTs. (#148733)
For timeout reason, we can't turn on all Windows Inductor UTs in CI: https://github.com/pytorch/pytorch/issues/135927
And without the UTs, we can't ensure Windows inductor quality.

Intel team will do some local test for Windows inductor, but we still need to add a switch to turn on the full Windows inductor UTs.

The switch is an environment variable:
```cmd
set TORCHINDUCTOR_WINDOWS_TESTS=1
```
After setup this environment variable, we can turn on all Windows inductor UTs, It will not affect to PyTorch CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148733
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-03-10 18:25:29 +00:00
4c13a859e5 Workaround no triton float8_e8m0fnu support in inductor (#148722)
Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873.

The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722
Approved by: https://github.com/vkuzo
ghstack dependencies: #148450
2025-03-10 17:37:39 +00:00
cyy
203dd18c5c Bump Clang-tidy to 19.1.4 (#148648)
Because Clang-tidy 19 has more powerful clang-analyzer checks to detect subtle bugs. New checks such as misc-use-internal-linkage can help identify potential static variables or functions, thus reducing binary sizes.

Some new checks are disabled temporarily for later enabling. Additional warnings have been fixed or suppressed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148648
Approved by: https://github.com/Skylion007
2025-03-10 17:32:30 +00:00
ebd087e4b5 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit f08146b67bab331f7bdc9fa247f526f6e60a7190.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))
2025-03-10 17:19:21 +00:00
2ec9aceaeb Revert "Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)"
This reverts commit 3680e666d8ceaa43069555f821d1e8a5de01d5ab.

Reverted https://github.com/pytorch/pytorch/pull/148834 on behalf of https://github.com/janeyx99 due to sorry I don't think I want this PR in before the branch cut, as it'd freeze the API in the file when it should really be in a different header ([comment](https://github.com/pytorch/pytorch/pull/148834#issuecomment-2711162193))
2025-03-10 16:29:40 +00:00
9dbc2527dc Disable some SVE autovec (#148489)
Summary: autovec miscompiles on patterns of the type:
```cpp
for (const auto i : c10::irange())
```
Same issue as described in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 and addressed by https://github.com/pytorch/pytorch/pull/137795 for gcc, but not clang

Test Plan:
buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface

Differential Revision: D70422723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148489
Approved by: https://github.com/malfet
2025-03-10 16:25:00 +00:00
a60b4ed623 [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148288
2025-03-10 16:06:19 +00:00
8f858e226b [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261
2025-03-10 16:06:19 +00:00
5d4e7d58b4 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-10 16:06:11 +00:00
bf752c36da [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-10 16:06:02 +00:00
bec7bdad47 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-10 16:05:53 +00:00
cyy
b8b1b364c9 Fix invalid format string in libfmt calls (#148855)
Wrap shaderSource inside fmt::runtime because the format string is not a string literal and can't pass libfmt's compile time check in C++23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148855
Approved by: https://github.com/Skylion007
2025-03-10 14:47:52 +00:00
a81751d8b7 [CD] Annotate linux/arm64 cuda wheels with consistent nvidia dependencies (#145021)
This resolves issues installing torch nightly wheels into a `uv sync`-generated `.venv`

The root cause is that the x64 and arm64 cuda nightly wheels have inconsistent metadata. This can be seen comparing `generated-linux-aarch64-binary-manywheel-nightly.yml` and `generated-linux-binary-manywheel-nightly.yml`

`uv` expects consistency:

https://github.com/astral-sh/uv/issues/10693
>Frankly, it's really not ideal that they change their dependencies from wheel to wheel.
>They could still put the dependencies there with the same platform markers they're using in the other wheel though... 🤷‍♀

https://github.com/astral-sh/uv/issues/10119#issuecomment-2559898792
>I think this is something that basically has to be solved by PyTorch. The issue is that the wheels for `2.6.0.dev20241222+cu126` don't have consistent metadata, and it's a fundamental assumption of uv that the metadata for a given version _is_ consistent.

To resolve this, I modified the arm64 nightly build workflow to add two new `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under `manywheel-py3_11-cuda-aarch64-build` and `manywheel-py3_12-cuda-aarch64-build`. These are based on their equivalents in the x64 workflow for the corresponding python versions.

I used the cuda 12.6 dependencies versions for the nvidia packages, to match the `DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main` being used by these jobs.

(The arm64 workflow file already had several `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under various cpu wheels. I'm not sure why these are there, but I left them as-is.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145021
Approved by: https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-10 14:39:39 +00:00
4fdd076907 [CD] Add triton xpu as dependency of torch xpu windows whl (#148755)
Depends on PR #147637 land

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148755
Approved by: https://github.com/atalman
2025-03-10 14:04:30 +00:00
31625b08b8 Add ccode for FloorDiv (#148727)
Summary: Add ccode for FloorDiv

Test Plan: CIs

Differential Revision: D70749021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148727
Approved by: https://github.com/bobrenjc93
2025-03-10 14:00:18 +00:00
2068235c0a Add timm_efficientnet to flaky models after cuda 12.6 update in CI/CD (#148788)
After https://github.com/pytorch/pytorch/pull/148612
This model have become flaky

Tracking this regression in an issue : https://github.com/pytorch/pytorch/issues/148699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148788
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-03-10 13:40:41 +00:00
68c12ecfe2 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-10 13:17:58 +00:00
098494e9cb [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-10 13:14:05 +00:00
59f14d19ae Implement gradient for the residuals of torch.linalg.lstsq (#148526)
Fixes #147543.

I have written some tests in python using `gradcheck`. Please advise where I should put these tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526
Approved by: https://github.com/lezcano
2025-03-10 12:35:09 +00:00
ea86b8d315 Fix redistribution cost for all-reduce (#148761)
This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce.

Thanks @lw for the helpful discussions on getting this PR out!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761
Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin
2025-03-10 12:13:11 +00:00
526524b489 Update slow tests (#148873)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148873
Approved by: https://github.com/pytorchbot
2025-03-10 11:46:30 +00:00
74da76f67c [ROCm][Windows] Fix ROCm/HIP version header (#148560)
On Windows, ROCm libraries do not have a `<rocm-core/rocm_version.h>` header, which causes the compilation to fail. This PR resolves this problem by utilising `<hip/hip_version.h>`  from HIP SDK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148560
Approved by: https://github.com/jeffdaily
2025-03-10 11:28:13 +00:00
00199acdb8 [inductor][triton] Block ptr analysis fix assert on matched index expression (#148446)
If dynamic shapes are enabled, then block analysis may create new precomputed size replacements from the index which can lead to an assertion failure when the matched index is compared with the original index. For example the below assertion fails, despite the expressions being equivalent (ps2 = 3 * ps0). This can be resolved by updating the original index with the replacements, or simply removing the replacements when the expressions are tested to be equal - the latter option is implemented in this PR.

```
       torch._inductor.exc.InductorError: AssertionError:
E       Invalid match!
E       Index: 3*ps0*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E       Matched expression: ps2*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E
```

This PR fixes the test below when `config.triton.use_block_ptr=True`:
```
python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_conv3d_channels_last_dynamic_shapes_cpu
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148446
Approved by: https://github.com/jansel
2025-03-10 05:26:55 +00:00
3680e666d8 Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)
I noticed that this op was likely intended to be in the `extern "C"` portion of the file, but it was not added as such in https://github.com/pytorch/pytorch/pull/145250 which means this function is actually not stable/would get mangled by C++.

Following the thread there I am thinking there are two possible solutions:
(1) Since this op was never stable to begin with, and @Xia-Weiwen already landed the fallback, maybe this op is deletable + should get deleted before the 2.7 branch cut
(2) Or we could just move the op to the right portion of the code. While I like just deleting the op, I am hesitant to do in case there's something I haven't considered, so this PR does option 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148834
Approved by: https://github.com/desertfire
2025-03-10 03:23:48 +00:00
7ae0ce6360 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:36 +00:00
b47d81682d [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:24 +00:00
cyy
aac230a511 [MPS] Fix Wreorder-init-list (#148839)
Fixes the following warning:
```
 warning: ISO C++ requires field designators to be specified in declaration order; field 'value' will be initialized after field 'size' [-Wreorder-init-list]
    662 |       return {.value.cf = scalar.to<c10::complex<float>>(), .size = sizeof(int64_t), .type = type};
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148839
Approved by: https://github.com/Skylion007
2025-03-09 23:45:46 +00:00
b95889042c [MPS] Introduce strides unary op (#148468)
By adding following template
```metal
template <typename T, typename F>
kernel void unary_strided(
    device result_of<F, T>* output [[buffer(0)]],
    constant T* input [[buffer(1)]],
    constant long* sizes [[buffer(2)]],
    constant long* input_strides [[buffer(3)]],
    constant long* output_strides [[buffer(4)]],
    constant uint& ndim,
    uint index [[thread_position_in_grid]]) {
  F f;
  int pos[max_ndim];
  pos_from_thread_index(int(index), pos, sizes, ndim);
  const auto input_offs = offset_from_coord(pos, input_strides, ndim);
  const auto output_offs = offset_from_coord(pos, output_strides, ndim);
  output[output_offs] = f(input[input_offs]);
}
```
and instantiating it for all existing unary shaders, which eliminates the need to any intermediate copies.
No extra testing are needed as those cases are already covered by `test_output_grad_match_corrcoef_cpu_float32` as well as `test_unary_ops_storage_offset_strided`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148468
Approved by: https://github.com/dcci
2025-03-09 22:30:51 +00:00
275a7c5dbb Revert "Add a stable TORCH_LIBRARY to C shim (#148124)"
This reverts commit 327e07ac1dc3351bb5f0ad436760b83590c400aa.

Reverted https://github.com/pytorch/pytorch/pull/148124 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but somehow it caused test failures in newly introduced tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pull%20%2F%20linux-focal-cuda12.6-py3.10-gcc11-sm89%20%2F%20test%20(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148124#issuecomment-2709057833))
2025-03-09 20:44:56 +00:00
19a39a7a06 Revert "[dynamo] allow global import from collections import deque in user code (#148676)"
This reverts commit 685fb377131cc684633dc5471e77038988db53f6.

Reverted https://github.com/pytorch/pytorch/pull/148676 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see f1444f006c/1(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148676#issuecomment-2709057326))
2025-03-09 20:42:03 +00:00
f1444f006c [caffe2/torch] Fixup upstream LLVM (major version 21) API changes (#148833)
Latest LLVM introduced two changes related to the `Triple` usage that causes build failures when building pytorch.

## Failure in llvm_codegen.cpp:
Triple is stored in Modules instead of the string: 979c275097

## Failure in llvm_jit.cpp:
Triple argument is removed from LLJITBuilder::... : b18e5b6a36

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148833
Approved by: https://github.com/Skylion007
2025-03-09 18:58:36 +00:00
9a1a2e1516 Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
2025-03-09 17:12:47 +00:00
a8e3d1984a [Inductor UT][XPU] Skip test case test_cat_max_autotune_triton for known issue. (#148734)
The mm triton template/configs have not been tuned for XPU, we observer that the epilogue fusion can not speed up on XPU because of registers spill. So XPU failed on the case `test_cat_max_autotune_triton` which checks the fusion. We'll remove the skip after #146568 being resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148734
Approved by: https://github.com/jansel
2025-03-09 15:09:43 +00:00
bb9c426024 Typo Errors fixed in multiple files (#148262)
# Fix typo errors across PyTorch codebase

This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.

## Changes Made

### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
  - `setup.py`: Build system documentation
  - `torch/_library/triton.py`: AOT compilation comments
  - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
  - `torch/export/_unlift.py`: Pass population comments
  - `torch/export/exported_program.py`: Decomposition table notes

### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
  - `test/mobile/test_lite_script_module.py`: Exception handling comments
  - `torch/export/_draft_export.py`: Error message text
  - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
  - `torch/csrc/utils/python_numbers.h`: Overflow handling comment
  - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
  - `torch/_dynamo/symbolic_convert.py`: Error explanation

### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
  - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
  - `torch/distributed/distributed_c10d.py`

## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.

## Test Plan
No testing required as these changes only affect comments and documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-09 12:21:40 +00:00
327e07ac1d Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-03-09 10:07:25 +00:00
685fb37713 [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-09 09:35:29 +00:00
6566d67bd3 [dynamo] show stack above dynamo in graph break user tracebacks (#148401)
Also show the line of code relevant to a dynamo-compiled frame, instead of just the first line (this was broken for data-dependent jump graph breaks and for 3.11+).

Also collapses resume frames together (use config.verbose to see full stack trace - for developers).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148401
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-03-09 07:37:38 +00:00
2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
85fe576ee3 [set_linter] allow x in {...} (#148422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148422
Approved by: https://github.com/Skylion007
2025-03-09 06:43:11 +00:00
9cb25f0ea2 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 17dbeb11db7afbab792ad76c24840c1552a0e76d.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))
2025-03-09 03:01:55 +00:00
17dbeb11db [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-08 20:00:12 +00:00
5245304f1e Update decompositions_for_jvp.py (#148821)
small typo thing that got my eye

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148821
Approved by: https://github.com/Skylion007
2025-03-08 19:08:42 +00:00
148eb735ee Change nvcc arch flags for sm100 (#148774)
### Summary
- Addressing this comment https://github.com/pytorch/pytorch/pull/148274#discussion_r1984944012

### Test plan
- Verified building from source w/ B200s is successful
- Verified B200 tensorcores are still being utilized properly via benchmarking script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148774
Approved by: https://github.com/Skylion007
2025-03-08 19:05:53 +00:00
7ffadff286 c10d/ProcessGroup: cleanup abort and shutdown (#148798)
This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs.

This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation.

Test plan:

```
pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798
Approved by: https://github.com/kwen2501
2025-03-08 18:33:18 +00:00
9841f0ddcf Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566)
This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation.

It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do.

For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`.
Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566
Approved by: https://github.com/weifengpy
2025-03-08 18:00:49 +00:00
439782960c Fix typos in SpectralOps.cpp (#148818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148818
Approved by: https://github.com/Skylion007
2025-03-08 17:34:59 +00:00
eqy
849cc058ee [CUDA][TF32] Account for tf32 in test_efficient_conv_bn_eval (#148802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148802
Approved by: https://github.com/Skylion007
2025-03-08 16:17:04 +00:00
c3b05c4a27 [triton 3.3] support both specialize_impl and create_specialize_impl (#148806)
After https://github.com/triton-lang/triton/pull/6099, we sometimes need to do `from triton.runtime.jit import specialize impl` and sometimes do `triton.runtime.jit.create_specialize_impl()`. This should fix a bunch of the new errors that appeared with the triton 3.3 / pytorch 2.7 integration (e.g. `python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_float_arg_dynamic_False_cuda`, failing at https://hud.pytorch.org/pr/pytorch/pytorch/148684#38392501220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148806
Approved by: https://github.com/drisspg
2025-03-08 09:31:52 +00:00
118c9e501a [ONNX] Remove inaccurate test comment (#148813)
Remove the comment that says jit trace strategy doesn't support dynamic shapes as dict because it does support it (which is what the test is testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148813
Approved by: https://github.com/cyyever, https://github.com/titaiwangms
2025-03-08 08:55:56 +00:00
3745da18f4 [AOTI] Swith to local cpp compile for fbcode (#148592)
Summary: as title, otherwise we can not find lamdhip64

Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431

Differential Revision: D70637798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592
Approved by: https://github.com/hl475
2025-03-08 08:38:26 +00:00
666508eb17 [aot cache][ca] remove restriction on caching ca's aot inference graph (#148491)
but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148381
2025-03-08 06:08:26 +00:00
c16cd25cf5 [ca] remove compiled_autograd_tracing (#148381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381
Approved by: https://github.com/jansel
2025-03-08 06:08:26 +00:00
5f1c79ba2b [CD] Enable triton xpu windows build (#147637)
Depends on #147727, which introduce triton xpu windows support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147637
Approved by: https://github.com/atalman
2025-03-08 05:28:46 +00:00
cyy
f7c0c230b0 Fix compile errors (#148758)
Fix
```
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry'
     91 |         static_assert(sizeof(_Tp)>0,
        |                       ^~~~~~~~~~~
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here
    399 |           get_deleter()(std::move(__ptr));
        |           ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here
    200 | AliasDb::~AliasDb() = default;
        |          ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here
    200 | AliasDb::~AliasDb() = default;
        |                       ^
  ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry'
    298 |   struct WriteRegistry;
        |          ^
  1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758
Approved by: https://github.com/Skylion007
2025-03-08 04:56:42 +00:00
75179fd6e6 [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148781)
Differential Revision: D70575053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148781
Approved by: https://github.com/SherlockNoMad
2025-03-08 04:43:32 +00:00
8f71d4563e Fix rms_norm in fp16/bf16 (#147203)
Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done.

Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203
Approved by: https://github.com/eqy
2025-03-08 04:43:18 +00:00
85467ed063 Fix for AOTI + CUDAGraphs when calling from Python (#148601)
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601
Approved by: https://github.com/desertfire
2025-03-08 02:44:14 +00:00
9f170d9d13 [Triton 3.3] Remove ROCm specific mm gemm template (#148662)
Fixes: https://github.com/pytorch/pytorch/issues/147121
Since triton 3.3.x fixes the problem

Needs to be handled in none BC breaking way, so we will conditionalise this change on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148662
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2025-03-08 01:24:40 +00:00
a89e7c2da9 [Upstream] Wrap log_2_e in tl.constexpr for new 3.3 bump (#148785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148785
Approved by: https://github.com/davidberard98
2025-03-08 01:09:28 +00:00
179b7a0abc Do not crash when compiling quantized LORA models (#148435)
Fixes #148072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148435
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel
2025-03-08 00:02:08 +00:00
24085db082 Don't clear feedback_saver_fns after cache clear (#148723)
Summary:
Since feedback_saver_fns are used for logging, I don't think it makes sense to clear them, and this resulted in weird behavior in user code where disabling caches caused logging code to break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148723
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-03-07 23:43:59 +00:00
d96c85558a [ONNX] Use torch export to get dynamic shapes for JIT convert strategy (#148627)
Use torch export to get dynamic shapes for JIT converted graph. I just realized we can retrace a converted jit graph with `torch.export` and produce dynamic shapes using `torch.export`.

-	**Prior:** The exporter will produce a **static graph silently** even when dynamic_shapes are provided.
-	**Proposed:** When `dynamic_shapes` is provided and when the strategy is able to handle it, it will succeed

## Why are we still keeping the JIT strategy?

It is useful when users want to convert JIT modules or `.pt` files into ONNX via the new path. Sometimes also useful when there are JIT scripted modules in the nn module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148627
Approved by: https://github.com/titaiwangms
2025-03-07 23:41:50 +00:00
26f8d81037 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-07 23:35:47 +00:00
187d5c0eb1 [logging] Log cudagraphify timings to dynamo_timed (#143220)
Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note:
* Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries
* The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id
* I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`:
* tlparse: https://fburl.com/21yrdn8h
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz

`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/r9mp7uiv
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220
Approved by: https://github.com/eellison
2025-03-07 23:07:13 +00:00
f2dfe2d99c [Triton 3.3] [ROCm] Enabled split_scan support for ROCm builds (#147619)
Fixes issue https://github.com/pytorch/pytorch/issues/133228

Enabled split_scan support for ROCm builds.

Must be handled in a non BC breaking way so this functionality is enabled conditionalised on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147619
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: David Berard <davidberard98@gmail.com>
2025-03-07 23:06:21 +00:00
0f852641c2 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit d35a4ddae2345e639001bfee58a0932e96597f2d.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/henrylhtsang due to mistakes when writing the tests ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2707637965))
2025-03-07 22:42:13 +00:00
755965d2e4 [inductor] fix matmul w/ torch.bucketize epilogue (#148769)
See https://github.com/pytorch/pytorch/issues/148764.

Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist.

As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape.

This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it.

Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769
Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi
2025-03-07 22:34:13 +00:00
67742128b7 [ROCm] Bump AOTriton to 0.9.2b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-07 22:10:07 +00:00
7b79e17275 [BE] Move cuda12.6 builds to gcc11 (#148740)
I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/`

Which accidentally fixes  undefined symbol references errors namely
```
/usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()'
```
Which happens because `libmagma.a` that were build with gcc-11 (after https://github.com/pytorch/pytorch/pull/148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`)

Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or
```
$ echo "int* foo(int x) { return new int[x];}"|g++ -std=c++17 -S -O3 -x c++ -o - -
	.file	""
	.text
	.section	.text.unlikely,"ax",@progbits
.LCOLDB0:
	.text
.LHOTB0:
	.p2align 4
	.globl	_Z3fooi
	.type	_Z3fooi, @function
_Z3fooi:
.LFB0:
	.cfi_startproc
	endbr64
	movslq	%edi, %rdi
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movabsq	$2305843009213693950, %rax
	cmpq	%rax, %rdi
	ja	.L2
	salq	$2, %rdi
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	jmp	_Znam@PLT
	.cfi_endproc
	.section	.text.unlikely
	.cfi_startproc
	.type	_Z3fooi.cold, @function
_Z3fooi.cold:
.LFSB0:
.L2:
	.cfi_def_cfa_offset 16
	call	__cxa_throw_bad_array_new_length@PLT
	.cfi_endproc
```

Fixes https://github.com/pytorch/pytorch/issues/148728 and https://github.com/pytorch/pytorch/issues/148495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148740
Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-03-07 21:21:12 +00:00
08baaa7d63 [Docs][TunableOp] TunableOp documentation update (#148384)
This PR aligns documentation to what is in the README file:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md

and removes the prototype NOTE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384
Approved by: https://github.com/jeffdaily, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-03-07 21:02:49 +00:00
bb94b65da7 Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 2fb654676f6291f6e27c6bab2761f170516598dd.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2707440106))
2025-03-07 20:58:28 +00:00
d8dc700e25 Delete duplicate entry from docker-builds.yml (#148782)
Regression introduced by merge conflict of https://github.com/pytorch/pytorch/pull/148612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148782
Approved by: https://github.com/atalman
2025-03-07 20:55:46 +00:00
99da439d10 Revert "Remove Cuda 12.4 from nightly Binaries (#148625)"
This reverts commit 1239176fe717839ca5612ac03a4806051225f381.

Reverted https://github.com/pytorch/pytorch/pull/148625 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/148625#issuecomment-2707415005))
2025-03-07 20:47:45 +00:00
6602e632cd Suppress build warnings when gcc-11 is used (#148763)
By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")`
that will suppress following (when building against ancient llvm-9)
```
In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24:
/opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type*, llvm::Value*, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]':
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
 1581 |     return Insert(new LoadInst(Ty, Ptr), Name);
      |                   ^~~~~~~~~~~~~~~~~~~~~
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void* llvm::UnaryInstruction::operator new(size_t)'
```

Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman
ghstack dependencies: #148739
2025-03-07 20:43:35 +00:00
d36391307f [ONNX] Handle error in verification interpreter (#148730)
Use a simple try catch to handle onnx runtime errors in the verification interpreter when that happens. One example is ort will sometimes produce a list of None for some nodes. I am not sure how that happens yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148730
Approved by: https://github.com/titaiwangms
ghstack dependencies: #148706
2025-03-07 20:24:49 +00:00
aebd2e411f [pytree][easy] lock global registry containers properly for thread-safety (#148750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148750
Approved by: https://github.com/StrongerXi
2025-03-07 20:04:52 +00:00
6b44a91a62 use statically_known_true instead of guard_size_oblivious in pattern matcher (#147557)
We shouldn't add guards here. Use statically_known_true instead. Internal xref: https://fb.workplace.com/groups/1075192433118967/?multi_permalinks=1609560723015466&comment_id=1610040026300869&notif_id=1740082892544333&notif_t=work_feedback_reaction_generic&ref=notif

Differential Revision: [D69950122](https://our.internmc.facebook.com/intern/diff/D69950122/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147557
Approved by: https://github.com/eellison
2025-03-07 19:17:25 +00:00
b246cd7b82 Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 17302b4bc837af079d2f6480f07ea2c99b93fb4b.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))
2025-03-07 18:59:58 +00:00
1239176fe7 Remove Cuda 12.4 from nightly Binaries (#148625)
https://github.com/pytorch/pytorch/issues/145570

removes cuda 12.4 nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148625
Approved by: https://github.com/atalman
2025-03-07 18:56:04 +00:00
61c4074df7 Add Windows Arm64 Nightly Builds (#139760)
This PR creates 3 new worklflows for Windows Arm64 target. The workflows and outputs can be reviewed at the following links:
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-release-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-debug-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-wheel-nightly.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139760
Approved by: https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-03-07 18:53:56 +00:00
cyy
e839e4f5bd Fix Wc++98-compat-extra-semi (#148757)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148757
Approved by: https://github.com/Skylion007
2025-03-07 18:49:12 +00:00
0a7ccee1e0 [ROCm][Windows] Disable Composable Kernels and Triton for Windows builds (#147334)
Currently, Composible Kernels and Triton aren't available on Windows. This PR ensures that the files relating to this dependency are not included during the build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147334
Approved by: https://github.com/jeffdaily
2025-03-07 18:40:49 +00:00
eqy
18c6e00c7b [CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in ProcessGroupNCCL.cpp (#148594)
Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594
Approved by: https://github.com/kwen2501
2025-03-07 18:39:02 +00:00
9769618d35 [CI] [inductor] Add cu126 inductor jobs and move away cu124 (#148612)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793 into eager and inductor benchmarks to unblock

Seems many inductor yml are added after initial change was prepared.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148612
Approved by: https://github.com/nWEIdia, https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 18:30:14 +00:00
da923afdc7 [MPS][BE] Align bitshift behavior with CPU (#148719)
By casting the argument to output type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148719
Approved by: https://github.com/Skylion007
ghstack dependencies: #148685, #148686
2025-03-07 18:28:14 +00:00
f84710aef4 [MPS] Fix scalar to tensors bitshifts (#148686)
By introducing a concept of non-commutative binary op and renaming all op templates from `bitwise_foo_tensor` and `bitwise_foo_scalar` to `bitwise_foo_tensor_tensor` and `bitwise_foo_tensor_scalar`

Add regression tests

Please note, that for some undefined values MPS and CPU behaviors are different, for example
```
>>> import torch
>>> 4095 >> torch.arange(12, device="mps", dtype=torch.uint8)
tensor([255, 255, 255, 255, 255, 127,  63,  31,  15,   7,   3,   1],
       device='mps:0', dtype=torch.uint8)
>>> 4095 >> torch.arange(12, device="cpu", dtype=torch.uint8)
tensor([255, 127,  63,  31,  15,   7,   3,   1,   0,   0,   0,   0],
       dtype=torch.uint8)
```
Because on CPU scalar is cast to output dtype before operation is performed, but on MPS this happens after the op is done

Fixes https://github.com/pytorch/pytorch/issues/147889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148686
Approved by: https://github.com/albanD
ghstack dependencies: #148685
2025-03-07 18:28:14 +00:00
cyy
116c1e42c5 Re-enable tests (#148732)
No UBSAN failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148732
Approved by: https://github.com/Skylion007
2025-03-07 18:11:57 +00:00
8059ead823 [ROCm] Incorporate ROCm triton specific tuning parameters (#148437)
Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.

A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437
Approved by: https://github.com/eellison, https://github.com/jansel
2025-03-07 18:09:47 +00:00
a3b77d434a Subprocess compile (attempt 2) (#148635)
Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Fixed the test which caused the previous version (#146134) to be reverted:
```
$ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635
Approved by: https://github.com/jamesjwu
2025-03-07 17:50:14 +00:00
50c9f6d83b [Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323)
In `fresh_inductor_cache` remove pyd files will raise permission error
on Windows because they are still used by the process.
So we clear the references to the loaded pyd libray obj and unload them
from the process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323
Approved by: https://github.com/jansel
ghstack dependencies: #148534, #148538, #147727
2025-03-07 17:19:59 +00:00
d05694807d [XPU][Inductor] Update Intel triton for release 2.7. (#147727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147727
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
ghstack dependencies: #148534, #148538
2025-03-07 17:19:59 +00:00
136b8165d1 [DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577)
Summary: Save Plan Caching: Fix the missing all_plans update in the cache.

Test Plan:
```
buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test
```

https://www.internalfb.com/intern/testinfra/testrun/17451448626323264

Reviewed By: MeetVadakkanchery

Differential Revision: D70229019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577
Approved by: https://github.com/MeetVadakkanchery
2025-03-07 17:00:59 +00:00
abcca2fcbb Revert "Fix torch.nn.functional.hardswish gradients corner case (#148049)"
This reverts commit 29b28e9d9f93d78092099a44a7bcc28cfbae06e3.

Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))
2025-03-07 16:05:56 +00:00
17302b4bc8 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-07 15:19:34 +00:00
d54b2b7fa7 [BE] Delete split builds (#148739)
They has been disabled since Oct 2024, perhaps time to remove them from the workflows

See https://github.com/pytorch/pytorch/issues/138750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148739
Approved by: https://github.com/atalman
2025-03-07 15:10:50 +00:00
372ad7b181 Enable FSDP2 on HPU device (#148667)
The motivation of this PR is to enable FSDP2 collectives for HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667
Approved by: https://github.com/wconstab
2025-03-07 14:33:43 +00:00
81847d08cf [Intel GPU][quant] Refine zero-point memory creation (#148640)
# Motivation
This PR skips  zero-point GPU memory creation when zero-point=0, as it would not be used by oneDNN library. This could help save the 1~3 H2D copy overhead per QLinear/QConv kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148640
Approved by: https://github.com/liangan1, https://github.com/EikanWang
2025-03-07 13:49:19 +00:00
f80aad62fa Improve Pareto frontier plot for AutoAC (#148678)
This was added in https://github.com/pytorch/pytorch/pull/126320. It's a very nice feature, which can be used to predict memory usage for different budget values.

However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings.

Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148678
Approved by: https://github.com/Chillee, https://github.com/fmassa
2025-03-07 13:22:29 +00:00
d4d7d813fa Update CURL url for manywheel images (#148343)
It looks like it was moved on the site it was downloaded from.
Switch to official site while updating URL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148343
Approved by: https://github.com/dr4gon01, https://github.com/janeyx99, https://github.com/atalman, https://github.com/seemethere
2025-03-07 11:41:12 +00:00
6cf360be04 fix lost input mutations with export_tracepoint (#148709)
Preserving module call signatures in the presence of input mutation cause incorrect results. The root cause turned out to be that export tracepoints would unwrap / wrap functional args that would lose mutation info on those args.

Differential Revision: [D70734821](https://our.internmc.facebook.com/intern/diff/D70734821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148709
Approved by: https://github.com/angelayi
2025-03-07 09:36:18 +00:00
bb84a23c22 [ROCm] [TunableOp] Enable logging of BLAS parameters (#147034)
This PR supports a logging feature that is being requested.
```
PYTORCH_TUNABLEOP_BLAS_LOG=1
```
Enables the logging of BLAS parameters with either offline of online (in-situ) tuning.

The BLAS parameters are written to the CSV file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147034
Approved by: https://github.com/jeffdaily
2025-03-07 09:32:59 +00:00
243b47e2ec [Intel GPU] Fix SDPA dummy LSE output to match meta function (#148652)
To fix XPU patched UTs including
```bash
pytest -vs third_party/torch-xpu-ops/test/xpu/test_meta_xpu.py::TestMetaXPU::test_dispatch_symbolic_meta_outplace_nn_functional_scaled_dot_product_attention_xpu_bfloat16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148652
Approved by: https://github.com/EikanWang
2025-03-07 08:36:18 +00:00
416ea1c71c Code Clean: Remove unnecessary code (#148735)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148735
Approved by: https://github.com/jingsh, https://github.com/cyyever
2025-03-07 08:15:37 +00:00
4075646bd8 Use oneDNN v3.7.1 for Intel GPU (#148403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148403
Approved by: https://github.com/EikanWang

Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-03-07 08:03:49 +00:00
cyy
3d854ea9bd Remove deprecated std::aligned_storage_t (#148660)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148660
Approved by: https://github.com/swolchok
2025-03-07 07:29:42 +00:00
3f069e7679 [mm_logs] enhance the printing for overview info (#148716)
Summary:
previously the dynamo counters does not print the counts information automatically.

explicitly added a log msg to print after lowering for overview info for inductor aten mms

it will look like:

the name is in `{aten_op_name}_{m}_{n}_{k}`
```
torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx
```

 {F1975874802}

Test Plan:
```
TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```

Differential Revision: D70739912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716
Approved by: https://github.com/henrylhtsang
2025-03-07 05:23:49 +00:00
5f392ae560 Throws error when using torch.cuda.MemPool with expandable segments (#148378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #148374
2025-03-07 05:22:03 +00:00
c0f1557285 [FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715)
we got asked a few times about FSDP2's equivalence of no_sync. highlight
set_requires_gradient_sync as the equivalence in docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715
Approved by: https://github.com/mori360
2025-03-07 04:34:46 +00:00
fe4b88f6aa [HPU] Add hpu to fused kernels supported devices (#148666)
This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring
compatibility with HPU backend.

Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with:

RuntimeError: fused=True requires all the params to be floating point Tensors
of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone']
but torch.float32 and hpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666
Approved by: https://github.com/albanD
2025-03-07 04:28:33 +00:00
33f8ab2f58 [ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238)
This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM.

There are few other items included in this PR as well:
- Fixes for offline tuning of scaled GEMM
- Simplification of existing offline UT
- Update existing online UT to also test rowwise versus tensorwise scaled GEMM
- New UT for offline scaled GEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238
Approved by: https://github.com/jeffdaily
2025-03-07 04:12:48 +00:00
cdb4fd0d29 Update win-vs2022-cuda12.1-py3 -> win-vs2022-cuda12.6-py3 (#148717)
Should have been migrated long ago
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148717
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-03-07 03:21:29 +00:00
389b496062 [XPU] Add test/kernel.errors.txt to .gitignore. (#148538)
Intel GPU user mode driver may generate kernel.errors.txt files in
current working directory in certain scenarios. It includes diagnostic
information but does necessarily indicates the issue with an
application. This is a known issue and will be fixed in newer version of driver.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148538
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #148534
2025-03-07 03:12:50 +00:00
9c9b05bc4f Expose functions used in custom backend in torch_python dll (#148213)
Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL).

Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file.

Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h.

Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213
Approved by: https://github.com/malfet
2025-03-07 02:34:37 +00:00
dfb4094b9c Skip buffer in dense update (#148533)
Summary:
as title.

PyTorch Module buffer will not be published in delta publishing.  In Quinn's previous diff, constant type annotations have been introduced.

In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list

Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn

Differential Revision: D69553929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533
Approved by: https://github.com/22quinn, https://github.com/jingsh
2025-03-07 01:59:58 +00:00
00cd6c07b9 [Intel GPU][pt2e] Enable quantized grouped convolution at XPU (#148522)
# Motivation&Details
This PR fix a bug that blocked quantized group convolution before. The bug is caused by that, grouped convolution requires setting weight scale mask on both group dimension and output channel dimension. This PR fixs the wrong mask in integration and add grouped conv in UT.

# UT
` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv2d_xpu`

# Runtime exemplification
```onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:s8::blocked:acdb::f0 wei:s8::blocked:abcde::f0 bia:f32::blocked:a::f0 dst:f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:3:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,g4mb1_ic128oc128_ih4oh2kh3sh1dh0ph0_iw4ow2kw3sw1dw0pw0,0.0529785``
The verbose shows that we successfully run into quantized convolution, where weight is `abcde` format(group conv).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148522
Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/jansel
ghstack dependencies: #148423
2025-03-07 01:57:45 +00:00
127bd5a02d Add sparsity (#148513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148513
Approved by: https://github.com/danielvegamyhre
2025-03-07 01:47:52 +00:00
b4430c3a6d [Intel GPU][pt2e]: Collapse 3D input to 2D for matmul in qlinear_pointwise_binary fusion (#148423)
# Motivation
During the `qlinear_pointwise_binary` lowering pass, dim collapsing only occurs when post-ops is `add`. It is the responsibility of  C++ kernels to handle dimension for post-ops `sum`

# Details
This PR explicitly reshape input from 3D to 2D in op `qlinear_pointwise_binary`. Besides, we refractor implementation `qlinear_pointwise_binary.tensor` to call `qlinear_pointwise_binary` for removing duplicated codes.

# UT testing
`python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlienar_add_xpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148423
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-03-07 01:47:33 +00:00
c8cd8f68bd [dynamo] Properly account for non-list instances in list comparison (#148470)
As title; this patch also removes an unused `list_compare` method.

Fixes #148179.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148470
Approved by: https://github.com/anijain2305
2025-03-07 01:29:30 +00:00
a7fe685be8 Add cpp wrapper skip to cudagraph logs (#148700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148700
Approved by: https://github.com/jbschlosser
2025-03-07 01:02:40 +00:00
e3087f6d76 [ONNX] Improve verify_onnx_program to use VerificationInterpreter (#148706)
I realized we can just extend `verify_onnx_program` to return intermediate values. There is no need for us to expose the VerificationInterpreter to users.

I added a `compare_intermediates` option to `verify_onnx_program`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148706
Approved by: https://github.com/titaiwangms
2025-03-07 00:40:54 +00:00
cyy
50eb4f3990 Enable UBSAN test (#147511)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147511
Approved by: https://github.com/colesbury
2025-03-07 00:35:32 +00:00
33a285379a [codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501
Approved by: https://github.com/Skylion007
2025-03-07 00:33:39 +00:00
a0bc6d81bb [CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests (#148602)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793/ into eager and inductor benchmarks to unblock

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148602
Approved by: https://github.com/atalman, https://github.com/malfet

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 00:15:04 +00:00
e2a0296e80 [dtensor] add CuDNN SDPA op support to DTensor (#148537)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537
Approved by: https://github.com/drisspg, https://github.com/fegin
2025-03-06 23:44:40 +00:00
3960f97832 Documents torch.cuda.MemPool API (#148374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148374
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-03-06 23:18:43 +00:00
ed9c8a5d13 ROCm: Disable torch check for Multiplication of two Float8_e5m2 matrices (#148228)
ROCm supports Multiplication of two Float8_e5m2 matrices.
Hence disabling the torch check for ROCm.
Test command (on ROCm h/w supporting fp8)
python test/test_matmul_cuda.py TestFP8MatmulCudaCUDA.test_float8_basics_cuda -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148228
Approved by: https://github.com/jeffdaily, https://github.com/petrex
2025-03-06 22:12:45 +00:00
e6800bda7f [Test][Linalg][CUDA] Increase niter in test_svd_lowrank_cuda_float64 (#145930)
A recent PR #143049 attempted to increase tolerances to make test passable. However, we are still seeing errors like:
```
Traceback (most recent call last):
  File "~git/pytorch/test/test_linalg.py", line 2540, in test_svd_lowrank
    run_subtest(None, size, (), device, torch.svd_lowrank, density=density)
  File "~git/pytorch/test/test_linalg.py", line 2505, in run_subtest
    self.assertEqual(A, a, rtol=1e-7, atol=2e-7)
  File "~git/pytorch/torch/testing/_internal/common_utils.py", line 4044, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 90 / 1000000 (0.0%)
Greatest absolute difference: 7.795904016052784e-07 at index (176, 930) (up to 2e-07 allowed)
Greatest relative difference: inf at index (6, 179) (up to 1e-07 allowed)
```
Increasing `niter` parameter actually decreases numerical differences.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145930
Approved by: https://github.com/ngimel
2025-03-06 22:10:53 +00:00
75d29443e7 [Docs] update bucketize documentaion (#148400)
Fixes #144504

Clarify the documentation for `torch.bucketize` by referencing the existing table. The current version includes a somewhat confusing explanation for the `right` kwarg, whereas the existing table is much clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148400
Approved by: https://github.com/benjaminglass1, https://github.com/eellison, https://github.com/albanD
2025-03-06 22:07:52 +00:00
2fb654676f [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:26 +00:00
d35a4ddae2 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:19 +00:00
5a5ac98918 [aarch64] add libcufile for cu126 and cu128 (#148465)
seeing `  File "/usr/local/lib/python3.12/site-packages/torch/__init__.py", line 411, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory` with arm cu128 nightly.
related to https://github.com/pytorch/pytorch/pull/148137
need to copy the dependency for arm build as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148465
Approved by: https://github.com/atalman, https://github.com/abhilash1910
2025-03-06 21:39:43 +00:00
3d62e81a1e [DCP] fix dcp gather_object/scatter_object_list (#147675)
gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675
Approved by: https://github.com/MeetVadakkanchery
2025-03-06 21:20:38 +00:00
1d7fc0c681 [dynamo] Remove dead code path around functools.partial objects (#148683)
This removes the code paths added in #98120, which has then been
superceded by #108846.

More importantly, it makes `EQUALS_MATCH`'s `ok_mutable_types` (added in #134016)
easier to reason about, i.e., no need to worry about `dict` types, which
was only needed for #98120.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148683
Approved by: https://github.com/yanboliang
2025-03-06 21:20:04 +00:00
262411e48b [inductor] online softmax (#127011)
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row

When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.

Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).

Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54

## Microbenchmark

- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
  - eager_ms=6.671296119689941
  - opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
  - eager_ms=6.634047985076904
  - opt_ms=6.230591773986816

Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
2025-03-06 21:07:18 +00:00
cf9efbdf16 Revert "Enable onednn in pytorch for ppc64le architecture (#143743)"
This reverts commit d4cf0e5af406239881acfeb4f9e4f62373faca8b.

Reverted https://github.com/pytorch/pytorch/pull/143743 on behalf of https://github.com/davidberard98 due to windows build failures look related [GH job link](https://github.com/pytorch/pytorch/actions/runs/13705127978/job/38329845095) [HUD commit link](d4cf0e5af4) ([comment](https://github.com/pytorch/pytorch/pull/143743#issuecomment-2704903253))
2025-03-06 20:47:57 +00:00
1add61c242 Replace unimplemented with unimplemented_v2' in codegen.py` (#148069)
Fixes #147913

- replace `unimplemented` in `codegen.py`
- remove unused import `unimplemented`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148069
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-03-06 20:42:37 +00:00
edd640a95a [BE][Ez]: Use itertools.chain.from_iterable when possible (#148190)
Often makes the code more readable, more efficient, and adds support for infinite iterables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190
Approved by: https://github.com/jansel, https://github.com/malfet
2025-03-06 20:37:06 +00:00
65dbc3b454 [BE][MPS] Remove redundant handle_tensor_scalar_binary_op (#148685)
After https://github.com/pytorch/pytorch/pull/143934 `mtl_setBuffer` can handle scalar tensors correctly, so no need to have a specialized function here
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148685
Approved by: https://github.com/dcci
2025-03-06 19:24:46 +00:00
29b28e9d9f Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-06 19:04:52 +00:00
f08146b67b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-06 18:59:02 +00:00
96176e32a9 Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433)"
This reverts commit 8af79b7ec816f5c73536a806aa4c7ea1f7bd3867.

Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))
2025-03-06 18:32:48 +00:00
b85ae06bed Update CPU tolerance for f16 triplet margin loss (#147742)
Currently, the `test_torchinductor_opinfo` test for `nn.functional.triplet_margin_loss` fails on AArch64, this PR increases the acceptable ATOL and RTOL for this test when using F16. There is precedent for this as XPU and CUDA already increase the tolerance. Additionally, the CPU backend increases the tolerance for the `with_distance_loss` variant of `triplet_margin_loss`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147742
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 18:09:43 +00:00
d10bacd4ce [AOTI][dashboard] Skip torchbench models not supported by export (#148359)
Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359
Approved by: https://github.com/angelayi, https://github.com/ysiraichi
2025-03-06 18:08:17 +00:00
d91a634edf [c10d] Make getDefaultBackend more fault tolerant (#148596)
This is a forward fix for #135338.
It hits error like this:
```
"distributed_c10d.py", line 2156, in destroy_process_group
    if type(pg) == ProcessGroup and pg._has_hooks():
RuntimeError: Could not find the default backend type 0 for Process Group with name undefined.
```

When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call.

The fix wraps `getDefaultBackend` with a try-catch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596
Approved by: https://github.com/LucasLLC, https://github.com/fduwjj
2025-03-06 18:07:43 +00:00
d4cf0e5af4 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet, https://github.com/albanD
2025-03-06 18:00:55 +00:00
097b0d372a [pytree] fix previously failed dynamo tests (#148669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148669
Approved by: https://github.com/zou3519
2025-03-06 17:59:29 +00:00
28b68b46bc Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 4aeca28137dcee74b5fcd0c0636d0ee1f113d5fb.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2704534995))
2025-03-06 17:45:49 +00:00
3cde4c3069 [BE] Remove onlyCPU decorator from test_local_scalar_dense (#148559)
Followup from https://github.com/pytorch/pytorch/pull/145717, not sure why author thinks those tests should be limited to one architecture.
And fixed similar crashes for CUDA and MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148559
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/seemethere
2025-03-06 17:43:02 +00:00
841451af9f Revert "[Inductor] Avoid tensor slice overflow for large step (#147433)"
This reverts commit 1d7397a2d04a4d636559f41511a20f7dadbe5777.

Reverted https://github.com/pytorch/pytorch/pull/147433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/147433#issuecomment-2704506627))
2025-03-06 17:33:08 +00:00
679e7d257e [mm_logs] follow up to add count info based on shape for inductor aten.mms (#148623)
Summary:
as title.

when enable `TORCH_LOGS="+inductor"`, you can get logs at the end such as

stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('benchmarking.TritonBenchmarker.benchmark_gpu', 2), **(('aten_addmm', (16, 6, 16)), 1)**, ('extern_calls', 1), ('async_compile_cache_miss', 1)]
graph_break []

Test Plan: follow up to add proper logging test.

Differential Revision: D70665104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148623
Approved by: https://github.com/henrylhtsang
2025-03-06 16:20:04 +00:00
b160dda743 cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403)
This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode.

Changes:

1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions.
2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns.
3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive.

The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression.

Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403
Approved by: https://github.com/desertfire
2025-03-06 16:08:16 +00:00
5fb0f45d3b [triton 3.3] test_triton_kernel_constants fix (#148626)
Thanks @FindHao who did the initial version of this PR: https://github.com/pytorch/pytorch/pull/148505

TL;DR is that https://github.com/triton-lang/triton/pull/5961 deprecates `tl.constexpr` annotations - you're supposed to wrap the constexpr value in `tl.constexpr()` instead.

This just updates the tests to wrap with `tl.constexpr()` (and leaves the annotations - that way the old triton versions will still pass).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148626
Approved by: https://github.com/FindHao
2025-03-06 14:18:21 +00:00
d5184901c4 Make torch.serialization.skip_data work with torch.load (#148018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148018
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787, #147788
2025-03-06 12:04:46 +00:00
be0ceee1c3 Make record/storage alignment in torch.save configurable (#147788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787
2025-03-06 12:04:46 +00:00
209977e6e5 Add information about checkpoint offset to untyped storages when torch.load under FakeTensorMode (#147787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147787
Approved by: https://github.com/albanD
ghstack dependencies: #147786
2025-03-06 12:04:39 +00:00
bdcc1b579b Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786)
This only fixes _rebuild_tensor_v2 and _rebuild_tensor_v3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147786
Approved by: https://github.com/albanD
2025-03-06 12:04:32 +00:00
79aa17489c [dynamo] ctx_manager.py: replace unimplemented with unimplemented_v2 (#148570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148570
Approved by: https://github.com/williamwen42
ghstack dependencies: #148454
2025-03-06 07:46:31 +00:00
e7bc1d1791 [ONNX] Update saved exported program in debugging report if the exporting passes run_decomposition() (#148617)
Previous to this PR, if the exporting passes run_decomposition(), the report still shows the exported_program before decomposition, which adds the difficulties to our users when they want to check the exported program that are used to translate to ONNX graph.

The following example is what we see before this PR:

```
# PyTorch ONNX Conversion Report

```
 Obtain model graph with `torch.export.export(..., strict=False)`
 Obtain model graph with `torch.export.export(..., strict=True)`
 Obtain model graph with `torch.jit.trace`
 Decompose operators for ONNX compatibility
 Translate the graph into ONNX
 Run `onnx.checker` on the ONNX model
 Execute the model with ONNX Runtime
 Validate model output accuracy
```

## Error messages

```pytb

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 707, in _translate_fx_graph
    _handle_call_function_node_with_lowering(

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 486, in _handle_call_function_node_with_lowering
    raise _errors.DispatchError(

torch.onnx._internal.exporter._errors.DispatchError: No ONNX function found for <OpOverload(op='aten.slice', overload='Tensor')>. Failure message: No decompositions registered for the complex-valued input

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1371, in export
    onnx_program = _exported_program_to_onnx_program(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1007, in _exported_program_to_onnx_program
    values = _translate_fx_graph(
             ^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 733, in _translate_fx_graph
    raise _errors.ConversionError(

torch.onnx._internal.exporter._errors.ConversionError: Error when translating node %slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%_to_copy, 0, 0, 9223372036854775807), kwargs = {}). See the stack trace for more information.

```

## Exported program

```python
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[3, 4]"):
             # File: /home/titaiwang/pytorch/test_slice_complex.py:6 in forward, code: x_complex = x.to(torch.complex64)
            to: "c64[3, 4]" = torch.ops.aten.to.dtype(x, torch.complex64);  x = None

             # File: /home/titaiwang/pytorch/test_slice_complex.py:8 in forward, code: return x_complex[:, :2]
            slice_1: "c64[3, 4]" = torch.ops.aten.slice.Tensor(to, 0, 0, 9223372036854775807);  to = None
            slice_2: "c64[3, 2]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 2);  slice_1 = None
            return (slice_2,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='slice_2'), target=None)])
Range constraints: {}

```

## Analysis

PyTorch ONNX Conversion Analysis

## Model Information

The model has 0 parameters and 0 buffers (non-trainable parameters).
Number of parameters per dtype:
```python
defaultdict(<class 'int'>, {})
```
Number of buffers per dtype:
```python
defaultdict(<class 'int'>, {})
```

Inputs:
- `x`: `TensorMetadata(shape=torch.Size([3, 4]), dtype=torch.float32, requires_grad=False, stride=(4, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})`

Outputs:
- `slice_2`: `TensorMetadata(shape=torch.Size([3, 2]), dtype=torch.complex64, requires_grad=False, stride=(4, 1), memory_format=None, is_quantized=False, qparams={})`

The FX graph has 5 nodes in total. Number of FX nodes per op:
- `placeholder`: 1
- `call_function`: 3
- `output`: 1

Of the call_function nodes, the counts of operators used are:

- `aten.slice.Tensor`: 2
- `aten.to.dtype`: 1

## ONNX Conversion Information

The model contains operators the dispatcher could not find registered ONNX decompositions for. This may be due to missing implementations, decompositions not registered correctly, or a bug in the dispatcher.

Errors grouped by operator:

- `aten.to.dtype`:     No decompositions registered for the real-valued input. Example node: `%to : [num_users=1] = call_function[target=torch.ops.aten.to.dtype](args = (%x, torch.complex64), kwargs = {})`. All nodes: `[to]`
- `aten.slice.Tensor`:     No decompositions registered for the complex-valued input. Example node: `%slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%to, 0, 0, 9223372036854775807), kwargs = {})`. All nodes: `[slice_1, slice_2]`

## Decomposition comparison

Ops exist only in the ExportedProgram before decomposition: `['aten.to.dtype']`

Ops exist only in the ExportedProgram after decomposition: `['aten._to_copy.default']`

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148617
Approved by: https://github.com/justinchuby
2025-03-06 07:03:45 +00:00
ae6bb58483 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit ad49cfc9f0a8a4d8881b3734edd8c33a087c8b97.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/davidberard98 due to broke lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/13690720601/job/38283359447) [HUD commit link](ad49cfc9f0) ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2702980028))
2025-03-06 06:59:39 +00:00
4dc956a1d8 [Inductor][Triton] Fix test_autotune_inplace_kernel to work with newer Triton version (#148595)
For new Triton version 3.3, constexpr are included as part of the signature. Update failing test to reflect this change, additional context in https://github.com/pytorch/pytorch/pull/145051.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148595
Approved by: https://github.com/davidberard98
2025-03-06 05:37:08 +00:00
1fac47702e [Break XPU][Inductor UT] Generalize device-bias code introduced by #146866. (#148534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148534
Approved by: https://github.com/nandesuka
2025-03-06 04:39:50 +00:00
f057206fca [ONNX] Support complex comparison when verify=True (#148619)
Previously, the comparison of complex numbers was not supported when `verify=True`.

NOTE: This PR can be extended to support more complex comparison cases if there are other places in onnx codebase needed to be changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148619
Approved by: https://github.com/justinchuby
2025-03-06 04:38:43 +00:00
8b65d522e1 refactor delayed compile to use code context (#148530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148530
Approved by: https://github.com/williamwen42
ghstack dependencies: #148509
2025-03-06 04:02:30 +00:00
ad49cfc9f0 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 03:42:55 +00:00
02e1580e39 [MPS] fix crash for mse loss with 0 numel inputs (#148608)
Fixes #148589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148608
Approved by: https://github.com/malfet
2025-03-06 03:32:34 +00:00
8728d4b815 Clear triton kernels after parent make_launcher (#148604)
Before, we were clearing the cache only after inductor compile. But inductor may not **always** compile, i.e. on AOTAutogradCache hit.

So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856

Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604
Approved by: https://github.com/masnesral
2025-03-06 03:28:38 +00:00
cyy
1433bc1455 Remove CAFFE2_USE_EXCEPTION_PTR (#147247)
The check is for older compilers and is now aways true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147247
Approved by: https://github.com/janeyx99
2025-03-06 02:56:23 +00:00
43e1284c96 Fix empty matrix handling of addmv in inductor (#143792)
This is a resubmission of my previous PR that I accidentally deleted, apology in advance if any inconvenience caused. Below are details of this PR.

Fix an issue when torch.addmv behaves inconsistent between torch.compile mode and eager mode. Here is the code to reproduce:

```
import torch
import numpy as np

@torch.compile
def test_optimized(input, mat, vec):
    return torch.addmv(input, mat, vec)

def test(input, mat, vec):
    return torch.addmv(input, mat, vec)

input = torch.tensor([2], dtype=torch.int32)
mat = torch.tensor(np.random.randn(0, 0), dtype=torch.int32)
vec = torch.tensor([])
origin_out = test(input, mat, vec)
optimized_out = test_optimized(input, mat, vec)
print(origin_out)  # tensor([2.])
print(optimized_out)  # tensor([])
```

According to the equation (https://pytorch.org/docs/stable/generated/torch.addmv.html), when matrix and vector is empty, returning `[2.]` seems more reasonable to me.

Following the cpu implementation of this API:e97b97af56/aten/src/ATen/native/Blas.cpp (L62)

I add an additional branch to handle empty matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143792
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 02:09:27 +00:00
38b3375a81 [MTIA] Use "ieee" instead of "tf32" for MTIA's default precision in FlexAttention (#148565)
Summary: MTIA supports ieee but not tf32, so we set the default precision of MTIA to ieee similar to how it's done for AMD.

Test Plan: CI

Reviewed By: mortzur

Differential Revision: D70072064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148565
Approved by: https://github.com/mortzur
2025-03-06 02:07:18 +00:00
32715a2311 [inductor][ck] add kBatch_sweep to config.rocm (#148223)
Summary:
# Why

enable testing and users to specify a set of kBatches to try rather than relying on our hand written heuristic

# What

add rocm.kBatch_sweep as a list of kBatches to try out. These will generate a product of CK instances, one per kBatch for each existing op, though they are often filtered out if they are likely to fail at runtime

Test Plan: n/a

Reviewed By: chenyang78

Differential Revision: D70226055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148223
Approved by: https://github.com/ColinPeppler
2025-03-06 01:14:33 +00:00
63fbc738dc [Easy/Profiler] Add last entry to truncated values (#148576)
Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata

Test Plan:
Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D70637461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576
Approved by: https://github.com/davidberard98
2025-03-06 01:14:15 +00:00
23441492f6 [scan] Refactoring of input checking and dynamo invocation (#142125)
This PR does a refactoring of the way dynamo is invoked and how the input shapes are checked for scan and for associative_scan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142125
Approved by: https://github.com/ydwu4
2025-03-06 01:06:54 +00:00
6cc3e69103 [inductor] use eager stride for custom op if no tags (#148367)
Fix https://github.com/pytorch/pytorch/issues/148356

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: https://github.com/pytorch/pytorch/pull/148104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148367
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-03-06 00:58:00 +00:00
703176e538 [ROCm] Fix sort for non-standard bool (#147459)
When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true.

In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool.

Fixes #139972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-03-06 00:23:02 +00:00
690fc2c876 Add aot_eager_then_compile stance (#148509)
Sometimes `eager_then_compile` stance isn't enough since some models are so close to the memory limit that going to eager will OOM since we don't get the memory reductions from activation checkpointing. This PR introduces `aot_eager_then_compile` which avoids the expensive inductor compile, but still does aot_eager to get the benefits of memory reduction in the first invocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148509
Approved by: https://github.com/williamwen42
2025-03-05 23:23:45 +00:00
d6d670ab4d [AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587)
In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits).

Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587
Approved by: https://github.com/desertfire
2025-03-05 22:47:46 +00:00
897fd9b514 Revert "Subprocess compile (#146134)"
This reverts commit 07f876e9602ec6881df2360ab4817e129b563b7c.

Reverted https://github.com/pytorch/pytorch/pull/146134 on behalf of https://github.com/malfet due to looks like it broke slow jobs, see e1dee4ccb3/3 ([comment](https://github.com/pytorch/pytorch/pull/146134#issuecomment-2702239123))
2025-03-05 22:41:19 +00:00
e1dee4ccb3 [ONNX] Assert capture strategy in tests (#148348)
Previously the strategy used for obtaining the exported program is not asserted. This leads to silent errors if torch.export breaks something and a fallback strategy is used. This change adds a _capture_strategy field to ONNXProgram and enables unit tests to assert the strategy used to prevent fallbacks from happening.

Fixes #147674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148348
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1
2025-03-05 22:31:54 +00:00
5ccd659c0e Fix decomp for linspace (#147997)
In python decompositions, we shouldn't do any non-functional operations for functional operators. This should go away once we start decomposing before functionalization.

Differential Revision: [D70265200](https://our.internmc.facebook.com/intern/diff/D70265200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147997
Approved by: https://github.com/zou3519
2025-03-05 22:10:08 +00:00
9e755a1c03 [ROCm] add gfx12 to nightly wheels (#148562)
Adds gfx1200 and gfx1201 to PYTORCH_ROCM_ARCH for wheels and libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148562
Approved by: https://github.com/jeffdaily
2025-03-05 21:56:22 +00:00
2a639ce1d7 Add new hf storage class to torch.distributed package (#148361)
Summary:
title - Add new hf storage class  to torch.distributed package so that it can be imported by customers.
The HF storage reader/writer was added as DCP storage components so that DCP load and save can directly interact with hugging face format and storage.

Test Plan: ensure signals pass

Differential Revision: D70495399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148361
Approved by: https://github.com/MeetVadakkanchery
2025-03-05 21:52:06 +00:00
10354e146f Re-enable test_torchinductor:test_buffer_batch_norm (#148573)
Summary: Per https://github.com/pytorch/pytorch/issues/128198 seems like this is working now
Fixes #128198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148573
Approved by: https://github.com/StrongerXi
2025-03-05 21:51:24 +00:00
87bd3471ff [c10d] Move record param for init to the right place (#148571)
The place we do the log of init does not look correct. We move it to the beginning of comm init.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148571
Approved by: https://github.com/kwen2501
2025-03-05 21:43:30 +00:00
ad9a10aff0 [dynamo] Make nonstrict_trace work with some pytree.register_constant-ed instances (#148007)
As title, this enables `nonstrict_trace`-ed function to take in object
whose type has been `pytree.register_constant`-ed, as long as the object
existed outside the `torch.compile` region. This also forces Dynamo to
emit a `EQUALS_MATCH` guard on the object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148007
Approved by: https://github.com/zou3519
ghstack dependencies: #148385
2025-03-05 21:28:26 +00:00
a10f577ee0 [dynamo] Account for function id reuse in relevant Dynamo decorators (#148385)
This fixes a recent series of flaky failure from `nonstrict_trace` unit
tests: #148166, #148056, #148055, #148054, #148034, #148033, #148032, #148031.

For now we don't need to worry about the other decorators because they
are either meant for builtin/numpy functions (which should never
deallocate in practice), or used for polyfills which keeps the function
object in `get_torch_obj_rule_map()`.

Fixes #147777.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148385
Approved by: https://github.com/zou3519
2025-03-05 21:28:26 +00:00
4aeca28137 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-05 21:26:22 +00:00
ed9624ee60 [export] Fix AttrProxy slicing (#148507)
Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1159599265750221/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148507
Approved by: https://github.com/zhxchen17
2025-03-05 21:03:15 +00:00
dd6ec8706e [BE] Relax sympy dependency to 1.13.3 or newer (#148575)
Fixes https://github.com/pytorch/pytorch/issues/145225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148575
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-03-05 20:51:16 +00:00
9efa9c73f6 [Dyamo] Replace unimplemented with unimplemented_v2 for variables/distributed (#148500)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148500
Approved by: https://github.com/williamwen42
2025-03-05 20:41:43 +00:00
98458e5c81 Add a docstring to build.sh (#144566)
Add a little blurb to explain what build.sh is doing.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
2025-03-05 15:26:37 -05:00
c6a05df174 [ONNX] Use onnxscript apis for 2.7 (#148453)
Use onnxscript apis for 2.7.

Remove reference to `torchlib_opset()` and `torchlib_opset_version()` which were removed in the onnxscript 2.7 apis. These apis were removed because torchlib in onnxscript will always stay on opset 18. Future opset version bumps will happen in pytorch core after the migration of torchlib.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148453
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1
2025-03-05 20:10:00 +00:00
c9edd37ffb Revert "[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377)"
This reverts commit 9eef457c0241f87097a2ca7625f9961e31f3adcd.

Reverted https://github.com/pytorch/pytorch/pull/148377 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/13683650448/job/38261818684) [HUD commit link](9eef457c02) probably landrace ([comment](https://github.com/pytorch/pytorch/pull/148377#issuecomment-2701903810))
2025-03-05 19:45:16 +00:00
c5d92edd5a [dynamo] WeakRefVar reconstruct (#148083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148083
Approved by: https://github.com/anijain2305
2025-03-05 19:34:17 +00:00
50e827b3df [ONNX] Create VerificationInterpreter (#148396)
An fx interpreter for comparing ONNX values with pytorch ones.

```py
import torch
from torch.onnx._internal.exporter._verification import VerificationInterpreter

class Model(torch.nn.Module):
    def forward(self, query, key, value):
        res = torch.nn.functional.scaled_dot_product_attention(
            query, key, value
        )
        rest = res.transpose(0, 1)
        return rest.view(8, 32, 128 * 64)

model = Model()

query = torch.rand(32, 8, 128, 64, dtype=torch.float16)
key = torch.rand(32, 8, 128, 64, dtype=torch.float16)
value = torch.rand(32, 8, 128, 64, dtype=torch.float16)

onnx_program = torch.onnx.export(model, (query, key, value), dynamo=True)
interpreter = VerificationInterpreter(onnx_program)
interpreter.run(query, key, value)
for info in interpreter.verification_infos:
    print(info)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148396
Approved by: https://github.com/titaiwangms
2025-03-05 19:18:52 +00:00
8af79b7ec8 [ROCm] Bump AOTriton to 0.9.1b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-05 19:11:57 +00:00
9eef457c02 [dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA.

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377
Approved by: https://github.com/drisspg
2025-03-05 19:09:52 +00:00
9dd46a9233 Deprecate sm70 for cuda 12.8 binary (#147607)
follow up for https://github.com/pytorch/pytorch/pull/146265/files, dropping sm_70 as well, since "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release."

https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147607
Approved by: https://github.com/atalman
2025-03-05 18:54:17 +00:00
3f4311d589 [CD] Upgrade xpu runtime pypi packages version and enable windows kineto again (#148319)
Fixes https://github.com/pytorch/pytorch/issues/145155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148319
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-03-05 18:39:55 +00:00
9db9593bba Add some more meta kernels (#147862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147862
Approved by: https://github.com/zou3519
2025-03-05 18:33:00 +00:00
e555c4d8ae Fix bug in AOTI lowering (#148364)
Fixes: https://github.com/pytorch/pytorch/issues/148370

Differential Revision: [D70514480](https://our.internmc.facebook.com/intern/diff/D70514480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148364
Approved by: https://github.com/desertfire
2025-03-05 18:27:15 +00:00
38479e495e Add note to get start xpu (#148168)
Installing PyTorch from binaries will automatically install the runtime packages of Intel® Deep Learning Essentials. In this case, if we activate oneAPI in a standalone installation of Intel® Deep Learning Essentials, there will be an environment issue. Therefore, add a note to remind users to avoid this situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148168
Approved by: https://github.com/janeyx99

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-05 18:11:14 +00:00
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
70c5edb697 [ROCm] fix CK compile for gfx1200 (#148496)
gfx1200 causes the CK-based GEMM to fail to compile because CK is choosing an incorrect FP8 interpretation.  CK assumes FP8 interpretation is static and chosen prior to compilation.  This PR is a work-around that makes the selection dynamic during hipclang compilation passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148496
Approved by: https://github.com/jeffdaily
2025-03-05 16:11:03 +00:00
864b75dd50 [MPS] Fix unary_kernel_strided logic (#148512)
Fixes bug introduced by https://github.com/pytorch/pytorch/pull/148350
Before this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[  0.0000,   1.4142,   2.0000,   2.4495],
        [ 80.0000,  82.0000,  84.0000,  86.0000],
        [ 96.0000,  98.0000, 100.0000, 102.0000],
        [112.0000, 114.0000, 116.0000, 118.0000]], device='mps:0')
```
After this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[0.0000, 1.4142, 2.0000, 2.4495],
        [4.0000, 4.2426, 4.4721, 4.6904],
        [5.6569, 5.8310, 6.0000, 6.1644],
        [6.9282, 7.0711, 7.2111, 7.3485]], device='mps:0')
```
One can not avoid copies if both input and output tensors have the same strides, one needs to make sure that they are dense-in-storage (transposed tensor would be dense, but say selecting every odd and even column wouldn't)

Add regression test to prevent those from happening again

Also, no need to check that sizes match, luckily it is checked by the structured op (and `out` for unary ops does not support broadcasting, I just checked)

Revived needs_copy_logic, though it  will become irrelevant after https://github.com/pytorch/pytorch/pull/148468 is landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148512
Approved by: https://github.com/janeyx99
2025-03-05 15:57:54 +00:00
8274da9312 [c10d][PGNCCL] Fix capturability of isend and irecv (#148462)
This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode.

<details>
<summary>The repro code</summary>

```Python
import os
import torch
import torch.distributed as dist

USE_ASYNC = True

def test_func(x, rank):
    if rank == 0:
        x += 1
        # Send the tensor to process 1
        if USE_ASYNC:
            a = dist.isend(tensor=x, dst=1)
        else:
            dist.send(tensor=x, dst=1)
    else:
        # Receive tensor from process 0
        if USE_ASYNC:
            a = dist.irecv(tensor=x, src=0)
        else:
            dist.recv(tensor=x, src=0)
    if USE_ASYNC:
        a.wait()
    return x + 2

def run(rank):
    torch.cuda.set_device(rank)
    x = torch.ones(1, device='cuda')
    with torch.cuda.stream(torch.cuda.Stream()):
        for i in range(11):
            x.copy_(torch.ones(1, device='cuda'))
            y = test_func(x, rank)
            print(f"Rank{rank} has data {y} in warmup")
    torch.cuda.synchronize()
    graph = torch.cuda.CUDAGraph()

    x.copy_(torch.ones(1, device='cuda'))
    with torch.cuda.graph(graph):
        y = test_func(x, rank)

    for i in range(1):
        x.copy_(torch.ones(1, device='cuda'))
        graph.replay()
    print(f"Rank{rank} has data {y} after graph replay")

def main():
    rank = int(os.environ['RANK'])
    local_rank = int(os.environ['LOCAL_RANK'])
    world_size = int(os.environ['WORLD_SIZE'])
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    run(local_rank)

if __name__ == "__main__":
    main()
```
</details>

Fails with an error stating that work handle is of a NoneType:
```
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/repro.py", line 54, in <module>
[rank1]:     main()
[rank1]:   File "/workspace/repro.py", line 51, in main
[rank1]:     run(local_rank)
[rank1]:   File "/workspace/repro.py", line 38, in run
[rank1]:     y = test_func(x, rank)
[rank1]:         ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/repro.py", line 22, in test_func
[rank1]:     a.wait()
[rank1]:     ^^^^^^
[rank1]: AttributeError: 'NoneType' object has no attribute 'wait'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462
Approved by: https://github.com/kwen2501
2025-03-05 15:49:53 +00:00
19a6cf35f6 add input shape check for _local_scalar_dense (#145717)
Fix https://github.com/pytorch/pytorch/issues/145066.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145717
Approved by: https://github.com/malfet
2025-03-05 15:24:08 +00:00
96afa8a2bb [TEST][SPARSE] Simplify branching in test_cusparselt_backend (#148318)
Due to introduction of CUDA versions, the branching becomes more complicated. This PR is proposed to simplify branching in `test_cusparselt_backend` in order to avoid checking each and every CUDA version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148318
Approved by: https://github.com/jcaip
2025-03-05 10:17:00 +00:00
0ef2e938d0 [ROCm] [TunableOp] Track top solutions during tuning process (#147243)
For each set of GEMM parameters that are evaluated by Tunableop, keep track of the top 5 solutions. Print the top 5 solutions when `PYTORCH_TUNABLEOP_VERBOSE=2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147243
Approved by: https://github.com/jeffdaily
2025-03-05 09:35:02 +00:00
6c3492b491 [ROCm] Enable mi300-specific workflows to be triggered on PRs (#147904)
This change will be needed to be able to trigger the MI300-specific CI workflows on PRs by using a PR label.

* inductor-rocm-mi300.yml uses the existing `ciflow/inductor-rocm` label so that any PR manually labeled as such will trigger `inductor` config runs on both MI200 and MI300.
* rocm-mi300.yml uses a separate `ciflow/rocm-mi300` label, since we don't want to over-trigger `default` config runs on MI300 runners due to limited capacity, and [`ciflow/rocm` label is automatically applied](79438512a0/torchci/lib/bot/autoLabelBot.ts (L24)) on many PRs.
* inductor-perf-test-nightly-rocm.yml uses a separate `ciflow/inductor-perf-test-nightly-rocm` label, so that we can manually trigger a round of perf testing on MI300 runners to test the perf impact of a major inductor-related change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147904
Approved by: https://github.com/huydhn
2025-03-05 06:00:37 +00:00
2295efa1b3 Fix only logging ir_post_fusion with torch_compile_debug enabled (#148499)
Because we were invoking the logs through `V.debug`, it was not running if TORCH_COMPILE_DEBUG was not set. this is because there is some magic the in debug [getattr](d789c22712/torch/_inductor/debug.py (L468-L480)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148499
Approved by: https://github.com/shunting314
2025-03-05 05:35:09 +00:00
fb1b7ec173 Remove deprecate method and attirbute in LRScheduler (#147301)
Following [#99270 suggestion](https://github.com/pytorch/pytorch/issues/99270#issuecomment-1511656408), remove deprecate method `LRScheduler.print_lr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147301
Approved by: https://github.com/janeyx99
2025-03-05 05:30:19 +00:00
df7e43e5d4 [AOTI] Fix aot_inductor_package test errors (#148279)
Summary: Fix fbcode test failures introduced by https://github.com/pytorch/pytorch/pull/147975. Make sure script.ld is copied to the build-time directory.

Differential Revision: D70454149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148279
Approved by: https://github.com/zoranzhao
2025-03-05 05:22:48 +00:00
b020d166f2 stage 1 of depreate silent fallback of tuning gemm (#147798)
Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/)

context:
https://github.com/pytorch/pytorch/issues/147479

For the most part, this should not change the behavior.

For int_mm, I also removed
```
    # TODO: Re-enable eager mode implementation once cuBLAS is fixed
    if use_cutlass or use_triton_template(layout, enable_int32=True):
        choices = []
```
because I think it is unwanted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798
Approved by: https://github.com/eellison
2025-03-05 05:15:59 +00:00
913356fb41 Fix recent regression in evaluate_expr that effect cache lookups (#147836)
PR https://github.com/pytorch/pytorch/pull/146939/ added an argument for evaluate_expr for the purpose of logging.
This caused a regression that we thought is due to calling id on symnode.

I digged deeper and found that adding that argument although does not effect results of evaluate_expr it mess the cache
lookups.
I refactored the code to avoid using expr_sym_node_id in the cache lookup, I also introduced evaluate_sym_node to and simplified the calls to evaluate_expr
#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147836
Approved by: https://github.com/oulgen
2025-03-05 04:11:41 +00:00
ed8ec0cb98 [cutlass backend][BE] Fix two small things in cutlass backend standalone debugger (#148493)
Differential Revision: [D70583777](https://our.internmc.facebook.com/intern/diff/D70583777/)

Two really small things:
* The bits in BlockFillRandomUniform would round float to ints
* when bias exists, the order of args are C, A, B, D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148493
Approved by: https://github.com/chenyang78
2025-03-05 04:01:36 +00:00
e0ea593974 [CD] Upgrade Windows xpu support package to 2025.0.1 for binary compression (#148313)
The binary compression feature can reduce the size of the Torch XPU Windows wheel packages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148313
Approved by: https://github.com/atalman
2025-03-05 03:00:27 +00:00
1673bc7610 [mm_logs][ez] dump tuned mm info at lowering stage (#148363)
Summary:
As title. it would be beneficial for judging e2e perf improvement

Easy first step to dump mm info at lowering stage.

e.g.

```
fbsource/fbcode/caffe2/torch/_inductor/kernel/mm.py:525] [0/0] Tuned aten.addmm: m=16, n=6, k=16, layout=FixedLayout('cuda:0', torch.float32, size=[16, 6], stride=[6, 1])
```

Next step:

Dump overview info at `post_grad_graph` stage such as
overall count of `aten.mm` in the graph & visualize to a table structure.

Test Plan: by looking very hard in aot inductor bmm and mm UTs.

Differential Revision: D70507880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148363
Approved by: https://github.com/henrylhtsang
2025-03-05 02:21:27 +00:00
edc3ca577e [Profiler] Add profiler activity for HPU devices (#148182)
Fixes #148181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148182
Approved by: https://github.com/sraikund16
2025-03-05 01:37:48 +00:00
3985ce0b88 [dynamo] rename test_graph_break_messages -> test_error_messages (#148220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148220
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #148205
2025-03-05 01:16:53 +00:00
b28cbe5db3 [dynamo] remove internal stack trace for fullgraph=True graph breaks (#148205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148205
Approved by: https://github.com/zou3519
2025-03-05 01:16:53 +00:00
2927a64357 [inductor][cpu] Fix error with FlexibleLayout weights in BMM (#148188)
Fixes #148074

When node A is reshaped (is a `ReinterpretView`) and node B has a `FlexibleLayout`, then the layout of node B *may* be changed during the `kernel.select(options["W"], 0, self.b_index)` call, which could cause the assertion in `kernel.select` to fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148188
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-03-05 01:05:05 +00:00
713a504a82 [dynamo][guards] Fix mem leak caused be refcount increment (#148480)
Should help [internalfb.com/sevmanager/view/491701](https://www.internalfb.com/sevmanager/view/491701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148480
Approved by: https://github.com/xmfan, https://github.com/StrongerXi, https://github.com/williamwen42, https://github.com/zou3519
2025-03-05 01:04:08 +00:00
b5873292c6 Add overload names to profiler trace (#143114)
Currently, recorded profiler events for aten ops do not store overload names. It would be useful to know which overloads are actually called to analyse performance.
For example, consider the following dispatch trace which occurs if there is a fallthrough kernel registered for aten::add:
```
             [call] op=[aten::add.Tensor], key=[AutogradCPU]
               [redispatch] op=[aten::add.Tensor], key=[Undefined]
                 [call] op=[aten::empty.memory_format], key=[BackendSelect]
                   [redispatch] op=[aten::empty.memory_format], key=[CPU]
                 [call] op=[aten::add.out], key=[CPU]
```

In this case, aten::add.out is a child of aten::add.Tensor, however the current profiler trace provides no way to differentiate aten op calls.

See the added unit test for a more detailed example.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143114
Approved by: https://github.com/sraikund16
2025-03-05 01:00:29 +00:00
cf5e3f3cea Add cutlass kernel for rowwise scaled mm on sm100 (#148421)
### Important
- Previous PR in stack https://github.com/pytorch/pytorch/pull/148274
- Despite the changes between sm90 vs sm100 being fairly minimal, I created a separate kernel since we'll be making various arch specific perf optimizations to the sm100 kernel next.
- This kernel has not been optimized yet. However, initial perf testing shows numbers which indicates the tensorcores are being utilized as expected (not just CUDA cores).

### Summary of changes
- This PR adds a new cutlass kernel for rowwise GEMM on sm100.
- sm100 kernel is based on sm90 kernel, with the following changes:
  - Use new arch tag `cutlass::arch::Sm100`
  - Do not use [large tile](4eb0c45297/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (L203)) schedule in CollectiveMainLoop or CollectiveEpilogue (causes build errors)
- SM90 vs SM100 kernel diff: https://www.diffchecker.com/ZCAPaFAg/

### Next steps
- Arch specific performance optimization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148421
Approved by: https://github.com/drisspg
2025-03-05 00:46:01 +00:00
a907b6abae [compiled_autograd] workaround windows compilation issue (#148454)
torch.compile doesn't work on windows so we can ifdef-away the problem.
I do not know what the root cause actually is. Most notably, the pytorch
windows build is fine, but some third-party projects that use pytorch headers
on windows (e.g. torchaudio) have issues.

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148454
Approved by: https://github.com/atalman, https://github.com/xmfan
2025-03-05 00:18:20 +00:00
e02a2ca07a Fix dist.init_process_group on windows (#148266)
Fix https://github.com/pytorch/pytorch/issues/139990

We don't build libuv on windows so anything that creates `TCPStore` which includes `init_process_group()` will fail, which is a bad experience. We should just default to `USE_LIBUV=0` for windows. There were a decent amount of hits for this [error on google ](https://www.google.com/search?q=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&sca_esv=921f59ac5f8bd98a&sxsrf=AHTn8zpG3PxdKoomFHkclOc451rBhoc3jw%3A1740854890873&source=hp&ei=albDZ5GHM-uIptQP4NTikQw&iflsig=ACkRmUkAAAAAZ8Nkei9H-aB2IBCk3pUOK3yFl5xBLZUt&ved=0ahUKEwiR5P7qxemLAxVrhIkEHWCqOMIQ4dUDCBg&uact=5&oq=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&gs_lp=Egdnd3Mtd2l6IkN1c2VfbGlidXYgd2FzIHJlcXVlc3RlZCBidXQgUHlUb3JjaCB3YXMgYnVpbGQgd2l0aG91dCBsaWJ1diBzdXBwb3J0SABQAFgAcAB4AJABAJgBAKABAKoBALgBA8gBAPgBAvgBAZgCAKACAJgDAJIHAKAHAA&sclient=gws-wiz) and https://github.com/pytorch/pytorch/issues/139579, so I figured we should add a more helpful message as well.

We don't have CI for windows and our support is just best effort, so I just tested these changes on my windows machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148266
Approved by: https://github.com/d4l3k
2025-03-05 00:07:56 +00:00
84b58bd63e Enable FSDP tests on XPU device (#147518)
**Motivation:**

Enable FSDP tests on XPU device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518
Approved by: https://github.com/weifengpy
2025-03-04 23:49:37 +00:00
c98c3af421 Add a couple config options to compiler bisector (#148450)
These are commonly source of bugs/divergence (through bad interactions etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148450
Approved by: https://github.com/shunting314
2025-03-04 23:23:21 +00:00
0c0a4baddd [MPS] unary kernels - avoid copying tensors if they have same stride (#148350)
I was a bit concerned when I saw in #148272 that metal unary kernel was 0.02x of the performance of what we had with MPS Graphs for sqrt(for non contiguous) tensors. This change makes it so that copying is only done if we don't have same strided tensors(for input/output). So if out tensor is not provided then we don't do copy(don't call contiguous) at all and dispatch the kernel as is. After making this change the script that I listed at the end of the above PR has the same execution time as the non-transposed one.

Times for reference(on transposed tensor where matrix is NxN matrix):

| N     | time_old           | time_new           |
|-------|--------------------|--------------------|
| 100   | 0.0002241021       | 0.0001548659       |
| 1000  | 0.0005934822       | 0.0002150342       |
| 10000 | 0.3242016407       | 0.0045755033       |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148350
Approved by: https://github.com/janeyx99
2025-03-04 23:20:26 +00:00
ade4af8c95 [MPS][BE] Fix c10:🤘:sinc implementation (#148471)
Restrict scalar implementation to `is_scalar_floating_point_v` types, but perform all internal computations in full 32-bit floats. Make complex implementation a template for `is_complex_v` types
This makes its eager kernel implementation for both real and complex type a trivial call to the template
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148471
Approved by: https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448, #148449
2025-03-04 23:14:03 +00:00
93e9daed54 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-03-04 23:09:09 +00:00
d789c22712 Upgrade github ubuntu-20.04 runners to ubuntu-24.04 (#148469)
The github provided ubuntu-20.04 gha runners are being deprecated (https://togithub.com/actions/runner-images/issues/11101) so upgrade workflows using them to the latest runner 24.04

They are currently doing a brownout, resulting in failures like: https://github.com/pytorch/pytorch/actions/runs/13660782115
```
[do_update_viablestrict](https://github.com/pytorch/pytorch/actions/runs/13660782115/job/38192777885)
This is a scheduled Ubuntu 20.04 brownout. Ubuntu 20.04 LTS runner will be removed on 2025-04-01. For more details, see https://github.com/actions/runner-images/issues/11101
```

Should we be using ubuntu-latest instead?

I attempted to upgrade actionlint to 1.7.7 but on my local in test-infra it seems to add a lot of new checks, and on test-infra's CI, I seem to have uploaded the wrong executable or something so it failed.  I'll try again later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148469
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-04 22:29:04 +00:00
5f47b7e268 [ROCm][TunableOp] Unit test for offline tuning of GEMM with bias (#148371)
One more unit test for the offline version of TunableOp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148371
Approved by: https://github.com/jeffdaily
2025-03-04 22:24:27 +00:00
842ffea445 [MPS][BE] Towards strided unary ops support (#148449)
Add generic functors kernels and rewrite all existing implementations into functors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148449
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448
2025-03-04 22:22:39 +00:00
70d0e1b96a Bump onnxscript to 0.2.2 in CI (#148388)
Unblock https://github.com/pytorch/pytorch/pull/148140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388
Approved by: https://github.com/malfet
2025-03-04 22:09:50 +00:00
c677f3251f [export] don't use unbacked_renamings in export (#147574)
Plan: avoid the use of unbacked renamings, and introduce a pass run in `_produce_aten_artifact` that recomputes unbacked bindings. Decided to do this because in we don't serialize unbacked renamings (or any ShapeEnv state), so this used to compose poorly with de/serialization. This hopefully establishes the invariant that the unbacked binding keys are always in sync with the example values (i.e. same indices, and removed if the symbol is replaced / specialized).

For de/serialization, we don't stored unbacked bindings, and just rerun the pass.

Involved a refactor of compute_unbacked_bindings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147574
Approved by: https://github.com/avikchaudhuri
2025-03-04 21:43:49 +00:00
84961a0c17 ci: Add workflow dispatch for commit hash update (#148486)
Maybe this should also be split into its own workflow instead of piggy
backing off of nightly?

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148486
Approved by: https://github.com/clee2000
ghstack dependencies: #148466, #148472
2025-03-04 21:26:23 +00:00
d290186ed3 ci: Add triton to update hash workflow (#148472)
Adds triton to our auto-update workflows so that PRs can be
automatically made and the triton team can follow up to fix any issues
that may arise.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148472
Approved by: https://github.com/Camyll, https://github.com/atalman
ghstack dependencies: #148466
2025-03-04 21:26:23 +00:00
9be8f74156 ci: Consolidate commit hash updates into a matrix (#148466)
Consolidates all of our commit hash update jobs into a single matrix to
make it easier to add more jobs later on.

Side note: How do I even test if this works?

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148466
Approved by: https://github.com/Camyll, https://github.com/clee2000, https://github.com/atalman
2025-03-04 21:26:13 +00:00
d1abde11ec [dynamo] Support passing arguments to DeviceMesh.get_group (#147741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147741
Approved by: https://github.com/StrongerXi
2025-03-04 21:19:47 +00:00
f30776c37a [BE] Upgrade to mypy 1.14 (#145966)
Upgrade mypy version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145966
Approved by: https://github.com/Skylion007
2025-03-04 20:58:26 +00:00
60205b0eb2 [export] Fix logging so that it doesn't result in max recursion error (#148231)
Test Plan:
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id=487493491 --test_suite ads_all --mode test_full_model

Produces https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp2wsjQH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Differential Revision: D70416613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148231
Approved by: https://github.com/yiming0416
2025-03-04 20:47:25 +00:00
e4c558be1d [scan] Corrections for scan (#146110)
This PR resolves some minor issues with the scan HOP and unifies the handling of the additional_inputs in the same way as for associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146110
Approved by: https://github.com/ydwu4
2025-03-04 20:29:08 +00:00
439395c0ae [MPS] add slogdet and logdet implementations to mps (#148287)
Low hanging fruits, all ops for these are implemented so just adding them to native functions adds the functionality on mps. Probably next op I should add should be lu solve seeing as how many ops need it for the grad calculation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148287
Approved by: https://github.com/malfet
2025-03-04 19:49:23 +00:00
92beda54c8 Revert "[fx] Move map_aggregate to C++ (#148243)"
This reverts commit edaff88f69f069d517b72ea23fd5eb04702eb0b5.

Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
17d003fe75 Revert "[fx] Move Node._update_args_kwargs to C++ (#148260)"
This reverts commit 0135f57f4aaeaba8d720f551eab6dca6fcede8cd.

Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
97b9e68bc6 Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)"
This reverts commit 29c2de9ae16f1673f3f44363243294d403e53d37.

Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
6fb18ff685 Revert "Better log message to update pr_time_benchmarks/expected_results.csv (#148303)"
This reverts commit a3d69e6e1a530ae2b91cd549ea26aac51ffc7566.

Reverted https://github.com/pytorch/pytorch/pull/148303 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
63778cb8a0 Revert "[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#147019)"
This reverts commit e3e45d90d8578083da8b51a3b1d911e9a4523e5b.

Reverted https://github.com/pytorch/pytorch/pull/147019 on behalf of https://github.com/clee2000 due to broke inductor test inductor/test_max_autotune.py::TestMaxAutotune::test_cat_max_autotune_extern [GH job link](https://github.com/pytorch/pytorch/actions/runs/13653495421/job/38171259603) [HUD commit link](e3e45d90d8) on inductor workflow and rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/147019#issuecomment-2698677222))
2025-03-04 19:20:15 +00:00
9d196edb7d Revert "Bump onnxscript to 0.2.2 in CI (#148388)"
This reverts commit 7ab6749ec7db32e0b3cdfd19db087f15dd0bebe2.

Reverted https://github.com/pytorch/pytorch/pull/148388 on behalf of https://github.com/clee2000 due to broke libtorch debug build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/13646179239/job/38152039312) [HUD commit link](7ab6749ec7) ([comment](https://github.com/pytorch/pytorch/pull/148388#issuecomment-2698665495))
2025-03-04 19:16:34 +00:00
c219c5ca38 Fix code descriptions in the test package. (#148145)
The parameter and function description have something wrong and make them correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148145
Approved by: https://github.com/janeyx99
2025-03-04 19:14:41 +00:00
e8900fbe4f [MPS] Add some useful utils (#148448)
Like `is_compex_v`, `is_scalar_intergral_v`, `result_of` etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148448
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399
2025-03-04 19:09:17 +00:00
f859722f70 [dtensor] refactor sharding prop to handle cross mesh computation (#147869)
as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level.

This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way".

This should also fix https://github.com/pytorch/pytorch/issues/134212

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869
Approved by: https://github.com/tianyu-l
2025-03-04 18:30:44 +00:00
eea54a55f6 ci: Switch manywheel build.sh to just use dev (#148310)
To avoid annoying error message like:

> fatal: no tag exactly matches 'a6520c85bd85875b09f2c68e51622699d7d07595'

These were popping up when GITHUB_REF is not set so let's just assume
that if someone is building without directly setting GITHUB_REF then
they're probably doing a dev build.

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148310
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-03-04 18:27:44 +00:00
611b0e9bc4 Revert "[fx] Optimizations for node name generation (#148288)"
This reverts commit 5eb0337cfd5e7c2cdf4a2d4829609e391467270f.

Reverted https://github.com/pytorch/pytorch/pull/148288 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
ed9055c303 Revert "[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)"
This reverts commit 8531d247ba411993f9a10686d70514f6945f9960.

Reverted https://github.com/pytorch/pytorch/pull/148292 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
67937be673 [BE] Move sinc kernels to the same OP family (#148399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148399
Approved by: https://github.com/dcci
ghstack dependencies: #148398
2025-03-04 15:49:20 +00:00
7fcbaff206 [BE] Remove stale arg for complex ops (#148398)
Not need to pass DTYPE0 and DTYPE1 if only one DTYPE is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148398
Approved by: https://github.com/dcci
2025-03-04 14:35:43 +00:00
f2f25a5444 Upgrade submodule oneDNN to v3.7.1 (#148293)
This PR is to upgrade submodule oneDNN to v3.7.1.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293
Approved by: https://github.com/mingfeima, https://github.com/atalman
2025-03-04 13:56:45 +00:00
f339e41a38 [inductor][triton] Fix average pool nd for int64 dtype (#146061)
The eager mode implementation of average pool nd returns an integer tensor if the input is also an integer tensor. This should also be preserved in inductor.

Fixes pytest -k test_comprehensive_nn_functional_avg_pool2d_cpu_int64 error: Triton compilation failed: triton_poi_fused_avg_pool2d_0

See WIP https://github.com/pytorch/pytorch/pull/145865#issuecomment-26200289890 to potentially enable such tests as they aren't enabled yet.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146061
Approved by: https://github.com/eellison
2025-03-04 13:53:50 +00:00
fdee60769a [DCP] Introduce process based async checkpointing (#147039)
Summary:
### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: Added E2E UTs for process based async save.

Differential Revision: D69272583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039
Approved by: https://github.com/saumishr
2025-03-04 13:33:28 +00:00
16d07988fc add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501, https://github.com/albanD
2025-03-04 12:37:06 +00:00
e3e45d90d8 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#147019)
Modified  TorchInductor’s autotuning flow so that each `best_config` JSON file also includes the Triton “base32” (or base64) cache key.

**Motivation**

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set `store_cubin = True` since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147019
Approved by: https://github.com/davidberard98
2025-03-04 12:16:38 +00:00
f1cce0951b Create unique test report files for distributed tests (#148325)
The distributed tests are executed once for each backend and for each init method.
`$TEST_REPORT_SOURCE_OVERRIDE` is used such that test results from different backends are stored in different files.
The same needs to be done for the init method.

Move the setting of the variable into `test_distributed` and incorporate the init method into the name.

Useful for e.g. https://github.com/pytorch/pytorch/issues/126523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148325
Approved by: https://github.com/clee2000
2025-03-04 10:45:33 +00:00
0b0d28accd Optimize param prepend class reference torch.nn.Module (#148304)
Fixes #147696

## Changes

Change `prepend` description  `torch.nn.modules.Module` to `torch.nn.Module`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/054f54b7-9487-4505-a926-3e17a84bd2f9)

### After

![image](https://github.com/user-attachments/assets/1d2a5708-62d1-428e-b136-bcaa35e5e6da)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148304
Approved by: https://github.com/Skylion007
2025-03-04 08:46:14 +00:00
da2688f624 Introduce delayed compile via eager_then_compile stance (#147983)
Recently I've been experimenting with introducing new APIs to delay compile as a way to reduce compile times while improving the ergonomics of using dynamic shapes. The high level idea is to run the first invocation of compile in eager, save the example inputs, and on the second invocation we can derive the dynamism in the inputs so that we don't need to waste our time doing a compile with static shapes (which is the status quo today with automatic dynamic).

Another benefit of this is most users no longer need to annotate their inputs with mark_dynamic and mark_unbaked calls since we can derive the dynamism on the very first call. Additionally we get dynamic ints out of the box in this new regime.

This PR implements this idea through the set_stance APIs. In particular it introduces a new `eager_then_compile` stance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147983
Approved by: https://github.com/williamwen42
2025-03-04 07:46:31 +00:00
e0f0db0105 updates to benchmarks (#144831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144831
Approved by: https://github.com/danielvegamyhre
2025-03-04 06:21:12 +00:00
ac99fc7e57 Updates to build rowwise scaled mm kernel on SM10.0a (#148274)
## Summary
Update cmake files and RowwiseScaledMM.cu to build on SM10.0a arch.

**NOTE**: performance optimization will be done in separate follow up PRs

## Steps to verify build
1. Access devgpu/machine with B200 GPUs, verify B200s are visible w/ `nvidia-smi`
2. Install CUDA tookit 12.8
    - e.g. see [Nvidia docs](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Rocky&target_version=9&target_type=rpm_local)
3. Verify CUDA toolkit installation
    - e.g. `nvcc --version` should have `... Cuda compilation tools, release 12.8 ... ` in output
4. Set env var `TORCH_CUDA_ARCH_LIST=10.0a`
4. Build pytorch from source with this PR ([steps](https://github.com/pytorch/pytorch#from-source))
5. Uninstall `pytorch-triton` with `pip uninstall pytorch-triton`
6. Build and install triton from source: https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source
7. Run tests shown in test plan below

**NOTE**: performance optimization will be done in a separate PR. The goal of this PR is just to ensure it builds correctly.

## Test plan
- `python test/distributed/tensor/test_matrix_ops.py  -k scaled_mm`: OK
- `python test/test_matmul_cuda.py -k rowwise`: OK
- `python test/test_flop_counter.py -k scaled_mm`: OK
- `python test/inductor/test_aot_inductor.py -k fp8`: OK
- `python test/inductor/test_fp8.py`: OK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148274
Approved by: https://github.com/drisspg
2025-03-04 05:23:41 +00:00
7ab6749ec7 Bump onnxscript to 0.2.2 in CI (#148388)
Unblock https://github.com/pytorch/pytorch/pull/148140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388
Approved by: https://github.com/malfet
2025-03-04 04:21:58 +00:00
d54cab78e1 [codemod] Fix missing field initializer in caffe2/torch/lib/libshm/manager.cpp +1 (#148393)
Summary:
The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers.

This can be unintended such as:
```
my_struct s1 = {0}; // Initializes *only* the first field to zero; others to default values
my_struct s2 = {}; // Initializes *all* fields to default values (often zero)
```
or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated.

To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D70472663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148393
Approved by: https://github.com/Skylion007
2025-03-04 04:20:04 +00:00
70410f93f2 doc/xpu: align description of SyclExtension with CPP/CUDA (#147988)
This commit just aligns description of `py_limited_api` feature in SyclExtension with CPP/CUDA. We've missed this change on doing SyclExtension due to parallel work on the changes. For CPP/CUDA change was done in 515e55e6927ad5f57ec222d7779712630341acf3.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147988
Approved by: https://github.com/janeyx99, https://github.com/guangyey
2025-03-04 04:17:36 +00:00
cyy
ec2805ada8 Remove outdated CUDA version check (#148142)
Since Torch requires CUDA>=11, some checks can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142
Approved by: https://github.com/janeyx99, https://github.com/eqy
2025-03-04 03:33:44 +00:00
cyy
98bf2f1170 Use Python 3.9 typing (#148157)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157
Approved by: https://github.com/janeyx99
2025-03-04 03:09:55 +00:00
cyy
b7832f0339 Enable ASAN in CUDA tests (#147812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147812
Approved by: https://github.com/janeyx99
2025-03-04 02:50:39 +00:00
8531d247ba [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303, #148288
2025-03-04 02:42:23 +00:00
5eb0337cfd [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303
2025-03-04 02:42:23 +00:00
a3d69e6e1a Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
ghstack dependencies: #148243, #148260, #148261
2025-03-04 02:42:23 +00:00
17518007b2 [cutlass backend] Benchmark compared to aten and triton (#148347)
Benchmark for cutlass backend.

```
python benchmarks/inductor_backends/cutlass.py
```

Test Plan:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 12.759539298713207 |  2.7271360370796174  |         NA          |
|        triton         | 10.573655366897583 |  1.8661278090439737  | -17.131370346859384 |
| triton_persistent_tma | 10.884030722081661 |  0.5315794269554317  | -14.698873781600327 |
|  cutlass_lvl_default  | 13.09632882475853  |  0.5520401500398293  | 2.6395116481931873  |
|   cutlass_lvl_1111    | 11.05172373354435  |  0.569593315012753   | -13.384617776451302 |
|   cutlass_lvl_2222    | 11.371277272701263 |  133.58984916994814  | -10.880189272601317 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 14.472318813204765 |  1.5445372510002926  |         NA          |
|        triton         | 10.568295605480671 |  16.583424195996486  | -26.975796056689987 |
| triton_persistent_tma | 10.45411266386509  |  5.830657540936954   | -27.764770809729562 |
|  cutlass_lvl_default  | 12.742593884468079 |  28.994930602959357  | -11.951954286402668 |
|   cutlass_lvl_1111    | 11.522261425852776 |  79.85037935699802   | -20.38413764531163  |
|   cutlass_lvl_2222    | 10.993581265211105 |  132.86601971101481  | -24.037181552548486 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.700622126460075 |  2.225986961973831   |         NA          |
|        triton         | 29.17378954589367  |  38.571991189033724  |  -4.97329524553989  |
| triton_persistent_tma | 29.642896726727486 |   7.2848734309664    | -3.4452897904663744 |
|  cutlass_lvl_default  | 29.514770954847336 |  29.819900761009194  | -3.8626291243482167 |
|   cutlass_lvl_1111    | 29.411429539322853 |  23.82907024596352   |  -4.19923929172139  |
|   cutlass_lvl_2222    | 29.57325428724289  |  134.31008586101234  | -3.672133530628152  |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 30.858177691698074 |  1.181898436974734   |         NA         |
|        triton         | 28.630023822188377 |  39.24473957403097   | -7.220626868414034 |
| triton_persistent_tma | 28.641965240240097 |  5.275042273919098   | -7.181929126210897 |
|  cutlass_lvl_default  | 29.16003204882145  |  29.934022572939284  | -5.503065216107967 |
|   cutlass_lvl_1111    | 28.79570797085762  |  23.948012012057006  | -6.683705504085324 |
|   cutlass_lvl_2222    | 29.02756631374359  |  136.25560767308343  | -5.932337924306467 |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1456.143856048584  |  1.020197194069624   |         NA         |
|        triton         | 1708.2737684249878 |  5.766509635956027   | 17.31490410985819  |
| triton_persistent_tma | 1476.485013961792  |  7.455113030038774   | 1.3969195302177155 |
|  cutlass_lvl_default  | 1583.3594799041748 |  50.408804678940214  | 8.736473620182366  |
|   cutlass_lvl_1111    | 1636.4418268203735 |  82.82403108896688   | 12.381879030898025 |
|   cutlass_lvl_2222    | 1507.5665712356567 |  260.03901409788523  | 3.531430975962381  |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1382.230520248413  |  1.2586536260787398  |         NA         |
|        triton         | 1646.9683647155762 |  5.442052865982987   | 19.15294450447995  |
| triton_persistent_tma | 1423.9195585250854 |  6.515797697938979   | 3.016069871556595  |
|  cutlass_lvl_default  | 1500.9030103683472 |  51.36402789200656   |  8.58557877152115  |
|   cutlass_lvl_1111    | 1446.9740390777588 |  30.65435610699933   | 4.683988515729638  |
|   cutlass_lvl_2222    | 1419.661521911621  |  205.1948991640238   | 2.7080144096717635 |
+-----------------------+--------------------+----------------------+--------------------+
```

Differential Revision: D70147589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347
Approved by: https://github.com/drisspg, https://github.com/chenyang78
2025-03-04 01:45:36 +00:00
c21dc11a17 [Intel GPU] Enable SDPA on XPU (#147614)
Motivation
===

This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`.

Depends on OneDNN version v3.7 upgrade in #147498
Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-04 01:40:45 +00:00
b17f5223a4 Generate AOTI input check by default (#148005)
Summary:
Generate AOTI size and stride input check by default. But the checks are only run if `AOT_INDUCTOR_DEBUG_COMPILE` env variable is set (to avoid slowing down the performance).

Example output:

```cpp
            bool _check_aoti_runtime_check_inputs_env() {
                const static char* env_var_value = getenv("AOTI_RUNTIME_CHECK_INPUTS");
                const static bool result = env_var_value != nullptr && env_var_value[0] != '\0';
                return result;
            }

            AOTI_NOINLINE static void __check_inputs_outputs(
                AtenTensorHandle* input_handles,
                AtenTensorHandle* output_handles) {
                if (!_check_aoti_runtime_check_inputs_env()){
                    return;
                }
//rest of the check
}

```

Test Plan: CI

Differential Revision: D70260490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148005
Approved by: https://github.com/hl475, https://github.com/desertfire, https://github.com/jingsh
2025-03-04 00:55:14 +00:00
0bd2caac55 Docker release - pin buildkit to v0.19.0 (#148372)
Fix nightly build failure during arm64 docker build (since 02.21.2025): https://github.com/pytorch/pytorch/actions/runs/13452177170/job/37588508155#step:12:851

Error:
```
#10 73.62 Segmentation fault (core dumped)
#10 73.67 qemu: uncaught target signal 11 (Segmentation fault) - core dumped
#10 73.85 Segmentation fault (core dumped)
#10 73.85 dpkg: error processing package libc-bin (--configure):
#10 73.85  installed libc-bin package post-installation script subprocess returned error exit status 139
```
Looks like we are hitting: https://github.com/moby/buildkit/issues/5783

Update setup-qemu and buildkit actions to v3 and buildkit to v0.19.0

Please note: CUDA 12.8 error is not related to this failure in nightly cpu arm64. Looks like we are trying to install release torch when running on PR. Cuda 12.8 build is not released yet, hence a failure. Will send followup to make sure we are using nightly torch when running on PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148372
Approved by: https://github.com/seemethere
2025-03-03 23:55:30 +00:00
d43c6f0033 [invoke_subgraph] Run joint passes on the hop graphs (#139325)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139325
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #147559
2025-03-03 23:38:14 +00:00
216a108aaf [ROCm] Add rocm-mi300 and inductor-rocm-mi300 to upload-test-stats.yml (#148365)
We currently run MI300X machines on rocm-mi300 and inductor-rocm-mi300 but we don't have artifacts for the results:
e.g.
6e10471966 (rocm-mi300)
![image](https://github.com/user-attachments/assets/f5588072-b818-4f54-a348-0e6ac7e96829)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148365
Approved by: https://github.com/jeffdaily
2025-03-03 23:22:56 +00:00
586d8df651 Fix condition for CONVERT_NON_VECTORIZED_INIT invocation (#148362)
Yet another regression caused by https://github.com/pytorch/pytorch/pull/146596 that breaks builds if PyTorch is compiled for Android or using NVIDIA GraceHopper systems

Not sure why author was trying to change the conditon to begin with

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148362
Approved by: https://github.com/izaitsevfb
ghstack dependencies: #148354
2025-03-03 23:13:37 +00:00
5887a2d8de [BE] Use C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED (#148354)
Instead of `#pragma GCC diagnostic ignored "-Wignored-qualifiers"`
Also limit the scope to just `Vectorized::map` that has to be declared that way due to sleef function signature definitions that return `const __m256` for AVX2 methods

Also delete `#pragma GCC diagnostic pop` from vec256_half and vec256_bfloat16 as it results in an unbalanced pop warning, for push that is defined in vec256_16bit_float, which will be included only once
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:232:27: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas]
  232 | #pragma GCC diagnostic pop
      |                           ^
1 warning generated.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148354
Approved by: https://github.com/izaitsevfb
2025-03-03 23:00:47 +00:00
d0b23e661d [cutlass backend] Add main tests for mm, addmm and bmm - step 1 (#148229)
This adds very good coverage for normal mm tests {aoti x torch.compile} x {default, dynamic}.

There are some parts that are less tested. For example:
* different layout combo
* shapes that are less aligned

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148229
Approved by: https://github.com/chenyang78
2025-03-03 22:31:46 +00:00
a41413829c Use release notes label for module: distributed_checkpoint (#148352)
module: distributed_checkpoint is redundant with oncall: distributed checkpointing.

@fduwjj let us know that module: distributed_checkpoint is just used for release notes, so let's use the release notes label for the release notes, which the bot will pick up better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148352
Approved by: https://github.com/fegin
2025-03-03 21:33:28 +00:00
e45040b1d3 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang
2025-03-03 21:32:21 +00:00
52078154f2 Add support for no-op concat with padded output (#146866)
Add support for no-op concat with padded output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146866
Approved by: https://github.com/shunting314
2025-03-03 21:10:46 +00:00
07f876e960 Subprocess compile (#146134)
Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134
Approved by: https://github.com/jamesjwu
2025-03-03 21:10:12 +00:00
8f361c808b [dynamo] run-only recursively on recompile limit exceeded (#148021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148021
Approved by: https://github.com/anijain2305
2025-03-03 21:01:08 +00:00
1bbe57336b Replace unimplemented with unimplemented_v2 for dynamo (#148158)
torch/_dynamo/variables/constant.py

https://github.com/pytorch/pytorch/issues/147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148158
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
2025-03-03 21:00:17 +00:00
b162b1600b [Inductor] Hot fix after #148011 (#148270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148270
Approved by: https://github.com/davidberard98
2025-03-03 20:18:21 +00:00
d260d4fc55 HSDP custom hook UTs are multi-threaded - can't set device rank (#148099)
HSDP custom hook UTs are multi-threaded and using single physical GPU. If we set rank in each thread, then we are referencing the same GPU with multiple ranks, which isn't right. Therefore, removing the rank setting from these UTs. Now, they are passing with 1, 2, 4 GPUs.

Fixes #147767 and #147769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148099
Approved by: https://github.com/jeffdaily
2025-03-03 19:48:49 +00:00
302c660298 Consistently use load_torchbind_test_lib in tests (#148082)
The same code is repeated multiple times with slightly different implementations.
Use the existing function for brevity and consistency.

In the function the code from `test_export` is used which does a single `load_library` with cleaner conditions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148082
Approved by: https://github.com/angelayi
2025-03-03 19:37:28 +00:00
40c2505f16 [logging] Log individual Triton kernel compilation times to dynamo_compile (#147022)
Summary: Gather the compilation time of individual triton kernels and log them to dynamo_compile:
* Time compilation in `_worker_compile_triton` and pass back to the main process and logged from `get_result()`.
* Added a way to track the "top N" (or N most-expensive compiles) in the metrics_context. I did this because I doubt we really care to capture potentially thousands of kernel compile times. That would be problematic for scuba logging anyway, so let's limit the number we track from the beginning. Arbitrarily chose 25 for now.
* Format the list of compile times as a json string before logging.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
Scuba: https://fburl.com/scuba/dynamo_compile/sandbox/nc4dzm3r

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147022
Approved by: https://github.com/jamesjwu
2025-03-03 19:32:17 +00:00
aade4fbd55 Expose the rendezvous keepalive arguments (#145228)
Enables support for this:

```python
from torch.distributed.launcher.api import LaunchConfig

config = LaunchConfig(
    ...,
    rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```

These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.

Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
2025-03-03 19:11:56 +00:00
a929e11e4f [dynamic shapes][export] ignore when real-tensor fallback fails (#147779)
Summary: uninspired solution to https://github.com/pytorch/pytorch/issues/147402

Test Plan: test_draft_export

Differential Revision: D70132269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147779
Approved by: https://github.com/bobrenjc93
2025-03-03 19:09:56 +00:00
cyy
09291817b2 Fix extra semicolon warning (#148291)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291
Approved by: https://github.com/Skylion007
2025-03-03 18:51:44 +00:00
1c544a9ddd [Inductor-CPP] If all of the activation scale dims are 1, make it a 0D tensor (#147033)
For int8 dynamically quantized activation & int8 quantized weights, add a workaround for some indexing issue that expected an empty index ( so, was expecting a 0D tensor) in epilogue creator when the activation scale was sized [1, 1] by converting it into a 0D tensor.

The issue was discovered while running LLaMA2 quantized with torchao's `int8_dynamic_activation_int8_weight` quantization on CPU with max-autotune enabled (although this error would've occurred regardless).

The final hidden states tensor that's activation to LM head is of shape `[batch_size, sequence_length, hidden_dim]` during decoding. For decoding one token at a time with batch size 1, sequence length is 1. The activation scale is shaped `[1, 1]` (reshaped from `[1, 1, 1]`). However, Inductor epilogue creator expects a 0D tensor in this case (my guess is that the corresponding logic in Inductor expects a 0D tensor if a tensor has only one element, even if it's 1D?).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147033
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
2025-03-03 18:32:27 +00:00
57addfcd58 Significantly speed up save_cache_artifacts (#148227)
While using save_cache_artifacts on internal workloads, we have noticed that repeatedly calling this function after every batch is incredibly expensive. This PR significantly speeds up this function call by opting out of pickle and redesigning serialization algorithm.

Essentially what we want is to be able to call serialize many times without incurring costs from scratch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148227
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148226
2025-03-03 17:28:41 +00:00
3ca1a2564d [BE][MPS] Use copysign for imaginary part of sqrt (#148286)
Also it's tempting trying to replace `a*a + b*b` with `dot(input[index])` but for some reason it results in a slightly different output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148286
Approved by: https://github.com/dcci
ghstack dependencies: #148285
2025-03-03 16:03:54 +00:00
84502baaff [MPS] Fix sqrt and other for torch.chalf (#148285)
Those kernels, instead of being instantiated for half2 (which corresponds to ComplexHalf) were instnatiated for short2, which resuled in the following test
```
% python3 -c "import torch; print(torch.rand(6, device='mps', dtype=torch.chalf).sqrt())"
```
Fail with
```
RuntimeError: Failed to create function state object for: sqrt_complex_half_half
```

As sqrt is not implemented for CPU, add explicit test to `test_sqrt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148285
Approved by: https://github.com/dcci
2025-03-03 16:03:54 +00:00
d57f617844 [Inductor][CPP] Avoid transpose with cpp micro-gemm for FlexAttention (#147069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147069
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/drisspg
ghstack dependencies: #147068
2025-03-03 15:22:11 +00:00
6c089f5da3 ci: move xpu triton build to manylinux 2.28 (#148195)
Follow PR #148129 to remove manylinux builds for triton xpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148195
Approved by: https://github.com/seemethere
2025-03-03 12:31:08 +00:00
165e33531c [Inductor][CPP] Fix the vec codegen for tanh (#148254)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/148241, The previous vectorized code generation for `tanh` used a decomposed implementation, leading to numerical differences that were further amplified by `atan2`. For example, in the given test case after `tanh`, the eager output at `[0,0,11,47]` was `-5.820766091346741e-10`, while the compiled output was `1.4319084584712982e-08`, resulting in different `atan2` outputs of `-2.3561` and `0.7853`. This issue is fixed by switching to the Sleef implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_tanh_atan2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148254
Approved by: https://github.com/malfet, https://github.com/jgong5
2025-03-03 11:46:57 +00:00
118a165ac5 [Inductor][CPP] Add transposed B matrix support for CppMicroGemmFP32Vec (#147068)
* Add transposed B support for CppMicroGemmFP32Vec.
* Add support for cases where N is not divisible by `block_n`.
Expand CppMicroGemmFP32Vec to generate gemm kernel that supports transposed B and N of arbitrary size.
This is the basis for https://github.com/pytorch/pytorch/pull/147069 to get better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147068
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-03 11:08:23 +00:00
6a3a1f96ce Enable XPU for Inductor MM Triton Kernel Benchmark (#148237)
#147620 enabled `force_shape_pad` for triton kernel benchmark. Intel GPU supports this scenario. Hence, we need to enable the case in this PR. Otherwise, there would be a test case regression for Intel GPU as #147620 has been landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148237
Approved by: https://github.com/jansel
2025-03-03 10:09:06 +00:00
b3bb73e11c Separate transpose from memory load/store and add load size support for convert_to_int32 (#147067)
Separate transpose from memory load/store and add load size support for convert_to_int32 to facilitate the expansion for CppMicroGemmFP32Vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147067
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 02:56:16 +00:00
ab81ca5053 [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes.

Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested.

**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 00:56:29 +00:00
608377d341 Revert "[import][inductor] Simplify grid handling (#147583)"
This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e.

Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))
2025-03-03 00:49:32 +00:00
29c2de9ae1 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-02 22:42:31 +00:00
0135f57f4a [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-02 22:42:31 +00:00
edaff88f69 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-02 22:42:31 +00:00
94afb165d9 Revert "[c10d] Add hccl distributed backend to c10d data structures (#146478)"
This reverts commit dae3fbfe9720e83e7e81d41430fb5067221bbed7.

Reverted https://github.com/pytorch/pytorch/pull/146478 on behalf of https://github.com/malfet due to This seems to break ROCM tests, see dae3fbfe97 ([comment](https://github.com/pytorch/pytorch/pull/146478#issuecomment-2692913573))
2025-03-02 21:22:04 +00:00
1106eb0212 [BE] Fix extra semicolon warning (#148284)
Introduced by https://github.com/pytorch/pytorch/pull/146596

I.e. while building locally my log was littered with
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL2d.cpp:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:228:42: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  228 | LOAD_FP32_NON_VECTORIZED_INIT(Half, fp16);
      |                                          ^
2 warnings generated.
[230/1017] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL.cpp:9:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:14:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:228:46: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  228 | LOAD_FP32_NON_VECTORIZED_INIT(BFloat16, bf16);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148284
Approved by: https://github.com/Skylion007
2025-03-02 19:06:46 +00:00
6d70b42810 [BE][Ez]: Update fmt submodule to 11.1.4 (#148264)
This minor release is mostly bugfixes, ABI fixes, and compiler support fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-02 19:00:00 +00:00
95d81d21a6 [MPS] Speedup interpolation (#148277)
First of all, perf claims made in https://github.com/pytorch/pytorch/pull/145581 and https://github.com/pytorch/pytorch/pull/148154 are too good to be true (due to the bug in the script that did not call `torch.mps.synchronize` at the end of the benchmark script, but still slightly better than MPS, probably due to the launch overhead.

And while measure performance correctly, I've noticed that a lot of time is spent on 64-bit integral division of thread_index to get spatial coordinates. Simply downcasting divisior to 32-bit integer (which is also the thread index) speeds it up almost 2x for bilinear and bicubic as could be demonstrated by running following script
```python
import torch
import time
import subprocess
import itertools

def benchmark(device, dtype, mode="bilinear", antialias=False, sf=.5):
    # Create example inputs
    x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype)

    # define kwargs
    kwargs = {"antialias": antialias, "mode": mode, "scale_factor": sf}

    # Skip for unimplemented flavors
    if antialias and mode == "bicubic" and device == "mps":
       return None, "Skip"
    elif antialias and dtype != torch.float32:
       if device == "cpu":
           return None, "Skip"
       outputs_match = None
    else:
        # Check output
        y = torch.nn.functional.interpolate(x, **kwargs)
        z = torch.nn.functional.interpolate(x.cpu(), **kwargs)
        outputs_match = torch.allclose(y.cpu(), z)
        if not outputs_match:
           atol = (y.cpu() - z).abs().max()
           rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
           print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, **kwargs)
    torch.mps.synchronize()
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
for mode,antialias in itertools.product(["bilinear", "bicubic"], [False, True]):
    outputs_match_list = []
    average_time_list = []
    for device in ["mps", "cpu"]:
      for dtype in [torch.float32, torch.float16, torch.bfloat16]:
          outputs_match, average_time = benchmark(device, dtype, mode=mode, antialias=antialias)
          outputs_match_list.append(str(outputs_match))
          average_time_list.append(average_time)

    print(f"\nBenchmarking Results (collected on {brand_string}) for {mode} interpolation {'with antialias' if antialias else ''}:")
    print("-"*40)
    print("Device            :                MPS        |               CPU")
    print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16")
    print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
    print(f"Average Time (us) :", "  |".join(average_time_list))
```

Before
```
Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  292.0  | 264.7  | 267.9  | 289.1  | 230.9  | 309.1
atol=1.430511474609375e-06 rtol=0.11363636702299118

Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias:
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  False |  False |  True  |  None |  None
Average Time (us) :  698.3  | 684.2  | 683.8  | 851.0  |Skip  |Skip
atol=2.086162567138672e-06 rtol=0.019750799983739853

Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  True  |  True  |  True  |  True  |  True
Average Time (us) :  314.3  | 301.0  | 298.8  | 681.5  | 616.7  | 833.7
```

After
```
Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  119.9  |  98.9  |  98.6  | 289.8  | 231.9  | 308.5
atol=1.430511474609375e-06 rtol=0.05681818351149559

Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias:
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  False |  False |  True  |  None |  None
Average Time (us) :  541.9  | 531.1  | 531.0  | 846.8  |Skip  |Skip
atol=2.0265579223632812e-06 rtol=0.008604463189840317

Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  True  |  True  |  True  |  True  |  True
Average Time (us) :  314.3  | 301.0  | 298.8  | 681.5  | 616.7  | 833.7
```

TODO:
 - Figure out if this ops make more sense as 3D jobs with n and c channels dispatch as one more dimension
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148277
Approved by: https://github.com/Skylion007
2025-03-02 17:13:52 +00:00
cyy
9aa897b992 Remove unnecessary tensor clone (#148159)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148159
Approved by: https://github.com/Skylion007
2025-03-02 16:21:39 +00:00
1d7397a2d0 [Inductor] Avoid tensor slice overflow for large step (#147433)
Fixes #147071

Currently, if step is a value very close to INT64_MAX, the calculation of slice output length will overflow. This PR tries to fix this problem and thus fix #147071.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147433
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-02 16:07:15 +00:00
9c506aa8a6 [aotinductor] add option to disable runtime assertions (#146462)
A recent user experience is like this:
* User runs AOTI lowering, it's successful.
* They take AOTI model and run it with some sample inputs. Everything runs well
* Then they boot up a serving test that loads the AOTI model and runs it with a set of sample requests.
* They see that some of the requests fail. The logs show them this:
* AOTInductorModel run failed with input spec: [1, 32]:c10::BFloat16, [2]:long ...
  * Error: u45 >= 2
* To the untrained eye, "AOTInductorModel run failed" is all they see. But, the true reason is Error: u45 >= 2

However, the assertion isn't always correct.
* In fact, u45 can actually be 0.
* So, why did AOTI say u45 ≥ 2? It's a two-piece combo:
* With 0/1 Specialization, the ShapeEnv creates symbolic shapes (e.g. s0) with a default value-range of [2, inf]
* In the graph, Dynamo traces torch.mul(A, B) where A is [s0, ...]and B is [u45, ...]. So, Dynamo learns Eq(s0, u45).
* Therefore, u45 also has a range of [2, inf]. Hence, the incorrect runtime assertion.

So, the motivation for this PR is to add an option to disable the logging. If you run into a situation like this. However, another way to avoid this is to call `mark_unbacked()` on all the dynamic dims.

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146462
Approved by: https://github.com/desertfire, https://github.com/22quinn
2025-03-02 09:14:58 +00:00
26358fa2d8 Add AppendingByteSerializer class (#148226)
This PR adds a new util class that enables efficient appending of sequential byte data with custom serialization and deserialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148226
Approved by: https://github.com/aorenste
2025-03-02 08:20:58 +00:00
b59776d857 [import][inductor] Simplify grid handling (#147583)
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Note the attached diff contains some minor fbcode-only changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-03-02 07:31:07 +00:00
dae3fbfe97 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang, https://github.com/guangyey
2025-03-02 05:13:48 +00:00
6e10471966 [ci] disable cudagraph for tts_angular on dashboard (#148221)
tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular.

[Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221
Approved by: https://github.com/eellison
2025-03-02 03:31:19 +00:00
de7af81f18 [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-03-02 03:25:28 +00:00
ce2f680e00 [fr] Added protection against missing stack frames in fr (#148203)
Summary: We have quite a while failures due to this unprotected access. https://fburl.com/scuba/ai_rca_debug_tracing/qtnb63qf

Test Plan:

Reviewed By: fduwjj

Differential Revision: D70358287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148203
Approved by: https://github.com/fduwjj
2025-03-02 01:03:49 +00:00
19de523de6 [MPS] metal unary kernel for sqrt (#148272)
Issue #148219 highlighted the high dispatch times of ops which ran with MPS Graph on smaller tensors. This PR rewrites the sqrt with metal kernel to mitigate that issue

## Speedups:

Matrix size means NxN matrix here.
![speedup_sqrt](https://github.com/user-attachments/assets/db0a705b-1a0e-42b4-bd42-4e7960415c81)

Code to generate the times(needs building the torch with old time and new time):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [1, 100, 1000, 10_000]
num_runs = 1000
warmup_runs = 3

def run_sqrt(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = torch.sqrt(A)
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.rand((n, n), dtype=torch.float32, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_sqrt(A_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_sqrt(A_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('sqrt_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148272
Approved by: https://github.com/malfet
2025-03-02 00:45:45 +00:00
1a6883759d Fix macro for bit_cast in c10/util/bit_cast.h - one line change (#148265)
Fixes #148263.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148265
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-03-01 20:55:31 +00:00
1919e0de9a Revert "stage 1 of depreate silent fallback of tuning gemm (#147798)"
This reverts commit 297c00264e54cfb192f289e23a41775b81cb9cb8.

Reverted https://github.com/pytorch/pytorch/pull/147798 on behalf of https://github.com/wdvr due to failing internal builds, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147798#issuecomment-2692390551))
2025-03-01 20:04:23 +00:00
82603fd7d2 introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk, https://github.com/wdvr
2025-03-01 19:57:54 +00:00
5301710b15 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69945678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-03-01 19:46:13 +00:00
0ff2e6a85a Fix None and equal_to_1 arguments issue in Triton kernel generated by AOTI (#148102)
Summary:
When a Triton kernel has arguments with None values followed by arguments with value 1, AOTI attempts to remove the None arguments and update the indices of the equal_to_1 arguments in triton_meta["configs"]. However, if the same kernel is called multiple times, this optimization process is repeated. Prior to this diff, the indices of equal_to_1 arguments from subsequent calls (second and later) were based on the updated indices from the previous call, resulting in incorrect behavior.
This diff aims to localize the updated indices for equal_to_1 arguments within the optimization process of the current call, ensuring accurate and consistent results.

Test Plan:
Unit Test:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_triton_kernel_with_none_inputs_and_equal_to_1_arg
```

Differential Revision: D69998314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148102
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
2025-03-01 18:38:33 +00:00
2b86309da3 separate f16 vectorized class from bf16 (#146596)
Separating the f16 vectorized class into a different file from the bf16 vectorized class in order to be able to add a new bf16 SVE vectorized class in https://github.com/pytorch/pytorch/pull/143666. This is required as we would need to exclude the current bf16 class in order to use the sve bf16 class but still include the current f16 vectorized class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146596
Approved by: https://github.com/malfet
2025-03-01 18:22:32 +00:00
8e004865dd Revert "introduce dynamism library (#147981)"
This reverts commit 1c1bf410ecdeac8d240e15bf8c33c0f00fab0673.

Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/malfet due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2692351017))
2025-03-01 18:16:52 +00:00
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
d23051f29b [Inductor] Support parallel reduction for GroupNorm (#144020)
Summary:
Support parallel reduction for GroupNorm by optimizing the parallelization heuristics: When the range of the first inner loop is much larger than the range of all outer loops, change the starting depth of parallelization to the first inner loop.
I tested the Inductor benchmark with this PR on CPU. One torchbench model(pytorch_CycleGAN_and_pix2pix) achieved ~45% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) achieved ~2% performance improvement.

Example:
```
import torch
import torch.nn as nn

class GN(nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

x = torch.randn(2, 64, 168, 168).to(memory_format=torch.channels_last)
m = GN(2, 64).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    out = compiled_m(x)
```

Generated code:

- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       float* out_ptr3,
                       float* out_ptr4)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(56448L));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L))
                            {
                                {
                                    if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L)))
                                    {
                                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32L*x1 + 64L*x2 + 1806336L*x0), static_cast<int64_t>(16));
                                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                                    }
                                }
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        #pragma omp single
        {
            {
                #pragma GCC ivdep
                for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
                    {
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                        {
                            {
                                if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                                {
                                    auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2L*x0)];
                                    auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                                    auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2L*x0)];
                                    auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                                    auto tmp1 = static_cast<float>(903168.0);
                                    auto tmp2 = tmp0 / tmp1;
                                    auto tmp3 = static_cast<float>(1e-05);
                                    auto tmp4 = decltype(tmp2)(tmp2 + tmp3);
                                    auto tmp5 = 1 / std::sqrt(tmp4);
                                    auto tmp7 = at::vec::Vectorized<float>(tmp5);
                                    auto tmp8 = tmp7 * tmp6;
                                    auto tmp10 = decltype(tmp9)(-tmp9);
                                    auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                    auto tmp12 = tmp11 * tmp8;
                                    auto tmp14 = tmp12 + tmp13;
                                    tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                                    tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                                }
                            }
                        }
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L))
                {
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L))
                    {
                        {
                            if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L)))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp2 = tmp0 * tmp1;
                                auto tmp4 = tmp2 + tmp3;
                                tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')
```

- After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       float* out_ptr3,
                       float* out_ptr4)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            #pragma GCC ivdep
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
            {
                {
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>();
                    }
                    Welford<float> tmp_acc0_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        tmp_acc0_arr[i] = Welford<float>();
                    }
                    Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        masked_tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>();
                    }
                    #pragma omp parallel num_threads(56)
                    {
                        int tid = omp_get_thread_num();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(1008L));
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>();
                        Welford<float> tmp_acc0_local = Welford<float>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>();
                        #pragma omp for
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L))
                            {
                                {
                                    if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L)))
                                    {
                                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32L*x1 + 64L*x2 + 1806336L*x0), static_cast<int64_t>(16));
                                        tmp_acc0_vec_local = welford_combine(tmp_acc0_vec_local, tmp0, &wrecps0);
                                    }
                                }
                            }
                        }
                        tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local;
                        tmp_acc0_arr[tid] = tmp_acc0_local;
                        masked_tmp_acc0_vec_arr[tid] = masked_tmp_acc0_vec_local;
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp_acc0_vec_arr[tid]);
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]);
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, masked_tmp_acc0_vec_arr[tid]);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            #pragma GCC ivdep
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
            {
                for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                        {
                            auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2L*x0)];
                            auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                            auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2L*x0)];
                            auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                            auto tmp1 = static_cast<float>(903168.0);
                            auto tmp2 = tmp0 / tmp1;
                            auto tmp3 = static_cast<float>(1e-05);
                            auto tmp4 = decltype(tmp2)(tmp2 + tmp3);
                            auto tmp5 = 1 / std::sqrt(tmp4);
                            auto tmp7 = at::vec::Vectorized<float>(tmp5);
                            auto tmp8 = tmp7 * tmp6;
                            auto tmp10 = decltype(tmp9)(-tmp9);
                            auto tmp11 = at::vec::Vectorized<float>(tmp10);
                            auto tmp12 = tmp11 * tmp8;
                            auto tmp14 = tmp12 + tmp13;
                            tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                            tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                        }
                    }
                }
            }
        }
    }
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L))
                {
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L))
                    {
                        {
                            if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L)))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp2 = tmp0 * tmp1;
                                auto tmp4 = tmp2 + tmp3;
                                tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144020
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-01 17:11:50 +00:00
cyy
8bf3920279 Remove unneeded Clang-tidy suppression (#148246)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148246
Approved by: https://github.com/Skylion007
2025-03-01 16:51:54 +00:00
1c1bf410ec introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk
2025-03-01 14:57:06 +00:00
3a0c9f7f9d [MPS] Fix SDPA crash (#148239)
If operation is invoked with mask twice it will crash, as mask expansion logic was implemented inside cache creation block, which is executed only once for all shapes

Fixes https://github.com/pytorch/pytorch/issues/148194 which is a regression introduced by https://github.com/pytorch/pytorch/pull/147545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148239
Approved by: https://github.com/dcci
2025-03-01 13:06:51 +00:00
735d7b1af6 [EZ][BE] Increase tolerances for interpolate op (#148224)
Not sure why tolerances were set like that, this logic was added in https://github.com/pytorch/pytorch/pull/104181 without much explanation
But if I'm to make a guess, it's likely due to the inaccuracy of bilinear op, that has since been replaced by shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148224
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148154, #148187, #148211
2025-03-01 13:03:59 +00:00
762724f3d0 [Break XPU][Inductor] Generalize device-bias code and fix test_graph_partition for XPU (#148178)
This PR generalized the device-bias code introduced by #147038 . And align the behavior between XPU and CUDA on add + mm + pointwise pattern (for XPU, from addmm + pointwise to mm + fused_add_pointwise) , which fix the failed test case `test_graph_partiton` on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148178
Approved by: https://github.com/benjaminglass1, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #148155
2025-03-01 10:59:55 +00:00
ab78bf5c66 [Break XPU][Inductor UT] Avoid custom op registration conflicts in test_auto_functionalize.py. (#148155)
Fix #148148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148155
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-01 10:59:55 +00:00
2f1b8e0fe2 [DTensor][Test] Add a test to demonstrate current dtensor view behavior if redistribution happens (#148015)
This does not fix the view op issue when redistribution happens. We want to add a test to demonstrate/record the issue, in which the distributed behavior does not match up with single device behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148015
Approved by: https://github.com/XilunWu
2025-03-01 10:24:40 +00:00
191c9bd013 Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)"
This reverts commit b8efebe57d05a87be5b0f304218d2af7bb2bf6c6.

Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/davidberard98 due to looks like another lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2692042859))
2025-03-01 07:43:58 +00:00
fe3b9e3764 [Inductor] optimize the heuristics of outer loop fusion (#147523)
Summary:
Optimize the heuristics of outer loop fusion: When the range of the first inner loop is much larger than the range of all outer loops, do not fuse the outer loops and fallback to standard codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147523
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-01 06:50:04 +00:00
b8efebe57d [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-03-01 06:38:39 +00:00
fd16311e7f [inductor][subgraph] Plumbing to get ShapeAsConstantBuffer from subgraph to main graph output (#147559)
I am unable to create a test case that fails without the next PR. The idea is to have a symint which is returned by the inner subgraph and then returned by the forward graph after partitioning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147559
Approved by: https://github.com/eellison
2025-03-01 06:17:11 +00:00
c87097e74a [triton 3.3] Fix inductor/test_profiler.py test (#148230)
test_inductor_profiling_kernel_names_pointwise is checking that the profiler correctly records the input shapes to the kernel. After triton 3.3, we get a different number of args (because the constexpr args are passed in, from the python perspective). This just patches the test to pass in either case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148230
Approved by: https://github.com/drisspg, https://github.com/YUNQIUGUO
2025-03-01 04:27:49 +00:00
9377a32cd1 [Inductor][NFC] Remove unused functions from compile_tasks.py (#147564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147564
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2025-03-01 03:44:43 +00:00
baf1c8fcdc Revert "introduce dynamism library (#147981)"
This reverts commit 6eff6b28e4d09cbf632f79502a8e317bf5b53c34.

Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2691906065))
2025-03-01 03:43:01 +00:00
493cd97af5 add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134)
Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134
Approved by: https://github.com/eqy
2025-03-01 02:59:48 +00:00
6eff6b28e4 introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk
2025-03-01 02:49:16 +00:00
08434df1f2 [MPS] fix empty place holder error for smooth l1 loss (#148133)
Fixes #123171

And parametrizes the tests for it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148133
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-01 02:32:45 +00:00
02c5f21541 [Inductor] fix AOTInductorTestABICompatibleGpu.test_triton_kernel_weird_param_order with new Triton (#148011)
In this case, the parameters have already been filtered [here](201666d77d/torch/_inductor/codegen/cpp_wrapper_gpu.py (L335)) and subsequent filtering is not only unnecessary, it breaks the code, since the positions of the parameters change after filtering. For this test, for example, the second filtering discarded `buf0`.

For example:
```python
(Pdb) triton_meta["signature"]
{'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'n_elements': 'i32', 'BLOCK_SIZE': 'constexpr', 'out_ptr': '*fp32'}
(Pdb) call_args
['arg0_1', 'arg0_1', '256L', 'buf0']
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148011
Approved by: https://github.com/davidberard98
2025-03-01 01:21:20 +00:00
338ed67a1e [inductor] Implement max_pool2d_with_indices as a reduction for large window sizes (#147876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147876
Approved by: https://github.com/eellison
2025-03-01 01:07:01 +00:00
230a3b0f83 Add cuda 11.8 guard for cufile preload (#148184)
Follow up after https://github.com/pytorch/pytorch/pull/148137
Make sure we don't try to load cufile on CUDA 11.8

Test:
```
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu118'
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148184
Approved by: https://github.com/mikaylagawarecki
2025-03-01 01:01:04 +00:00
2544afaa1a [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364
Approved by: https://github.com/fduwjj
2025-03-01 00:57:37 +00:00
5d297f7a34 [MPS][BE] Combine two upsample_kernel_out_template into one (#148211)
- First, by stopp inverting sizes and strides, i.e. passing them as is, but reading them in inverse order in the shader as 1st stride of 4D tensor is one used for batches, 2nd for channels and 3rd and 4th for spatial coordinates
- Pass `scales` as float2 even in linear tensor
Above  allows one to collide two flavors `upsample_kernel_out_template` into one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148211
Approved by: https://github.com/dcci
ghstack dependencies: #148154, #148187
2025-03-01 00:39:26 +00:00
clr
83fb974b5d scriptfunction: Make sure we have valid __name__ and __qualname__ (#147906)
It's not fully clear why these are not being created, but you can definitely
    reproduce this in code. `__name__` is fun, since there appears to be no way to
    explicitly set it on the pybind11 layer or c++ layer. I've set this in the
    python wrapper code (which works correctly). But let me know if people feel
    strongly and want us to go explicitly cast to python within the cpp functions
    and set it there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147906
Approved by: https://github.com/jansel
ghstack dependencies: #147894
2025-02-28 23:25:47 +00:00
1ae7cc41ca Define __all__ for torch.utils.tensorboard (#147550)
Fixes the issue:

```python
import torch.utils.tensorboard
torch.utils.tensorboard.FileWriter  # pyright: "FileWriter" is not exported from module "torch.utils.tensorboard"
torch.utils.tensorboard.RecordWriter  # pyright: "RecordWriter" is not exported from module "torch.utils.tensorboard"
torch.utils.tensorboard.SummaryWriter  # pyright: "SummaryWriter" is not exported from module "torch.utils.tensorboard"
```

The [docs page for `torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147550
Approved by: https://github.com/albanD
2025-02-28 23:06:11 +00:00
3a69dee955 [Submodule][FlashAttention] Bump to 2.7.4 (#148147)
# Summary
This makes me happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147
Approved by: https://github.com/Skylion007
2025-02-28 22:40:02 +00:00
83ec7cdcd4 Fix recompile reason logging (#148200)
for the following test case

```
        @torch.compile(dynamic=False, backend=cnts)
        def fn(x, y, z):
            return x * y * z[0]

        fn(1, torch.randn(1), {0: torch.randn(1)})
        fn(2, torch.randn(2), {0: torch.randn(2)})
        fn(3, torch.randn(3), {0: torch.randn(3)})
        fn(4, torch.randn(4), {0: torch.randn(4)})
        fn(5, torch.randn(5), {0: torch.randn(5)})
```

previously we would log

```
0/0: L['x'] == 1
0/0: L['x'] == 1
0/0: L['x'] == 1
0/0: L['x'] == 1
```

but after this change we now log

```
0/0: L['x'] == 1
0/1: L['x'] == 2
0/2: L['x'] == 3
0/3: L['x'] == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148200
Approved by: https://github.com/xmfan
2025-02-28 22:33:37 +00:00
40b3e4a358 [dynamo] expose code execution strategy to python (#148020)
@anijain2305 this can be used to mark a code object to be skipped/run-only (recursively) while tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148020
Approved by: https://github.com/jansel
2025-02-28 21:59:12 +00:00
e74fdbe6d0 [inductor] ignore block ptr advancements for removed buffers (#148087)
Follow up to https://github.com/pytorch/pytorch/pull/147193. Some buffers are removed only when the kernel context is exited so defer the lines instead.

Added `use_block_ptr` as a parameter to test case that fails if run with block ptrs enabled.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148087
Approved by: https://github.com/jansel, https://github.com/eellison
2025-02-28 21:31:15 +00:00
d174562487 [MPS][BE][EZ] Aggregate macros (#148187)
Refactor `INSTANTIATE_UPSAMPLE_BILINEAR2D(DTYPE)`, `INSTANTIATE_UPSAMPLE_BICUBIC2D(DTYPE)` and `INSTANTIATE_UPSAMPLE_BILINEAR2DAA(DTYPE)` use common `INSTANTIATE_UPSAMPLE2D`
Then combine multiple invocations into `INSTANTIATE_UPSAMPLE_ALL`

I.e. functionally it's a no-op, but achieves the same with fewer lines of code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148187
Approved by: https://github.com/Skylion007
ghstack dependencies: #148154
2025-02-28 21:30:00 +00:00
4995e058bf [user-triton] handle inline_asm_case (#148043)
Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value

Test Plan:
../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel

```
test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True)
ok

----------------------------------------------------------------------
Ran 2 tests in 0.706s

OK
```

Differential Revision: D69878591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043
Approved by: https://github.com/zou3519
2025-02-28 20:52:51 +00:00
6f91720e1c [inductor][ck] manual kBatch heuristic (#148118)
Summary:
# Why

Leverage kBatch parameter for large splitK examples for CK for better than ATEN performance

# What

replace default kBatch = 1 with a manual heuristic

- if K > 16 * max (M,N)
- leverage k_per_block, and K and number of SMs on the chip
- upper bound to 128, lower bound to 1

This is better than defaulting to 1, cheap to calculate, and shows performance beyond ATEN

This is of course subject to change and improvement

Test Plan:
with minor modifications to to run torch.mm on the shape `M, N, K = 2048, 2048, 524288`

```
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

```
AUTOTUNE mm(2048x524288, 524288x2048)
  rocm_ck_gemm_template_49 10.4972 ms 100.0%
  rocm_ck_gemm_template_8 10.6132 ms 98.9%
  rocm_ck_gemm_template_9 10.6907 ms 98.2%
[...]
  mm 18.9880 ms 55.3%
```

Reviewed By: ColinPeppler

Differential Revision: D70224591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148118
Approved by: https://github.com/ColinPeppler
2025-02-28 20:36:16 +00:00
48c55a66ec [ROCm] Move ROCm unstable MI300 jobs back to stable (#146675)
Fixes #145790
Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners.

This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](https://github.com/pytorch/pytorch/pull/145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146675
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-02-28 20:34:27 +00:00
6778084531 [inductor][cutlass] Environment variables for allow/denylist (#148161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148161
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-02-28 20:33:10 +00:00
5a1954eb93 [Inductor-CPU] Fix broken int8 WoQ GEMM AMX implementation in main (#147895)
#146843 broke int8 WoQ GEMM's (for BF16 activation) AMX ISA implementation in the main branch.
UT: `python test/inductor/test_cpu_select_algorithm.py -v -k woq`

The issue remained undetected because in case of templated kernel compilation failure, the auto-tuning infra marks its runtime as `inf`, and the op against which it was being benchmarked is used, so UTs didn't fail even on machines that support AMX ISA.

`test/inductor/test_cpu_select_algorithm.py` UTs checked the value of the `select_algorithm_autotune` counter, which only counts how many ops were selected for autotuning against their templated codegened counterparts.

@leslie-fang-intel advised using a new counter. I added `counters["inductor"]["cpp_templated_kernel_counter"]`, which is incremented after a codegened kernel's compilation, so it'd help catch breakage scenarios in which a templated kernel could not be codegened due to a compilation failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147895
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-02-28 20:20:45 +00:00
clr
e0e516c554 Don't crash when we call __qualname__ on torch._C.ScriptFunction (#147894)
We've root caused this to correctly throwing attribute error on ScriptFunction
when missing attributes are caused. This PR will fix crashes that are showing
up. I'm going to stack a second PR to fix torch._c.ScriptFunction just being a
very badly behaving python object (which should also fix this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147894
Approved by: https://github.com/jansel
2025-02-28 20:15:38 +00:00
297c00264e stage 1 of depreate silent fallback of tuning gemm (#147798)
Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/)

context:
https://github.com/pytorch/pytorch/issues/147479

For the most part, this should not change the behavior.

For int_mm, I also removed
```
    # TODO: Re-enable eager mode implementation once cuBLAS is fixed
    if use_cutlass or use_triton_template(layout, enable_int32=True):
        choices = []
```
because I think it is unwanted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798
Approved by: https://github.com/eellison
2025-02-28 19:51:55 +00:00
ebc3f27bf4 Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)"
This reverts commit 6e037ac41c095dfdb37fdd4b36bf8ec2ebf84bf1.

Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/wdvr due to lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2691421540))
2025-02-28 19:44:54 +00:00
42aeb5d259 Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners (#145504)
E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120
```
Beginning upload of artifact content to blob storage
Error: An error has occurred while creating the zip file for upload
Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log'
/home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459
    throw new Error('An error has occurred during zip creation for the artifact');
    ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145504
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-02-28 19:16:28 +00:00
6e037ac41c [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-02-28 18:51:42 +00:00
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
982d7ba3ef [while_loop][inductor] relax the constraint that all inputs must be on the same device (#148019)
Previously, we require all inputs of while_loop to be on the same device. However, there're use cases where we want to keep some of the inputs on cpu while others on gpu e.g. an loop_idx on cpu will save the gpu to device copies. This PR relaxes the constraint and only check if carry and input at the same position have the same device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148019
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-28 18:27:03 +00:00
2d2f60bdda [cond] support mismatched output in inductor (#147567)
In this PR, we extract `codegen_unbacked_symbol_defs` of FallbackKernel out as a `codegen_unbacked_symbol_defs_for_outputs` method in wrapper. With it,  HOPs can support the case where the subgraph returns a tensor with unbacked symints. This PR only do it for cond, we'll have follow up PRs for others (e.g. while_loop) as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147567
Approved by: https://github.com/jansel
2025-02-28 18:26:48 +00:00
d765077004 [cutlass backend] Sort the list of ops for better repro (#148047)
Differential Revision: [D70298051](https://our.internmc.facebook.com/intern/diff/D70298051/)

This only affects anything if `cutlass_max_profiling_configs` is used. I believe cutlass_max_profiling_configs is more of a testing config.

Problem is when we get the configs from cutlass_library, the ops can come in different orders.

Motivation is to make repro small issues easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148047
Approved by: https://github.com/chenyang78, https://github.com/coconutruben
2025-02-28 18:04:10 +00:00
790ec756ee [cutlass backend] Check if len(timings) == len(choices) before skipping precompile (#148050)
Differential Revision: [D70298908](https://our.internmc.facebook.com/intern/diff/D70298908/)

Mostly from @coconutruben observation. Right now, we skip precompilation if we find **some** timings. That sounds like a bug. Most of the time it is fine, since we don't change the number of configs and triton compilation doesn't take too long. But it is devastating for cutlass backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148050
Approved by: https://github.com/coconutruben
2025-02-28 17:58:58 +00:00
e5e31050d3 [MPS] Implement linear1d as shader (#148154)
And get rid of MPS call, as for some reason implementation via MPSGraph
API call is 100x+ times slower that Metal shader, at least according to
the following benchmark
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(3, 5, 65536, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="linear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS        |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16  ")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results after the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    2.5  |   2.1  |   2.2  | 161.4  | 115.0  | 161.1
```
And before the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  354.0  | 336.0  | 332.4  | 145.5  | 114.7  | 148.3
```

Fixes https://github.com/pytorch/pytorch/issues/144245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148154
Approved by: https://github.com/dcci
2025-02-28 16:47:42 +00:00
b5cd4ac950 [torchgen] Add support for schema with namespace (#148038)
Fixes https://github.com/pytorch/executorch/issues/8711

In ExecuTorch when we try to parse the following schema:

```
aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor
```
Repro:

```python
from torchgen.model import FunctionSchema
native_schema = FunctionSchema.parse("aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor")
```
It's failing because `BaseOperatorName` categorizes it to be a
inplace operator.

I understand we are not supposed to pass in namespace "aten::" into
`FunctionSchema.parse()` but unfortunately ExecuTorch requires this
feature to work.

This PR adds a new `namespace` attribute to `BaseOperatorName` and makes
sure the rest of the stack works as before, if a schema without
namespace  is passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148038
Approved by: https://github.com/bdhirsh
2025-02-28 16:41:50 +00:00
e593288859 ci: Remove manylinux builds for triton, except for XPU (#148129)
We're dropping regular old manylinux so let's drop it here too

Relates to #123649

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148129
Approved by: https://github.com/Camyll, https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
ghstack dependencies: #148126
2025-02-28 16:23:18 +00:00
4708cfdbd9 Support whitelist of dynamic sources (#147979)
This PR introduces the ability to whitelist sources as dynamic. This is particularly useful for large models with graph breaks, as you can keep the dynamism across graph breaks since source names stay consistent. Additionally you can use this to mark ints as dynamic.

NB: I intentionally didn't complicate the interface by supporting specification of per dimension dynamism. There is virtue in keeping true to the standard way of representing sources (eg. L['x']). If we find in practice that we need more more fine grained control, we can explore further affordances at that time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147979
Approved by: https://github.com/Mingming-Ding
2025-02-28 15:43:14 +00:00
0a948f705b [Dynamo] Fix AssertionError when dynamo traces torch.functional.xxx() functions (#148075)
Fixes #147840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148075
Approved by: https://github.com/yanboliang
2025-02-28 15:09:11 +00:00
1db3c58fab Remove manylinux 2014 artifacts (#148135)
1. Switch Magma build to Manylinux 2.28 base
2. Use manylinux 2.28 as default in populate_binary_env.sh
3. Remove manylinux 2014 docker builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148135
Approved by: https://github.com/malfet
2025-02-28 13:43:14 +00:00
1cb4e2df65 [BE][PYFMT] migrate PYFMT for torch._inductor to ruff format (#144550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550
Approved by: https://github.com/jansel
2025-02-28 13:33:19 +00:00
34d726011f [dynamo] update data-dependent branching graph break messages (#147912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147912
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #147494, #147872
2025-02-28 12:30:06 +00:00
4106aa33eb [dtensor][fix] fix _scaled_dot_product_flash_attention sharding (#148125)
### Summary
https://github.com/pytorch/pytorch/pull/146372/ changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at 40ad5e01df/torch/distributed/tensor/_ops/_matrix_ops.py (L232)

### Test
`pytest test/distributed/tensor/test_attention.py`

### Follow-up
It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148125
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-02-28 09:26:43 +00:00
af720cd5a7 [Intel GPU] Decompule Intel GPU oneDNN from other backends (#147926)
# Motivation
Currently, Intel GPU is moving forward rapidly with the development of feature. We(Intel GPU) want an independent version control over oneDNN component so as to quickly adopt the optimization or bug fixing provided by oneDNN team.

This PR does not change the behaviors of other backends like Intel CPU, ARM. They can keep using the stable version contained in `third_party/ideep`.

# Detail

At compilation time, we will `git clone` oneDNN via  URL `https://github.com/oneapi-src/oneDNN` and checkout to the tag/commit that Intel GPU backend prefers. This feature is supported by CMake `Externalproject_add` command.
Following is a build log example:
```bash
[11/60] Performing download step (git clone) for 'xpu_mkldnn_proj'
Cloning into 'xpu_mkldnn_proj'...
HEAD is now at 5e92240360 meta: updated citation file
[12/60] Performing update step for 'xpu_mkldnn_proj'
-- Already at requested tag: v3.7
[13/60] No patch step for 'xpu_mkldnn_proj'
```
The log demonstates that, we explicitly download the source files and checkout to a specific tag. The source file of oneDNN is located at `build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj`

# Runtime verification
Running UT for CPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit fc3f17ad469b8a6da7192ae12d32625faa509f1e)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine
```

Runnint UT for Intel GPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:DPC++
onednn_verbose,v1,info,gpu,engine,sycl gpu device count:2
```

We can see that, Intel GPU would uses commit `5e922` (tag v3.7), while CPU uses `fc3f17`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147926
Approved by: https://github.com/EikanWang

Co-authored-by: leizhenyuan <zhenyuan.lei@intel.com>
2025-02-28 07:42:06 +00:00
3a58a04898 Build a storage reader/writer to write checkpoints in HF format (#148089)
Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere

Test Plan:
signals pass
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage

Differential Revision: D70324090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089
Approved by: https://github.com/saumishr
2025-02-28 07:38:10 +00:00
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
4e160d5fd9 [triton 3.3] Fix aoti cpp wrapper remaining 5 issue. (following #148051) (#148117)
Summary:
Fix the following 5 on a100:

- test_foreach_cpp_wrapper_cuda_gpu_wrapper

- test_enable_dynamic_shapes_cpp_wrapper_cuda_gpu_wrapper

- test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_gpu_wrapper

- test_enable_dynamic_shapes_cpp_wrapper_cuda_dynamic_shapes_gpu_wrapper

- test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_dynamic_shapes_gpu_wrapper

Test Plan:
oss :

```
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCH_LOGS="+inductor, output_code" TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CPLUS_INCLUDE_PATH=/usr/local/cuda-12.6/include:$CPLUS_INCLUDE_PATH python test/inductor/test_gpu_cpp_wrapper.py -k test_foreach_cpp_wrapper_cuda_gpu_wrapper
```

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148117
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
2025-02-28 06:56:30 +00:00
ea12fc8a9f Revert D70262395 (#148164)
Summary:

This reverts #147804 due to internal revert.

---
This diff reverts D70262395

Reviewed By: RossMcKenzie

Differential Revision: D70318024

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148164
Approved by: https://github.com/xmfan
2025-02-28 06:39:48 +00:00
baba7beed2 [dynamo] add context manager debug information to graph breaks (#147872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147872
Approved by: https://github.com/zou3519
ghstack dependencies: #147494
2025-02-28 06:23:28 +00:00
4caeede799 [dynamo] more better error messages [3/N] (#147494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147494
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-02-28 06:23:28 +00:00
bc362cc15a Move expanded dim require_exact_stride handling to api from sdpa lowering (#148101)
See issue: https://github.com/pytorch/pytorch/issues/147156#issue-2852362217.

Original tests from https://github.com/pytorch/pytorch/pull/146054 should cover these changes, and I tested that the perf on https://github.com/pytorch/pytorch/issues/145760 remains fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148101
Approved by: https://github.com/zou3519
2025-02-28 06:02:18 +00:00
cyy
b0dfd242fa Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 05:53:19 +00:00
3b4b23ab0b [BE][Ez]: Remove extra copy in dtensor parallel loss (#148096)
Remove an extra copy of the input to `_log_softmax` when there is a dtype and memory format change. Fuse the copies instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148096
Approved by: https://github.com/jansel, https://github.com/wconstab
2025-02-28 05:42:32 +00:00
9b7130b8db Clean temporary directory at exit (#147813)
Issue: A temporary directory is created in [pytorch/torch/distributed/nn/jit/instantiator.py](https://github.com/arthurlw/pytorch/blob/clean-temp-directory-at-exit/torch/distributed/nn/jit/instantiator.py) but is never cleaned up, leading to a ResourceWarning on program exit.

Solution: Registered an `atexit` handler to properly clean up the temporary directory when the program exits.

Fixes #147744

**Line 23 in [0a49f8f](0a49f8fd3d)**
```python
23  atexit.register(_TEMP_DIR.cleanup)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147813
Approved by: https://github.com/H-Huang
2025-02-28 04:12:23 +00:00
760921a7d8 [MPS] Add inductor support for the entr() operator. (#148128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148128
Approved by: https://github.com/jansel, https://github.com/malfet
2025-02-28 03:33:22 +00:00
eb9c127341 [dynamo][optimizers] Install ID_GUARDED tensors into the Fx graph (#147824)
Earlier, with inline flag we were lifting id-guarded tensors to the inputs to the Fx graph. But this offers no benefit. Main idea behind lifting parameters as inputs was to reuse the compilation units across many instances of the nn-module. However, if we are guarding on the `id`, we are explicitly specializing the compiled artifact to the parameter.

This PR installs the parameters back into the graph. The benefit is removal of all pre-graph bytecode to extract the id-guarded tensors from locals/globals. This increases speedup from 1.67x to 1.75x for an internal model that has large number of optimizer parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147824
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-02-28 03:22:11 +00:00
926b7b5027 Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705)"
This reverts commit 40ad5e01dff05c7d64e070fb01683820e678f788.

Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))
2025-02-28 03:04:38 +00:00
3ce352e389 [BE][PYFMT] migrate PYFMT for torch._dynamo to ruff format (#144549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549
Approved by: https://github.com/jansel
2025-02-28 03:03:53 +00:00
edc5bf91d2 [Intel GPU] Add synchronize() in torch.utils.benchmark (#147835)
When following https://pytorch.org/tutorials/recipes/recipes/benchmark.html on XPU, I notice that the device it is not synchronized in the benchmark. This PR tries to fix this and align the behavior with CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147835
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-02-28 02:58:17 +00:00
0edb2da4a4 [dynamo] add sourceless builder for types.MethodType (#147880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880
Approved by: https://github.com/jansel
2025-02-28 02:30:04 +00:00
30375cb326 Fix minor typo in python_nccl (#148088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148088
Approved by: https://github.com/Skylion007
2025-02-28 00:47:09 +00:00
481a57bc37 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-28 00:47:03 +00:00
c6d1038aaa only print GraphModule during fx.Interpreter errors if valid (#148090)
Came up in https://www.internalfb.com/diff/D69057074?dst_version_fbid=970771615000938&transaction_fbid=1723357345264461 - we need to make sure the GraphModule is valid before calling `print_readable` on it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148090
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #147749
2025-02-28 00:44:27 +00:00
5a14ff8ace Add cufile to list of libraries to preload (#148137)
Fixes: https://github.com/pytorch/pytorch/issues/148120

Test with almalinux/9-base:latest :
```
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module>
    from torch._C import *  # noqa: F403
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory
>>> exit()
[root@18b37257e416 /]# vi /usr/local/lib64/python3.9/site-packages/torch/__init__.py
[root@18b37257e416 /]# python3
Python 3.9.19 (main, Sep 11 2024, 00:00:00)
[GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu126'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148137
Approved by: https://github.com/malfet
2025-02-28 00:35:47 +00:00
40ad5e01df Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 00:15:32 +00:00
2978771c9d [CI] test upload: better check for if job is rerun disabled tests (#148027)
Some disabled test runs weren't being uploaded as disabled tests because some dynamo tests are set to mark themselves as skipped if they are failing.  This makes the script think that there are fewer retries than there are actually are and that the job is not a rerun disabled tests job.  Instead, query for the job name to see if it contains rerun disabled tests and fall back to counting the number of retries if querying fails

Alternate options: relax the check for the number of tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148027
Approved by: https://github.com/huydhn
2025-02-28 00:04:33 +00:00
fc78192b1d ci: Only run CI specific things when in CI (#148126)
This was blocking me from running this locally so don't run it like this

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148126
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/atalman
2025-02-27 23:27:57 +00:00
f4235310e8 [BE][Ez]: Remove redundant empty tensor copies in meta-reg (#147978)
Empty_likes includes a memory_format arg. Let's use it to avoid unnecessary copy operations. Noticed while reviewing: https://github.com/pytorch/pytorch/pull/147862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147978
Approved by: https://github.com/jansel
2025-02-27 23:16:44 +00:00
915b9c80ab [export] Sync aoti schema to schema.py (#148017)
Summary: Synchronizing internal AOTI schema to OSS schema.py

Test Plan: CI

Differential Revision: D70271151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148017
Approved by: https://github.com/yiming0416
2025-02-27 21:46:11 +00:00
871b3909fc ci: Remove manylinux 2014 remnants (#148028)
These are the only remaining references I could find to manylinux2014,
we should probably look to remove these a bit quicker since it made it
difficult to know which Dockerfiles were important in
.ci/docker/manywheel/

> [!TIP]
> I checked if we were using these by running
> `rg 2014 .github/`

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148028
Approved by: https://github.com/wdvr, https://github.com/malfet, https://github.com/atalman
2025-02-27 21:37:00 +00:00
10ffd94216 Reference the commit explicitly (#148026)
Reference the commit tested by CI explicitly, and fail the merge if the PR was updated.

Tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148026
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/atalman
2025-02-27 21:06:34 +00:00
783d83c5d8 [PT2] Port fuse_split_getitem_squeeze to PT2 pre_grad passes (#148059)
Summary: put it as an add_pass option

Reviewed By: frank-wei

Differential Revision: D68909559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148059
Approved by: https://github.com/frank-wei
2025-02-27 21:03:51 +00:00
d48eb58d1d [BE][CI] bump ruff to 0.9.8 (#145606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145606
Approved by: https://github.com/malfet
ghstack dependencies: #144546
2025-02-27 21:01:10 +00:00
644d84d594 Revert "optimize the decomposition of aten.native_group_norm (#144733)"
This reverts commit b533bb4b133c36767270bd8a24f11d5c37f8dd5c.

Reverted https://github.com/pytorch/pytorch/pull/144733 on behalf of https://github.com/desertfire due to Cause TIMM pass rate regression on H100, see https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2020%20Feb%202025%2020%3A53%3A55%20GMT&stopTime=Thu%2C%2027%20Feb%202025%2020%3A53%3A55%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=main&lCommit=4216478250e08e950fdd090fc23a1b270c520cc4&rBranch=main&rCommit=4986f0f52eb871cdb91b8124ee162cfe622b8688 ([comment](https://github.com/pytorch/pytorch/pull/144733#issuecomment-2689092714))
2025-02-27 20:57:25 +00:00
1845e7d1f5 Use nightly-wheel-upload env for triton wheel publishing (#148108)
Required for publishing triton builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148108
Approved by: https://github.com/malfet
2025-02-27 20:47:40 +00:00
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
f0d00421cf [inductor][ck] kBatch filtering with gen_ops (#148004)
Summary:
# Why

not all choices of kBatch are valid and will lead to a runtime error (when CK checks the validity of the args)

c9bcfd755e/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp (L1020)

# What

- move kBatch inside the gen_ops to have more control over it, and be able to filter it
- expand filtering based on the cpp logic
- refactor the padding checks to be more readable

Test Plan:
```
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

with

kBatch = 128: some filering
kBatch = 1: no filering
kBatch = 1738: all options filtered out

Reviewed By: henrylhtsang

Differential Revision: D70211442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148004
Approved by: https://github.com/ColinPeppler, https://github.com/tenpercent
2025-02-27 20:13:58 +00:00
ce805a5ba5 [BE/metal] Rename REGISTER_I0_I1 to REGISTER_SPECIAL. (#148036)
Now that it's used for other ops as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148036
Approved by: https://github.com/malfet, https://github.com/jansel
2025-02-27 17:56:26 +00:00
9a1f720a72 Validate inputs to _nested_view_from_buffer to prevent overflows (#147356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147356
Approved by: https://github.com/albanD, https://github.com/jbschlosser
ghstack dependencies: #147352, #147354
2025-02-27 15:48:58 +00:00
536bce5a04 Make Tensor.set_ validate storage_offset when sizes/strides are unchanged (#147354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147354
Approved by: https://github.com/albanD
ghstack dependencies: #147352
2025-02-27 15:48:58 +00:00
e64441915f Fix overflow in checkInBoundsForStorage (#147352)
Use `computeStorageNbytes` (which checks for overflows) to include the computation re the storage_offset

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147352
Approved by: https://github.com/albanD
2025-02-27 15:48:50 +00:00
6ccbff1450 [Inductor] Fix inductor/test_kernel_benchmark.py for new Triton; do not duplicate parameters in _dump_launch_params (#147746)
The problem is that the new Triton uses the following code branch, which does not filter the call parameters, which may already be in the launcher's cfg.kwargs. This is generally expected behavior, so I just stopped adding arguments from `launcher.config.kwargs`: cde12207a0/torch/_inductor/runtime/triton_heuristics.py (L1099)

Issue example (from https://github.com/intel/intel-xpu-backend-for-triton/issues/3499):

```bash
Failed when when running cleaned triton Command '['/home/xinanlin/xinanlin/miniforge3/bin/python', '/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3b
dmtky5n4j4jrd5k5pu.py.cleaned']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 103, in <module>
    compiled_module_main('None', benchmark_compiled_module)
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/wrapper_benchmark.py", line 435, in compiled_module_main
    wall_time_ms = benchmark_compiled_module_fn(times=times, repeat=repeat) * 1000
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 98, in benchmark_compiled_module
    return print_performance(fn, times=times, repeat=repeat)
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in print_performance
    [timed(model, example_inputs, times, device) for _ in range(repeat)]
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in <listcomp>
    [timed(model, example_inputs, times, device) for _ in range(repeat)]
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 434, in timed
    result = model(*example_inputs)
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 97, in <lambda>
    fn = lambda: call([arg0_1, arg1_1])
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 86, in call
    triton_poi_fused_add_0[grid(1)](arg0_1, arg1_1, buf0, 1, 1, XBLOCK=1, num_warps=1, num_stages=1)
  File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 336, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 531, in run
    bound_args, specialization, options = binder(*args, **kwargs)
TypeError: dynamic_func() got multiple values for argument 'XBLOCK'
```

Reroduce:
`python test/inductor/test_kernel_benchmark.py -k test_remove_inductor_deps`

Triton: c4a79a1960
Pytorch: bea72180ed75f522ce4fe5e723bc2112e0874732

@davidberard98 @etaf please take a look
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147746
Approved by: https://github.com/jansel
2025-02-27 14:40:22 +00:00
2c35af4def [Intel GPU] Avoid including CPU oneDNN header files for Intel GPU (#147969)
XPU builds oneDNN in another folder. The XPU oneDNN head files are in the XPU-specific folder - `${__XPU_MKLDNN_BUILD_DIR}`.
f522d899fb/cmake/Modules/FindMKLDNN.cmake (L73)

 So, `${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn/include` is useless for XPU. `XPU_MKLDNN_INCLUDE` is good enough. Meanwhile, it may mess up the included files if the version of XPU oneDNN differs from other backends.

* __->__ #147969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147969
Approved by: https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/atalman
2025-02-27 14:22:17 +00:00
71ee17baa1 Smoke Test skip cuda.gds on windows (#148060)
Follow up after : https://github.com/pytorch/pytorch/pull/147120
Cufile was enabled only on Linux: https://pypi.org/project/nvidia-cufile-cu12/#files
Fixes validation workflow failues: https://github.com/pytorch/test-infra/actions/runs/13558218752/job/37896578837

```
 File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 105, in __init__
    raise RuntimeError("GdsFile is not supported on this platform.")
RuntimeError: GdsFile is not supported on this platform.
Exception ignored in: <function GdsFile.__del__ at 0x000001772B5003A0>
Traceback (most recent call last):
  File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 113, in __del__
    if self.handle is not None:
AttributeError: 'GdsFile' object has no attribute 'handle'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148060
Approved by: https://github.com/mikaylagawarecki
2025-02-27 14:00:49 +00:00
7ae0e0b2ea [aotd] Log torch._functorch.config in tlparse (#147883)
Adding torch._functorch.config to tlparse for better debugability.
E.g. https://github.com/pytorch/pytorch/pull/147638 happened only with `torch._functorch.config.view_replay_for_aliased_outputs=False` which is True by defautl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147883
Approved by: https://github.com/bdhirsh, https://github.com/jamesjwu
2025-02-27 11:22:45 +00:00
c5bf9aaf1c Log graph breaks (#146537)
Graph breaks currently aren't logged to dynamo_compile and pt2_compile_events. We want to log them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146537
Approved by: https://github.com/c00w
2025-02-27 11:06:33 +00:00
0489a349e7 Skip the logging if the pass cannot be pickled (#148053)
Summary:
Skip the logging for vllm at this moment, we can add some pickle logic later.

The log is only for debugging purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148053
Approved by: https://github.com/chenyang78
2025-02-27 10:54:34 +00:00
26f19539ad [triton 3.3] cpp_wrapper: add a global_scratch arg (#148051)
Following triton # 4916, the generated cubin expects a global_scratch argument to support on-device TMA. We believe this is the source of many of the "invalid argument" failures on AOTI/cpp_wrapper tests. AFAIK, we don't use on-device TMA in Inductor as of now, so it should be safe to use a nullptr for the scratch space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148051
Approved by: https://github.com/YUNQIUGUO
2025-02-27 10:13:57 +00:00
91e7c7945c [Intel GPU] Avoid unnecessary copy when the dst of Matmul is non-contiguous (#144759)
We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in https://github.com/pytorch/pytorch/pull/143784

I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144759
Approved by: https://github.com/EikanWang
2025-02-27 08:04:34 +00:00
8ee84aa703 [ONNX] Fix missed None type support in dyamic shapes string cases (#148025)
In `_any_str_or_dim_in_dynamic_shapes`, we strictly guard the `dynamic_shapes` to make sure the flattened shapes are valid. But the code missed to consider None could be in the shapes.

NOTE: Found in benchmarking with Olive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148025
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-02-27 07:57:47 +00:00
fd43c36aa9 [ca] side-effect free initial trace: RAII PyCompilerInterface (#147891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147891
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796, #147804
2025-02-27 07:17:30 +00:00
9017becf1d Add unique kernel name support for user defined triton kernel (#147587)
Summary:
Add unique_user_kernel_names which mimics what unique_kernel_names do, but for user defined Triton kernels.
This does rewrite the copied kernel src, and modifies non-Inductor generated code, so we split it out from unique_kernel_names, where we have more control over all namings and generations.

Test Plan: Only used for debug purpose

Differential Revision: D69966608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147587
Approved by: https://github.com/desertfire
2025-02-27 06:00:50 +00:00
c622796cde Revert "Build a storage reader/writer to write checkpoints in HF format (#147622)"
This reverts commit 6a658d983e84f7bcb8e67328b00661ec49db78c5.

Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))
2025-02-27 05:14:28 +00:00
21bd5fe203 Update torch-xpu-ops commit pin (#147968)
Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](86aaaf8a9d), includes:

- Bugfix (PT2E/BatchNorm)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968
Approved by: https://github.com/Skylion007
2025-02-27 05:01:12 +00:00
b6fe28ff02 [Inductor] Graph Partition (#147038)
This PR implements inductor graph partition. Previously, 1 dynamo graph is mapped to 1 inductor graph, and further mapped to 1 call function. In this PR, we allow 1 dynamo graph mapped to multiple inductor graphs and multiple `graph_partition` functions in the generated code. This allows applying different further optimizations to different `graph_partition`.

Design Doc: [link](https://docs.google.com/document/d/1qPgOfy25l7SIYnrQrvU-TO1mdHMslCwv_SLmeXID6tM/edit?usp=sharing)
Example: [Generated code before and after this diff](https://www.internalfb.com/intern/diffing/?paste_number=1737334601)

In the follow-up PR, we will extend the work to cudagraph, which allows applying cudagraph to parts of the generated code (#125864).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147038
Approved by: https://github.com/eellison
2025-02-27 04:50:43 +00:00
e0b93082f1 Remove HuggingFace reader and writer from __init__.py (#148030)
Summary: This is causing a HFStorageReader/Writer to be imported which imports fsspec but dependencies don't have fsspec, which is causing failing builds

Differential Revision: D70286926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148030
Approved by: https://github.com/hl475
2025-02-27 04:50:14 +00:00
8cb8722979 [inductor][triton] Ignore block ptr advances for removed buffers (#147193)
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-27 03:37:33 +00:00
17358ce778 Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878)"
This reverts commit ad0c879e2203145f6d56df0b95af36822220ab8f.

Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))
2025-02-27 03:36:16 +00:00
9d3636283b [Inductor] Use generic GPU device in test_preserves_strides (#148006)
#147861 added a new test tagged for the generic GPU but uses the cuda GPU type for creating the tensors. Update the GPU type to also be generic. This passes with my local Intel Triton install, presumably it will work for the current pin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148006
Approved by: https://github.com/eellison, https://github.com/etaf
2025-02-27 02:52:51 +00:00
07b7b3ed4e torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-27 02:44:39 +00:00
84c89a4527 [cutlass backend] cache_clear algorithm select cache on fresh inductor cache (#147590)
Differential Revision: [D69959917](https://our.internmc.facebook.com/intern/diff/D69959917/)

AlgorithmSelectorCache is a cache. The expectation is that when we force disable cache + clear inductor caches, it would be clear. However that is not the case.

The reason why this is a problem can be seen by following this repro:
What we will see is
```
SingleProcess AUTOTUNE benchmarking takes 6.2202 seconds and 46.0568 seconds precompiling for 36 choices
SingleProcess AUTOTUNE benchmarking takes 492.3141 seconds and 0.0010 seconds precompiling for 36 choices
```

The root cause is, while precompiling is skipped, due to it being cache, autotuning isn't skipped since we force disable it.

repro:
```
import logging
import os

os.environ["TORCH_LOGS"] = "+output_code,+benchmarking,+inductor"

import torch

import torch._inductor.config
from torch._inductor.utils import clear_inductor_caches

torch._inductor.config.max_autotune = True
torch._inductor.config.force_disable_caches = True
torch._inductor.config.autotune_num_choices_displayed = None
torch._inductor.config.max_autotune_gemm_backends = "CUTLASS"
torch._inductor.config.autotune_fallback_to_aten = False
torch._inductor.config.cuda.cutlass_instantiation_level = "0001"

def main():
    M, N, K = 2048, 2048, 2048
    dtype = torch.bfloat16
    A = torch.randn(M, K, device="cuda", dtype=dtype)
    B = torch.randn(K, N, device="cuda", dtype=dtype)
    for _ in range(2):
        torch._dynamo.reset()
        clear_inductor_caches()
        compiled_model = torch.compile(torch.mm, fullgraph=True)
        _ = compiled_model(A, B)

    print("done")

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147590
Approved by: https://github.com/eellison, https://github.com/chenyang78
2025-02-27 02:30:49 +00:00
97ebccaa91 Add _fft_r2c as core ATen (#147998)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147998
Approved by: https://github.com/tugsbayasgalan
2025-02-27 02:29:59 +00:00
ad0c879e22 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-27 02:08:29 +00:00
784902983e Remove +PTX from cuda 12.6 builds (#148000)
Similar to: https://github.com/pytorch/pytorch/pull/141142

Ahead of the release 2.7
I see following validation failure: https://github.com/pytorch/test-infra/actions/runs/13552433445/job/37879041739?pr=6339
```
RuntimeError: Binary size of torch-2.7.0.dev20250226+cu126-cp310-cp310-manylinux_2_28_x86_64.whl 1076.45 MB exceeds the threshold 750 MB
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148000
Approved by: https://github.com/clee2000, https://github.com/ngimel, https://github.com/tinglvv
2025-02-27 02:02:11 +00:00
20ce67cd06 Udpate hw requirement for FP64 on "Getting Started on Intel GPU" (#147802)
Fixes #147731

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147802
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-27 01:54:19 +00:00
cyy
9ca871f32b Remove binaries/benchmark_args.h (#147920)
It's not used in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147920
Approved by: https://github.com/Skylion007
2025-02-27 01:16:28 +00:00
ea5d40db73 Address source code building command for Intel GPU support (#143476)
As the title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143476
Approved by: https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Xu Han <xu.han@outlook.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-27 01:07:40 +00:00
f104ef1248 [AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode (#147975)
Summary: Let CppBuilder handle all the cpp build logic

Differential Revision: D70141808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147975
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-02-27 00:35:12 +00:00
f98cd84b04 cpp_wrapper: use largeTensorTest for test memory checks (#146991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146991
Approved by: https://github.com/desertfire
2025-02-27 00:30:21 +00:00
723f3a9eab torch.utils._content_store: fix error in hash_storage on XPU (#147785)
See https://github.com/pytorch/pytorch/actions/runs/13508573465/job/37745227468 for an example error. This is triggering after the merge of #147541, which enabled Dynamo compilation on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147785
Approved by: https://github.com/jansel
2025-02-26 23:57:59 +00:00
915eb012e1 Revert "[dynamo] add sourceless builder for types.MethodType (#147880)"
This reverts commit 08f4c1a2332921e57c782c80a66b2adc9cdc0575.

Reverted https://github.com/pytorch/pytorch/pull/147880 on behalf of https://github.com/wdvr due to failing trunk tests ([comment](https://github.com/pytorch/pytorch/pull/147880#issuecomment-2686436432))
2025-02-26 23:29:58 +00:00
84e60eece8 [ROCm] [TunableOp] Unit tests for scaled GEMM and GEMM with bias (#147890)
Two more unit tests for TunableOp:
- Scaled GEMM
- GEMM with bias

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147890
Approved by: https://github.com/jeffdaily
2025-02-26 22:41:24 +00:00
b13ad1a193 [ROCm][TunableOp] Remove extra transpose characters in hipBLASLt signature. (#147900)
Cleanup the TunableOp hipBLASLt signature of extra transpose characters.

Test manually and no new regressions found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147900
Approved by: https://github.com/jeffdaily
2025-02-26 22:28:00 +00:00
7e7d05bf85 Revert "[do not merge yet] update grammar (#147996)"
This reverts commit 6e129a697f86425d0682ed30ffc9b3f8abe00e9e.

Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686291282))
2025-02-26 22:01:12 +00:00
6e129a697f [do not merge yet] update grammar (#147996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996
Approved by: https://github.com/seemethere
2025-02-26 21:52:58 +00:00
dc7556f1bd Revert "[do not merge yet] update grammar (#147996)"
This reverts commit a1ee2c3a08c3bf3d83c4e9f352ea179c107edb13.

Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686266052))
2025-02-26 21:43:06 +00:00
a1ee2c3a08 [do not merge yet] update grammar (#147996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996
Approved by: https://github.com/seemethere
2025-02-26 21:39:08 +00:00
201666d77d [cutlass backend] turn autotuning logs off by default + rename log to autotuning log (#147922)
things we did:
* turn off autotuning logs by default
* rename autotuning logs from log to autotuning_log, so people are aware that it is a special artifact log.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147922
Approved by: https://github.com/eellison
2025-02-26 21:02:04 +00:00
976ff5cf01 Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418)
per title

sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-02-26 20:52:28 +00:00
6a658d983e Build a storage reader/writer to write checkpoints in HF format (#147622)
Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.
Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622
Approved by: https://github.com/saumishr
2025-02-26 20:47:54 +00:00
7c71ab1d40 [scan] User-facing reverse flag handling (#147886)
This PR removes the reverse flag from the backend implementation and resolves it via `torch.flip` in the frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147886
Approved by: https://github.com/ydwu4
2025-02-26 20:04:57 +00:00
683e083e8d [MPS] Add support for entr() in eager. (#147948)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147948
Approved by: https://github.com/malfet
2025-02-26 19:55:02 +00:00
eb08ada5d3 [dynamo] Support reads to global/captured tensors in nonstrict_trace-ed function (#147572)
As title. Without this patch we get the following error:

Tweaking the `allow_non_fake_inputs` flag on tensor mode doesn't quite
work for AOTAutograd, which also needs to fake-tensor-propagate the
`nonstrict_trace`-ed function, but that's _after_ Dynamo has handled the
`nonstrict_trace` processing and put the `flat_apply(...)` node into the graph.

So we can't easily to temporarily enable the `allow_non_fake_inputs`
flag on current fake mode, when AOTAutograd processes a `flat_apply`
node from Dynamo's `nonstrict_trace` handling. And after discussing
with zou3519, I decided to add a global `FakeTensorTLS` that contains a
`allow_non_fake_inputs_override` flag, and patch the `nonstrict_trace`-ed
function to temporarily tweak this flag during its execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147572
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367, #146950, #147571
2025-02-26 19:47:39 +00:00
73e963459e [dynamo] Support nonstrict_trace on class method (#147571)
As title, also see
1. new test `test_nonstrict_trace_on_method` for example.
2. newly added comments for why we need special treatment on methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147571
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367, #146950
2025-02-26 19:47:39 +00:00
7e0ef2c844 [dynamo] Use the new get_unique_name_wrt helper when applicable (#146950)
This patch removes some duplicated name generation logic in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146950
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367
2025-02-26 19:47:39 +00:00
f46f0e465c [dynamo] Initial support for nonstrict_trace (#146367)
## Context
> **Note:** `mark_traceable` got renamed to `nonstrict_trace` after
> offline discussion. The reasons are (1) it aligns with `torch.export`'s
> `nonstrict` notion, and (2) it's more definitive in behavior suggestion.

1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0)
2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn)

## Summary
This patch adds a `torch._dynamo.nonstrict_trace` decorator, which
currently is an enhanced version of `torch._dynamo.allow_in_graph` (see
docstring for their differences). Specifically, this patch focuses on
the UI and functionality prototyping/plumbing.

The main enhancement is supporting more input types, and the
implementation challenge lies in reconstructing the input objects from
Dynamo `VariableTracker` (while accounting for buffered side-effects and
guards).  This patch takes a middle-ground (simple implementation with a
bit of user labor), by
1. asking the user to provide pytree registration for non-proxy-able
   input types,
2. letting Dynamo trace through `pytree_flatten` (which accounts for
   buffered side-effects and guards automatically),
3. and passing in the TreeSpec as a graph attribute constant into
   `torch._higher_order_ops.flat_apply` (which unflattens the inputs and
   invokes the underlying function).

## Next Steps
In subsequent patches, we will try to support the following:
- annotating on class method
- reads to global tensors
- inputs that contains `pytree.register_constant`-ed instances.
- function as input
- more output types (e.g., any pytree-registered type)
- `torch.nn.Module` as inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367
Approved by: https://github.com/zou3519
ghstack dependencies: #146714
2025-02-26 19:47:39 +00:00
bab84f0bd9 [hop] Support more output types for flat_apply (#146714)
This patch enables `flat_apply` to support certain non-Tensor output
types like containers and graphable types. This will in turn enable the
upcoming `mark_traceable` to support more output types.

The patch also exposes a `func_to_graphable` rather than having the
users calling the lower level `pytree.flatten(ConstantFunction(...))`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146714
Approved by: https://github.com/zou3519
2025-02-26 19:47:39 +00:00
8594856651 [aotd] Alias of intermediate unwrap TensorAlias (#147638)
Bug was reported by internal user.

AOTD classified outputs that are aliases of intermediates of the graph in different categories.

...
- output is alias of intermediate which base is already output
- output is alias of intermediate which base is not in output

If we look at the fn:
```
def fn(x):
    ix = x + 1
    a = ix.transpose(0, 1)
    return a.detach(), a
```

output 0: detach view of alias a, where a is already output
output 1: alias of intermediate ix, then additional output ix will be added internally

output 0 base is TensorAlias(a) in this case, but could be Tensor.
Adding runtime unwrapping solves this problem.

Alternatively we should track base of a.detach() all the way to ix, in that case the base will be always a Tensor, not TensorAlias.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147638
Approved by: https://github.com/bdhirsh
2025-02-26 19:42:21 +00:00
30db64bf51 [PT2] Support add/remove passes in pre_grad (#146064)
Summary:
support the same functionality with acc_tracer disabled, add a new config for pre_grad add/remove_passes, at the front end it still uses the same interface

some minor updates in pre_grad passes to make sure the passes are run in desired order, after added passes, still run pass like remove_noops at the end

Test Plan: add new UT, please see stacked diff for add pass tests (TODO: update diff link)

Differential Revision: D68909278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146064
Approved by: https://github.com/frank-wei
2025-02-26 18:46:43 +00:00
00732c3f7e [MPS] Implemented masked_fill_scalar as shader (#147369)
- Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header
- Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop
```metal
ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim);
if (mask[thread_index]) {
  StridedTensor<T> input(input_data, sizes, input_strides, ndim);
  input[thread_index] = val;
}
```
But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow

- Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided`  which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors.

Performance measured on M2Pro thru different iterations of the same shader

| dtype | MPS | int64-idx | int64-inlined | 32-bit strided | 32-bit broadcasted |
| ------|------| -----|   ---- | --- | ---- |
| float32 | 2.8 msec  | 41.6 msec | 26.9 msec | 5 msec | 2.4 msec |
| float16 | 1.86 msec | 38.2 msec| 26.6 msec | 4.6 msec | 1.9 msec |
|bfloat16|1.86 msec |38.3 msec | 26.6 msec | 4.6 msec | 1.9 msec |

And benchmark script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_mask_fill(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()",
        setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_mask_fill(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```
Fixes https://github.com/pytorch/pytorch/issues/143477
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369
Approved by: https://github.com/dcci
ghstack dependencies: #147977
2025-02-26 18:39:15 +00:00
ebf6b9839c [MPS] faster integer batched matmul (#147877)
Followup to #147526
Tiled matmul for bmm as well.

## Speed ups:
![speedups_bmm](https://github.com/user-attachments/assets/02501145-7d64-4bbe-9dcc-994f004b4829)

Script to record times:
```python
import torch
import numpy as np
import time
import csv

batch_sizes = [1, 2, 4, 8]
matrix_sizes = [256, 512, 1024, 2048]
num_runs = 10
warmup_runs = 3

def run_int_mm(A, B):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = A @ B
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'B': [],
    'mean_time': [],
    'std_time': []
}

for b in batch_sizes:
    for n in matrix_sizes:
        print(f"\nBenchmarking N={n} and B={b}")

        try:
            A_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps")
            B_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps")

            for _ in range(warmup_runs):
                _, _ = run_int_mm(A_mps, B_mps)

            times = []
            for _ in range(num_runs):
                _, t = run_int_mm(A_mps, B_mps)
                times.append(t)

            mean_time = np.mean(times)
            std_time = np.std(times)

            results['N'].append(n)
            results['B'].append(b)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)

            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

        except RuntimeError as e:
            print(f"Error for N={n}: {e}")
            continue

with open('int_bmm_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['B'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147877
Approved by: https://github.com/Skylion007
2025-02-26 18:37:13 +00:00
cfb293ee02 [inductor] Add logs for precompile and autotuning (#147923)
Differential Revision: D70222645

I want to add more logs around precompile, especially around the reason why sometimes it gets fast returned. See https://github.com/pytorch/pytorch/pull/147590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147923
Approved by: https://github.com/Skylion007
2025-02-26 18:26:07 +00:00
0ea5d1067b ROCm: Remove static specifier for allow_tf32 variable. (#147186)
Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186
Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd
2025-02-26 18:24:02 +00:00
4e4191854b [logs][qol] Print log options alphabetically (#147888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147888
Approved by: https://github.com/jansel
2025-02-26 18:15:39 +00:00
fb566c5aea Fix auto_functionalize x inference_mode (#147925)
Fixes #147924

We were using the wrong FunctionalTensorMode to construct
FunctionalTensors. FunctionalTensors modify the FunctionalTensorMode on
construction, so that led to the wrong FunctionalTensorMode being
modified. This PR threads the FunctionalTensorMode through correctly.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147925
Approved by: https://github.com/bdhirsh
2025-02-26 18:05:30 +00:00
678435c443 [FlexAttention] Fix IMA bug (#147918)
# Summary
Fixes: https://github.com/pytorch/pytorch/issues/147268

I got this right for the backwards and somehow forgot to do the flip in the forward, not sure how this wasnt found earlier..

Testing IMAs is tuff in pytest so didnt add but verified on reproducer

```py
❯ sanitize python flex/maurice_ima.py --setting 0
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(1.0078, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(1.7994, device='cuda:0')
========= ERROR SUMMARY: 0 errors
❯ sanitize python flex/maurice_ima.py --setting 1
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(2.8297, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(1.9714, device='cuda:0')
========= ERROR SUMMARY: 0 errors
❯ sanitize python flex/maurice_ima.py --setting 2
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(3.2232, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(2.2095, device='cuda:0')
========= ERROR SUMMARY: 0 errors
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147918
Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007
2025-02-26 17:59:05 +00:00
3f7e242c86 [CI] Checkout with more processes (#147652)
The default action doesn't use more processes, possibly because most github provided runners only have 2 cpus, but we have more than that, so we might as well use them

Generally cuts maybe 1 min off of checkout time?

Changed checkout from pytorch/pytorch@main to pytorch/pytorch@my branch to test on 249a936998e66cc0d6ad8664e0e93ec1b9432a8b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147652
Approved by: https://github.com/ZainRizvi
2025-02-26 17:51:28 +00:00
ef61c290e1 [DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025)
Resolves https://github.com/pytorch/pytorch/issues/146767.

May also resolve https://github.com/pytorch/pytorch/issues/147584.

### Summary
This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons:

1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present.
2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution.

Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method.

### Consequence

DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`.

### Test
`pytest test/distributed/tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
`pytest test/distributed/tensor/parallel/test_tp_style.py`

Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025
Approved by: https://github.com/kwen2501
2025-02-26 17:33:22 +00:00
5ef94ca816 [BE] Do not copy arguments in variadic template (#147977)
By adding missing  `std::forward<Args>(args)...` and declaring template as passing args by reference

Noticed while working on creating `mtl_setBytes` specification that takes `MPSScalar` as argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147977
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-26 17:20:16 +00:00
ba9ed856e0 [FlexAttention] Improve error msg for embedding < 16 (#147765)
flex_attention uses tl.dot, which [does not support embedding < 16](https://github.com/triton-lang/triton/issues/2266) on input shapes. This PR adds explicit error message for users who are prototyping with small tensors.

Fixes #147701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147765
Approved by: https://github.com/drisspg
2025-02-26 17:06:35 +00:00
ac926f81cc [Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)
Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395
Approved by: https://github.com/eellison
2025-02-26 16:56:17 +00:00
fd1220e386 [ca] side-effect free inital trace: compiled_args (#147804)
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796
2025-02-26 16:37:27 +00:00
5e3069dde8 [ca] side-effect free initial trace: GraphTask (#147796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796
Approved by: https://github.com/jansel
ghstack dependencies: #147242
2025-02-26 16:37:27 +00:00
0a2da008f8 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-26 16:37:17 +00:00
08f4c1a233 [dynamo] add sourceless builder for types.MethodType (#147880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880
Approved by: https://github.com/jansel
2025-02-26 15:43:47 +00:00
edaf9ddeb5 Add basic Gaudi support to benchmarks/dynamo (#145920)
This PR adds basic Gaudi support to benchmarks/dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920
Approved by: https://github.com/eellison
2025-02-26 14:50:22 +00:00
be830c8b1c [Inductor][CPP] fix store mode atomic add (#147961)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/147848 and https://github.com/pytorch/pytorch/issues/146390. While addressing these issues, 2 problems were encountered:

- In `CppVecKernel`, when the number of threads is 1 and the mode is `atomic_add`, `store` did not `load/add` before storing. This has been fixed in this PR.

- In `CppTile2DKernel`, `store` did not support `atomic_add` mode. Support for this has been added in this PR.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_nn_fold
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147961
Approved by: https://github.com/malfet
2025-02-26 14:04:34 +00:00
f522d899fb Add MSVC version condition to "Fix for MSVC problem on Windows Arm64 (#136765)" (#145076)
This PR adds MSVC version guards around the if block presented on f7e36d8d6f9706ee9b9653538c4c8d2ba375a181. This commit was to provide a workaround for the problem reported here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 .
The issue is fixed now and only appears between versions 19.36 and 19.42.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145076
Approved by: https://github.com/malfet, https://github.com/alinpahontu2912

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-02-26 12:08:24 +00:00
60d94ea22b Add option to limit number of SMs used by matmul kernels (#147966)
Resubmission of #144974 which was reverted for unrelated reasons.

Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966
Approved by: https://github.com/danthe3rd
2025-02-26 12:01:12 +00:00
7ffae2c028 Split test_transformers.py (#147441)
Split test_transformers.py into test_transformers.py and test_transformers_privateuser1.py. Currently the privateuse1 test cases in test_transformers.py are skipped since they conflict with cuda test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147441
Approved by: https://github.com/drisspg
2025-02-26 11:54:24 +00:00
cf6d1e6824 [dynamo] add generic graph break hints (#147429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147429
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #147385
2025-02-26 09:20:28 +00:00
3fd68e4e2f [dynamo] make some more graph break messages readable in English [2/N] (#147385)
This is for "for some large number Z, make sure the error messages are readable English." - beginning to audit all `unimplemented` sites and making sure that all messages are at least English-readable. Hints may not necessarily be provided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147385
Approved by: https://github.com/jansel
2025-02-26 09:20:28 +00:00
7a06bfdd1c [inductor][ck] kBatch parametrized (#147885)
Summary:
# Why

Enable us to set the kBatch parameter, rather than bake it in

Especially for larger splitK scenarios, this can yield very good performance (up to 1.5x vs hipblaslt from initial tests)

## Why like this

The obvious question should be: why not add this to the op itself, and maybe even into the template/kernel. That would simplify the code.

The choice to have it as a "runtime" param that we fix is be able to reuse the compiled CK `.so` libraries, as now multiple choices of kBatch can be used with the exact same `.so` (as the shared library does not depend on kBatch, but takes it as a parameter)

# What

- copy cutlass approach for swizzle to have a "runtime" arg that we pass in but is really choice dependent
- pipe through everything from template and kernel
- hard-code it to be kBatch=1 for now (same as before, just now settable)

This is part of a series of Diffs, where next we need to figure out
1. how to filter out ops + kBatch that don't work
2. set this better for splitK scenarios (hand written heuristic)

Test Plan:
(with minor modifications)

```
# show it working with AOTI
buck2 run mode/opt-amd-gpu //scripts/henrylhtsang/repros:aot
```

```
# show it working with inductor only
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

Differential Revision: D70200008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147885
Approved by: https://github.com/ColinPeppler
2025-02-26 07:28:19 +00:00
a84db75e1b Revert "torch._scaled_mm with MXFP8 (#147548)"
This reverts commit 12b9674cb603438639298d6c9757ea93e18a7289.

Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - similar to previous, see below ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2684134336))
2025-02-26 07:17:24 +00:00
4216478250 Fix the benchmark config name from H100 benchmark (#147947)
When using the wrong benchmark configs, the benchmark jobs will be skipped.  The name should have the `_cuda_h100` suffix as used in the test matrix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147947
Approved by: https://github.com/wdvr
2025-02-26 06:40:07 +00:00
4ec6c1d1ec Fix test_halide.py report invocation to re-run failed tests (#147640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147640
Approved by: https://github.com/jansel
2025-02-26 06:32:22 +00:00
acca9b9cb0 Revert "[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803)"
This reverts commit 0b9da1ae0ad30ef228f132354b875bcaec214ace.

Reverted https://github.com/pytorch/pytorch/pull/147803 on behalf of https://github.com/wdvr due to breaking internal tests, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147803#issuecomment-2683938121))
2025-02-26 05:32:17 +00:00
12b9674cb6 torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-26 05:21:26 +00:00
9ed40af917 [BE][EZ] Delete MacOS-12.3 xfail list (#147905)
As PyTorch requires at least MacOS-13 (and Metal-3) to work, delete any pre-MacoS13 checks from test script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147905
Approved by: https://github.com/dcci
ghstack dependencies: #147892
2025-02-26 05:08:09 +00:00
a2399c9b44 [BE] Switch index_variable to torch.testing.make_tensor (#147892)
As it was a long-time todo and actually ublocks using this function for MPS devices (that do not support double)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147892
Approved by: https://github.com/dcci
2025-02-26 05:08:09 +00:00
c839fa4dd2 [Resubmit] Record input strides at time of tracing, constrain to them for triton fn (#147861)
Resubmit of https://github.com/pytorch/pytorch/pull/145448. it lost its changes on rebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147861
Approved by: https://github.com/zou3519
2025-02-26 05:05:06 +00:00
ba25e26baa [ROCm] Use IPT=8 for block radix sort (#147657)
Improve performance for shapes that use block radix sort by decreasing the item_per_thread to 8.
This will increase the thread block size leading to higher occupancy.

Co-author: @amd-sushetty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147657
Approved by: https://github.com/jeffdaily
2025-02-26 04:22:16 +00:00
f211818bc0 [c10d] Restrict use condition of NCCL mem pool (#147764)
Add check to see if CUDA driver support multicast, as does in Symmetric Memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147764
Approved by: https://github.com/syed-ahmed, https://github.com/yifuwang
2025-02-26 03:40:00 +00:00
d3fc583ff0 [cutlass backend] force_disable_caches for test_number_mm_precompiles (#147901)
Summary: Test is flaky right now.

Differential Revision: D70209511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147901
Approved by: https://github.com/ColinPeppler
2025-02-26 03:22:49 +00:00
9ad0ad6497 [MPS] Introduce a shader for entr(). (#147914)
To be used in eager/inductor in order to implement the missing operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147914
Approved by: https://github.com/malfet
2025-02-26 02:54:44 +00:00
805f7d97f7 [Inductor][Optimus] Fix a corner case in split cat aten pass (#147784)
Summary: We need to further check the input of the cat to make sure all of them are from the same split node.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/c875cbdd-5374-46cf-811c-45f91cf6ba3e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10977524161964655
Network: Up: 64KiB  Down: 27KiB  (reSessionID-2e5915cb-4894-48f6-ab1c-3981adb42dab)
Executing actions. Remaining     0/3                                                                         1.5s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:52.1s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

before
aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e

tlparse:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

after
aps-recgpt_ig_emb_pt2_comment_out-c03f74e353

Differential Revision: D70132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147784
Approved by: https://github.com/Microve
2025-02-26 02:19:48 +00:00
b533bb4b13 optimize the decomposition of aten.native_group_norm (#144733)
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.

The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],

The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],

I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-26 01:42:46 +00:00
12112fd198 Fix bug in FSDP wrapped module with zero argument (#147771)
Fixes https://github.com/pytorch/pytorch/issues/147531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147771
Approved by: https://github.com/awgu
2025-02-26 01:40:53 +00:00
8de6fe8c0b [docs] fix numpy docs reference (#147697)
Fix a link to numpy documentation that has moved and now 404's

I"ve checked other numpy doc links that point to docs.scipy.org (which then redirects to numpy.org) and they do work, so I am fixing just this 404.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147697
Approved by: https://github.com/soulitzer
2025-02-26 01:30:03 +00:00
90e3a3d86d Revert "[ca] trace saved variable unpacking (#147242)"
This reverts commit 68ddca94498fd7961cc5ebcb0dffafb8c2f4baca.

Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))
2025-02-26 00:40:16 +00:00
4d614baa30 Revert "[ca] side-effect free initial trace: GraphTask (#147796)"
This reverts commit 5758743f3c92f9cd9b61bc435602f13dd19c13d7.

Reverted https://github.com/pytorch/pytorch/pull/147796 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147796#issuecomment-2683599896))
2025-02-26 00:36:08 +00:00
143f0f0006 Revert "[ca] side-effect free inital trace: compiled_args (#147804)"
This reverts commit ec768d8dc04b334e01db1a90e4e6646e4e867e67.

Reverted https://github.com/pytorch/pytorch/pull/147804 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147804#issuecomment-2683594740))
2025-02-26 00:31:40 +00:00
3ecfe6be25 [Submodule] Turning flash-attention integration into 3rd party submod (#144120) (#146372)
Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
276dfe8150 [dynamo][cpp-guards] Disable dict-tag optim if the guard_manager has child accessors (#147694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147694
Approved by: https://github.com/isuruf
2025-02-26 00:02:08 +00:00
8e7e5ba182 Add sparse tensors constructed via legacy constructor to _sparse_tensors_to_validate (#147759)
This is a redo of https://github.com/pytorch/pytorch/pull/147408 which added validation at the end of the legacy constructor calls.

The reason why I didn't land that was because in `legacy_load`, constructor would be called before storages of indices/values are set. So the tensor would not actually be validated.

Technically, torch.sparse.{Foo}Tensor should not even be called by our rebuild process since afaict this was the first PR that added support for sparse tensor serialization https://github.com/pytorch/pytorch/pull/27062 and it already uses `_rebuild_sparse_tensor` (which would add the rebuilt tensor to the list to validate), but torch.sparse.FooTensor is allowlisted

This PR adds tensors constructed as such to the list to validate at the end of torch.load.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147759
Approved by: https://github.com/albanD
2025-02-25 23:51:12 +00:00
c82c1411c6 Revert "torch._scaled_mm with MXFP8 (#147548)"
This reverts commit e34c15a05b027b9da0962c971d448138fcf94926.

Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2683517851))
2025-02-25 23:28:15 +00:00
0633f63f0d [cutlass backend] try fix standlone runner test (#147811)
Differential Revision: [D70147859](https://our.internmc.facebook.com/intern/diff/D70147859/)

Trying to fix this test one last time, especially when mixed mm is getting removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147811
Approved by: https://github.com/chenyang78
2025-02-25 23:27:02 +00:00
05bc8fe62e Revert "follow up to #147548, fix regression on MI300 (#147878)"
This reverts commit cc444e75d540daff127f0210b7f8965a5c2b8d2a.

Reverted https://github.com/pytorch/pytorch/pull/147878 on behalf of https://github.com/wdvr due to temporary reverting to revert an older one in the stack ([comment](https://github.com/pytorch/pytorch/pull/147878#issuecomment-2683515567))
2025-02-25 23:25:59 +00:00
2df9a8d72d [Inductor][Tests] Update get_divisible_by_16 function in test_torchinductor.py to work correctly with new Triton (#147865)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147865
Approved by: https://github.com/davidberard98
2025-02-25 23:14:13 +00:00
1e894d2635 Revert "Add option to limit number of SMs used by matmul kernels (#144974)"
This reverts commit af2d63637ed025789679a17c241e6bb466508a1d.

Reverted https://github.com/pytorch/pytorch/pull/144974 on behalf of https://github.com/wdvr due to reverting in order to revert #147548 that causes a merge conflict ([comment](https://github.com/pytorch/pytorch/pull/144974#issuecomment-2683461733))
2025-02-25 22:46:38 +00:00
cc444e75d5 follow up to #147548, fix regression on MI300 (#147878)
Removing curly braces seemed superficial but broke MI300 rowwise matmul.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147878
Approved by: https://github.com/drisspg
2025-02-25 22:16:28 +00:00
a821d69d92 Fix register constant to be usable in exportz (#147533)
Differential Revision: [D69939737](https://our.internmc.facebook.com/intern/diff/D69939737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147533
Approved by: https://github.com/zou3519
2025-02-25 21:10:47 +00:00
0d31c621a3 Revert "[inductor][triton] Ignore block ptr advances for removed buffers (#147193)"
This reverts commit 17766b7aad0d9931bb6b3485fcf3d4c7532c3557.

Reverted https://github.com/pytorch/pytorch/pull/147193 on behalf of https://github.com/wdvr due to failing tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147193#issuecomment-2683286358))
2025-02-25 21:04:04 +00:00
6eb3d1e762 [DCP] Cache save plans in default planner (#147343)
Summary:
This PR caches the save plans to significantly reduce the collective cost for successive checkpoint save attempts. Here is the high level approach:
-  Create the local plan and cache the same.
- In next iteration, compare the local plan with the cached plan metadata. If no change, do not send that local plan in the collective.
- Global plan step, will only create the global plan with the new delta plans and empty plans for the cached ones.
- Finish plan step will check for the empty plans. If its empty, it will grab the cached plan. If not, it will use the new plan provided.

Test Plan: UTs

Differential Revision: D69224491

## How to enable the caching:
DefaultSavePlanner introduces the enable_plan_caching which is set to False by default for now.
https://github.com/pytorch/pytorch/pull/147343/files#diff-579bbb7b82572753afa91085fbf954f7c7613ff8376da9b26153d5cc3a3c4ee8R77
Set this to True to enable the caching and we should see significant speed up in the subsequent checkpoint save attempts, specially for larger scale jobs. Reference issue: https://github.com/pytorch/pytorch/issues/123695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147343
Approved by: https://github.com/MeetVadakkanchery
2025-02-25 20:59:25 +00:00
8d921eb97f export method (#147573)
The `export` API takes a `nn.Module` and traces its `forward` method. However sometimes it is useful to export different methods of a `nn.Module`, either as a one-off for debugging or as a set of methods that are called in some sequence outside `export` (e.g., `encode` / `decode`). When multiple methods of the same module instance are exported, they should share the same of the common module instance.

This PR adds a couple of utils in `torch._export.utils` for this workflow.

The `wrap_method` util wraps a method as a `nn.Module` that can then be exported. See included test. We recommend using the same module instance to export multiple methods on that instance, in which case they are guaranteed to share  state. On serde, this state sharing is lost, so we provide another util, `sync_state`, to re-sync the state.

These utils are meant to be eventually replaced by API-level changes, but for now this can unblock users who need this workflow. In particular, in the future we can accept one or multiple method entrypoints, with their own args / kwargs / dynamic shape specifications, which can create a variant of `ExportedProgram` with multiple graphs that share state; then we can automatically ensure that the state sharing is preserved through serde.

Differential Revision: [D69960801](https://our.internmc.facebook.com/intern/diff/D69960801/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147573
Approved by: https://github.com/tugsbayasgalan
2025-02-25 20:58:54 +00:00
687fe64667 Fix crash in -[PTMCoreMLCompiler _compileModel:atPath:] (#147809)
Summary:
We could hit one of those exceptions:
https://github.com/apple/coremltools/blob/main/modelpackage/src/ModelPackage.cpp#L205-L225

And it would make this code path crash.

Test Plan: build.

Differential Revision: D70122378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147809
Approved by: https://github.com/mcr229
2025-02-25 20:56:16 +00:00
ec768d8dc0 [ca] side-effect free inital trace: compiled_args (#147804)
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796
2025-02-25 20:38:51 +00:00
5758743f3c [ca] side-effect free initial trace: GraphTask (#147796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796
Approved by: https://github.com/jansel
ghstack dependencies: #147242
2025-02-25 20:38:51 +00:00
68ddca9449 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-25 20:38:51 +00:00
adf0f4ffd2 [custom op] fix inductor cpp codegen when returning a list of single tensor (#147649)
For a custom op that returns a list of a single tensor with unbacked symint shape:
```python

@torch.library.custom_op(
    "aoti_custom_ops::fn_ret_list_of_single_tensor", mutates_args={}
)
def fn_ret_list_of_single_tensor(x: torch.Tensor) -> list[torch.Tensor]:
    s = x.sum().to(torch.int64)
    return [torch.randn(s.item())]

@fn_ret_list_of_single_tensor.register_fake
def _(x):
    ctx = torch._custom_op.impl.get_ctx()
    i0 = ctx.new_dynamic_size()
    return [torch.randn(i0)]
```

Before the fix, we have the following error:
```
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp& std::get(const std::variant<_Types ...>&)’
  456 |     auto u0 = std::get<0>(buf1).size(0);
      |               ~~~~~~~~~~~^~~~~~
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note:   expected a type, got ‘0’
In file included from /data/users/yidi/pytorch/torch/include/c10/util/Exception.h:14,
                 from /data/users/yidi/pytorch/torch/include/c10/core/ScalarType.h:5,
                 from /data/users/yidi/pytorch/torch/include/ATen/AccumulateType.h:4,
                 from /data/users/yidi/pytorch/torch/include/ATen/native/Math.h:3,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec_base.h:31,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec512/vec512.h:8,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec.h:4,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional_base.h:6,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional.h:3,
                 from /tmp/tmp5iikarn2/3b/c3bi5gk6mslf6u4iaqafhxm64z6u65e3eain4xlary5blqnvv6xx.h:39,
                 from /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:366:
/usr/include/c++/11/variant:1145:27: note: candidate: ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’
 1145 |     constexpr const _Tp&& get(const variant<_Types...>&& __v)
      |                           ^~~
/usr/include/c++/11/variant:1145:27: note:   template argument deduction/substitution failed:
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’
  456 |     auto u0 = std::get<0>(buf1).size(0);
      |               ~~~~~~~~~~~^~~~~~
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note:   expected a type, got ‘0’
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147649
Approved by: https://github.com/angelayi
ghstack dependencies: #147130
2025-02-25 20:28:41 +00:00
824474cb35 [cond] support output sizes mismatch in front end (#147130)
This PR finishes https://github.com/pytorch/pytorch/pull/137615 by addressing the TODOs and comments left there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147130
Approved by: https://github.com/zou3519
2025-02-25 20:28:41 +00:00
de80b6f0d3 Updated test_cuda.py to rerun tests (#147040)
Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](d4871750d9).

Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-25 19:58:42 +00:00
361b6c97cd cpp_wrapper: Fixup output code indentation (#147215)
Closes #142165.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147215
Approved by: https://github.com/desertfire
ghstack dependencies: #146109, #146424
2025-02-25 19:50:37 +00:00
7c515b2da4 cpp_wrapper: fix test_torchinductor* tests (#146424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146424
Approved by: https://github.com/desertfire
ghstack dependencies: #146109
2025-02-25 19:50:37 +00:00
46d1422afd cpp_wrapper: fix inductor triton tests (#146109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146109
Approved by: https://github.com/desertfire
2025-02-25 19:50:37 +00:00
9740d69e78 [logging] Add toplevel dynamo_compile / tlparse logging for AOTI (#147760)
Summary:
This adds the proper context managers in `compile_fx_aot` such that we get:
1) A toplevel chromium event (i.e., tlparse)
2) A single `dynamo_compile` log entry

Test Plan:
Before:
* Scuba (we only log the dynamo event): https://fburl.com/scuba/dynamo_compile/sandbox/gaqowzrd
* Perfetto trace: https://fburl.com/vol7r6w1

After:
* Scuba (we log the dynamo _and_ compile_fx_aot event): https://fburl.com/scuba/dynamo_compile/sandbox/cx2we8w8
* Perfetto trace (click on the toplevel event to see the additional metadata): https://fburl.com/sziy40r9

Differential Revision: D70113859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147760
Approved by: https://github.com/desertfire
2025-02-25 19:41:39 +00:00
14b9f7f7bc Remove link to search survey (#147751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147751
Approved by: https://github.com/malfet
2025-02-25 19:26:59 +00:00
17766b7aad [inductor][triton] Ignore block ptr advances for removed buffers (#147193)
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 19:14:55 +00:00
ea6938a1f7 Add XuehaiPan to CODEOWNERS for C++ PyTree utilities (#137408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137408
Approved by: https://github.com/zou3519
2025-02-25 18:48:32 +00:00
580f1183b4 Enable ruff rule S324 (#147665)
Fixes #147627

- Add `S324` in `pyproject.toml `
- Running check and clean warnings

```bash
lintrunner --take RUFF --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 18:27:34 +00:00
6061664266 Enabled force_shape_pad for triton tests in test_kernel_benchmark (#147620)
During ROCm runs we naturally have those tests show that padding path will be slower for our archs and the pad_mm chooses to opt out of padding thus failing those tests.

Reasoning for this is per my understanding those tests don't check IF the operation should be padded in the first place, but HOW is it padded and if it's done in a correct way. More than that the tests shouldn't really be hardware dependent or have some condition for them.

Similar PR for reference: https://github.com/pytorch/pytorch/pull/141768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147620
Approved by: https://github.com/jeffdaily, https://github.com/chenyang78, https://github.com/shunting314
2025-02-25 18:06:48 +00:00
651e6aacf9 [ROCm] Remove benign warning about missing amdgpu.ids (#147791)
Fixes #144203.

We build a custom libdrm when preparing our docker image.  We attempt to locate the amdgpu.ids file relative to the python binary, but this is not possible for venv installs of pytorch when the python binary is a symlink.  Not finding amdgpu.ids causes `torch.cuda.get_device_name()` to return "AMD Radeon Graphics" as a generic name instead of something specific such as "AMD Instinct MI250X / MI250".  The libdrm warning is noisy, so we are removing it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147791
Approved by: https://github.com/jeffdaily
2025-02-25 17:17:25 +00:00
e5a13410cd Fix the tiny doc descriptions (#147319)
As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147319
Approved by: https://github.com/zou3519
2025-02-25 17:10:16 +00:00
346bbefa63 [BE] Parameterize TestSDPA in test_mps.py (#147856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147856
Approved by: https://github.com/Skylion007
2025-02-25 16:07:24 +00:00
810d2a3dbd [ARM] Fix bug in _ref_test_helper in test_ops and fix failing test on Aarch64 (#146597)
We have a failing unit test on Aarch64

```
Exception: Caused by reference input at index 34: SampleInput(input=Tensor[size=(5, 5, 4), device="cpu", dtype=torch.complex64, contiguous=False], args=(), kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=34 python test/test_ops.py TestCommonCPU.test_python_ref__refs_square_cpu_complex64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

After debugging it I found that `ex` variable is not being reset to None on each loop inside _ref_test_helper. Which after fixing, highlighted another expectedFailure to reenable - `nn.functional.hinge_embedding_loss` which was incorrectly being skipped due to the same problem.

4a545eb85d/test/test_ops.py (L546)
ex variable is not reset after this for next loop iteration

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146597
Approved by: https://github.com/digantdesai
2025-02-25 14:15:10 +00:00
a695aae89b [MPS] fix attention for >4d tensors (#147545)
Fixes #147443

and adds tests for >4d tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147545
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:55:28 +00:00
0b9da1ae0a [AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803)
Summary: Let CppBuilder handle all the cpp build logic

Differential Revision: [D70146185](https://our.internmc.facebook.com/intern/diff/D70146185)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147803
Approved by: https://github.com/malfet
ghstack dependencies: #147805, #147806, #147807
2025-02-25 13:33:12 +00:00
cc1c9826d4 [AOTI][refactor] Fix a typo (#147807)
Summary: defination -> definition

Differential Revision: [D70146182](https://our.internmc.facebook.com/intern/diff/D70146182)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147807
Approved by: https://github.com/malfet
ghstack dependencies: #147805, #147806
2025-02-25 13:33:12 +00:00
7ed0670e21 [AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147806)
Summary: Consolidate cpp compilation action to CppBuilder. Reland https://github.com/pytorch/pytorch/pull/147680

Differential Revision: [D70146183](https://our.internmc.facebook.com/intern/diff/D70146183)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147806
Approved by: https://github.com/malfet
ghstack dependencies: #147805
2025-02-25 13:33:03 +00:00
2680e835c8 [AOTI][refactor] Rename use_absolute_path to use_relative_path (#147805)
Summary: The option really means to compile a cpp file using its basename instead of the its full path. Reland https://github.com/pytorch/pytorch/pull/147679.

Differential Revision: [D70146184](https://our.internmc.facebook.com/intern/diff/D70146184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147805
Approved by: https://github.com/malfet
2025-02-25 13:32:54 +00:00
7e37fb0a4c [MPS] faster integer matmul for mps (#147526)
There is a naive matmul kernel written for MPS matmul which is used when input types are integer(and some other cases for older MacOSes). The old version of matmul is naive with global memory accesses which really tanks the performance especially when matrix is sufficiently large.

This PR optimizes it (even though there might be more optimizations with using simdgroup matrices which I'll cover in followup since writing that kernel will take more time)

## Performance comparison on M1 Pro:
![performance_comparison](https://github.com/user-attachments/assets/6ea8de5a-8231-4c5b-8dc9-caa79ea6879a)

You can get these numbers by running this script with old kernel compiled and then new kernel compiled(Make sure to change the csv where each output is written):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [32, 128, 512, 1024, 2048, 4096]
num_runs = 10
warmup_runs = 3

def run_int_mm(A, B):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = A @ B
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps")
        B_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_int_mm(A_mps, B_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_int_mm(A_mps, B_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('int_mm_benchmark_times_old.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147526
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:15:18 +00:00
b63c601614 Update merge rules for oneDNN part (#147615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147615
Approved by: https://github.com/atalman
2025-02-25 11:26:59 +00:00
af2d63637e Add option to limit number of SMs used by matmul kernels (#144974)
Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144974
Approved by: https://github.com/eqy, https://github.com/albanD
2025-02-25 10:19:19 +00:00
94969d0a40 [inductor][user triton] Handle scf.yield more accurately (#147762)
**TL;DR**: Previously, the mutation analysis for scf.if/scf.for would bundle all the scf.yield arguments into a single op (the scf.yield), such that a mutation on any returned value from the scf.if/scf.for would register as a mutation to _all_ of the scf.yield args. To fix this, this PR artificially introduces a new scf.yield op for each of the scf.yield args.

**Context**: The relevant kernel is something like this one (added as a test in test_triton_kernels.py)

```python
        @triton.jit
        def branch_with_multiple_yield_args(
            in_ptr0,
            in_ptr1,
            out_ptr,
            conditional_ptr,
            n_elements,
            BLOCK_SIZE: "tl.constexpr",
        ):
            pid = tl.program_id(axis=0)
            block_start = pid * BLOCK_SIZE
            offsets = block_start + tl.arange(0, BLOCK_SIZE)
            mask = offsets < n_elements
            conditional = tl.load(conditional_ptr)
            if conditional:
                in0 = in_ptr0 + 1
                in1 = in_ptr1 + 1
                out = out_ptr + 1
            else:
                in0 = in_ptr0
                in1 = in_ptr1
                out = out_ptr
            x = tl.load(in0 + offsets, mask=mask)
            y = tl.load(in1 + offsets, mask=mask)
            tl.store(out + offsets, x + y, mask=mask)
```

The mutation analysis starts with the `tl.store` - and then does a DFS backwards towards the parameters.  When a new op is encountered in the DFS, the analysis pass recurses on the op's arguments.

The if branch gets converted to TTIR like this:

```mlir
    %21:3 = scf.if %20 -> (!tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32>) {
      ...
      scf.yield %31, %32, %33 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc10)
    } else {
      scf.yield %arg0, %arg1, %arg2 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc11)
    } loc(#loc7)
```

and so the "source" op of the `out` variable is marked as the `scf.yield` op - and then all of the arguments to `scf.yield` are marked as mutable (including arg0, arg1, and arg2 - only one of which is actually mutated).

**This PR** we duplicate the `scf.yield` to add one `scf.yield` per return value. That way we avoid marking all the returns from the scf.if/scf.for as mutated when only some are.

Differential Revision: [D70118202](https://our.internmc.facebook.com/intern/diff/D70118202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147762
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-02-25 08:41:00 +00:00
7bd2e3bca1 Update torch-xpu-ops commit pin (#147743)
Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](306a0ffb6e), includes:

- Bugfix (LayerNorm/Nonzeros)
- Update AOT target

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743
Approved by: https://github.com/EikanWang
2025-02-25 08:06:35 +00:00
866dc45d3c [Inductor][ROCm][CK] Unhardedcoded kernel shapes for ck_conv_template codegen (#147504)
## [Inductor][ROCm][CK] Parameterize `ck_conv_template` Codegen

### Description
Previously, ROCm CK kernel codegen templates were hardcoded with fixed values for convolution parameters:

- `index_t GroupCount`
- `index_t NBatch`
- `index_t NOutChannels`
- `index_t NInChannels`
- `vector<index_t> FilterSize`
- `vector<index_t> InputSize`
- `vector<index_t> ConvolutionStrides`
- `vector<index_t> Dilations`
- `vector<index_t> LeftPads`
- `vector<index_t> RightPads`

This PR updates `ck_conv_template` to accept these parameters dynamically from Inductor. By doing so, we reduce the number of generated templates, improving flexibility and maintainability.

### Testing
- Verified correctness by running relevant test cases, i.e `test/inductor/test_ck_backend.py`
- Ensured generated kernels reflect the updated parameterization, i.e generated templates in `/tmp/torchinductor_root/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147504
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/tenpercent

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-02-25 07:48:07 +00:00
d73b927662 [DSD] Fixes issue when there is a PG without parameters (#147730)
Fixes https://github.com/pytorch/pytorch/issues/143828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147730
Approved by: https://github.com/mori360
2025-02-25 07:25:38 +00:00
fb73b0c7c5 Revert "use copy2d in h2d/d2h copy when possible (#146256)"
This reverts commit 0bc036a9e98d2cc92ff9dd367342b1f2efcc15f0.

Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))
2025-02-25 07:06:38 +00:00
bb7e8fbd66 [CacheBench] Add hf_T5 llama moco to cachebench (#147783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147783
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781, #147782
2025-02-25 04:34:45 +00:00
895564d6b6 [CacheBench] Add huggingface (#147782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147782
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781
2025-02-25 04:34:45 +00:00
c4fb6ae55d [CacheBench] Separate dynamic into its own option (#147781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147781
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780
2025-02-25 04:34:34 +00:00
60d4cbfc06 [CacheBench] Add repeat option so that we can have more accurate cache results (#147780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147780
Approved by: https://github.com/huydhn
ghstack dependencies: #147688
2025-02-25 04:34:25 +00:00
ab3b814af3 [CacheBench] Add ciflow/trunk test (#147688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147688
Approved by: https://github.com/huydhn
2025-02-25 04:34:16 +00:00
4b7604ec10 Delete Mixed MM Special Casing (#147151)
Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup:

- prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not.
- similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path.

It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication.

The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete.

Future optimizations could include:

- cutlass prologue path
- making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit.

Differential Revision: [D70114858](https://our.internmc.facebook.com/intern/diff/D70114858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2025-02-25 04:29:54 +00:00
890213f65f Revert "[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679)"
This reverts commit 0b52d801d2297ad6c38e631eedfd4dead9360e1b.

Reverted https://github.com/pytorch/pytorch/pull/147679 on behalf of https://github.com/desertfire due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147679#issuecomment-2680389225))
2025-02-25 04:11:13 +00:00
9b06b30468 Revert "[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680)"
This reverts commit 22fae0d948ac14c72b510fafc2283072d744dff9.

Reverted https://github.com/pytorch/pytorch/pull/147680 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147680#issuecomment-2680383986))
2025-02-25 04:06:40 +00:00
9478c90e2b [Quant] flip: throw runtime error for QUInt4x2 and QUInt2x4 input (#147430)
Fixes #147208

**Summary**
The `flip` op causes memory corruption for `torch.quint4x2` and `torch.quint2x4` inputs. It is because the TensorIterator-based implementation does not support multiple elements per byte. And `torch.quint4x2` and `torch.quint2x4` are deprecated in PyTorch. So, we add a check here to throw a runtime error if input dtyps is `torch.quint4x2` or `torch.quint2x4`.

**Test plan**
```
pytest -s test/test_shape_ops.py -k test_flip_unsupported_dtype
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147430
Approved by: https://github.com/mingfeima, https://github.com/ngimel
2025-02-25 03:47:40 +00:00
20295c017e Fix import of getArtifactLogger for ir_pre_fusion and ir_post_fusion (#147560)
Fixes #147002

There was an issue with the previous PR https://github.com/pytorch/pytorch/pull/147248 that didn't show up in CI,
where a logging import was not complete in torch/_inductor/debug.py before importing it.
This only happened if someone directly imported the file without doing any other imports before.

Also set to off_by_default by request to reduce log spew.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147560
Approved by: https://github.com/Skylion007
2025-02-25 03:36:08 +00:00
e34c15a05b torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-25 03:32:22 +00:00
cyy
8f728e28dd Enable ASAN in CUDA tests (#147512)
It should work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147512
Approved by: https://github.com/soulitzer
2025-02-25 02:58:39 +00:00
b0fa92042b Fix torch.mean out dtype check (#147188)
**For CPU**:
Type promotion is supported for torch.mean

**For Meta**:
Not supported for torch.mean

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147188
Approved by: https://github.com/albanD
2025-02-25 02:50:03 +00:00
33ff96b3f9 cpp_builder: unbreak clang++ detection (#147775)
Fixes an issue where `_is_gcc` would match on `clang++` due to the string ending with `g++`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147775
Approved by: https://github.com/desertfire
2025-02-25 02:33:01 +00:00
dacdc9782b [Inductor] Add input value checking to randint meta function (#147191)
Fixes #147070

Adding value checking for the range to the meta function, similar to which in the CUDA/CPU aten op.

Test with
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_tensor_creation_ops.py -k test_randint_inference
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147191
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-25 02:18:16 +00:00
c644f4c5fe [Inductor] Fix the decompositions of torch isin (#147519)
**Summary**
Fixed two decomposition issues in `torch.isin`:

- Issue 1: As reported in [#147329](https://github.com/pytorch/pytorch/issues/147329), the current decomposition does not support cases where test_element is a scalar. This is now implemented by referring to the ead970c8d0/aten/src/ATen/native/TensorCompare.cpp (L1004-L1008)

- Issue 2: Found while enabling a unit test with `elements = 1` and `test_elements = torch.tensor([1, 2, 3, 4])`, where Inductor produced different results compared to eager mode. This issue is fixed by referring to ead970c8d0/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L329-L338)

**Test Plan**
```
python test/inductor/test_torchinductor.py -k test_isin_tensor_scalar
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147519
Approved by: https://github.com/jgong5, https://github.com/FFFrog, https://github.com/peterbell10
2025-02-25 01:49:44 +00:00
2c8cd41c1f Delete unused conda-aws-upload environment (#147792)
As this environment only contains keys for Anaconda uploads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147792
Approved by: https://github.com/atalman
2025-02-25 01:42:52 +00:00
43074680b5 [ROCm] Add support for gfx1102 arch to wheel builds. (#147761)
[gfx1102 is not officially supported](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) but most ROCm libs have gfx1102 code objects available since ROCm 5.5.  Now that we're using `--offload-compress` we can fit another gfx target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147761
Approved by: https://github.com/jeffdaily
2025-02-25 01:35:52 +00:00
97557b9833 [Inductor] Update set_driver_to_gpu code to avoid backend re-initialization with new Triton (#147621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147621
Approved by: https://github.com/jansel
2025-02-25 00:04:54 +00:00
55bf3ff3a5 [Docs] Add OpDTypes.any_common_cpu_cuda_one (#147605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147605
Approved by: https://github.com/soulitzer
2025-02-24 23:23:43 +00:00
e72b4c61bf Revert "Upgrade submodule oneDNN to v3.7 (#147498)"
This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c.

Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))
2025-02-24 22:57:39 +00:00
81dccd706b [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-24 22:47:52 +00:00
a71d8b7246 Fix ReferenceError: weakly-referenced object no longer exists in cycle detector (#146922)
Summary: weakref.proxy objects will throw errors when they re dead.  We just do not bother visulaizing them. They are weak, so they aren't relevant to cycles anyway.

Differential Revision: D69270429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146922
Approved by: https://github.com/tianfengfrank, https://github.com/Chillee
2025-02-24 22:27:39 +00:00
22fae0d948 [AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680)
Consolidate cpp compilation action to CppBuilder

Differential Revision: [D69723632](https://our.internmc.facebook.com/intern/diff/D69723632/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147680
Approved by: https://github.com/yushangdi, https://github.com/angelayi
ghstack dependencies: #147679
2025-02-24 21:45:15 +00:00
0b52d801d2 [AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679)
The option really means to compile a cpp file using its basename instead of the its full path.

Differential Revision: [D69722709](https://our.internmc.facebook.com/intern/diff/D69722709/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147679
Approved by: https://github.com/angelayi
2025-02-24 21:44:33 +00:00
96acb56626 [ROCm] Optimize the stride one indexing backwards kernel (#146420)
This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <Hashem.Hashemi@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-24 21:19:06 +00:00
89b9c12de8 remove prints from partitioner (#147749)
See c57894cd74..22d8f9a657 (r1968015955)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147749
Approved by: https://github.com/Skylion007, https://github.com/laithsakka
2025-02-24 21:03:45 +00:00
8eb400ef66 [BE] TCPStore: use typed errors for assertions (#147647)
This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend  to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python.

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647
Approved by: https://github.com/fduwjj
2025-02-24 20:58:10 +00:00
19fd21fe7e [Inductor] Hot fix after #146917 (#147639)
This pull request reverts the changes to `torch/_inductor/ir.py` file that were added in #146917.

Where I tested, there were changes only from `torch/_inductor/codegen/cpp_wrapper_gpu.py`, it turns out that changes in `torch/_inductor/ir.py` file are not really needed. So it's my fault, I didn't sync the environments (between several machines) correctly.

@davidberard98 @YUNQIUGUO maybe that's why the tests on CUDA didn't pass?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147639
Approved by: https://github.com/etaf, https://github.com/davidberard98
2025-02-24 20:34:48 +00:00
754fb834db [BE][CI] bump ruff to 0.9.0: string quote styles (#144569)
Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting

- Change the outer quotes to double quotes for nested f-strings

```diff
- f'{", ".join(args)}'
+ f"{', '.join(args)}"
```

- Change the inner quotes to double quotes for triple f-strings

```diff
  string = """
-     {', '.join(args)}
+     {", ".join(args)}
  """
```

- Join implicitly concatenated strings

```diff
- string = "short string " "short string " f"{var}"
+ string = f"short string short string {var}"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569
Approved by: https://github.com/Skylion007
ghstack dependencies: #146509
2025-02-24 19:56:09 +00:00
52f6d4aa30 [BE][CI][Easy] bump ruff to 0.9.0: long statements in docstrings (#146509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146509
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-24 19:56:08 +00:00
9605c5063b [ROCm][TunableOp] Speed-up matmul_small_brute_force_tunableop unit test (#147659)
This PR has a UT speed-up and some refactoring of tests.

A previous PR https://github.com/pytorch/pytorch/pull/142422 fixed this matmul_small_brute_force_tunableop for the FP16 data type by adding TunableOp numerical checks. It had the unfortunate side effect that it increased the execution time for the FP32 and FP64 data types by a significant margin. This PR *reduces* the execution time by 20+ minutes.

We also move a hipBLASLt version check to a different tunableop UT for simplicity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147659
Approved by: https://github.com/jeffdaily
2025-02-24 19:44:38 +00:00
69c4f6ff13 [Minor] Fix minor mistake in docstring of replace_pattern (#147611)
Fixes #147610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147611
Approved by: https://github.com/soulitzer
2025-02-24 19:33:44 +00:00
b9b1fd9b93 [Intel GPU] qlinear.pointwise with mixed dtype support (#136753)
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_int8_mixed_bf16_xpu \
    -k test_qlinear_relu_int8_mixed_bf16_xpu \
    -k test_qlinear_add_int8_mixed_bf16_xpu
```

# Runtime exemplification
```bash
#qlinear+bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242
# qlinear_add + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922
# qlinear_add_relu + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277
```
As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136753
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337, #135465

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-24 19:27:50 +00:00
075b91bef1 [Intel GPU] qconv.pointwise with mixed dtype XPU support (#135465)
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qconv_pointwise` op. With current PR, we allow `qconv` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qconv2d_int8_mixed_bf16_xpu \
    -k test_qconv2d_relu_int8_mixed_bf16_xpu \
    -k test_qconv2d_hardtanh_int8_mixed_bf16_xpu \
    -k test_qconv2d_hardswish_int8_mixed_bf16_xpu \
    -k test_qconv2d_silu_int8_mixed_bf16_xpu \
    -k test_qconv2d_add_int8_mixed_bf16_xpu \
    -k test_qconv2d_add_relu_int8_mixed_bf16_xpu
```

# Runtime verification
```bash
#qconv + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0539551
# qconv_silu + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0588379
# qconv_hardswish + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_hardswish:0.166667:0.5,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0568848
```
The `dst_bf16::blocked:acdb::f0` attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135465
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-24 19:27:50 +00:00
ffa19b9024 [ROCm][Windows] Fix unrecognized constexpr std::memcpy for HIP-clang (#147316)
Since in MSVC's 2019/2022 implementation of STL memcpy is not defined as a constexpr function, HIP clang compiler on Windows cannot evaluate the following memcopy as one that could be resolved during the compile time. To resolve this, a `__builtin_memcpy` is used instead which doesn't have this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147316
Approved by: https://github.com/jeffdaily
2025-02-24 18:28:59 +00:00
900a774781 Revert "[ROCm] Update periodic.yml to use 2GPU runners (#146839)"
This reverts commit b6273d7f4ba4fbb126eb96816287641ca1e4efc6.

Reverted https://github.com/pytorch/pytorch/pull/146839 on behalf of https://github.com/jithunnair-amd due to This change is not needed anymore since our 4-GPU runners are back online and stable so far ([comment](https://github.com/pytorch/pytorch/pull/146839#issuecomment-2679145448))
2025-02-24 17:17:58 +00:00
cde12207a0 [Intel GPU] Add SDPA implementation on XPU with OneDNN (#147612)
Add XPU implementation of OneDNN based SDPA operator. Will be integrated and enabled later.

Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147612
Approved by: https://github.com/EikanWang
2025-02-24 16:12:04 +00:00
576ed1e400 Upgrade submodule oneDNN to v3.7 (#147498)
This PR is to upgrade submodule oneDNN to v3.7.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498
Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
2025-02-24 14:32:51 +00:00
80d3afc698 [inductor] Improve type annotations in _inductor/pattern_matcher.py (#146626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146626
Approved by: https://github.com/Skylion007
2025-02-24 14:30:35 +00:00
d0f08dc3eb Update slow tests (#147728)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147728
Approved by: https://github.com/pytorchbot
2025-02-24 11:48:19 +00:00
cba14212e6 [FX] micro-optimization map_aggregate(immutable_dict) (#147691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147691
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #147699, #144640
2025-02-24 09:14:08 +00:00
a50af71fb6 [FX] Refactor immutable collections implementation (#144640)
Get rid of dynamic class creation via `type(name, bases, ...)`. Convert it to classic static class definition for better readability and static analysis support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144640
Approved by: https://github.com/jansel
ghstack dependencies: #147699
2025-02-24 09:14:08 +00:00
dc9a03d30c [Window] Fix invalid file path on windows. (#147708)
This PR aims to fix the invalid path for windows: `C:\\Users\\sdp\\AppData\\Local\\Temp\\tmp0wugz2qm\\dynamo\\code_state___main__.TestFxGraphCache.test_cache_hot_load_pgo:None:.pkl.lock`
Windows does not allow chars `\ / : * ? " < > |` in a path.

And this PR also replace `os.rename` to `os.replace` in torch/_dynamo/pgo.py because `os.replace` allows target file exists on Windows, but not `os.rename` .
| Function                      | `os.rename()`              | `os.replace()`             |
|--------------------------------|----------------------------|----------------------------|
| Rename a file                 |                           |                           |
| Move a file                   |                           |                           |
| Overwrite an existing file     |  (Error on Windows)       |  (Will overwrite)         |
| Overwrite an existing directory |  (Error on Windows)                         |   (Error on Windows)                        |
| Move across disks             |                           |                           |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147708
Approved by: https://github.com/jansel
2025-02-24 08:31:11 +00:00
5b6ad682bc Revert "[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)"
This reverts commit 85ea67983421acc30ccc76f7a159042e75c6ea08.

Reverted https://github.com/pytorch/pytorch/pull/147254 on behalf of https://github.com/jeanschmidt due to introduced reds on main ([comment](https://github.com/pytorch/pytorch/pull/147254#issuecomment-2677700862))
2025-02-24 08:20:16 +00:00
8d618f3da7 [AOTI][XPU] Suppress multi-line comment warning for XPU. (#147710)
This PR aim to suppress multi-line comment waring in sycl header when building Inductor cpp_wrapper .
```
/intel/oneapi/compiler/2025.0/include/sycl/detail/builtins/builtins.hpp:235:1: warning: multi-line comment [-Wcomment]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147710
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-02-24 07:28:59 +00:00
cee03b7746 [Inductor] Update should_decompose_mm condition for CPU (#147673)
Summary:
Previously, for cpu we decompose addmm if
```
check_device(mat1, mat2, device="cpu")
        and mat1.shape[0] == 1
        and mat2.shape[0] <= 64
        and mat2.shape[1] <= 16
```
We have a new case where `mat2.shape[2] = 304`, and benchmark shows that it will beneficial if we decompose, so update the condition to
```
check_device(mat1, mat2, device="cpu")
        and mat1.shape[0] == 1
        and mat2.shape[0] <= 64
        and mat2.shape[1] <= 512
```

Differential Revision: D70033166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147673
Approved by: https://github.com/houseroad
2025-02-24 05:51:50 +00:00
8b65dbad13 [MPS/Inductor] Add support for xlog1py. (#147709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147709
Approved by: https://github.com/jansel
2025-02-24 05:28:52 +00:00
baccadb2f1 xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431)
Initially discussed here: https://github.com/pytorch/pytorch/pull/132945#discussion_r1957366131

Previously `torch.xpu.get_arch_list()` got relaxed to work even if XPU device is not available. However, we overlooked the case when pytorch is not compiled with XPU support. In such a case function throws an exception. This commit adjusts this behavior and makes function return `[]` even if pytorch is not compiled with XPU support.

CC: @EikanWang @fengyuan14 @guangyey @malfet @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147431
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-02-24 01:35:54 +00:00
7c52ef2424 Add XPU to is_compile_supported to support roi_align op in torchvision (#147541)
Part of the required fix for https://github.com/intel/torch-xpu-ops/issues/1264.

To support `roi_align`, torchvision uses `is_compile_supported` in `torch/_dynamo/utils.py` to compile a non-deterministic version of the op for backwards passes. This PR adds XPU device to the supported compile devices.

The `is_compile_supported()` util function has extremely limited usage, only being used in `torchvision.ops.roi_align` and `torch.utils._content_store.has_storage()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147541
Approved by: https://github.com/guangyey, https://github.com/jansel

Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
2025-02-24 01:32:36 +00:00
4e934ee5a7 [MPS] Add eager support for xlog1py. (#147687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147687
Approved by: https://github.com/malfet
2025-02-24 01:23:59 +00:00
eqy
718cf68aee [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-23 22:01:39 +00:00
b5d7aefa57 [BE] add missing overload annotations for tree_map_only (#147699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147699
Approved by: https://github.com/Skylion007
2025-02-23 20:21:07 +00:00
f47573f70d Add super().setUp() to some test cases (#147651)
I saw that their disabled issues were getting spammed with comments, meaning that they were still running in CI despite having a disable issue, so I added the super().setUp() call to check if there's a disable issue for them since they were missing it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147651
Approved by: https://github.com/huydhn
2025-02-23 18:21:17 +00:00
f03e7f3801 [MPS] Workaround rng bug for 5D tensors (#147667)
For some reason MPSGraph returns repeated values is tensor dimention is
larger than 4, which can be clearly seen by running following
```swift
import Metal
import MetalPerformanceShadersGraph

func randMPS(device: MTLDevice, obuf: MTLBuffer, nelem: Int, ndim: Int = 5) {
  let graph = MPSGraph()
  var dims = Array(repeating: 1, count: ndim)
  dims[0] = nelem
  let shape = dims.map { NSNumber(value: $0) }
  let randNode = graph.randomUniformTensor(withShape: shape, seed: 42, name: nil)
  let mpsOutputBuffer = MPSGraphTensorData(obuf, shape: shape, dataType: .float32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [:], targetOperations: nil, resultsDictionary: [randNode: mpsOutputBuffer])
}

func printBuf(_ prefix: String, buf: MTLBuffer, nelem: Int) {
  let buf_data = buf.contents().assumingMemoryBound(to: Float.self)
  print(prefix)
  for i in 0..<nelem {
      print(buf_data[i], terminator: i != nelem - 1 ? " " : "\n")
  }
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let nelem = 2
guard let buf = device.makeBuffer(length:nelem * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 4)
printBuf("4D uniform", buf: buf, nelem: nelem)

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 5)
printBuf("5D uniform", buf: buf, nelem: nelem)
```

Workaround by flatting the tensor if it's contiguous

Fixes https://github.com/pytorch/pytorch/issues/147624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147667
Approved by: https://github.com/dcci
2025-02-23 16:52:01 +00:00
3e2d9d079e Revert "[ROCm] OCP FP8 Support for new GPUs (#146632)"
This reverts commit f95ab46797e1f3e8cc48ce2f45e4f6985132fb19.

Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))
2025-02-23 12:04:50 +00:00
d0adff761e Propagate AttributeError to user code in user_defined.py (#146497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146497
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #146496
2025-02-23 01:18:28 +00:00
8c761ac7e3 Handle is/is not (#146496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146496
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-02-23 01:18:28 +00:00
b084635735 [MPS/inductor] Adjust more tests that depends on non-divisible input sizes (#147681)
Also adjust a comment while I'm at it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147681
Approved by: https://github.com/jansel
2025-02-23 00:33:26 +00:00
6a5e3917a7 [MPS] Add inductor support for spherical_bessel_j0. (#147650)
Counterpart to my previous patch that added support for the op in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147650
Approved by: https://github.com/jansel
2025-02-23 00:32:36 +00:00
f9c117f859 [mps/inductor] XFAIL adaptive_avg_pool_with_output_size_0. (#147676)
Non-divisible input sizes are not implemented on MPS device yet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147676
Approved by: https://github.com/malfet
2025-02-22 20:17:33 +00:00
db15cb0988 [Submodule] [Cutlass] Update to 3.8.0 tag (#147655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655
Approved by: https://github.com/henrylhtsang, https://github.com/eqy
2025-02-22 20:05:31 +00:00
85ea679834 [TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)
[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)

Summary:

# context
* more details in the [post](https://fb.workplace.com/groups/1075192433118967/permalink/1587079018596970/)
* disable contextlib with PT2

Test Plan:
* run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+dynamo,+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_ultra_mini training.pipeline_type=pt2 data_loader.dataset.table_ds=[2024-12-02] 2>&1 | tee -a output.log
```
* old tlparse
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpYYAS3o/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
* new tlparse
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpUJhCGZ/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Reviewed By: Microve

Differential Revision: D68480678
2025-02-22 18:57:55 +01:00
fa8e3a28a7 Revert "[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)"
This reverts commit 533b884870acd951e684e0bf551eb76904dec047.

Reverted https://github.com/pytorch/pytorch/pull/141178 on behalf of https://github.com/jeanschmidt due to Broke internal arvr signals, see D69971019. @jbschlosser please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/141178#issuecomment-2676317470))
2025-02-22 17:28:12 +00:00
bea72180ed Revert "[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)"
This reverts commit e7bf490c430ac5a70ccb7ab8e954d3386fd29413.

Reverted https://github.com/pytorch/pytorch/pull/144572 on behalf of https://github.com/jeanschmidt due to Broke internal signals, D69994027, I'll find someone to help get this change merged ([comment](https://github.com/pytorch/pytorch/pull/144572#issuecomment-2676314308))
2025-02-22 17:19:38 +00:00
3409cbd177 Revert "Delete Mixed MM Special Casing (#147151)"
This reverts commit d6bb1d7f0a9dc3d11d2864da9ab46872377a6e52.

Reverted https://github.com/pytorch/pytorch/pull/147151 on behalf of https://github.com/jeanschmidt due to Broke a few internal signals, see comments on D69994157 ([comment](https://github.com/pytorch/pytorch/pull/147151#issuecomment-2676312215))
2025-02-22 17:14:32 +00:00
72b4f35cb5 [CI] Reduce the AOT target list to reduce build time (#147601)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147601
Approved by: https://github.com/atalman
2025-02-22 14:43:26 +00:00
3cc3d7e08f Also support non-contiguous activation for torch._weight_int8pack_mm on CPU (#147588)
### Problem
Non-contiguous activation for `torch._weight_int8pack_mm` is unsupported on CPU.
So, with int8 WoQ with B16 activation with torchao, for batch-size 2 & above, an assertion is hit regarding non-contiguous A being unsupported. Such an issue was encountered with LLaMA models.

### Solution
Also support non-contiguous activation for `torch._weight_int8pack_mm`, so long as it's contiguous on the last dimension & remove the assertion that requires contiguous activation.

### Alternative solutions considered
Could modify LLaMA model in transformers library to call `contiguous` after obtaining the final hidden state, just before computing logits with the LM head. However, [it](https://github.com/huggingface/transformers/pull/36078) might cause some regression for other users of that code.

Another aspect to this issue is - is latency always lower if we make an activation tensor contiguous before linear or `torch._weight_int8pack_mm` is called on CPU? I guess we need some data-points to analyze this part, although I think the performance should be good enough with this patch, since the first cache lines of rows of A are being explicitly prefetched in the existing code (and it also avoids copy, which a `contiguous` call would do).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147588
Approved by: https://github.com/mingfeima, https://github.com/leslie-fang-intel, https://github.com/malfet
2025-02-22 08:29:07 +00:00
e1bf892d90 [DDP] Temporarily disable comm mem (#147663)
For fear that it incur slightly more memory usage and cause some applications at tight memory margin to OOM.
(bc the comm mem pool is a separate pool than the regular pool ?)

Differential Revision: [D70026681](https://our.internmc.facebook.com/intern/diff/D70026681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147663
Approved by: https://github.com/d4l3k
2025-02-22 05:55:43 +00:00
086d146f6f Update ruff linter for PEP585 (#147540)
This turns on PEP585 enforcement in RUFF.

- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-22 04:45:17 +00:00
77d2780657 Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549)
running python test/strobelight/examples/compile_time_profile_example.py
```
strobelight_compile_time_profiler, line 123, 2025-02-20 14:08:08,409, INFO: compile time strobelight profiling enabled
strobelight_compile_time_profiler, line 159, 2025-02-20 14:08:08,409, INFO: Unique sample tag for this run is: 2025-02-20-14:08:081656673devgpu005.nha1.facebook.com
strobelight_compile_time_profiler, line 160, 2025-02-20 14:08:09,124, INFO: URL to access the strobelight profile at the end of the run: https://fburl.com/scuba/pyperf_experimental/on_demand/9felqj0i

strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:12,436, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:15,553, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:16,170, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 214, 2025-02-20 14:08:16,877, INFO: profiling frame 1/0
strobelight_function_profiler, line 247, 2025-02-20 14:08:19,416, INFO: strobelight run id is: 4015948658689996
strobelight_function_profiler, line 249, 2025-02-20 14:08:21,546, INFO: strobelight profiling running
strobelight_function_profiler, line 289, 2025-02-20 14:08:25,964, INFO: work function took 4.417063233006047 seconds
strobelight_function_profiler, line 230, 2025-02-20 14:08:28,310, INFO: strobelight profiling stopped
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Total samples: 119
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/73h2f7ur
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/zs06fi9e
strobelight_compile_time_profiler, line 167, 2025-02-20 14:08:44,308, INFO: 1 strobelight success runs out of 1 non-recursive compilation events.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147549
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #147547
2025-02-22 03:44:53 +00:00
fc095a885c move _strobelight/example to avoid graph breaks (#147547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147547
Approved by: https://github.com/bobrenjc93
2025-02-22 03:44:53 +00:00
fecd3f7ecb [ROCm] change is_hip_clang() to always return True (#147646)
hipify is replacing kernel launchs <<< >>> with hipLaunchKernelGGL() macro and this is a regression caused by /opt/rocm/hip/.hipinfo no longer existing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147646
Approved by: https://github.com/jeffdaily, https://github.com/petrex
2025-02-22 03:26:55 +00:00
b11d5cd584 [Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481)
This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #147347
2025-02-22 02:53:16 +00:00
2d433cf1ad [Inductor UT][Windows][XPU] Enable Inductor UT on XPU Windows. (#147347)
This PR removes the restrictions on general cases for XPU on Windows, allowing us to run Inductor UT on Windows.
Additionally, this series of PRs has also fixed all XPU Inductor UT issues on Windows. However, due to resource constraints, we have not yet set up a Windows CI pipeline online.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147347
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-02-22 02:53:16 +00:00
84fcf1bb11 constexpr all the things in irange.h (#147633)
I got complaints while irangeifying some files in ExecuTorch
that irange could not be used in a constexpr function. This made the
complaints go away.

I added a constexpr function in irange_test that used to fail to build
with `error: variable of non-literal type 'iterator' (aka
'integer_iterator<int, true>') cannot be defined in a constexpr
function before C++23` and now builds fine.

Differential Revision: [D69959614](https://our.internmc.facebook.com/intern/diff/D69959614/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147633
Approved by: https://github.com/albanD
2025-02-22 01:51:51 +00:00
6e0b09728a [export] Remove report from draft-export output (#147558)
Summary: This matches the export API. To print the report, people can just do `print(ep._report)`. This information is also displayed in the terminal after the draft_export call.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D69689154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147558
Approved by: https://github.com/pianpwk
2025-02-22 00:54:29 +00:00
1c334893dc [CacheBench] Refactor code to prepare for mode benchmarks (#147641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147641
Approved by: https://github.com/huydhn
2025-02-22 00:20:54 +00:00
5d26b7108f [PP] Remove extra code and docs BE (#147636)
current docs:
<img width="746" alt="image" src="https://github.com/user-attachments/assets/4c4088fc-ee97-4a82-be28-e33eb35e76f5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147636
Approved by: https://github.com/awgu
2025-02-22 00:10:31 +00:00
f95ab46797 [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 23:44:08 +00:00
b1a81a4a65 Don't use '-e' when installing Triton (#147228)
Currently the install_triton.sh script uses "pip install -e ." to install Triton.
Using the -e is sometimes appropriate for develop work but is less appropriate for delivery.
To make matters worse it seems the behavior of the -e various depending on the version of pip invovled.

This PR removes the -e and installs Triton normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147228
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-21 23:00:12 +00:00
995b125cdd [CI] Build sm89 with more procs experiment (#147487)
Add a build that uses 4 out of the 8 processes available on a linux.2xlarge/c5.2xlarge.  Currently it's set to 2 because it would oom, but I'm curious as to how often people's builds oom.  I can't test this on my own because of caching, so it has to run on pull request

This might result in a failing job on may people's PRs and I'm not sure how to get around it.  I named it stable to make it automatically get sorted into the stable group for Dr. CI but it'll still show up
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147487
Approved by: https://github.com/huydhn
2025-02-21 22:07:00 +00:00
7c8c82cd64 [trymerge] Post initial starting merge comment on stacked PRs (#147028)
Post a small comment stating if a PR is being merged as part of a stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028
Approved by: https://github.com/ZainRizvi
2025-02-21 22:05:00 +00:00
698f6f9fae specify only some dimensions in shapes collection (#147534)
Differential Revision: [D69936316](https://our.internmc.facebook.com/intern/diff/D69936316/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147534
Approved by: https://github.com/bobrenjc93
2025-02-21 22:02:42 +00:00
2fb9416e6f [inductor][cpu] Move VNNI weight packing into AMX GEMM kernel for contiguous BMM weights (#146843)
Currently, the bfloat16 microkernel that uses AMX vectorization requires that the weights are in an interleaved VNNI format. For GEMM code, this hasn't been an issue because GEMM currently only supports constant weights, so the VNNI weight packing is done during compile-time and saved as a constant tensor to the graph. But for BMM ops where weights are not required to be constant, current code does an expensive reshape/VNNI packing for all BMM weights.

This PR removes the need for the reshape/packing for non-constant inputs by moving VNNI packing inside the AMX microkernel. A new `K * block_n` buffer is used to store the temporary packed weights. Weight packing involves interleaving 2 rows of weights.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146843
Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-21 21:46:00 +00:00
d91be786cb [cutlass backend] clear_on_fresh_inductor_cache when generatings cutlass ops (#147586)
Differential Revision: [D69966732](https://our.internmc.facebook.com/intern/diff/D69966732/)

This is needed if we want to generate cutlass ops with different instantiation level in one session.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147586
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-02-21 21:28:41 +00:00
ef6b16ea9d Revert "[trymerge] Post initial starting merge comment on stacked PRs (#147028)"
This reverts commit 0295aabf6071c7da62325e6a29e04ed09a3e34ef.

Reverted https://github.com/pytorch/pytorch/pull/147028 on behalf of https://github.com/clee2000 due to I think this broke merge for non ghstack prs ([comment](https://github.com/pytorch/pytorch/pull/147028#issuecomment-2675532017))
2025-02-21 21:02:19 +00:00
05e6f15966 Revert "[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)"
This reverts commit e758d8b4d1632ea765bf8bc8e87b6039ae708b9f.

Reverted https://github.com/pytorch/pytorch/pull/147395 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D69890757 - servicelab_benchmark_pyper_local_runner, @eellison please help the author get this change landed ([comment](https://github.com/pytorch/pytorch/pull/147395#issuecomment-2675521966))
2025-02-21 20:56:40 +00:00
6eb795c9e8 [associative_scan] compile backend change to "eager" (#146973)
This PR fixes some issues with torch export discussed here: https://github.com/pytorch/pytorch/pull/140043#discussion_r1941932960

However, this backend change does still not resolve the failure for specific shapes mentioned here: https://github.com/pytorch/pytorch/issues/137943#issuecomment-2649564994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146973
Approved by: https://github.com/ydwu4
2025-02-21 20:21:41 +00:00
5ed1e23e3a Fix type stubs for SymmetricMemory (#146310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146310
Approved by: https://github.com/yifuwang
2025-02-21 19:59:43 +00:00
fd8ae1aa04 [ROCm] gfx940 and gfx941 cleanup (#147394)
Removing gfx architectures not supported by ROCm.

NOTE: For users wanting to build PyTorch for gfx archs that are *not* supported by the official wheels on download.pytorch.org, you can build PyTorch from source for your desired gfx arch [using the PYTORCH_ROCM_ARCH env var](https://github.com/pytorch/pytorch/blob/main/README.md#amd-rocm-support).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147394
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-02-21 19:42:12 +00:00
c0ee62573a [Easy][optim] Add LBFGS params optional desc (#147579)
[LBFGS docs](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#torch.optim.LBFGS) missing `optional` description for params in compare with other optimizer docs, like [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/34877490-16b4-4c68-bf6c-405bae563352)

### After

![image](https://github.com/user-attachments/assets/7fba94c8-7091-47b8-bdf1-ca7d779a027f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147579
Approved by: https://github.com/janeyx99
2025-02-21 19:38:10 +00:00
b5c3bb6185 Add continuous run for cachebench (#147546)
This PR adds a continuous run for cache bench.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147546
Approved by: https://github.com/huydhn
ghstack dependencies: #147537
2025-02-21 19:02:17 +00:00
76ce194b8e For addmm and bmm, check if config.autotune_fallback_to_aten before using aten as a fallback. Also fix bmm cutlass backend (#147148)
This PR also fixes BMM, which was silently failing for a while.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147148
Approved by: https://github.com/eellison
2025-02-21 18:41:52 +00:00
0295aabf60 [trymerge] Post initial starting merge comment on stacked PRs (#147028)
Post a small comment stating if a PR is being merged as part of a stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028
Approved by: https://github.com/ZainRizvi
2025-02-21 18:05:05 +00:00
2190ca7f47 Use __qualname__ in add_safe_globals and update Unpickling error raised for Unsupported GLOBAL (#146815)
- Fixes #146814

Change
```python
for f in _marked_safe_globals_set:
    module, name = f.__module__, f.__name__
```
to

```python
for f in _marked_safe_globals_set:
    module, name = f.__module__, f.__qualname__
```
for avoiding same key string overwrite.

A test is also added.
```
python test/test_serialization.py TestSerialization.test_serialization_nested_class
```

- Fixes #146886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146815
Approved by: https://github.com/mikaylagawarecki
2025-02-21 18:04:59 +00:00
f4e4cfcb91 [caffe2] Ignore compiler option when building using clang (#147556)
Summary:
Skip adding unrecognized option optimize("-fno-tree-loop-vectorize") when building using clang

This piece of code began to be compiled after armv9a has been set as default compilation profile

Test Plan: buck2 run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12 lego/scripts:lego_cli -- run-locally --model_entity_id ${MODEL} --config_version ${CONFIG_VERSION} --disable_generate_new_checkpoint --checkpoint_version 0 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 | tee aimp_697790515.log

Reviewed By: andrewjcg

Differential Revision: D69947027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147556
Approved by: https://github.com/janeyx99
2025-02-21 17:46:04 +00:00
a0c7d96028 [Easy] Add Delimeter To Show Where Allocation Addr Begins (#147461)
Summary: When we print the addr we append an "s" or a "b" to the beginning of an addr. Since the addr is in hex, a user might be confused and think the "b" is part of the address. Added an approstrophe to clear this up

Test Plan: CI

Differential Revision: D69828538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147461
Approved by: https://github.com/zdevito
2025-02-21 17:19:53 +00:00
784f64bb05 [inductor] triton support port-#5512, update cpp wrapper for gpu (#146917)
In short, this pull request enhances `constexprs` expression filtering.

Note: I tested the changes on xpu backend.

Part of https://github.com/pytorch/pytorch/issues/144103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146917
Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/davidberard98, https://github.com/YUNQIUGUO
2025-02-21 17:10:53 +00:00
6a6de0e09d better error message (#147532)
Differential Revision: [D69939736](https://our.internmc.facebook.com/intern/diff/D69939736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147532
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2025-02-21 17:08:47 +00:00
a8ce4d1846 Add cachebench (#147537)
This PR adds a new benchmark called cachebench in order to measure/demonstrate the prowess of PT2 caching.
```
python benchmarks/dynamo/cachebench.py --output="result.json"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147537
Approved by: https://github.com/jamesjwu
2025-02-21 17:06:45 +00:00
af1072ffb6 [Intel GPU] Enable BUILD_GRAPH for xpu_mkldnn (#147608)
For preparation of OneDNN based XPU SDPA enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147608
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-02-21 16:12:30 +00:00
d6bb1d7f0a Delete Mixed MM Special Casing (#147151)
Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup:

- prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not.
- similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path.

It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication.

The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete.

Future optimizations could include:

- cutlass prologue path
- making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2025-02-21 16:02:40 +00:00
36c461af95 Support SymmetricMemory's signaling kernels on sm60 and sm70 (#146308)
By leveraging libcudacxx's utilities: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/atomic_ref.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146308
Approved by: https://github.com/yifuwang
2025-02-21 15:29:02 +00:00
7ce4974e50 Fix PEP585 update (#147536)
Summary: D69920347 causes a pyre failure due to changing a base object from typing.Iterable to abc.Iterable.  For now revert that change until it can be dealt with on its own.

Test Plan:
failures from D69920347 pass locally
unit tests pass

Reviewed By: oulgen

Differential Revision: D69936518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147536
Approved by: https://github.com/jeanschmidt
2025-02-21 14:37:03 +00:00
654f2666d9 Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.12xlarge.memory.ephemeral`.

This change depends on https://github.com/pytorch/test-infra/pull/6316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-21 14:15:40 +00:00
51748a5d1a Update OpenBLAS to 0.3.29 (#144857)
* Improvements for GEMM to GEMV kernels
 * Improvements for SVE kernels for SGEMV and DGEMV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144857
Approved by: https://github.com/malfet
2025-02-21 10:07:06 +00:00
71d2827eeb Code Refactoring for getting start and stride from global ranks (#147230)
Summary: Code Refactoring for getting start and stride from global ranks, this function can be used in different collective backend.

Differential Revision: D69555405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147230
Approved by: https://github.com/kwen2501
2025-02-21 10:02:50 +00:00
e7bf490c43 [ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)
This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm.

Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 10:01:27 +00:00
cffe7183f1 [cutlass backend] Fix standalone runner test after swizzle became a runtime parameter (#147554)
Differential Revision: [D69945114](https://our.internmc.facebook.com/intern/diff/D69945114/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147554
Approved by: https://github.com/mlazos
2025-02-21 09:27:44 +00:00
cyy
b61a556427 Turn onnx functions into static (#147598)
To avoid exposing ONNX symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598
Approved by: https://github.com/justinchuby
2025-02-21 07:40:28 +00:00
3395da7f7c Revert "Build a storage reader/writer to write checkpoints in HF format (#146352)"
This reverts commit c615b8c174c80936d365d19d8b8f4d9ad9a195f3.

Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))
2025-02-21 07:30:52 +00:00
e5da9df421 Revert "Increase memory for linux binary builds (#147542)"
This reverts commit 87e6e2924eb706b928cdfc4a11623b39259fa830.

Reverted https://github.com/pytorch/pytorch/pull/147542 on behalf of https://github.com/jeanschmidt due to seems that it is best to use another machine type ([comment](https://github.com/pytorch/pytorch/pull/147542#issuecomment-2673765724))
2025-02-21 07:14:57 +00:00
4986f0f52e [PT2]: allow empty dict to pass type check (#147167) (#147480)
Summary:

Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models.
```
terminate called after throwing an instance of 'c10::Error'
  what():  forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'.
```
Let empty dict pass type check.

please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover:

Test Plan:
```
MODEL_ENTITY_ID=691508446
SNAPSHOT_ID=0
OTHER_MODEL_ENTITY_ID=649645886
OTHER_SNAPSHOT_ID=0
MODULE=local

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \
    --loadMode=BenchmarkAB \
    --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \
    --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \
    --moduleName=${module} \
    --submodToDevice "" \
    --benchmarkDontRebatchSamples=true \
    --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt
```

Reviewed By: yjhao

Differential Revision: D69871393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480
Approved by: https://github.com/henryoier, https://github.com/jeanschmidt
2025-02-21 07:00:46 +00:00
c74b59fc1f [ROCm][TunableOp] resolve the rocBLAS version dynamically (#147363)
Dynamically gets rocBLAS version instead of relying on some preprocessing-time definitions which may be stale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147363
Approved by: https://github.com/pruthvistony, https://github.com/naromero77amd, https://github.com/jeffdaily
2025-02-21 06:50:21 +00:00
86ae672b6a Use has_triton_package in _inductor.runtime.hints (#147442)
Fixes #ISSUE_NUMBER
Use existing method for triton check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147442
Approved by: https://github.com/Skylion007
2025-02-21 05:52:00 +00:00
533b884870 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-02-21 05:22:19 +00:00
a2c3a2c5c4 Support serialization for uintx/intx in weights_only (#147500)
Summary:
Fixing the issue reported by huggingface

Test Plan:
python test/test_serialization.py -k test_serialization_uintx_intx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147500
Approved by: https://github.com/mikaylagawarecki
2025-02-21 04:38:44 +00:00
c615b8c174 Build a storage reader/writer to write checkpoints in HF format (#146352)
Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.

Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage

N6476188 --> able to save and load tensor in hf format

Differential Revision: D68444967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352
Approved by: https://github.com/saumishr
2025-02-21 03:31:21 +00:00
fe100c3c5b Add libtorch nightly build for CUDA 12.8 (#146265)
Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error

"Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note.

Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support.

https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146265
Approved by: https://github.com/atalman
2025-02-21 03:04:06 +00:00
ba214ab56c TCPStore: soft fail bind when agent store active (#147465)
This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`.

This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical.

This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests.

https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465
Approved by: https://github.com/fduwjj
2025-02-21 03:02:26 +00:00
8a5265cb37 [Intel GPU] qlinear_pointwise.binary[_tensor] XPU support (#135337)
# Motivation
This PR intends to enable quantized fusion `qlinear+add` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary")` and `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary_tensor")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_add_xpu
```

# Runtime Verification
```bash
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu,,4x4:4x4,0.0319824
```
The verbose is collected from UT. We can see the attribute ` attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu`, the post add and ReLU is successfully fused on GEMM computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135337
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/liangan1, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-21 02:09:28 +00:00
8b818ab58f Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353)
Fixes #147303.
Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353
Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch
2025-02-21 01:33:33 +00:00
ac88a6c00d [fx] demote node prepend to self log from warning to debug (#147538)
FIXES https://github.com/pytorch/pytorch/issues/147175

This is harmless, not sure why this is a user warning. Writing reordering graph passes is more concise when we ignore this warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147538
Approved by: https://github.com/yanboliang
2025-02-21 01:32:34 +00:00
4b35139a46 [ROCm][TunableOp] Fix TunableOp warmup environment variable. (#147412)
This PR corrects the behavior of the TunableOp warmup variables:
```
PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS
PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS
```

See the updated comments which describe how the environment variables are intended to work. Previously, if you only set one of the two environment variables the warmup iters would always be zero.

Manually tested the four possible combinations to make sure things still behavior as intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147412
Approved by: https://github.com/jeffdaily
2025-02-21 00:29:58 +00:00
fdb1305ace reland "[sigmoid] Test OSS model runner with test_export.py" (#147535)
Summary: There are ~260 tests for all the corner cases of export from test_export.py. utitlizing to test sigmoid in the OSS setting.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r _sigmoid

Differential Revision: D69937387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147535
Approved by: https://github.com/yiming0416
2025-02-20 23:45:13 +00:00
87e6e2924e Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.24xlarge.ephemeral`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-20 23:02:45 +00:00
be0df96b50 Fix c++ implementation of strip_function_call (#147436)
#143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors.

- Fix all the UCS handling (and make some of the common code more common)
- Make sure all the error paths return `nullptr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436
Approved by: https://github.com/jansel
2025-02-20 20:41:21 +00:00
af31640391 [cutlass backend] enable mixed mm test (cutlass2x) for H100 (#147474)
I am okay with not landing this as well. The motivation is to make developing on H100 smoother.

The reason the current test works on A100 but not H100 is because of alignment issue. Which was caused by arch specific filtering logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147474
Approved by: https://github.com/alexsamardzic, https://github.com/ColinPeppler
2025-02-20 20:28:44 +00:00
d068141c3b [cutlass backend] add subproc tests (#147173)
I want to separate subproc autotuning from the main tests. And I observed that for addmm, it can work without subproc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147173
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #147169
2025-02-20 20:07:42 +00:00
2565951f8a [cutlass backend] remove triton from most tests and add an integration test (#147169)
Removing aten and triton from the list of backends for the tests that have it. Instead, add a small integration test to make sure autotuning works fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147169
Approved by: https://github.com/ColinPeppler
2025-02-20 20:07:42 +00:00
fb1f7f6a09 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/native/miopen/Conv_miopen.cpp +1 (#147496)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69755123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147496
Approved by: https://github.com/Skylion007
2025-02-20 19:00:38 +00:00
6971b77510 [CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935)
Summary: Adds wait_event and record_event endpoints to CPU stream in order to facilitate device-agnostic code. Both methods are noops.

Test Plan: CI

Differential Revision: D68833927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145935
Approved by: https://github.com/Skylion007
2025-02-20 18:50:55 +00:00
863ac20659 [CI] Do not overwrite return code of test file when fails for rerun disabled tests (#147484)
Do not overwrite the return code of a single file when it fails.  This will allow the log to be printed to stdout and the gha logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147484
Approved by: https://github.com/ZainRizvi
2025-02-20 17:51:58 +00:00
83bb921a5a [ROCm] Update meta_registration for efficient attention (#146979)
Fixes a series of failing and skipped unit tests.

For nvidia hw, the longsumexp last dimension is required to be a multiple of 32.  This is not the case for rocm.

A related issue: https://github.com/pytorch/pytorch/issues/146848

The unit tests in question:
```bash
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_6_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_6_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146979
Approved by: https://github.com/shunting314
2025-02-20 15:05:13 +00:00
382fbcc1e4 add the torch.float8_e8m0fnu dtype to PyTorch (#147466)
Summary:

Continuing the work from https://github.com/pytorch/pytorch/pull/146427

Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in
https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format.  Example of basic functionality:

```python
import torch

# round trip
x0 = torch.randn(4, 4, dtype=torch.float32)
x1 = x0.to(torch.float8_e8m0fnu)  # RNE rounding
x2 = x1.to(torch.float32)  # 2 ** exponent

# creation with empty
x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu)

# printing
print(x0)
```

Done in this PR:
* numerical correctness
* op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32
* printing a tensor works

For future PRs:
* performance optimizations for casting
* torch._scaled_mm
* PT2
* various cleanups (detailed in comments with issue numbers)

Test Plan:

```
pytest test/quantization/core/experimental/test_float8.py -s
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466
Approved by: https://github.com/drisspg
2025-02-20 13:55:42 +00:00
574371d828 Add current cuda device index to FXGraphCache key (#147464)
This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405.
It does *not* handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key.

Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key.

A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts.

I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change.

Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2025-02-20 12:38:21 +00:00
ead970c8d0 Revert "Add cifllow/riscv64 label"
This reverts commit 5116b27792d37c38039459c922a466581e219fc2.
(I've pushed to the wrong branch by accident)
2025-02-20 11:55:52 +01:00
5116b27792 Add cifllow/riscv64 label 2025-02-20 11:09:06 +01:00
6beba8dcce Optimize graph.py typing (#147099)
Optimize `graph.py` methods type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147099
Approved by: https://github.com/cyyever, https://github.com/aorenste
2025-02-20 09:32:30 +00:00
f9b8121350 Make Inductor scheduler aware of _scaled_mm (#146992)
This is used for example to estimate runtime when doing comms overlap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146992
Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/shunting314
2025-02-20 09:02:31 +00:00
9da250aada type fully_shard so that the return value can be chained with typing enabled (#147489)
This allows for

```
fsdped = fully_shard(model)
fsdped.set_xyz()
```

same applies if `model` is actually a list of modules

Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489
Approved by: https://github.com/Skylion007
ghstack dependencies: #147488
2025-02-20 08:43:16 +00:00
6a72aaadae Fix torch.max optional args dim, keepdim description (#147177)
[`torch.max`](https://pytorch.org/docs/stable/generated/torch.max.html#torch.max) optional args `dim`, `keepdim` not described in document, but users can ignore them.

```python
>>> import torch
>>> a = torch.randn(3,1,3)
>>> a.max()
tensor(1.9145)
>>> a.max(dim=1)
torch.return_types.max(
values=tensor([[ 1.1436, -0.0728,  1.3312],
        [-0.4049,  0.1792, -1.2247],
        [ 0.8767, -0.7888,  1.9145]]),
indices=tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]))

```

## Changes

- Add `optional` description for `dim`, `keepdim`
- Add example of using `dim`, `keepdim`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/3391bc45-b636-4e64-9406-04d80af0c087)

### After

![image](https://github.com/user-attachments/assets/1d70e282-409c-4573-b276-b8219fd6ef0a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147177
Approved by: https://github.com/colesbury
2025-02-20 08:18:09 +00:00
452315c84f Fix RuntimeError: value cannot be converted to type int64_t without overflow (#147492)
The exact call is coming from here:

78a94c9114/torch/_inductor/memory.py (L161)

I have no idea why this error is being thrown and what mode/modes might be failing for this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147492
Approved by: https://github.com/eellison
2025-02-20 08:00:26 +00:00
a000c7e6d2 Add hint message for pack_padded_sequence (#146747)
Fixes #144207

Add truncate hint message in docs [torch.nn.utils.rnn.pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html)

## Test Result

![image](https://github.com/user-attachments/assets/46258f36-f6c7-4f11-9213-8513e52a9001)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146747
Approved by: https://github.com/mikaylagawarecki
2025-02-20 06:27:07 +00:00
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
76ad19a549 [dynamo][codegen] Implement CSE for pre-graph graph-arg bytecode reconstruction (#147425)
This reduces fixed overhead seen in a few internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147425
Approved by: https://github.com/jansel, https://github.com/StrongerXi
2025-02-20 05:42:52 +00:00
8f6b9403c1 [audio hash update] update the pinned audio hash (#147423)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147423
Approved by: https://github.com/pytorchbot
2025-02-20 05:39:46 +00:00
77aa602871 [torchbind] Differentiate ScriptModule and ScriptObject with qualified name (#147399)
Summary:
This pr add a _is_script_object method to differentiate scriptModule and scriptObject, where the formal inherits from ScriptObject in C++ so they both passes the isinstance(obj, torch.ScriptObject) check.

The qualified name of ScriptObject (i.e. custom class) would starts with "__torch__.torch.classes", this has been a widely used assumption for dealing with custom class across our code base.

Test Plan: Add new test.

Differential Revision: D69685316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147399
Approved by: https://github.com/yushangdi
2025-02-20 04:57:57 +00:00
7185ca8348 [Cutlass] Add test verifying number of precompiles (#147477)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147477
Approved by: https://github.com/henrylhtsang
2025-02-20 04:47:57 +00:00
5f5b44f6bf [ROCm] Update inductor-periodic.yml to use the correct label (#147473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147473
Approved by: https://github.com/jeffdaily
2025-02-20 04:44:18 +00:00
0d56b7e665 Support size oblivious max equation (#147344)
Addresses https://github.com/pytorch/pytorch/issues/125914 by detecting when we have a sym_max between {0, 1} and a summation of size-like unbacked symints.

The basic idea is max(1, u0 + u1) can be simplified to u0 + u1 if both u0 and u1 are size-like since their value ranges are [2, inf].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147344
Approved by: https://github.com/angelayi
2025-02-20 04:33:19 +00:00
0b0da81021 Support static method of torchbind attributes in torch.compile with inductor backend (#146927)
As title.

Many changes adapted from https://github.com/pytorch/pytorch/pull/129537.

Also this diff is only for *static* method of torchbind *attributes*. Some case that's not supported/tested:
- dynamic torchbind objects
-  torchbind objects as an input to the module.

Note that in JIT Inductor, the attributes are lifted as inputs. So even if we just have torchbind objects as attributes, they will show up as inputs in the graph.

Example generated python code in torch.compile with inductor backend for the test case in `inductor/test_torchbind.py` (P1730554370):

```python
async_compile.wait(globals())
del async_compile

def call(args):
    arg1_1, arg2_1, arg3_1 = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    assert_size_stride(arg2_1, (2, 3), (3, 1))
    buf2 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, arg2_1, buf2)
    del arg1_1
    del arg2_1
    # Topologically Sorted Source Nodes: [x, takes_foo_tuple_return], Original ATen: [aten.add]
    buf3 = torch.ops._TorchScriptTesting.takes_foo_tuple_return.default(arg3_1, buf2)
    buf4 = buf3[0]
    assert_size_stride(buf4, (2, 3), (3, 1))
    buf5 = buf3[1]
    assert_size_stride(buf5, (2, 3), (3, 1))
    buf6 = buf4; del buf4  # reuse
    cpp_fused_add_1(buf6, buf5)
    del buf5
    # Topologically Sorted Source Nodes: [y, b], Original ATen: [aten.add]
    buf7 = torch.ops._TorchScriptTesting.takes_foo.default(arg3_1, buf6)
    del buf3
    del buf6
    buf8 = buf7
    assert_size_stride(buf8, (2, 3), (3, 1))
    # Topologically Sorted Source Nodes: [c], Original ATen: []
    buf9 = torch.ops.higher_order.call_torchbind(arg3_1, 'add_tensor', buf2)
    del arg3_1
    del buf7
    buf10 = buf9
    assert_size_stride(buf10, (2, 3), (3, 1))
    del buf9
    buf11 = buf2; del buf2  # reuse
    cpp_fused_add_2(buf11, buf8, buf10)
    return (buf11, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg1_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
    import pickle
    global arg3_1
    arg3_1 = pickle.loads(b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x05torch\x94\x8c\x0cScriptObject\x94\x93\x94)\x81\x94]\x94(K\nK\x14e\x8c0__torch__.torch.classes._TorchScriptTesting._Foo\x94\x86\x94b.')
    fn = lambda: call([arg1_1, arg2_1, arg3_1])
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146927
Approved by: https://github.com/angelayi
2025-02-20 03:33:19 +00:00
de1cb0f351 capture the return value in the contract typing (#147488)
----

* the existing typing makes the return type `Optional[nn.Module]`
* this doesn't seem to be what the decorator actually does as it does
  not alter the original return type
* This PR aims to fix the typing

Differential Revision: [D69888120](https://our.internmc.facebook.com/intern/diff/D69888120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147488
Approved by: https://github.com/Skylion007
2025-02-20 03:32:34 +00:00
fea718f062 [BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730)
Our three main users are OK with this, with two of them (foreach_map,
invoke_quant) prefering it like this.

I was originally worried about BC issues (this now means you cannot add
any positional args) but I think that's not a concern -- one can always
add kwonly args.

Test Plan
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730
Approved by: https://github.com/ydwu4, https://github.com/mlazos
2025-02-20 02:30:36 +00:00
f79b352f5a [Intel GPU] qconv_pointwise.binary XPU support (#135189)
# Motivation
This PR intends to enable quantized fusion `qconv+add` and `qconv+add+relu` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qconv2d_pointwise.binary")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_add_xpu \
   -k test_qconv2d_add_relu_xpu 2>&1
```

# Runtime exemplification
Following is the oneDNN verbose collected from UT
```bash
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_s8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_linear:1:0.337704+sum:0.0241217+eltwise_relu,alg:convolution_direct,mb1_ic3oc6_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.151123
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135189
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jerryzh168
ghstack dependencies: #133307

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-20 02:02:54 +00:00
93316cfe94 Move ir_pre_fusion.txt and ir_post_fusion.txt to TORCH_LOGS (#147248)
Fixes #147002

Moves ir_{pre, post}_fusion.txt to be controlled by TORCH_LOGS instead of TORCH_COMPILE_DEBUG.
Updated tests of these logs as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147248
Approved by: https://github.com/eellison
2025-02-20 00:26:17 +00:00
16e202a38e [dynamo] improved graph break messages for some common graph break sites [1/N] (#146525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146525
Approved by: https://github.com/jansel
2025-02-20 00:08:13 +00:00
1e94c7aaa4 [draft_export] only clear pending unbacked symbols for overwritten kernels (#147427)
This was wrong, we were doing this in all cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147427
Approved by: https://github.com/angelayi
2025-02-20 00:07:54 +00:00
3986c3e4a6 [reland][cutlass backend] Do not change dtype of GEMM template for cutlass 3x (#147434)
Reland of https://github.com/pytorch/pytorch/pull/146877
incorporate forward fix (didn't land): https://github.com/pytorch/pytorch/pull/147185

Summary:
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69825865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147434
Approved by: https://github.com/ColinPeppler
2025-02-20 00:07:07 +00:00
a88d7d4268 [util] fetch logical count cpu (#147413)
To match with Vcpu count with aws:

after (96), before (48)
Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal
before: https://hud.pytorch.org/utilization/13377376406/37360984234/1
after: https://hud.pytorch.org/utilization/13401543806/37435031356/1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413
Approved by: https://github.com/clee2000
2025-02-19 23:44:54 +00:00
004d65aeb0 Add type hints to cuda kernel (#147471)
Missed this in a previous PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147471
Approved by: https://github.com/eellison
2025-02-19 23:35:10 +00:00
48203bec63 [BE] remove sysconfig.get_config_var("LIBDIR") from cuda lib paths (#147409)
Summary: I think the path is not needed anymore. It was added in https://github.com/pytorch/pytorch/pull/126408, but  it has been a while since then. See if CI complains.

Differential Revision: D69573185

See also  https://github.com/pytorch/pytorch/pull/147158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147409
Approved by: https://github.com/chenyang78
2025-02-19 23:04:22 +00:00
f63db6255f Re-land exclude upsample_bilinear2d.vec and nearest2d.vec from default export decomposition table (#147153)
Note: This is a re-land of https://github.com/pytorch/pytorch/pull/141791, which I reverted due to breaking some Meta-internal tests - an internal ET delegate did not handle the non-decomposed upsample_nearest2d, and it was not caught in CI. I've resolved that issue and should be ready to safely re-land.

Summary:
As upsample_bilinear2d.vec and upsample_nearest2d.vec are core ATen ops, they should not be decomposed by default in the export path. Because the operators have CompositeImplicitAutograd dispatch, their decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.

In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining two ops, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.

Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and recompose it, but this is no longer necessary.

Test Plan:
Added a new test (`test_default_decomposition_core_cia_ops`) in test_export.py to verify that upsample_bilinear2d.vec (and in the future, other core-tagged CIA ops) are not decomposed by default. Also, I manually validated end to end with ExecuTorch that the op is not decomposed in to_edge (see N6238522).

```
buck test //caffe2/test:test_export -- test_default_decomposition_core_cia_ops
```

Differential Revision: D69625112

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147153
Approved by: https://github.com/manuelcandales
2025-02-19 23:03:29 +00:00
fb55bac3de [fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439)
The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script.

Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439
Approved by: https://github.com/fegin
2025-02-19 22:09:16 +00:00
41ae15faa3 [ONNX] Add scaffolding for onnx decomp and logic for op tests (#147392)
Create scaffold for onnx op test data and common logic. This PR creates the scaffolding for new onnx decomp functions described in https://github.com/pytorch/pytorch/issues/139301. It adds two ops: abs and add, and enables the related tests.

https://github.com/pytorch/pytorch/issues/139301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147392
Approved by: https://github.com/titaiwangms
ghstack dependencies: #147396
2025-02-19 21:55:12 +00:00
24738768a8 more dist ops in non strict (#147417)
Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`.

Test Plan: added unit tests

Differential Revision: D69813991

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417
Approved by: https://github.com/angelayi
2025-02-19 21:29:16 +00:00
394676759d ci: Add h100 nightly perf testing (#146868)
This infrastructure has been up for a while so add a workflow to actually run things on it.

> [!IMPORTANT]
> We only have **14** linux.aws.h100 runners so it might be beneficial for us to actually pair this list down.
> Will leave it up to the compiler team to comment on this PR on which tests are actually important vs. what is not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146868
Approved by: https://github.com/eellison, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-02-19 21:13:17 +00:00
8bea08e5bc [BE] Fix tensor stub (#147384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147384
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/atalman
2025-02-19 19:47:03 +00:00
e758d8b4d1 [Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)
Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395
Approved by: https://github.com/eellison
2025-02-19 19:45:01 +00:00
279c7f262e [ONNX] Refactor dispatcher and registry (#147396)
This PR sets up the registry to accept onnx decomp functions to be moved into PyTorch (https://github.com/pytorch/pytorch/issues/139301).

The ops from onnx script are currently appended to the registry. When the ops are moved into PyTorch, the moved ops takes precedence because they appear first in the registry list.

After the migration hooks for loading ops from onnx script will be removed.

1. Use a private field `_pt_onnx_signature` to store function signatures to avoid conflicts
2. Update the registry to record the signature in OnnxDecompMeta and update the dispatcher to leverage the data structure
3. Update registry to prepare for onnx op registration, and update the the onnx_impl decorator to support a no_compile option

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147396
Approved by: https://github.com/titaiwangms
2025-02-19 19:38:28 +00:00
4f3c070b25 [inductor] GraphLowering code movement (#147335)
moved these methods under __init__ to be more idiomatic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147335
Approved by: https://github.com/eellison
ghstack dependencies: #147331
2025-02-19 19:32:30 +00:00
5a3a50c791 Update Arm Compute Library (ACL) to v25.02 (#147454)
Among many things, this version of ACL fixes the redundant declaration  warning that we're blocked on in (#145942, #146620, #147337) and introduces better scheduling heuristics for GEMMs

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147454
Approved by: https://github.com/malfet
2025-02-19 18:51:08 +00:00
9fee408daa [caffe2] disable warning for unused arguments (#147411)
Summary: Disable warnings on unused command line arguments for ukernels_asm.

Test Plan:
On top of D69602077:
```
$ buck2 build --flagfile fbsource//xplat/mode/arstudio/auto.py fbsource//xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:ukernels_asmAppleMac
```

Differential Revision: D69807977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147411
Approved by: https://github.com/kimishpatel
2025-02-19 17:54:31 +00:00
5220d402b5 [ROCm] TopK optimizations for AMD GPUs (#146387)
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146387
Approved by: https://github.com/malfet, https://github.com/ngimel
2025-02-19 17:10:59 +00:00
e6c86952c6 Add CUDA 12.8 windows nightly build (#147037)
https://github.com/pytorch/pytorch/issues/145570

windows AMI is deployed to prod today, prepping the windows cuda 12.8 build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147037
Approved by: https://github.com/atalman
2025-02-19 16:59:32 +00:00
8cbf7d0d6e [Inductor UT][XPU] Skip fft_c2c case since it's not implemented on XPU. (#147351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147351
Approved by: https://github.com/jansel
2025-02-19 16:03:03 +00:00
ed83b0b70b [ddp] decouple python reducer from compilation mode (#147123)
Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147123
Approved by: https://github.com/fegin
2025-02-19 15:51:40 +00:00
303ad1916f [FlexAttention] Fix weird generate stride call in flex decode (#147435)
# Summary
Seems like we had a redundant tuple unpack and that doesn't appear to be supported in new triton

Fixes https://github.com/pytorch/pytorch/issues/147373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147435
Approved by: https://github.com/BoyuanFeng
2025-02-19 12:12:27 +00:00
77dbd28535 [Cutlass] Restore search space for swizzle (#147224)
This restores the previous search space, since swizzle is now a runtime parameter, there shouldn't be extra compile-time overhead from searching this now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147224
Approved by: https://github.com/eellison
ghstack dependencies: #147222, #147223
2025-02-19 09:22:51 +00:00
e9b3ff0570 [Cutlass] Add support for runtime param choices, starting with swizzle (#147223)
This PR adds support for swizzle as a runtime parameter choice. Future runtime parameter choices can be added to the [get_runtime_arg_info](2d40f9fb52/torch/_inductor/codegen/cuda/cuda_template.py (L282)) list method and then possible choices can be [looped over similarly to swizzle](933f921b36/torch/_inductor/codegen/cuda/gemm_template.py (L532)). For precompile, we now filter choices by hash to only compile each distinct kernel source once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147223
Approved by: https://github.com/Chillee, https://github.com/eellison
ghstack dependencies: #147222
2025-02-19 09:22:51 +00:00
81eb2a78ad [Inductor] Add autotuning artifact logging (#147222)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147222
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-02-19 09:22:42 +00:00
655b061ef0 [inductor] Freeze runtime asserts after shape prop but before codegen (#147331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147331
Approved by: https://github.com/eellison
2025-02-19 06:29:13 +00:00
454fbd5bbe realize stride symbols in estimate_runtime (#146752)
Unfortuanlty could not create a local repo, or unit test.
fix https://github.com/pytorch/pytorch/issues/146686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146752
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-02-19 06:02:49 +00:00
2c3680ce38 [apf] Fix input adapter (#147238)
Summary: Add support for inputs that no longer exist in `input_fields`, but is not actually used by the original program. In this case, we just give it a dummy input based on the node's metadata.

Test Plan: Verified for S488841

Differential Revision: D69328093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147238
Approved by: https://github.com/pianpwk
2025-02-19 04:49:58 +00:00
465930ee81 Revert "[ROCm] ROCm-specific gemm tuning parameters" (#147388)
Summary:
This diff reverts D69573225 / https://github.com/pytorch/pytorch/pull/143286

15% cold compile time regression, see https://fb.workplace.com/groups/1075192433118967/permalink/1608559059782299/

Test Plan: NA

Differential Revision: D69790102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147388
Approved by: https://github.com/yanboliang
2025-02-19 04:47:35 +00:00
4ece056791 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-19 03:52:26 +00:00
bd370c138a fix pt2e block wise quantization unit test (#147406)
Differential Revision: D69806596

https://github.com/pytorch/pytorch/pull/146946 breaks the unit test, because the quant nodes are folded by default now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147406
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168
2025-02-19 02:40:27 +00:00
5006932cbc [cutlass backend] forward fix of standalone runner for fbcode (#147158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147158
Approved by: https://github.com/chenyang78
2025-02-19 02:02:10 +00:00
f16d30137c [OSS] Update FileSystem methods to properly handle a string argument (#145751)
Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used.

Test Plan: N6475361 methods don't throw an error with a string arg like they were previously

Differential Revision: D68713937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751
Approved by: https://github.com/pradeepfn
2025-02-19 01:50:24 +00:00
953f7834cc [ONNX] Pick up missing types in dynamic shapes renaming (#147407)
Found in `_check_dynamic_shapes` that int and None type are valid inputs of dynamic_shapes.
This PR adds the support on these two types and add the tests to guard the sync of ONNX flatten logic and the one in expor.t
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147407
Approved by: https://github.com/justinchuby
2025-02-19 01:49:53 +00:00
757d7f28d1 [CD] Increase timeout for windows binary builds (#147390)
Mitigates https://github.com/pytorch/pytorch/issues/147376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147390
Approved by: https://github.com/huydhn, https://github.com/jeanschmidt, https://github.com/malfet
2025-02-19 01:15:04 +00:00
959d79f85f [ONNX] Move and improve error reproduction logic in test (#147391)
https://github.com/pytorch/pytorch/issues/139301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147391
Approved by: https://github.com/titaiwangms
2025-02-19 00:00:11 +00:00
babb2dc2af Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 6f7e67c43c13b5675b4ff60cbaa71e5083a22481.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/wdvr due to failing inductor mkldnn_pattern_matcher_cpu tests ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2667186865))
2025-02-18 23:58:31 +00:00
525ca80f53 add unbacked strict mode (#147333)
fixes #145775

This is the first step in introducing a "strict" mode where we don't silent specialize and don't silent graph break. At a high level when we do mark_unbacked(... strict=True), anytime we specialize an unbacked symint we will explicitly error and tell the user their unbacked dimension was specialized to a single value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147333
Approved by: https://github.com/laithsakka
2025-02-18 23:33:55 +00:00
5d547d82e6 Add no_data_dependent_graph_break mode (#147342)
This adds a strict mode `TORCHDYNAMO_UNBACKED_STRICT` to prevent graph breaking when we guard on data dependent. This is a better UX for those who are actively trying to make their model more dynamic, but aren't close enough to full graph to use that flag directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147342
Approved by: https://github.com/laithsakka
2025-02-18 23:33:47 +00:00
bae049b439 Update addr doc (#146482)
Fixes https://github.com/pytorch/pytorch/issues/146399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146482
Approved by: https://github.com/janeyx99
2025-02-18 23:25:38 +00:00
ca397d82a6 [Sigmoid] Fix issues with constant folding and fba_ops (#146948)
Summary:
There are 2 issues:

- `skip_folding_node_fn` isn't considered when propagating constant values. So given a skipped node with constant inputs, it outputs a constant and its users can output constant values and then be included in the constant graph. However, the skipped node is not included in the constant graph when extracting the constant graph. This issue is fixed by checking for skipped node when propagating the constant values and making the skipped node to output unknown value (not constant) so that its users cannot output constant.

- `fba_linear` op can be included in the constant graph but it is not implemented for CPU so constant graph cannot be executed. This issue is fixed by converting `fba_linear` to `aten.addmm`.

- A refactor to allow more fba_ops to be included in the constant graph (via mapping fba_ops to aten ops).

Reviewed By: StellarrZ

Differential Revision: D68716393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146948
Approved by: https://github.com/zhxchen17
2025-02-18 23:17:47 +00:00
c9a15d980f [FSDP2] Simplify shard_placement_fn in test (#146847)
Summary: Found this while checking `shard_placement_fn` for Shampoo shard independent implementation.

Test Plan: OSS CI & tests

Differential Revision: D69412878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146847
Approved by: https://github.com/awgu
2025-02-18 23:01:26 +00:00
c8433c2c6c [BE] correct docs for clock_rate to MHz, fixes #147098 (#147393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147393
Approved by: https://github.com/andrewor14
2025-02-18 22:59:58 +00:00
a21a123fd5 Add fqn_modifier at loading_state_dict and unit test (#146557)
In Fusion model, users might change the state_dict keys by state_dict_hook
The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys.

The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default
During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook.

For example:
There's a state_dict_hook:

```
def _state_dict_hook(self, destination, prefix, keep_vars):
    """Remove "embedding" from the original embedding in the state_dict
    name. This keeps the orginal state dict name for the embedding
    from before fusing with the FusionEmbedding.

    [!Note] This update changes the order of the OrderedDict
    """
    key = prefix + "embedding.weight"
    new_key = prefix + "weight"
    destination[new_key] = destination[key]
    del destination[key]
```

In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module
```
def fqn_modifiers(self) -> Dict[str, str]:
    return {
        "weight": "embedding",
    }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557
Approved by: https://github.com/fegin
2025-02-18 22:54:41 +00:00
7622e29a37 Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))
2025-02-18 22:23:35 +00:00
3f35664ee8 More precise check for shared storage check in inductor/reinplace pass (#147050)
Currently if two tensor share storage we have some logic to avoid re-inplacing. Before this PR two tensors share storage if use same underlying storage even if they do not overlap. This diff enhance the checks to avoid cases when we know tensors do not overlap easily.
mitigate https://github.com/pytorch/pytorch/issues/139628 but does not fix the inductor issue in it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147050
Approved by: https://github.com/zou3519
2025-02-18 21:55:34 +00:00
63e8ad49b8 [dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355)
This PR and the previous:
- Moves parts of `eval_frame.c` to C++.
- Reduces code duplication in `dynamo__custom_eval_frame` and makes the control flow more clear.
- Enables `convert_frame` to signal to `eval_frame.cpp` in a general manner how to evaluate this frame, recursive frames, and future frames with the same code object (default/compile, skip, run-only). e.g. this will allow us to change skipping/cache limit hit eval_frame behavior directly from convert_frame without requiring changes to C/C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146355
Approved by: https://github.com/jansel
ghstack dependencies: #145603
2025-02-18 21:37:12 +00:00
75db0fd8a0 [dynamo] refactor dynamo__custom_eval_frame to C++, refactor SKIP_CODE[_RECURSIVE] (#145603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145603
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-02-18 21:37:12 +00:00
eb892cd768 [codegen] enable SORT and TUPLE_REDUCTION for AMD Triton (#147340)
Looks like Triton's AMD backend supports multiple inputs already.
Let's enable SORT and TUPLE_REDUCTION for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147340
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/eellison
2025-02-18 21:15:23 +00:00
1b047d5d7a Add link to non_blocking/pinmem tutorial in Tensor.to docstrings (#145651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145651
Approved by: https://github.com/svekars
2025-02-18 20:38:01 +00:00
clr
166419b9c1 dynamo: Don't crash when encountering a object with no __name__ (#147246)
This was triggering on ScriptFunctions. Note that other than badly implemented c functiosn, this seems to be almost impossible to trigger, so I wrote a smaller unit test, rather than a full repro. Let me know if people feel strongly and want a full reproduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147246
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007
2025-02-18 20:35:49 +00:00
12v
74682e8595 Fix typo (#147330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147330
Approved by: https://github.com/srinivasreddy, https://github.com/Skylion007
2025-02-18 20:20:34 +00:00
d9b3d76b85 Fix linter warnings (#147386)
https://github.com/pytorch/pytorch/pull/145866 accidentally introduced a warning about const casts and also comparison of unsigned long int with signed long int.

This PR fixes both of those warnings.

Tested by running:

```
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o
```

And I got no warnings or errors. Same with `python setup.py develop`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147386
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-02-18 20:03:16 +00:00
302f56a1f2 Revert "Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)"
This reverts commit 59b7e52ad8f6146b4364515a7f3e54d6f3edd6da.

Reverted https://github.com/pytorch/pytorch/pull/146845 on behalf of https://github.com/jeanschmidt due to Seems to break a few code dependencies in multiple places ([comment](https://github.com/pytorch/pytorch/pull/146845#issuecomment-2666656834))
2025-02-18 19:01:27 +00:00
57060bebf3 [symbolic shapes] Add replacement for backed symints (#147240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147240
Approved by: https://github.com/pianpwk
ghstack dependencies: #146939
2025-02-18 18:49:51 +00:00
84abeaad5c [export] Log evaluate_expr (#146939)
We want to log each symnode created so that we can do provenance tracking in the tlparse report generated for draft export. To do this, we want to assign a unique id to every symnode, which python's `id` function already does, and then for every expression created, we can find the provenance by tracing back through its arguments ids. This logging only happens when dtrace_structured is enabled, which is only when running draft export.

An example output is as follows:

<img width="799" alt="image" src="https://github.com/user-attachments/assets/88bb31b4-8c31-43fb-aa88-08b573b9f71d" />

For the increase in the compile_time_instruction_count benchmark, this seems unavoidable because I need to call `id` to get the unique identifier for each symnode. But I believe `id` is an inexpensive operation, so hopefully it should be ok?  I tried doing the following:
* Originally I was passing around `self`, which is a SymNode, which caused the compile time to be ~6.36M
* I changed it to pass around `id(self)` instead, which reduced the compile time to ~6.33M
* Then I changed it to be passed as a positional arg instead of a kwarg, which reduced the compile time to ~6.22M, but this doesn't seem to be a super worthwhile fix?

#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146939
Approved by: https://github.com/oulgen
2025-02-18 18:49:51 +00:00
c6b331f7d9 Deprecate skip_code_recursive_on_cache_limit_hit config flag (#136970)
Fixes one of #136862

Make `skip_code_recursive_on_cache_limit_hit` flag deprecated.

Affected logic is in here:
6931c1644a/torch/_dynamo/convert_frame.py (L866-L876)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136970
Approved by: https://github.com/williamwen42
2025-02-18 18:48:23 +00:00
6f7e67c43c Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-18 18:44:26 +00:00
dd2a943e14 Fix the AOTI compile failure with ARM CPU for Meta internal (#147204)
Summary: Fix the AOTI compile failure with ARM CPU for Meta internal

Differential Revision: D69642211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147204
Approved by: https://github.com/houseroad
2025-02-18 17:54:34 +00:00
5d675de754 Update ck (#144799)
Updates the CK version and re-implements kernel generation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144799
Approved by: https://github.com/jianyuh
2025-02-18 17:00:27 +00:00
a00d2b5144 s390x: add cleanup for cancelled docker image builds (#147110)
When podman image build is cancelled,
a couple of processes are left behind,
and their existence prevents
proper shutdown of runner container.

Add cleanup step at the end of workflow
using new option recently introduced in podman:
https://github.com/containers/podman/pull/25102

Example of job preventing s390x worker cleaning up and restarting properly:
https://github.com/pytorch/pytorch/actions/runs/13289159296/job/37105230728
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147110
Approved by: https://github.com/huydhn
2025-02-18 16:26:46 +00:00
6edc419d69 Update torch-xpu-ops commit pin (#147358)
Update the torch-xpu-ops commit to [a14d1eaa834a616705068103dc8129319087e864](a14d1eaa83), includes:

- SparseCSR XPU support
- Refine build system

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147358
Approved by: https://github.com/EikanWang
2025-02-18 16:05:25 +00:00
0c8028e877 [export] Loosen symint input serialization (#147237)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147237
Approved by: https://github.com/avikchaudhuri
2025-02-18 13:03:47 +00:00
b10ba0a46c Unify all sympy versions to avoid conflicts within PyTorch (#147197)
As the title stated.

There are some tiny diffrences between 1.13.1 and 1.13.3:
1.13.1:
2e489cf4b1/sympy/core/numbers.py (L1591)

1.13.3:
b4ce69ad5d/sympy/core/numbers.py (L1591)

**Previous PR:**
https://github.com/pytorch/pytorch/pull/143908

**ISSUE Related:**
https://github.com/pytorch/pytorch/issues/147144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147197
Approved by: https://github.com/malfet
2025-02-18 10:51:43 +00:00
d9cf1debf9 [ROCm][Windows] Fix clang-cl error related to -Wmissing prototypes enabled (#146981)
Some of the windows files (fused_kernels.cpp or temp_file.h) contain code that fail to compile when this flag is enabled when built with clang-cl.

This PR resolves the issue by ensuring that even if we build with clang-cl, it doesn't include those flags on windows.

Alternatively if needed, I can fix the files mentioned to pass under this flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146981
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-02-18 07:41:12 +00:00
49e8f9c965 Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 22fae4c5f94eb43f71a2eebc1904880740cb1d60.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to third time is the charm ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2664622598))
2025-02-18 05:11:32 +00:00
59a08138c5 [executorch hash update] update the pinned executorch hash (#147345)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147345
Approved by: https://github.com/pytorchbot
2025-02-18 05:08:06 +00:00
6a2bb629ec Update torch-xpu-ops commit pin (#147302)
Update the torch-xpu-ops commit to [b421032c8fed40df5eaee395c2e7f5f8a7bcc815](b421032c8f), includes:

- Correct int4 weight pack implementation
- Enhance build system: only build one shared library for the user

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147302
Approved by: https://github.com/EikanWang
2025-02-18 05:04:15 +00:00
59915b8dec [Intel GPU] qlinear at XPU backend (#133307)
# Motivation
The PR is intended to enable `onednn.qlinear` and `onednn.qlinear_unary` at Intel GPU.

We register the qlinear ops at C++ backend via `TORCH_LIBRARY_IMPL`, the op this PR registers includes `onednn::qlinear_pointwise`, `onednn::qlinear_pointwise.tensor`, and `onednn::qlinear_prepack`. The prepack conduct transpose on weight for fitting oneDNN requirement on weight to acquire higher performance.

Also, we remove the limitation of the corresponding annotation method in  the `XPUInductorQuantizer` (`torch/ao/quantization/quantizer/xpu_inductor_quantizer.py`) to allow GPU linear conversion.

We add the kChar(`torch.int8`) dtype in the `torch/_inductor/fx_passes/quantization` and `torch/_inductor/mkldnn_ir.py`, as signed int8 is the default INT8 data type at GPU side.

We verified the op through UTs and e2e model testing like ResNet18, ResNet50.

# UT verification
```
 DNNL_VERBOSE=0 TORCH_COMPILE_DEBUG=0 python test/inductor/test_mkldnn_pattern_matcher.py -v  \
     -k test_qlinear_xpu \
     -k test_qlinear_relu_xpu \
     -k test_qlinear_gelu_xpu
```

# Runtime exemplification
Here is the oneDNN verbose collected through running above UTs
```
//pure int8 gemm
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32,,2x4:4x3,0.187988
// post-relu fusion
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_relu,,2x4:4x4,0.115234
// post-gelu fusion
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_gelu_tanh,,2x4:4x4,0.170898

````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133307
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jerryzh168

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-18 04:02:42 +00:00
bb8c4ecc6d Allow XPU device for validating the arguments to sparse compressed tensor factory functions (#147306)
During Sparse tensor conversion, a validity check is performed. We need to allow XPU to pass this check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147306
Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey
2025-02-18 03:55:54 +00:00
71484a2106 [pt2-benchmarks] Compiler reset on every run (#147313)
Internal benchmarks call `run` in a loop. Compiler reset gives a clean env

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313
Approved by: https://github.com/jansel
2025-02-18 02:09:19 +00:00
708428704e patch for block-wise quantization + pt2e (#146946)
Summary: https://github.com/pytorch/pytorch/pull/144492 was reverted due to duplicate kernel registration. This PR will re-introduce the patch

Differential Revision: D69488779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146946
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14
2025-02-18 01:15:26 +00:00
59b7e52ad8 Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)
Fix https://github.com/pytorch/pytorch/issues/145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845
Approved by: https://github.com/Skylion007
2025-02-17 22:42:16 +00:00
1393f9a76c [ROCm] Update inductor-perf-test-nightly-rocm.yml to use the correct labels & frequency (#147221)
This workflow takes around 75-80hrs on ROCm, so scaling down the frequency to once per week until we get more CI capacity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147221
Approved by: https://github.com/pruthvistony, https://github.com/huydhn
2025-02-17 19:29:27 +00:00
6c0e7463af Fix test_device_memory_allocated (#147311)
Fixes #147310

The `torch.ones`  allocates memory and is released immediately, thus the following assertion will fail.
This PR stores it into a temp variable to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147311
Approved by: https://github.com/guangyey, https://github.com/Skylion007
2025-02-17 19:00:53 +00:00
516133ddb0 Fix arvr macOS buck pytorch builds (#147292)
Summary:
X-link: https://github.com/ctrl-labs/src2/pull/42453

buck arvr macOS builds had a few issues that needed fixing.

Test Plan: build with buck

Differential Revision: D69722372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147292
Approved by: https://github.com/Skylion007
2025-02-17 18:47:24 +00:00
22fae4c5f9 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-17 18:39:10 +00:00
1b29de5c05 Add NEON implementation for 8 bit quantized embedding bag on aarch64 (#147322)
This improves performance by ~5.5x on NeoverseV1 cores using the following benchmarking script:
```
import torch
import torch.nn as nn
import numpy as np
import torch.autograd.profiler as profiler

np.random.seed(0)
torch.manual_seed(0)

class SimpleEmbeddingBagModel(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(SimpleEmbeddingBagModel, self).__init__()

        weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32))
        obs = torch.ao.quantization.PerChannelMinMaxObserver(dtype=torch.quint8, qscheme=torch.per_channel_affine_float_qparams, ch_axis=0)
        obs(weights)
        qparams = obs.calculate_qparams()
        qweight = torch.quantize_per_channel(weights, qparams[0], qparams[1], axis=0, dtype=torch.quint8)

        # Defining the EmbeddingBag layer
        self.qembedding_bag = torch.ao.nn.quantized.EmbeddingBag(num_embeddings, embedding_dim, _weight=qweight,
                                                                 mode='sum', include_last_offset=True, dtype=torch.quint8)

    def forward(self, input, offsets):
        # Forward pass through the EmbeddingBag layer
        result = self.qembedding_bag(input, offsets, per_sample_weights=None)
        return result

num_embeddings = 40000000
embedding_dim = 128

model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
model.eval()

multi_hot = 100
batch_size = 400

input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long)

offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot))

with torch.no_grad():
    # warm up
    _ = model(input_tensor, offsets)

    with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
        for i in range(100):
            _ = model(input_tensor, offsets)
    print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=50))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147322
Approved by: https://github.com/malfet
2025-02-17 17:10:47 +00:00
71855a1cad Update slow tests (#147308)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147308
Approved by: https://github.com/pytorchbot
2025-02-17 12:03:40 +00:00
e8b20f6ef3 [MPS][BE] Turn exec_unary_kernel as MetalShaderLibrary method (#147299)
And delete duplicate implementations from SpecialOps and UnaryKernel.
Change input and output arguments order for SpecialOps kernels to match those of UnaryOps

Fixes https://github.com/pytorch/pytorch/issues/146770
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147299
Approved by: https://github.com/dcci
ghstack dependencies: #147296, #147297
2025-02-17 08:31:24 +00:00
ae5f7fec82 [Intel GPU] Enable fp64 GEMM (#140677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140677
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/desertfire
2025-02-17 08:15:55 +00:00
2b30e94fc0 [BE] Make exec_unary_kernel take TensorIterator as argument (#147297)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147297
Approved by: https://github.com/dcci
ghstack dependencies: #147296
2025-02-17 07:34:35 +00:00
3d251e6512 [BE] Switch all structured funcs to stubs (#147296)
No need to have separate foobar_out_mps when registering a dispatch to  foobar_stub will do

And this makes `exec_unary_kernel` defined in UnaryKernel.mm and
SpecialOps.mm look very similar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147296
Approved by: https://github.com/dcci
2025-02-17 07:34:34 +00:00
424c1b82e0 [Inductor][CPP] Add the legalize low fp support for index expr (#147298)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/147279. The test case produced a low-precision floating-point value using `ops.index_expr`, but the CPP backend did not handle its legalization. This PR adds support for it.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_low_fp_index_expr_issue_147279
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147298
Approved by: https://github.com/jgong5
2025-02-17 07:11:20 +00:00
359165734b [executorch hash update] update the pinned executorch hash (#147294)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147294
Approved by: https://github.com/pytorchbot
2025-02-17 05:03:05 +00:00
ae351d4d0e [Intel GPU] allow_tf32 for oneDNN backend - XPU part (#137570)
# Motivation
Add context variable `torch.bachend.mkldnn.allow_tf32` to control tf32 computation in convolution kernels at XPU side.  The tf32 data type is beneficial to improve the performance of deep learning workloads during training/inference. Current PR uses the [oneDNN API fpmath_mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#the-floating-point-math-mode-attribute) to trigger the tf32 acceleration in convolution kernels.

# Valiadation
* ut to test context variable
`python test/xpu/test_conv.py -k test_mkldnn_allow_tf32_get_set`

* Runtime exemplification
```
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.649902
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.151855
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_data,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_undef::undef::: dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.167969
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.26709
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.219971

```
According to the field `fpmath:tf32` in verbose, we could see that, current context setting utils could successfully trigger tf32 computation in conv forward/backward_data/backward_weights kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137570
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2025-02-17 01:46:43 +00:00
198ffbdf11 [MPS] Implement and test round.decimals (#147266)
If inductor can do it, why not eager
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147266
Approved by: https://github.com/Skylion007
ghstack dependencies: #147286
2025-02-16 23:17:13 +00:00
e738f7ba23 [BE]: Enable ruff rule SIM113 (#147290)
Lint rules that tells the user to avoid keeping track of their own counter and use the builtin enumerate when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147290
Approved by: https://github.com/jansel
2025-02-16 22:41:16 +00:00
a8fa4bcfd2 [StaticRuntime] Support a new pattern (aten::to with 5 inputs) for ClipRangesToGatherToOffsets (#147189)
Summary:
Support the following new pattern for ClipRangesToGatherToOffsets:

Before optimization:
```
%11175 : Tensor, %11176 : Tensor = fb::clip_ranges_gather(%int_66.1, %getitem_1784.1, %347)
%getattr_256.1 : int = prim::dtype(%11175)
%to_298.1 : Tensor = aten::to(%11176, %getattr_256.1, %13, %13, %12)
%lengths_to_offsets_333.1 : Tensor = fb::lengths_to_offsets(%to_298.1, %8)
```

After optimization:
```
%11199 : int = prim::dtype(%int_66.1)
%11200 : Tensor, %11201 : Tensor = fb::clip_ranges_gather_to_offsets(%int_66.1, %getitem_1784.1, %347, %8, %11199)
```

It is similar with https://github.com/pytorch/pytorch/pull/146931, but aten::to has 5 inputs instead of 4.

Differential Revision: D69627793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147189
Approved by: https://github.com/hanyilou123
2025-02-16 22:16:02 +00:00
5c0c99f658 [MPS][BE] Use stubs for floor/ceil/round/trunc (#147286)
To avoid duplicating logic that those ops are no-ops for integral dtypes
(And in preparation of adding `round_decimals` that calls round_stub if decimals are 0)

Tested for the corner cases by manually invoking `round`, `trunc`, `floor` and `ceil` for int dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147286
Approved by: https://github.com/Skylion007
2025-02-16 17:22:49 +00:00
d27ecf85db xpu: support sycl with torch.utils.cpp_extension APIs (#132945)
This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension.

Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension.

By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for.

Fixes: #132944

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945
Approved by: https://github.com/albanD, https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-16 16:50:59 +00:00
dd5d0ea6bb Revert "xpu: support sycl with torch.utils.cpp_extension APIs (#132945)"
This reverts commit 607379960bc5093a1fe51ff72c3e0fd39ac126ab.

Reverted https://github.com/pytorch/pytorch/pull/132945 on behalf of https://github.com/malfet due to It just broke all the tests, see b16ae97ad0/1 ([comment](https://github.com/pytorch/pytorch/pull/132945#issuecomment-2661498747))
2025-02-16 16:03:42 +00:00
b16ae97ad0 Generalize mixed precision in DDP (#146808)
**Motivation:**

1. Generalize mixed precision in DDP.
2. Enable `SyncBatchNorm` for XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146808
Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/wconstab
2025-02-16 11:59:40 +00:00
ee38a32c55 [Dynamo] support isinstance(...) check for type tuple (#146984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146984
Approved by: https://github.com/jansel
2025-02-16 10:41:49 +00:00
607379960b xpu: support sycl with torch.utils.cpp_extension APIs (#132945)
This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension.

Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension.

By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for.

Fixes: #132944

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945
Approved by: https://github.com/albanD, https://github.com/guangyey
2025-02-16 10:16:09 +00:00
ed3b119c40 Skip unsupported types by MPS in test_torchinductor.py (#147211)
- Skip unsupported dtypes in `test_split_cumsum` (and manually skip int64 for MacOS13)
- Adapt `test_cat` to use `torch.half` instead of `torch.double` on MPS
- Skip `test_adaptive_avg_pool1d_argmax` is avgpool is not implemented for all sizes
-

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147211
Approved by: https://github.com/jansel, https://github.com/Skylion007, https://github.com/dcci
2025-02-16 10:15:53 +00:00
0fb5b224b7 [DCP] Cache save plans: planner helpers and interface updates (#147116)
Summary:
This PR updates the planner interface and introduces the class variables to cache the local and global plans.
Two new helpers are also introduced which will be used to compare if the plans have changed across save attempts and merge the delta plans.

Test Plan: UTs

Differential Revision: D69224488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147116
Approved by: https://github.com/MeetVadakkanchery, https://github.com/huydhn
2025-02-16 07:18:26 +00:00
4bacd13c92 [executorch hash update] update the pinned executorch hash (#147273)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147273
Approved by: https://github.com/pytorchbot
2025-02-16 05:11:33 +00:00
8f20026bcb [Intel GPU] Support SparseCsrXPU codegen (#144722)
Adding a new dispatch key - `SparseCsrXPU`  to enable Intel GPU support for SparseCsr Tensor.

Similar PR: https://github.com/pytorch/pytorch/pull/139267
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144722
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD

Co-authored-by: Kanya-Mo <kanya.mo@intel.com>
2025-02-16 03:16:12 +00:00
1677a31019 [Inductor] Fix 3D tiling with permute (#147249)
This PR adds a test case and tiny fix for 3D tiling. Before this PR, tiling would crash because one of the candidates lacked a `"y"` dimension. Now, when we're calculating 3D tiling candidates, we assume the y size is 1 if it's missing.

The test case implements a 3D permute using block pointers.

```
@triton.jit
def triton_poi_fused_add_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    znumel = 51
    ynumel = 51
    xnumel = 51
    zoffset = tl.program_id(2) * ZBLOCK
    zindex = zoffset + tl.arange(0, ZBLOCK)[None, None, :]
    zmask = zindex < znumel
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = xindex < xnumel
    x2 = xindex
    y1 = yindex
    z0 = zindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2])
    tmp1 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[51, 1, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2])
    tmp2 = tmp0 + tmp1
    tmp3 = tmp0 + tmp0
    tmp4 = tmp2 + tmp3
    tl.store(tl.make_block_ptr(out_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), tl.broadcast_to(tmp4, [XBLOCK, YBLOCK, ZBLOCK]).to(tl.float32), boundary_check=[0, 1, 2])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147249
Approved by: https://github.com/jansel
2025-02-15 23:28:36 +00:00
44ee9ca593 [inductor] Add type annotations to _inductor/utils.py (#144108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144108
Approved by: https://github.com/eellison
2025-02-15 23:13:41 +00:00
4ab967c44d all reduce non strict (#147133)
Summary:
Some distributed collectives like `all_reduce` have special handling in Dynamo, where they are mapped to functional collectives. Non-strict was previously blind to such mappings, which means using them would fail to trace. Here we show how intercepting them in non-strict's torch function mode can mimic this remapping logic. More ops to follow.

Side note: a recently added distributed test was in the wrong place, making the expected failures for non-strict not fire because we weren't actually generating those tests to begin with! Now fixed.

Test Plan: moved and updated test

Differential Revision: D69607140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147133
Approved by: https://github.com/tugsbayasgalan
2025-02-15 19:37:08 +00:00
75a4b73816 utils: Update md5 call to be fips compliant (#147252)
Updates md5 call to be fips compliant according to this issue:
* https://github.com/pytorch/pytorch/issues/147236

Not going to add a conditional here because minimum the python version
that we support is already 3.9

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147252
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/malfet
2025-02-15 15:19:08 +00:00
6ca5c22e31 Revert "Enable fp16 linear layers in PyTorch via ACL (#144992)"
This reverts commit 5b37249259ad50d9b4b32a78a5b5178a1eb3d110.

Reverted https://github.com/pytorch/pytorch/pull/144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](https://github.com/pytorch/pytorch/pull/144992#issuecomment-2660902238))
2025-02-15 12:40:59 +00:00
86be5d4421 remove unnecessary xpu availability check when retrieving aot flags (#146966)
As title

Retrieving xpu aot flags that the pytorch binary was compiled against is not the same as running the binary itself. Thus it doesn't seem to necessarily check if there is an xpu environment available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146966
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/albanD
2025-02-15 09:15:49 +00:00
9e0b3e9b6c [Inductor] Fix Inplace Buffer inner name conflict (#147199)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/146975, when create `InplacedBuffer` inner name, we only count the number of unique `InplacedBuffer` or `RemovedArg`. The name may have conflict, for example reported in this issue

```
---- make inplace create, input_name is: buf22; output_name is: buf27; buf.inner_name is: in_out_ptr2
dict_values([
InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']),
InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']),
InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26']),
InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26'])])

---- make inplace create, input_name is: buf0; output_name is: buf3; buf.inner_name is: in_out_ptr2
dict_values([
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']),
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33'])
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']),
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33'])
])
```

- The first time create `in_out_ptr2`, there are 2 unique `InplacedBuffer`

- The second time create `in_out_ptr2`, there is 1 `RemovedArg` and 1 unique `InplacedBuffer`

They are 2 different `InplacedBuffer`, but with same name `in_out_ptr2`. In this PR, we fix this regression by counting the number of `RemovedArg`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147199
Approved by: https://github.com/jansel
2025-02-15 08:31:06 +00:00
a30f145101 [inductor] Don't leak pointers to cpp_wrapper with lru_cache (#147233)
Putting lru_cache on methods will keep pointers to the `self` objects
alive forever and leak memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147233
Approved by: https://github.com/yanboliang
2025-02-15 08:25:41 +00:00
9dc702875d [dynamo][mappingproxy][inspect] Support existing types.MappingProxyType (#147217)
Fixes https://github.com/pytorch/pytorch/issues/147162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147217
Approved by: https://github.com/williamwen42
2025-02-15 07:59:33 +00:00
cyy
8daa742e8b Remove code for Python < 3.9 (#147181)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147181
Approved by: https://github.com/albanD
2025-02-15 06:43:26 +00:00
9919375cf1 [executorch hash update] update the pinned executorch hash (#147241)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147241
Approved by: https://github.com/pytorchbot
2025-02-15 05:02:22 +00:00
cyy
8f291e8c00 Fix clang-tidy warnings in torch/jit (#146963)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963
Approved by: https://github.com/davidberard98
2025-02-15 03:36:59 +00:00
4233a77960 update kineto submodule to include fix for windows build (#147195)
Fixes an issue causing windows builds to fail
https://github.com/pytorch/kineto/pull/1039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147195
Approved by: https://github.com/cyyever, https://github.com/davidberard98, https://github.com/sraikund16
2025-02-15 01:53:16 +00:00
c1fcba3648 [Inductor] Fix the lowering of squeeze when input is not contiguous (#146746)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/143498. The issue happens when we lowering `select = torch.ops.aten.select.int(cat, 1, 0)`.

For example, when `cat` is contiguous with size[2, 2] stride[2,1]

- for eager, it returns a view of size[2,] stride[2,]
- for Inductor lowering, it returns wrong stride 1 instead of 2
```
TensorBox(
  ReinterpretView(
    StorageBox(
      ConcatKernel(name='buf10', layout=FixedLayout('cpu', torch.int64, size=[u0, 2], stride=[2, 1]), inputs=[ComputedBuffer(name='buf8', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b856449d0>, ranges=[u0, 1])), ComputedBuffer(name='buf9', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b85644790>, ranges=[u0, 1]))])
    ),
    FixedLayout('cpu', torch.int64, size=[u0], stride=[**1**]),
    origins=OrderedSet([select])
  )
)
```

To fix this issue, we give the right stride when lowering of `squeeze`.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_unbacked_symints.py -k test_issue_143498
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146746
Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/eellison
2025-02-15 01:33:04 +00:00
bf0c89a72f [dynamo] fix error message when logging graph that contains hops (#147227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147227
Approved by: https://github.com/zou3519
2025-02-15 00:53:44 +00:00
933f921b36 [PT][FSDP] support custom all reduce hook across FSDP units (#147114)
This change adds an API `set_all_reduce_hook` to the `FSDPModule` to
support customized all reduce either in native HSDP (2d mesh) setup or custom HSDP (1d FSDP + custom AR across replicas)
* For native HSDP, the original AR would still run as is and this hook allows for additional gradient modification post all reduce.
* For custom HSDP, the original AR will be skipped and all the logic is instead expected to be executed in the hook.

The custom hook is expected to perform operations in place (no return value).

Example basic usage:
```
model = ...
fully_shard(model, mesh=...)
model.set_all_reduce_hook(my_hook)
```

By default, the hook will run in the default all reduce stream post reduce scatter.
When native HSDP is NOT enabled, the custom hook can be specified to run in a custom stream. This custom stream will also be synchronized post reduce scatter similarly. See tests for examples.

Test Plan: CI

Differential Revision: D68255583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147114
Approved by: https://github.com/awgu
2025-02-15 00:38:00 +00:00
a9ae3340ca Fix triton masked loading for non-block tl.loads (#144782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782
Approved by: https://github.com/eellison
2025-02-15 00:07:33 +00:00
49727bbc9d Turn on prologue fusion (#147008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147008
Approved by: https://github.com/masnesral
2025-02-14 23:36:21 +00:00
76f57e184a [dynamo] Make SliceVariable a subclass of VariableTracker (#147046)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147046
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146819, #146995
2025-02-14 23:22:27 +00:00
a5c0dab900 [AOTInductor] Guard RAII_cpuMalloc with macro (#147150)
Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function]

Test Plan: Existing tests

Differential Revision: D69623481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150
Approved by: https://github.com/henrylhtsang
2025-02-14 23:21:35 +00:00
1224765286 [cond] make cond call fake kernel in dynamo (#147045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147045
Approved by: https://github.com/zou3519
ghstack dependencies: #146954
2025-02-14 23:13:15 +00:00
85a82c5bc8 [cond] make cond re-dispatch in proxy mode (#146954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954
Approved by: https://github.com/zou3519
2025-02-14 23:13:14 +00:00
eecee5863e Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 21:23:19 +00:00
d38db94689 [inductor][refactor] Move _compile_file to cpp_builder (#147202)
Summary: To further conslidate cpp build logic into cpp_builder

Test Plan: CI

Differential Revision: D69595327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147202
Approved by: https://github.com/yushangdi
2025-02-14 21:02:30 +00:00
dd86491b35 [cutlass backend][BE] refactor tests to remove duplicate logic (#146743)
Doing many things here:
* remove duplicate hip checking logic
* check for CUDA in setup
* remove CUTLASS_DIR setting. That is not needed when building from source and fbcode anymore
* fix some typing errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146743
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-02-14 20:50:27 +00:00
6f035d8462 [torch] Make amdsmi cdll hook private (#147207)
Summary: https://github.com/pytorch/pytorch/actions/runs/13314282597/job/37186177974 yelled at me for landing a seemingly public API that's not exported. It's a private API, so lets prepend `_` to make that clear

Test Plan: CI

Differential Revision: D69665234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147207
Approved by: https://github.com/PaulZhang12
2025-02-14 20:30:48 +00:00
272ead7b5e Make fx.node.map_arg() and .map_aggregate() generic (#146248)
## What's the problem?

The popular `fx.node.map_arg()` and `fx.node.map_aggregate()` apply operations recursively on `dict`s, `tuples`, `list`s, etc, and return a new collection of the same type.

Unfortunately, their base input type is `Argument`, which is [very unspecific indeed](5d55a6585d/torch/fx/node.py (L48-L58)): most type information is just thrown away at the call site of either of these functions, as far as the type checker goes.

As `torch` moves to a more typed code base, this would force innocent, unsuspecting developers to add logically unnecessary casts or `# type: ignore` statements.

## What's the solution?

Making these two `node.map_*` functions generic on the first argument and return type means that type information is preserved for the type checker. (The signature of the other parameter, the function that visits the nodes and subnodes, has not changed, nor should it.)

## Won't it break everything?

It doesn't break the type checker - one place needed an extra hint.

There have been code breakages, resolved one, at least one new one... we'll see!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146248
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-02-14 19:25:32 +00:00
58f654b5ad [ONNX] Consolidate constants to a single location (#147166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147166
Approved by: https://github.com/titaiwangms
ghstack dependencies: #147164, #147165
2025-02-14 19:08:19 +00:00
765bc30ab9 [ONNX] Set warning stacklevel so it appears at the torch.onnx call site (#147165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147165
Approved by: https://github.com/Skylion007
ghstack dependencies: #147164
2025-02-14 19:04:43 +00:00
9a1eac6704 [ONNX] Handle number of outputs in builder (#147164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147164
Approved by: https://github.com/titaiwangms
2025-02-14 19:03:51 +00:00
5517eb4398 Revert "[cutlass backend] Do not change dtype of GEMM template (#146877)"
This reverts commit 260b21b8bca6edd3e0b89b800d6efa8243f0d122.

Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to let me resubmit  ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2660053270))
2025-02-14 18:58:18 +00:00
aac5d1a289 Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit f0bdc27f74f8b1d4ab6789156691ee0fd5cbb30f.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like internal ideep version is too old to support this ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2660008996))
2025-02-14 18:31:54 +00:00
20a9938069 try print stacktrace for error (#147061)
Differential Revision: D69573525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147061
Approved by: https://github.com/Skylion007
2025-02-14 18:28:03 +00:00
8b5ee275fb [MPS] Fix cholesky_ex for empty inputs (#147159)
By making sure that `info` is actually initialized  if input is empty(but no need to do anything about `out`, is it's guaranteed to be an empty tensor)

Also move output resizing logic before `input.numel()` check

Fixes https://github.com/pytorch/pytorch/issues/147128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147159
Approved by: https://github.com/albanD
2025-02-14 17:44:08 +00:00
0d16188c06 [CI] Use job name to index into test times json (#147154)
When the test times are generated, it doesn't know what the build environment is because it's an environment variable.  But when we index into the test times, we (previously) didn't know what the job name is.  These are usually the same but sometimes they're different and when they're different it ends up using default, which can have unbalanced sharding

I think job name was added at some point to most of the CI environments but I didn't realize, so we can now update this code to use the job name instead so the generation and the indexing match

also upload stats workflow for mps

Checked that inductor_amx doesn't use default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147154
Approved by: https://github.com/huydhn
2025-02-14 17:06:56 +00:00
e8fbc86de0 Make torch.cuda.gds APIs public (#147120)
Follow up to https://github.com/pytorch/pytorch/pull/145748 that turned USE_CUFILE on for CUDA 12.6 and 12.8 binaries

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147120
Approved by: https://github.com/albanD
2025-02-14 17:06:50 +00:00
c3853d924f Introduce new template heuristic for triton autotune configs (#144985)
Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in.

This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs:
- CPUConfigHeuristic()
- CUDAConfigHeuristic()
- ROCmConfigHeuristic()
- XPUConfigHeuristic()

These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs.

The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985
Approved by: https://github.com/jansel
2025-02-14 17:01:06 +00:00
e06ee4aa9f Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))
2025-02-14 16:44:46 +00:00
059dfe2081 Revert "update kineto submodule (#147015)"
This reverts commit d1997b610f5b974af7ebad6b9903d2d8f751d927.

Reverted https://github.com/pytorch/pytorch/pull/147015 on behalf of https://github.com/atalman due to broke windows builds ([comment](https://github.com/pytorch/pytorch/pull/147015#issuecomment-2659730304))
2025-02-14 16:11:08 +00:00
06f4a5c0e5 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 15:29:59 +00:00
cefd9805de Add RAISE_VARARGS 0 (#146493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146493
Approved by: https://github.com/zou3519
ghstack dependencies: #146498, #146492
2025-02-14 13:37:23 +00:00
134723ee1c Add WITH_EXCEPT_START opcode (#146492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146492
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #146498
2025-02-14 13:37:23 +00:00
dbb86b78ad Add sys.exc_info and sys.exception (#146498)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146498
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-02-14 13:37:14 +00:00
ea188ac0c7 [export] Add meta for aten.bincount (#147129)
Fixes https://github.com/pytorch/pytorch/issues/147094
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147129
Approved by: https://github.com/pianpwk
2025-02-14 10:33:54 +00:00
de26ddfbdc Update torch-xpu-ops commit pin (#146671)
Update the torch-xpu-ops commit to [80c375570e2b6b2989a8610da1871f8a50dfddc7](80c375570e), includes:

- Aten operator coverage improvement
- SYCL kernel optimization
- Nested Tensor OPs support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146671
Approved by: https://github.com/EikanWang
2025-02-14 09:30:36 +00:00
bd019c0bb4 [Inductor][CPP] Fix node name for wgt delete (#147056)
**Summary**
This is a regression issue caused by a change in the FX node name. In commit 71010bf0972834e35a155e6a187e5c6649a5a36b, both the node name and target for the `get_attr` node in `V.graph.graph.nodes` were `_frozen_param2`. However, in the latest main, the node name has changed to `_reorder_linear_weight`. This PR fixes the regression by using the node's target instead of its name.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_weight_prune
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147056
Approved by: https://github.com/jgong5
2025-02-14 06:27:41 +00:00
10bc8f25b2 [MPS][BE] Migrate polar to use functor (#147184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147184
Approved by: https://github.com/dcci
ghstack dependencies: #147182, #147183
2025-02-14 06:25:36 +00:00
278ffd84fc [MPS][BE] Add copysign integral flavors as functor (#147183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147183
Approved by: https://github.com/dcci
ghstack dependencies: #147182
2025-02-14 06:25:36 +00:00
2ef51cfb9d [BE][MPS] Infer results of functor (#147182)
Do not assume that functor will return the same results as its arguments, but rather dynamically infer it using `decltype` and `:🤘:declval`
This is a no-op that prepares for migration of `copysign` of integral arguments, that would return a float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147182
Approved by: https://github.com/dcci
2025-02-14 06:25:27 +00:00
331d5cf560 [inductor] [cpp] Support vectorization for score and mask in FlexAttention CPU (#143638)
## Description
We generate vectorized kernel for score and mask in FlexAttention with this PR.

## Modification
The main change include:
- For the input and output buffer to the mask and score function, instead of passing scalars, we pass tensors to it.
- For the mask function, the original function which works on a scalar only includes the logic of calculating the mask value. The PR added the logic of applying the mark to the qk_data tensor into the graph and then leverage the CPP backend to generate vectorized kernels.
  The original mask graph:
  ```python
  def mask_fn(b, h, q_idx, kv_idx):
      mask = q_idx >= kv_idx
      return mask
  ```
  The converted_mask_graph should be:
  ```python
  def converted_mask_fn(qk_data, b, h, q_idx, kv_idx):
      mask = q_idx >= kv_idx
      qk_data = torch.where(mask, qk_data, torch.full_like(qk_data, -float("inf")))
      return qk_data
  ```

## Benchmark
For q, k, v of shape: `[1, 32, 1024, 128]`, using 40 CPU cores, we observe over 20x speedup compared with the non vectorized version for both `is_causal` = `False` and `True`.

## Test plan
The existing FlexAttention UTs (`test/inductor/test_flex_attention.py`, `test/inductor/test_flex_decoding.py`) can cover the change in this PR.

## Output code

**Code before this PR is in scalar version:**

```cpp
// apply score mod function
for (int64_t row = 0; row < cur_qSplitSize; ++row) {
    for (int64_t col = 0; col < cur_kvSplitSize; col++) {
    std::vector<int64_t> b_idx = {i};
    std::vector<int64_t> h_idx = {j};
    std::vector<int64_t> q_idx = {m+row};
    int64_t phisical_kv_idx = n+col;
    if (use_kv_indice) {
        phisical_kv_idx= *kv_logical_data * kvBlockSize + col;
    }
    std::vector<int64_t> kv_idx = {phisical_kv_idx};
    accum_t* in_ptr0 = qk_data + row * cur_kvSplitSize + col;
    auto in_ptr1 = b_idx.data();
    auto in_ptr2 = h_idx.data();
    auto in_ptr3 = q_idx.data();
    auto in_ptr4 = kv_idx.data();

    accum_t* out_ptr0 = in_ptr0;
    {
        {
            {
                auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
                out_ptr0[static_cast<int64_t>(0L)] = tmp0;
            }
        }
    }

    }
}
// Apply block mask, fill unused with -inf
for (int64_t row = 0; row < cur_qSplitSize; ++row) {
    for (int64_t col = 0; col < cur_kvSplitSize; col++) {
    std::vector<int64_t> b_idx = {i};
    std::vector<int64_t> h_idx = {j};
    std::vector<int64_t> q_idx = {m+row};
    int64_t phisical_kv_idx = n+col;
    if (use_kv_indice) {
        phisical_kv_idx= *kv_logical_data * kvBlockSize + col;
    }
    std::vector<int64_t> kv_idx = {phisical_kv_idx};
    accum_t* qk_block = qk_data + row * cur_kvSplitSize + col;
    auto in_ptr1 = b_idx.data();
    auto in_ptr2 = h_idx.data();
    auto in_ptr3 = q_idx.data();
    auto in_ptr4 = kv_idx.data();

    std::vector<int64_t> temp = {0};
    int64_t* out_ptr1 = temp.data();
    {
        {
            {
                auto tmp0 = static_cast<bool>(true);
                out_ptr1[static_cast<int64_t>(0L)] = tmp0;
            }
        }
    }

    *qk_block = *out_ptr1 != 0
                    ? *qk_block
                    : -std::numeric_limits<accum_t>::infinity();
    }
}
```

**Code after this PR will be vectorized:**
```cpp
accum_t* in_ptr0 = qk_data;

auto in_ptr1 = b_idx.data();
auto in_ptr2 = h_idx.data();
auto in_ptr3 = q_idx.data();
auto in_ptr4 = kv_idx.data();

// apply score mod function
{

    accum_t* out_ptr0 = in_ptr0;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(16));
                        tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0));
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))));
                        tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))));
                    }
                }
            }
        }
    }

}

// Apply block mask, fill unused with -inf
{

    accum_t* out_ptr1 = in_ptr0;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(16));
                        auto tmp1 = static_cast<bool>(true);
                        auto tmp2 = -std::numeric_limits<float>::infinity();
                        auto tmp3 = at::vec::VecMask<float,1>::from(tmp1);
                        auto tmp4 = at::vec::Vectorized<float>(tmp2);
                        auto tmp5 = decltype(tmp0)::blendv(tmp4, tmp0, tmp3.template cast<float,1>());
                        tmp5.store(out_ptr1 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0));
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize)))
                    {
                        for (int64_t x1_tail = static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))));x1_tail < static_cast<int64_t>(cur_kvSplitSize); x1_tail++)
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x1_tail + cur_kvSplitSize*x0)];
                            auto tmp1 = static_cast<bool>(true);
                            auto tmp2 = -std::numeric_limits<float>::infinity();
                            auto tmp3 = tmp1 ? tmp0 : tmp2;
                            out_ptr1[static_cast<int64_t>(x1_tail + cur_kvSplitSize*x0)] = tmp3;
                        }
                    }
                }
            }
        }
    }

}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143638
Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel
2025-02-14 05:26:18 +00:00
ce38bfd299 [executorch hash update] update the pinned executorch hash (#147157)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147157
Approved by: https://github.com/pytorchbot
2025-02-14 05:04:17 +00:00
92f669e39c [BE] Use c10::multiply_integers in cholesky_impl (#147163)
That replaces explicit for loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147163
Approved by: https://github.com/huydhn
2025-02-14 03:59:17 +00:00
2d089a5697 [dynamo] Remove unintended lru_cache (#147147)
I forgot to remove it while add frozenset __contains__ method in this PR
- https://github.com/pytorch/pytorch/pull/146062?fbclid=IwZXh0bgNhZW0CMTEAAR3S_qq8bYxO7pDuHqpr2X-vqkXQrY0KtT14z46bfuRDYikjJBet3uKF2dE_aem_o1c7I4eawKyaEsfiWhnTmw

This is causing memory leak

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147147
Approved by: https://github.com/williamwen42
2025-02-14 03:55:39 +00:00
6344ca1dd4 [BE][Ez]: Apply FURB188: use str remove(pre|suf)fix (#146997)
Since we are on 3.9, we can use this nice str builtin which is more readable and more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146997
Approved by: https://github.com/XuehaiPan, https://github.com/cyyever, https://github.com/jansel
2025-02-14 03:38:07 +00:00
cyy
d473c212fd Remove code for Python < 3.9 (#147097)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147097
Approved by: https://github.com/albanD
2025-02-14 03:22:49 +00:00
880e176544 [inductor] Fix for pattern file contains 'getitem' fails during impor… (#144980)
…t of the pattern module

  For example any pattern module that has the following pattern generated, fails to import because
  the name getitem undefined.

  native_dropout_default = CallFunction(aten.native_dropout.default, div_Tensor_1, KeywordArg('dropout_p'), True, _users=2)
  getitem = CallFunction(getitem, native_dropout_default, 0)

  this fix will resolve the error.

Fixes #144674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144980
Approved by: https://github.com/eellison
2025-02-14 02:30:24 +00:00
0b84311842 [export] Generate printers/parsers for serialization enum values. (#147126)
Summary:
Generate two helper functions for enum classes in generated_serialization_types.h

printEnum: will convert enum values into strings.
parseEnum: will convert strings into enum values.

Test Plan: CI

Differential Revision: D69604850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147126
Approved by: https://github.com/yiming0416
2025-02-14 02:14:35 +00:00
05001f0459 Add Structured Tracing for Traced Graph Edge Details for AC Debugging (#146634)
Summary:
Updating the structured trace infrastructure so that we are able to output to Zoomer and have an E2E solution.

Context Doc: https://docs.google.com/document/d/1T6omIBEWVhbOiwDLSLffgQwjxiT2rQv8QvvQwXkw4fY/edit?usp=sharing

Test Plan:
### Testing Structured Log + tlparse locally

Command:
```
TORCH_TRACE=/data/users/basilwong/fbsource/fbcode/log_torch_trace buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2
```

Torch Trace Logs (local then sent to paste): P1686419449
```
cat log_torch_trace/dedicated_log_torch_trace_rank_0_2lg012xo.log | pastry
P1686419449
```

tlparse output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

tlparse graph edge details output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/9_0_0/joint_graph_information_397.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Differential Revision: D61557220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146634
Approved by: https://github.com/jansel, https://github.com/Yuzhen11
2025-02-14 02:04:26 +00:00
486fc12d7e torch: Log a unified waitcounter for torch.compile and triton.autotune (#146723)
Summary: Add a second more generic waitcounter to torch.compile. We'll keep expanding this as new generic pytorch compilation sites show up.

Test Plan: Waitcounter only change, relying on existing tests.

Differential Revision: D69215401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146723
Approved by: https://github.com/davidberard98
2025-02-14 02:04:13 +00:00
f0bdc27f74 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-14 02:03:53 +00:00
c5a9e4a6a0 [Inductor][CPP] Fix a CPP GEMM Template output data type issue (#146958)
**Summary**
Issue found when fixing https://github.com/pytorch/ao/issues/1662. A FP32 GEMM with an epilogue node `to_fp16` resulted in [generated code](https://gist.github.com/leslie-fang-intel/464fb112abdb105818ae09b057350e84), which failed to compile. The root cause is that we used the slice of global buffer `Y` as the output of micro GEMM instead of a `local buffer`. However, due to the `to_fp16` epilogue node, the global buffer `Y` has a float16 data type, leading to the failure. This fix will ensure the use of a local buffer in such cases.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_linear_to_lowp_fp
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146958
Approved by: https://github.com/jgong5
2025-02-14 01:40:08 +00:00
d3524ecdd6 [Break XPU] Align meta calculation for fft_r2c with _fft_r2c_mkl (#146763)
Fix #146761
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146763
Approved by: https://github.com/jansel
ghstack dependencies: #146762, #145248, #146880
2025-02-14 01:39:18 +00:00
ade5af9430 [XPU] Align XPU convolution_backward output layout between fake tensor and real output tensor. (#146880)
Fix #146879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146880
Approved by: https://github.com/EikanWang, https://github.com/jansel
ghstack dependencies: #146762, #145248
2025-02-14 01:39:18 +00:00
9befdf565a [Break XPU][Inductor UT] Set input tensors to corresponding device for test case in test_aot_indutor.py (#145248)
Fix #145247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145248
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #146762
2025-02-14 01:39:11 +00:00
972e927134 [Break XPU][Inductor UT] Fix XPU Inductor UT failures introduced from community. (#146762)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146762
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel
2025-02-14 01:38:50 +00:00
6419076db9 [torch][amdsmi] Look for amdsmi in ROCM_HOME/ROCM_PATH before using rpath (#147117)
Summary: ROCm uses ROCM_HOME/ROCM_PATH to specify which version of rocm the user wants to use. This is especially important in multi-version setups. Let's respect that behavior when loading amdsmi.

Test Plan:
CI
```
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL MSCCL_ALGO_DIR=~/2fbsource/third-party/rccl/develop/tools/msccl-algorithms RCCL_MSCCLPP_THRESHOLD=(math '128*1024*1024')  RCCL_MSCCLPP_ENABLE=1 ENABLE_MSCCLPP=1 buck2 run fbcode//mode/opt-amd-gpu -m rocm621 fbcode//accelerators/workloads/microbench:bench_comm -- --shape moe_17b --comm_algo nccl_allreduce
```

Differential Revision: D69597647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147117
Approved by: https://github.com/malfet
2025-02-14 01:11:59 +00:00
20a369aa3a [Intel GPU] Avoid copy when the input of Matmul is broadcasted (#143784)
Avoid copy when the input of Matmul is 3D and broadcasted on batch dim.  oneDNN support implicit broadcast semantics i.e., src can be broadcasted into weight if the corresponding dimension in src is 1 (and vice versa). On Max 1100, timm resmlp_12_224 amp_fp16 inference with bs=128 can improve from 42ms to 13.7 ms on torch.compile and 57.5ms to 32ms on eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143784
Approved by: https://github.com/EikanWang
2025-02-14 00:48:07 +00:00
057bcd3a45 [ca] eliminate duplicate getitem graph nodes for shape inputs (#146875)
should reuse existing proxies instead of creating new ones

before: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpL7hmHe/0_-_-_0/compiled_autograd_graph_3.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
```python
class CompiledAutograd0(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem = inputs[0]
        getitem_1 = inputs[1]
        getitem_2 = inputs[2];  inputs = None
        getitem_3 = sizes[0];  getitem_3 = None
        getitem_4 = sizes[1];  getitem_4 = None
        getitem_5 = sizes[2];  getitem_5 = None
        getitem_6 = sizes[3];  getitem_6 = None
        getitem_7 = sizes[4];  getitem_7 = None
        getitem_8 = sizes[5];  getitem_8 = None
        getitem_9 = sizes[6];  getitem_9 = None
        getitem_10 = sizes[7];  getitem_10 = None
        getitem_11 = sizes[8];  getitem_11 = None
        getitem_12 = sizes[9];  getitem_12 = None
        getitem_13 = sizes[10];  getitem_13 = None
        getitem_14 = sizes[11];  getitem_14 = None
        getitem_15 = sizes[12];  getitem_15 = None
        getitem_16 = sizes[13];  getitem_16 = None
        getitem_17 = sizes[14];  getitem_17 = None
        getitem_18 = sizes[15];  getitem_18 = None
        getitem_19 = sizes[0]
        getitem_20 = sizes[1]
        getitem_21 = sizes[2]
        getitem_22 = sizes[3]
        getitem_23 = sizes[4]
        getitem_24 = sizes[5]
        getitem_25 = sizes[6]
        getitem_26 = sizes[7]
        getitem_27 = sizes[8]
        getitem_28 = sizes[9]
        getitem_29 = sizes[10]
        getitem_30 = sizes[11]
        getitem_31 = sizes[12]
        getitem_32 = sizes[13]
        getitem_33 = sizes[14]
        getitem_34 = sizes[15];  sizes = None
```

after: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpCo5T6B/0_-_-_0/compiled_autograd_graph_1.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
```python
class CompiledAutograd0(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem = inputs[0]
        getitem_1 = inputs[1]
        getitem_2 = inputs[2];  inputs = None
        getitem_3 = sizes[0]
        getitem_4 = sizes[1]
        getitem_5 = sizes[2]
        getitem_6 = sizes[3]
        getitem_7 = sizes[4]
        getitem_8 = sizes[5]
        getitem_9 = sizes[6]
        getitem_10 = sizes[7]
        getitem_11 = sizes[8]
        getitem_12 = sizes[9]
        getitem_13 = sizes[10]
        getitem_14 = sizes[11]
        getitem_15 = sizes[12]
        getitem_16 = sizes[13]
        getitem_17 = sizes[14]
        getitem_18 = sizes[15];  sizes = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146875
Approved by: https://github.com/jansel
ghstack dependencies: #146720, #146735
2025-02-13 21:41:33 +00:00
76dacd5fc7 [ca] log graph before reodering passes (#146735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146735
Approved by: https://github.com/jansel
ghstack dependencies: #146720
2025-02-13 21:41:33 +00:00
cdbf677cdd Remove outdated comment in ATen/mkl/Sparse.h about lack of Windows support (#147125)
Fixes #147124.

* #102604 added support for Intel oneMKL Sparse BLAS APIs so there was an outdated comment left around in the codebase that can now be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147125
Approved by: https://github.com/janeyx99
2025-02-13 21:34:05 +00:00
1f41ceb713 [BE][Ez]: Enable ruff rule banning print in assert (#146615)
Enables a few ruff rules
* Ban print statements within asserts (likely bugs)
* ~Use string for Decimal literal to prevent loss of precision~
* ~Do not use default args for __post__init__ in dataclasses, they likely were meant to go into the factory method, the __init__, or somewhere else. The default values are useless here.~

Wait until ruff upgrade for the last 2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146615
Approved by: https://github.com/jansel
2025-02-13 21:14:00 +00:00
5469e5c556 [export] Minor fix to locals (#146955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146955
Approved by: https://github.com/bobrenjc93
2025-02-13 20:29:15 +00:00
7b4efb492b [inductor][refactor] Make _compile_file only used for fbcode (#147106)
Summary: _compile_file in codecache.py only handles specific cpp compilation in fbcode. The next step is to consolidate it with cpp_builder.

Test Plan: CI

Differential Revision: D69592025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147106
Approved by: https://github.com/yushangdi
2025-02-13 20:22:31 +00:00
2d3db4509a fix pt2e block wise quantization test (#147035)
Differential Revision: D69559217

https://github.com/pytorch/pytorch/pull/145941 breaks the unit test added for prepare pt2e + block wise quantization. Fixing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147035
Approved by: https://github.com/andrewor14
2025-02-13 19:44:56 +00:00
b0553cee6b [Utilization] post-test-process workflow (#145310)
# Overview
Add reusable workflow to trigger the post-test right after each test job is complete.

Cousion with pr to setup the runner permissions:
Add m fleet instances: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595/files
add to lix fleet:https://github.com/pytorch/ci-infra/pull/322/files

Currently I turn on the debug flag for testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145310
Approved by: https://github.com/huydhn
2025-02-13 18:51:19 +00:00
260b21b8bc [cutlass backend] Do not change dtype of GEMM template (#146877)
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69085556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877
Approved by: https://github.com/ColinPeppler
2025-02-13 18:36:16 +00:00
92d448ff62 Add self to CODEOWNERS for fx/proxy.py; warn against adding new node arg types (#147031)
Not sure if there's a better way

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147031
Approved by: https://github.com/StrongerXi
ghstack dependencies: #147016, #147012, #147013
2025-02-13 18:21:21 +00:00
9a883007a2 Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)"
This reverts commit c7515da7b00de40942c83dc5856b6daec727e280.

Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))
2025-02-13 18:04:26 +00:00
65e8862b9a Revert "[cond] make cond re-dispatch in proxy mode (#146954)"
This reverts commit 2ce6de2415fb6592dd4447ebea334fd12b8c31ea.

Reverted https://github.com/pytorch/pytorch/pull/146954 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert it to cleanly revert 140979 ([comment](https://github.com/pytorch/pytorch/pull/146954#issuecomment-2657357742))
2025-02-13 18:02:33 +00:00
1f8ff6812d [Fix]: Disable KleidiAI if unsupported gcc/clang compiler is detected (#146836)
Fixes: https://github.com/pytorch/pytorch/issues/146740

Description:
1. KleidiAI officially supports GCC>=11 and Clang>=11. Certain hardware features are tied to compiler version and KleidiAI compilation will fail in such cases.

Change-Id: Ib43d6b5bf66ef5ea48c481a2774801c573ec205c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146836
Approved by: https://github.com/malfet
2025-02-13 17:49:26 +00:00
447a142de2 support input mutations on tangents in compile (#141131)
Fixes https://github.com/pytorch/pytorch/issues/141111. We previously supported mutations on saved activations that happened in the backward. This PR extends the support to tangents

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141131
Approved by: https://github.com/zou3519
2025-02-13 17:48:56 +00:00
7077d0ac8c [DCP] Introduce modules metadata in the storage_meta (#146654)
Summary: Introduce the list of modules in the storage_meta which is shared between the planner and the storage writer. We will use it to let the storage writer know about the modules in the state dict and create module directories in the checkpoint.

Test Plan: UTs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146654
Approved by: https://github.com/MeetVadakkanchery
2025-02-13 17:44:30 +00:00
938209fb6f Revert "Use 2022 as default VC_YEAR for windows builds (#147053)"
This reverts commit 858bc0cea50614d1e190e6991d974ddb0f53fc88.

Reverted https://github.com/pytorch/pytorch/pull/147053 on behalf of https://github.com/atalman due to Broke windows tests ([comment](https://github.com/pytorch/pytorch/pull/147053#issuecomment-2657239501))
2025-02-13 17:09:37 +00:00
683178fabc [cuda] fix printing of num_gpus (#146838)
Previously on machines with less than 8 gpus, the device==7 case would
trigger the assert inside getDeviceProperties, and print `num_gpus=BEL`
which is ascii for 7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146838
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-02-13 15:23:35 +00:00
020232ec9f [Submodule]: Update KleidiAI submodule to v1.3.0 (#146480)
Change-Id: I687255982c72ee7daca438a15b718f07298963cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146480
Approved by: https://github.com/digantdesai, https://github.com/malfet
2025-02-13 15:23:04 +00:00
df776d64f7 chore: fix typos in error messages in FSDP (#146805)
Fixes two small typos in FSDP error messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146805
Approved by: https://github.com/awgu, https://github.com/Skylion007
2025-02-13 15:22:13 +00:00
345f556628 Fix DispatchStub.cpp compilation for gcc 14 (#146512)
Otherwise I get the following error:

```bash

.../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: error: no matching function for call to ‘find(std::array<c10::DeviceType, 7>::const_iterator, std::array<c10::DeviceType, 7>::const_iterator, const c10::DeviceType&)’
  152 |     if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) {
      |         ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/c++/14/bits/locale_facets.h:48,
                 from /usr/include/c++/14/bits/basic_ios.h:37,
                 from /usr/include/c++/14/ios:46,
                 from /usr/include/c++/14/ostream:40,
                 from .../intel-xpu-backend-for-triton/pytorch/c10/core/DeviceType.h:13,
                 from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.h:3,
                 from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:2:
/usr/include/c++/14/bits/streambuf_iterator.h:435:5: note: candidate: ‘template<class _CharT2> typename __gnu_cxx::__enable_if<std::__is_char<_CharT2>::__value, std::istreambuf_iterator<_CharT, std::char_traits<_CharT> > >::__type std::find(istreambuf_iterator<_CharT, char_traits<_CharT> >, istreambuf_iterator<_CharT, char_traits<_CharT> >, const _CharT2&)’
  435 |     find(istreambuf_iterator<_CharT> __first,
      |     ^~~~
/usr/include/c++/14/bits/streambuf_iterator.h:435:5: note:   template argument deduction/substitution failed:
.../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: note:   mismatched types ‘std::istreambuf_iterator<_CharT, std::char_traits<_CharT> >’ and ‘const std::array<c10::DeviceType, 7>::value_type*’ {aka ‘const c10::DeviceType*’}
  152 |     if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) {
      |         ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146512
Approved by: https://github.com/Skylion007
2025-02-13 15:21:59 +00:00
7c3b2a29ec [subclass] testing WrapperSubclass respect outer_size, outer_stride (#146897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146897
Approved by: https://github.com/bdhirsh
2025-02-13 15:21:19 +00:00
e2479d7809 Update slow tests (#146822)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146822
Approved by: https://github.com/pytorchbot
2025-02-13 15:20:58 +00:00
aeabbffe15 Disable test with dynamo for schema gen (#146865)
Fixes https://github.com/pytorch/pytorch/issues/141202.

1. So we skip the schema gen tests under dynamo. https://github.com/pytorch/pytorch/issues/141202 fails in a weird way: where it's claiming node is an integer, but we tested isinstance tests [here](https://github.com/pytorch/pytorch/blob/main/torch/_library/utils.py#L234-L241). This is probably dynamo messing up with the stacks. and checking fx.Node isn't really what dynamo is designed for.
2. We move some of legit cond testes out of schema gen and put it back to control flow tests. Also rename _test_export to a lengthy names.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146865
Approved by: https://github.com/zou3519
2025-02-13 15:20:52 +00:00
67c4c39b4f [docs] Minor fixes to export and aoti docs (#144513)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144513
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-02-13 15:19:35 +00:00
d1997b610f update kineto submodule (#147015)
Fix https://github.com/pytorch/kineto/issues/1032
See https://github.com/pytorch/kineto/pull/1035 for testplan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147015
Approved by: https://github.com/sraikund16, https://github.com/Skylion007
2025-02-13 15:13:18 +00:00
8d94eb1e3b [BE]: Make OrderedSet reversible (#146904)
It's rather trivial to make OrderedSet reversible, so let's do it and unlock that additional functionality for downstream users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146904
Approved by: https://github.com/eellison
2025-02-13 15:11:48 +00:00
858bc0cea5 Use 2022 as default VC_YEAR for windows builds (#147053)
New Windows AMI does not have Visual Studio 2019. Hence use 2022 as default.
See: https://github.com/pytorch/test-infra/pull/6226
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147053
Approved by: https://github.com/huydhn
2025-02-13 14:37:55 +00:00
f95bdf5e6c Make GetCPUAllocatorMaybePinned to be Device-Agnostic (#146687)
----

- Keep cuda first to perserve BC
- Remove cuda first if it is possible to have only one accelerator at a time in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146687
Approved by: https://github.com/ngimel
2025-02-13 13:09:48 +00:00
e21181642f [AOTInductor] Align behavior between CPU and GPU (#145459)
Summary:
(1) Make sure CPU and GPU doesn't have different implementation and behavior when calling from the same path and API. Only difference between CPU and GPU after this PR should ONLY be the running hardware.
(2) This PR fixes the issue of memory access when it==constants_map.end()
(3) This PR resolves T179437596

Test Plan: buck2 run mode/dev sigmoid/inference/test:e2e_test_cpu

Differential Revision: D68540744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145459
Approved by: https://github.com/desertfire, https://github.com/hl475
2025-02-13 09:50:18 +00:00
ca3aabc8e6 [Inductor][CPU] Add a lowering pass for _weight_int4pack_mm_for_cpu (#145250)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds a lowering pass for `torch.ops.aten_weight_int4pack_mm_for_cpu`. This op is used for WoQ int4 in Torchao. The lowering pass is a prerequisite for max-autotune, which is planed to be enabled for this op in subsequent PRs.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int4
python test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145250
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #145245
2025-02-13 08:40:12 +00:00
17d3a69c32 [Intel GPU] fix memory leak in deconv backward (#144385)
Fixes #143807

We need manage onednn scratchpad in pytorch, otherwise onednn will always allocate scratchpad memory during primitive execution and causes memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144385
Approved by: https://github.com/liangan1, https://github.com/EikanWang
2025-02-13 07:41:34 +00:00
43496e9b90 [NJT] fix flop counter for SDPA & test (#147032)
Fixes 3 issues:
1. The test wasn't actually testing SDPA: both were checking cuda, and the inputs to SDPA were not transposed.
2. FlopCounterMode has been renamed _FlopCounterMode (and a wrapper named FlopCounterMode has been added)
3. offsets_to_list also needs to ignore the actual offset values if offsets is a meta tensor.

Differential Revision: [D69558785](https://our.internmc.facebook.com/intern/diff/D69558785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147032
Approved by: https://github.com/jbschlosser
2025-02-13 07:14:58 +00:00
tim
b9a22b3f37 bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps (#146623)
This pr addresses the issue in the MPS backend for `_scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape.

The issue was found in https://github.com/hiyouga/LLaMA-Factory/issues/6835, in [transformers qwen2_vl](1590c66430/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (L373C14-L373C93)), 3d q/k/v were passed into sdpa function, which lead to an error.

Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms.

---
reproduce code:
```
import torch
import torch.nn.functional as F

head_num, seq_len, embed_dim = 16, 16, 80
bsz = 1

q = torch.randn(head_num, seq_len, embed_dim)
k = torch.randn(head_num, seq_len, embed_dim)
v = torch.randn(head_num, seq_len, embed_dim)
attention_mask = torch.ones(1, seq_len, seq_len)

oo_cpu = F.scaled_dot_product_attention(
    q.to("cpu"),
    k.to("cpu"),
    v.to("cpu"),
    attention_mask.to("cpu"),
    dropout_p=0.0
)

if torch.backends.mps.is_available():
    oo_mps = F.scaled_dot_product_attention(
        q.to("mps"),
        k.to("mps"),
        v.to("mps"),
        attention_mask.to("mps"),
        dropout_p=0.0
    )
    assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5)
```

error outputs:
```
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module>
    oo_mps = F.scaled_dot_product_attention(
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```

hardware and envs:
```
torch               2.6.0
apple m3 max
```

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146623
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-13 07:00:51 +00:00
17a808557c [MPS] cholesky ex version (#146799)
PR #145701 didn't have experimental version of cholesky. This PR adds that version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146799
Approved by: https://github.com/malfet
2025-02-13 07:00:21 +00:00
4879f8f919 [TP] Add warning when module is distributed twice (#147006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147006
Approved by: https://github.com/XilunWu
2025-02-13 06:49:17 +00:00
3e4172d985 [BE][Ez]: Update fmtlib submodule to 11.1.3 (#146985)
This submodule just fixes a bunch of miscellaneous bugfix issues with ABI compatibility, compiler warning, workarounds for older compilers, performance, and edge cases in formatting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146985
Approved by: https://github.com/drisspg
2025-02-13 06:47:11 +00:00
aa20b4b6cf Friendly handle mem_get_info's runtime error message (#146899)
# Motivation
Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899
Approved by: https://github.com/EikanWang
2025-02-13 06:26:19 +00:00
66fb10fc53 [BE][OpInfo] Introduce generic dtypesIf (#146905)
Use `__setattr__` and `__getattribute__` to wrap existing `dtypesIfXYZ` into it, which will allow for subsequent incremental elimination of those

Also, type annotation for OpInfo is a sham: it claims that `dtypes` and `dtypesIfXYZ` must be of type `_dispatch_dtypes`, but in reality it's converted to set in post init.

Test Plan:
 - Check that `op_db[0].dtypesIfCUDA` and others shows the same values as before, by running the following script
 ```python
from torch.testing._internal.common_methods_invocations import op_db
print({name: getattr(op_db[0], f'dtypesIf{name}') for name in ['CUDA', 'ROCM', 'XPU', 'Hpu']})
```
 - CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146905
Approved by: https://github.com/janeyx99
2025-02-13 05:33:17 +00:00
43eb39d7c8 [executorch hash update] update the pinned executorch hash (#145128)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145128
Approved by: https://github.com/pytorchbot
2025-02-13 05:06:44 +00:00
88d0bb0fee [aoti_debug_printer][BE] explicitly dumping float32, bfloat16, float16 data type (#147020)
Summary:
per request, explicitly dumping the float dtypes for aten tensors in debug printing summary info.

can be useful in identifying issues such as "wrong AOTI Lowering precisions"

Test Plan:
```
 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm
```

Differential Revision: D69547344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147020
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-02-13 04:41:00 +00:00
2ff3fdfdae [audio hash update] update the pinned audio hash (#146738)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146738
Approved by: https://github.com/pytorchbot
2025-02-13 04:29:46 +00:00
936df4571b Update test_c10d_object_collectives.py with DistributedTestBase class (#145056)
# MOTIVATION
To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR #138216](https://github.com/pytorch/pytorch/pull/138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs.

# CHANGES

Replaced MultiProcessTestCase with the DistributedTestBase class.
Extended test functionality to include support for HPUs.
Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145056
Approved by: https://github.com/kwen2501, https://github.com/guangyey
2025-02-13 03:57:59 +00:00
a9598337b7 [Optimus] Include more corner cases in the select cat aten pass (#146662)
Summary: Thanks to Shuai for reporting the bug in the pattern. We found there's a typo in the pass, where we should make sure all the selects will go to the cat node.

Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad

Buck UI: https://www.internalfb.com/buck2/2cd0888e-d803-43a8-8530-d97e6bc281b3
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449699305108
Network: Up: 110KiB  Down: 35KiB  (reSessionID-687be0fa-031a-47a0-8780-5ab4cf4bbd94)
Executing actions. Remaining     0/4                                                                              6.6s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:12.0s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D69278487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146662
Approved by: https://github.com/Microve
2025-02-13 03:40:26 +00:00
6ca497a8e5 Replace is_same with is_same_v for concise syntax (#145450)
Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450
Approved by: https://github.com/huydhn
2025-02-13 03:29:39 +00:00
c159723c39 Fix meta impl for topk (#147017)
Topk in this context is always size-like so we should use torch._check_is_size. Fixes some issue in https://github.com/pytorch/pytorch/issues/146990

Differential Revision: [D69545983](https://our.internmc.facebook.com/intern/diff/D69545983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147017
Approved by: https://github.com/ydwu4
2025-02-13 03:18:47 +00:00
821422018a [FlexAttention] Make zero_length sequence handiling better (#147010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147010
Approved by: https://github.com/Chillee
2025-02-13 03:18:24 +00:00
54e28b2a71 [BE] Turn nextafter into functor (#147018)
This functor is a bit more involved as nextafter is missing for MacOS13
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147018
Approved by: https://github.com/dcci
ghstack dependencies: #146965, #146993, #147023
2025-02-13 02:10:29 +00:00
aaa46c0625 Add missing autoreleasepool around runUniqueGraph to prevent leaks (#145512)
References were held onto longer than needed. Added autoreleasepool around the runUniqueGraph to allow the memory to be freed.

Fixes #145151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145512
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-13 01:58:18 +00:00
e0ca041ae3 [BE] Toward Metal Iterator (step 2) (#147023)
Add dense flavor of the binary ops, i.e. if iterator is contiguous, do not build indices but rather run different flavor, using the same functor, which results in almost 100% perf gain for binary operation with 1mln elements of `torch.fmax` as one can see from the table below collected on M4Pro Mini using following benchmarking script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_binary(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()",
        setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024**2
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_binary(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```

 Dtype  | Time before | Time After |
| ------|------------ | ---------- |
| float32  | 0.84 msec  | 0.66 msec |
| float16  |  0.49 msec |  0.23 msec |
| bfloat16  | 0.48 msec  | 0.22 msec |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147023
Approved by: https://github.com/dcci
ghstack dependencies: #146965, #146993
2025-02-13 01:50:43 +00:00
80f146dedf Update addbmm, addmm, addmv and baddbmm description (#146689)
Fixes #146611, following #146482

## Test Result

![image](https://github.com/user-attachments/assets/5c1749be-1f10-4e80-a284-b1929ca340eb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146689
Approved by: https://github.com/mikaylagawarecki
2025-02-13 01:30:50 +00:00
5dab0aeef0 [SkipFiles] Some more cleanup (#147013)
This isn't a no-op but I think it's fine. It changes the case where a
function f1 in a module in MOD_SKIPFILES calls a function f2 in one of
the deleted modules. Previously f2 would have been skipped, now f2 gets
inlined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147013
Approved by: https://github.com/yanboliang
ghstack dependencies: #147016, #147012
2025-02-13 01:18:47 +00:00
fddaa2958b [SkipFiles] Some more cleanup (#147012)
I think these are all no-ops.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147012
Approved by: https://github.com/yanboliang
ghstack dependencies: #147016
2025-02-13 01:18:47 +00:00
87ebd77b34 Add some more docs to trace_rules.py (#147016)
After discussing with Yanbo we wanted to record the behavior down so we
don't need to rederive them in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147016
Approved by: https://github.com/yanboliang
2025-02-13 01:18:39 +00:00
b77a6eb184 [dynamo] Fix tensordict regression (#146995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146995
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146819
2025-02-13 00:59:59 +00:00
2ce6de2415 [cond] make cond re-dispatch in proxy mode (#146954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954
Approved by: https://github.com/zou3519
2025-02-13 00:50:33 +00:00
67cbbb29e0 [export] Dedup expression_created logs (#146859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146859
Approved by: https://github.com/pianpwk
ghstack dependencies: #146532, #146533, #146534, #146858
2025-02-13 00:21:34 +00:00
59bc5d0d71 [tlparse] Add stacktrace filter utility (#146858)
Added a utility function for capturing the user stack and framework stacktrace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146858
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #146532, #146533, #146534
2025-02-13 00:21:34 +00:00
43f5566c92 [export] Add additional tlparse logging (#146534)
Added some additional logging so we can also run tlparse on generic export errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146534
Approved by: https://github.com/pianpwk
ghstack dependencies: #146532, #146533
2025-02-13 00:21:34 +00:00
b4bdbce1ac [export] Use custom stream logger in draft-export (#146533)
Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows:

```python
        if key == "missing_fake_kernel":
            return hash((key, data["op"]))  # Same ops get deduped
        elif key == "mismatched_fake_kernel":
            return hash((key, data["op"], data["reason"]))  # Same op and reason for errors get deduped
        elif key == "propagate_real_tensors":
            return hash((key, json.dumps(data["stack"])))  # Guards appearing on the same stacktrace get deduped
        elif key == "create_unbacked_symbol":
            return hash((key, json.dumps(data["stack"])))  # Unbacked symbols appearing on the same stacktrace get deduped
```

Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this.

The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146533
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #146532
2025-02-13 00:21:34 +00:00
be387f57b1 [symbolic shapes] Log SymNode id for provenance (#146532)
We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse:
<img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146532
Approved by: https://github.com/bobrenjc93
2025-02-13 00:21:34 +00:00
21c2565f35 Document dynamo (#146736)
Many files in dynamo are currently lacking file/module-level documentation, which makes it hard to know what they do at a glance and without digging into the code. This fixes that.

Note: documentation was AI-generated and could be incorrect, please review carefully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146736
Approved by: https://github.com/jansel, https://github.com/StrongerXi, https://github.com/anijain2305, https://github.com/zou3519
2025-02-13 00:02:21 +00:00
0344bf8a5a [cuDNN] cuDNN to 9.7.1.26 for CUDA 12.8 (#146957)
rebasing for https://github.com/pytorch/pytorch/pull/146717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146957
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/atalman

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-12 23:43:34 +00:00
d5a2e4c754 [oncall] Change error message to be more readable (#146934)
Summary:
During oncall, got a debug, where the error message is a bit ambiguous, due to multiple colons, and full line cutoff
```
AssertionError: Expected order: 1 for the component: remote_request_only to be >= 2, the max order for all its
```

Update the error message to something like
```
AssertionError: Component remote_request_only order must be >= max order of its upstream components, got component order=1 and max=2
```

Test Plan: CI

Differential Revision: D69482789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146934
Approved by: https://github.com/ColinPeppler
2025-02-12 23:33:09 +00:00
ad4e5bf705 cpp_wrapper: handle mixed-device C-shim fallbacks (#146449)
Fixes an error from test_torch, where a CUDA cpp_wrapper run called a CUDA native C-shim kernel with two CPU tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146449
Approved by: https://github.com/desertfire
2025-02-12 23:21:04 +00:00
076215944a Turn on autograd local caches in fbcode (#146996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146996
Approved by: https://github.com/jamesjwu
2025-02-12 23:04:39 +00:00
c60f587c04 Fix shape_inference for V-schedules (#147000)
I was hitting a hang in shape_inference when testing v-shaped schedules with >2 ranks in titan.

`self.next_rank` and `self.prev_rank` are used in shape inference but are not accurate for v-shaped schedules:
bfcce6984b/torch/distributed/pipelining/stage.py (L1325-L1326)

Will clean up / delete the use of next_rank / prev rank in follow up PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147000
Approved by: https://github.com/wconstab
2025-02-12 22:56:46 +00:00
f954aac6be Add make_dynamo_test (#146491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146491
Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/malfet
2025-02-12 22:54:29 +00:00
fd21126007 [ONNX] Deprecation message follow up (#147005)
Follow up on https://github.com/pytorch/pytorch/pull/146923 to address comments.

This pull request includes updates to the `torch/onnx` module, focusing on deprecations and documentation improvements. The most important changes involve moving version change notes within the `export` function, updating deprecation messages, and removing example code in the `dynamo_export` function.

Documentation and Deprecation Updates:

* [`torch/onnx/__init__.py`](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184): Moved version change notes to the correct location within the `export` function's docstring. Updated the deprecation note for the `dynamo_export` function to version 2.7 and removed example code from its docstring. [[1]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184) [[2]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553R349-R357) [[3]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L434-R430) [[4]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L445-L475)

* [`torch/onnx/utils.py`](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114): Enhanced deprecation messages for several functions (`select_model_mode_for_export`, `disable_apex_o2_state_dict_hook`, `setup_onnx_logging`, `unconvertible_ops`) to provide clearer guidance on their removal and suggest copying logic if needed. [[1]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114) [[2]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL148-R151) [[3]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL166-R173) [[4]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1180-R1189) [[5]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1190-R1199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147005
Approved by: https://github.com/titaiwangms
2025-02-12 22:48:56 +00:00
f655f840b8 [ONNX][dort] Remove reference to onnxscript rewriter (#147003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147003
Approved by: https://github.com/titaiwangms, https://github.com/gramalingam, https://github.com/shubhambhokare1
2025-02-12 22:02:07 +00:00
995f607c74 fix doc string (#146968)
Fixes a wrong function name in doc string

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146968
Approved by: https://github.com/zackycao, https://github.com/H-Huang
2025-02-12 21:43:16 +00:00
06a07f6018 [BE] Towards MetalTensorIterator (#146993)
Further refactor binary kernels to replace individual implementation with a binary_indexing_kernel template that takes functors that implement the logic.

According to godbolt such refactoring should have no impact on the performance as compiler thru dead code elimination should just replaces the functor with direct underlying function call as one can see for clang CPU compiler here: https://godbolt.org/z/8dxv5jvz7 but to be on the safe side, run following benchmark
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_binary(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()",
        setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024**2
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_binary(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```

That reports roughly identical before and after times (1 msec for float32 and .5 msec for float16)

Another interesting quirk, that functors can not be in anonymous namespace, otherwise they'll not be visible from the library, as one can see by running following swift sample (filed FB16490467 to clarify if this is supported)
```swift
let shader_source = """
struct add_functor {
  template <typename T>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(a + b);
  }
};

namespace {
struct sub_functor {
  template <typename T>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(a - b);
  }
};
} // anonymous namespace

template <typename T, typename F>
kernel void binary_executor(
    constant T* input [[buffer(0)]],
    constant T* other [[buffer(1)]],
    device T* out [[buffer(2)]],
    uint tid [[thread_position_in_grid]]) {
  F f;
  out[tid] = f(input[tid], other[tid]);
}

template
[[host_name("add_float")]] kernel void binary_executor<float, add_functor>(constant float*, constant float *, device float*, uint);

template
[[host_name("sub_float")]] kernel void binary_executor<float, sub_functor>(constant float*, constant float *, device float*, uint);
"""

import Metal
guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
let library = try! device.makeLibrary(source:shader_source, options:MTLCompileOptions())

// Expect two kernels to be printed, but see only one, with functor in global namespace
for kernel_name in library.functionNames {
  print(kernel_name)
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146993
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #146965
2025-02-12 21:40:40 +00:00
de964b9f8b dont specialize symints when testing truthiness (#146731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146731
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #146642, #146729
2025-02-12 20:57:10 +00:00
5cda021cac support meta_tensor.to(device='cpu') under fake_mode (#146729)
Fixing this is actually a bit annoying:

(1) FakeTensorMode sees a function where all of its inputs are real tensors, so it tries to run the real compute before converting the output to a FakeTensor

(2) we don't actually want this, because the "real compute" is support to error normally, when you do `meta_tensor.to(device='cpu')`. Instead, we want FakeTensor to actually skip constant prop and run the normal FakeTensor implementation, which will not error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146729
Approved by: https://github.com/zou3519, https://github.com/SherlockNoMad, https://github.com/albanD
ghstack dependencies: #146642
2025-02-12 20:57:10 +00:00
ec0b318ddb [poc] force UntypedStorage.from_buffer(buf) to return meta storage under FakeTensorMode (#146642)
context here: https://fb.workplace.com/groups/326136610199609/permalink/495389539940981/

This PR is an attempt to make it such that if you create a tensor from an external buffer (using `UntypedStorage.from_buffer(buf)`, we can generate a proper fake tensor for you out of the box.

The annoying bit is that there are not any dispatcher ops to interpose on and change behavior. So instead, I took the manual C binding and tweaked the storage device to be "meta' if we see an active fake mode.

Put "poc" in the title since I... think this is hopefully reasonable, but I can be convinced that it's not :)

```
from torch._subclasses.fake_tensor import FakeTensorMode
import pickle
import io
import torch
from contextlib import nullcontext

use_fake_tensor = True
with FakeTensorMode() if use_fake_tensor else nullcontext():
    obj = [1, 2]
    f = io.BytesIO()
    pickle.Pickler(f).dump(obj)
    byte_storage = torch.ByteStorage._from_buffer(f.getvalue())  # type: ignore[attr-defined]

    t = torch.ByteTensor(byte_storage)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146642
Approved by: https://github.com/zou3519
2025-02-12 20:57:10 +00:00
8a975cb247 Revert "[cutlass backend] Do not change dtype of GEMM template (#146877)"
This reverts commit 5f2714d5e7cded0eb553d5915002e03c22e01e34.

Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to mistake on logging ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2654648949))
2025-02-12 19:26:45 +00:00
0de27ee7e0 Let _create_cpu_state_dict and _copy_state_dict support DTensor (#146852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146852
Approved by: https://github.com/d4l3k
2025-02-12 18:43:52 +00:00
352484cc83 [BE] Unify kernel templates instantiation (#146965)
By defining `REGISTER_BINARY_OP` template that could be used to register fmix, fmax, etc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146965
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-12 18:40:45 +00:00
7f62616a58 [ONNX][reland2] Create deprecation warning on dynamo_export (#146923)
Reland two PRs
- https://github.com/pytorch/pytorch/pull/146425
- https://github.com/pytorch/pytorch/pull/146639

Fixed by removing the deprecation warning on a base class `ExportOptions`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146923
Approved by: https://github.com/titaiwangms
2025-02-12 18:28:37 +00:00
5f2714d5e7 [cutlass backend] Do not change dtype of GEMM template (#146877)
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69085556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877
Approved by: https://github.com/ColinPeppler
2025-02-12 18:16:49 +00:00
bfcce6984b [ROCm][TunableOp] Close offline tuning results file when offline tuning is disabled. (#146574)
This PR is to fix UT breakage that has been reported internally and is considered high priority. When `tunable.record_untuned_enable(False)` is invoked, we flush the results of the untuned gemm file.

Offline tuning I/O currently doesn't have a set untuned results filename member function or untuned results write to file member function. When performing back-to-back unit tests, the same ofstream ends up getting reused between UTs. Due to the way the UT are executed, this can lead to unexpected failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146574
Approved by: https://github.com/jeffdaily
2025-02-12 18:03:06 +00:00
04011304e5 Update dynamo expected 20250210 (#146856)
Update all the ci accuracy expect values to make trunk green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146856
Approved by: https://github.com/yanboliang
2025-02-12 18:01:20 +00:00
d6513f3246 [dynamo] Support list subclasses and fix dict subclasses mutation bugs (#146819)
This PR adds support for list subclasses. Among other things are

1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it.
2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`).
3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods.

If this PR is hard to review, please let me know. I can break it into several small PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-02-12 17:46:02 +00:00
6c81435f16 [ONNX] Update CI transformers cache (#146926)
The cached models are outdated because the related tests are all deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146926
Approved by: https://github.com/justinchuby
2025-02-12 17:02:43 +00:00
b894c2824b [ONNX] Support custom axis name through dynamic_shapes (#146321)
Fixes #143443

This PR aims to support custom dynamic axis naming through dynamic_shapes. Currently, _Dim and _DimHint do not support dynamic axis naming (#144273).

1. **the original dynamic shapes guarantee**
The axis renaming is only applied when dynamic shapes include string instead of all _Dim and _DimHint. Thus, there will not be any inconsistent behavior to dynamic_shapes with torch.export.export if the given dynamic shapes follow torch.export.export format.
2. _DimHint.AUTO is applied to the axes that are specified with custom names to avoid exporter crash. (_DimHint.DYNAMIC crashes when the export fails.)
3.  There's no need to handle cases where kwargs are out of order with the model signature,
    as torch.export.export supports dynamism only when kwargs and dynamic_shapes are provided in order.
    49082f9dba/torch/export/_trace.py (L2034)
4. If `torch.onnx.ExportedProgram` finds the axes share the same constraints, they will have the same name (e.g. s0, s1, ...). Therefore, even if the ONNX users specify them with different custom names, they won't be respected.

Example model:
```python
        class NestedModel(torch.nn.Module):
            def forward(
                self,
                x: torch.Tensor,
                ys: list[torch.Tensor],
                zs: dict[str, torch.Tensor],
                c: torch.Tensor,
            ):
                y = ys[0] + ys[1] + zs["a"] + zs["b"]
                w = 5
                if x.shape[0] < 3 and c.shape[0] != 4:
                    return x + w, x + y, c
                else:
                    return x - w, x - y, c

        input = (
            torch.ones(5),
            [torch.zeros(5), torch.ones(5)],
            {"a": torch.zeros(5), "b": torch.ones(5)},
            torch.ones(6),
        )

        dynamic_shapes = (
            {0: torch.export.Dim("dim_x", min=3)},  # _Dim
            [("custom_name_axis_ys_0",), (torch.export.Dim.AUTO,)],  # custom name
            {
                "a": {0: torch.export.Dim.AUTO},
                "b": ("custom_name_axis_zs_b_0",),
            },  # _DimHint
            {0: "custom_name_axis_c_0"},  # custom name
        )

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146321
Approved by: https://github.com/justinchuby
2025-02-12 17:00:03 +00:00
9abaaad6a8 [pytree][Easy] preserve dict keys in insertion order in CXX pytree (#130140)
`optree` and JAX pytree traversal the `dict` in sorted key ordering (see [Key Ordering for Dictionaries](https://github.com/metaopt/optree#key-ordering-for-dictionaries)). While in PyTorch Python pytree, we traversal the `dict` in insertion order. See also:

- #114392

This aligns the behavior of CXX pytree with Python pytree.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130140
Approved by: https://github.com/zou3519
2025-02-12 16:41:49 +00:00
1f8ff94d4f PEP585: Add noqa to necessary tests (#146391)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146391
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-12 15:29:50 +00:00
b61032fcf7 [BE][Ez]: Remove unnecessary type ignores from orderedset (#146902)
After #145783, we can remove some type ignores from the ordered set class
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146902
Approved by: https://github.com/eellison
2025-02-12 15:00:13 +00:00
ce80865f13 Revert "Replace is_same with is_same_v for concise syntax (#145450)"
This reverts commit 5205158c1b0bc5c390b2a9d83fe3b2ec5edbe3f2.

Reverted https://github.com/pytorch/pytorch/pull/145450 on behalf of https://github.com/jeanschmidt due to testing to see if reverting would fix timeout in inductor jobs ([comment](https://github.com/pytorch/pytorch/pull/145450#issuecomment-2653645466))
2025-02-12 13:01:32 +00:00
b0042286d4 [Dynamo] Allow dynamo to handle str.xxx() (#146587)
Fixes #146350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146587
Approved by: https://github.com/zou3519
2025-02-12 08:54:10 +00:00
98e16012ec [Quant][CPU] add a wrapper op for _weight_int4pack_mm_for_cpu with tensor args (#145245)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds a wrapper op in `quantized` namespace for `torch.ops.aten_weight_int4pack_mm_for_cpu`, whose arguments are all tensors. It will be used in Inductor lowering with max-autotune where scalar arguments are difficult to handle.
The new op is not registered to
- `aten` because it will require changing `native_functions.yaml`, which is not recommended.
- `quantized_decomposed` because it will only have a Python implementation, which cannot be used for cpp wrapper in Inductor.

**Test plan**
```
python test/test_linalg.py -k test__int4_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145245
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2025-02-12 08:46:38 +00:00
ac0f206f3c [dtensor] fix side-effect on dtype for _like ops (#146869)
fixes https://github.com/pytorch/pytorch/issues/146749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146869
Approved by: https://github.com/yifuwang, https://github.com/janeyx99, https://github.com/ngimel
2025-02-12 08:42:14 +00:00
d774a6333d [StaticRuntime] Support a new pattern for ClipRangesToGatherToOffsets (#146931)
Summary:
Support the following new pattern for ClipRangesToGatherToOffsets:

Before optimization:
```
%18267 : Tensor, %18268 : Tensor = fb::clip_ranges_gather(%int_77.1, %getitem_2484.1, %493)
%getattr_368.1 : int = prim::dtype(%18267)
%to_443.1 : Tensor = aten::to(%18268, %getattr_368.1, %self._maybe_compute_kjt_to_jt_dict.is_weighted, %self._maybe_compute_kjt_to_jt_dict.is_weighted)
%lengths_to_offsets_490.1 : Tensor = fb::lengths_to_offsets(%to_443.1, %8)
```

After optimization:
```
%18297 : int = prim::dtype(%int_77.1)
%18298 : Tensor, %18299 : Tensor = fb::clip_ranges_gather_to_offsets(%int_77.1, %getitem_2484.1, %493, %8, %18297)
```

Reviewed By: garroud

Differential Revision: D69373835

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146931
Approved by: https://github.com/hanyilou123
2025-02-12 08:19:41 +00:00
ae5cc19ba7 [pytorch][cuda] Improve softmax backward pass native CUDA implementation (#145866)
This PR is similar to https://github.com/pytorch/pytorch/pull/122970, but works on the softmax backward pass.

Specifically, it uses shared memory to cache the gradOutput when it can fit in shared memory. Before this PR we were reading gradOutput twice.

On my H100 this seems to improve the softmax backward pass performance by about 5% for problem sizes that fit within shared memory. (Note that this is not the only kernel that runs when you call softmax backward pass -- there is an elementwise kernel that runs before this; optimizing that can be a separate PR).

**Important Note**: Currently the softmax backward pass consists of an [element-wise multiply operator](7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1216)), followed by [this function](7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1062)) which calls the `cunn_SoftMaxBackward` kernel. With my change the kernel time reduces by about 12% (see screenshot below), while the total time (including the elementwise) reduces by about 5%.

```
Baseline						This PR
N	size	FP32 bandwidth	FP16 bandwidth		N	size	FP32 bandwidth	FP16 bandwidth		fp32 diff	fp16 diff
0	256	134.340966	70.042039		0	256	133.70146	70.342753		-0.48%	0.43%
1	512	233.501185	129.945803		1	512	234.057145	132.933066		0.24%	2.30%
2	1024	340.667966	229.280464		2	1024	338.833265	226.441699		-0.54%	-1.24%
3	2048	379.643726	337.452058		3	2048	399.559017	338.432284		5.25%	0.29%
4	4096	416.597537	383.625364		4	4096	428.252403	396.137506		2.80%	3.26%
5	6000	431.198241	384.384384		5	6000	457.744577	406.06275		6.16%	5.64%
6	8192	462.811252	427.292573		6	8192	474.791032	428.281563		2.59%	0.23%
7	10000	464.258731	429.050294		7	10000	483.7643	446.849381		4.20%	4.15%
8	10013	465.199701	429.824179		8	10013	464.904407	428.72184		-0.06%	-0.26%
9	10240	477.07359	428.853737		9	10240	485.317024	444.902586		1.73%	3.74%
10	11000	473.038785	430.778663		10	11000	488.161438	453.462162		3.20%	5.27%
11	12000	474.342475	432.594814		11	12000	490.532418	458.427653		3.41%	5.97%
12	16384	487.468854	473.611576		12	16384	488.154406	476.264631		0.14%	0.56%
13	20000	482.029793	465.666186		13	20000	482.147092	483.886193		0.02%	3.91%
14	24000	478.368093	474.159464		14	24000	478.364948	491.447921		0.00%	3.65%
15	32000	476.523796	473.18868		15	32000	476.523796	474.398962		0.00%	0.26%
16	32768	476.104723	477.493634		16	32768	476.704463	477.330606		0.13%	-0.03%
17	36864	477.900663	475.472787		17	36864	477.973279	475.728454		0.02%	0.05%
18	40960	477.707561	475.559064		18	40960	478.445017	476.088067		0.15%	0.11%
19	45056	479.169812	475.865134		19	45056	479.143266	475.878202		-0.01%	0.00%
20	49152	477.804907	475.382982		20	49152	477.868404	475.976377		0.01%	0.12%
21	65536	481.274125	478.171806		21	65536	481.537733	478.703926		0.05%	0.11%
22	66000	481.64652	480.095457		22	66000	481.856013	480.466388		0.04%	0.08%
23	68608	481.745774	479.034704		23	68608	481.917596	478.856209		0.04%	-0.04%
24	80000	483.409361	480.356529		24	80000	483.330481	480.375277		-0.02%	0.00%
25	98304	480.736301	481.396882		25	98304	480.789858	481.320143		0.01%	-0.02%
```

NCU profiler shows lower DRAM fetches with the new kernel:

![image](https://github.com/user-attachments/assets/f3606725-d8fc-4ea5-ae6d-9c188bf32d72)

NCU reports about 12% elapsed time reduction in this kernel alone compared to baseline (and because of other kernels that are run, the overall backward pass time as seen by the user gets reduced by 5%).

I compared the binary size increase by running `python setup.py develop` before and after and diffing the .so files:

![image](https://github.com/user-attachments/assets/8e6cee2e-3c7a-4fa4-8836-954047ce8ffc)

libtorch_cuda.so goes from 274,752,224 bytes to 274,787,072 bytes. The increase in size is 34kB which is about 0.01%.

I measured the compilation time for incremental development:

```
touch ./aten/src/ATen/native/cuda/SoftMax.cu
time python setup.py develop
real    0m10.083s
user    0m8.197s
sys     0m3.149s
```

Note that this uses `ccache` and does a bunch of copies and is not just measuring the `nvcc` time. I measured the `nvcc` time separately by capturing the `nvcc` command shown in [1] below and running it on the baseline and modified kernels:

```
# baseline nvcc time for SoftMax.cu
real    0m35.341s
user    0m33.801s
sys     0m1.289s

# this PR's nvcc time for SoftMax.cu
real    0m36.513s
user    0m34.722s
sys     0m1.408s
```

So the `nvcc` time increases by about 1 second, or ~3% of the baseline.

[1] `nvcc` command is here:
```
# This is the nvcc command
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/torch/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/torch/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145866
Approved by: https://github.com/ngimel
2025-02-12 07:54:41 +00:00
8c80c13b34 [CD] Add python 3.13t build for xpu (#146614)
Fixes #146451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146614
Approved by: https://github.com/atalman
2025-02-12 07:01:36 +00:00
b30bad710d Update octokit/request-action to 2.4.0 (#146940)
The current version 2.1.0 has disappeared since yesterday:

* https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml
* https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml

The latest version is 2.4.0 https://github.com/octokit/request-action
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940
Approved by: https://github.com/izaitsevfb
2025-02-12 05:36:27 +00:00
6105b6f15f Revert "Update octokit/request-action to 2.4.0 (#146940)"
This reverts commit 7aa629f1268f6944eee6e49e43071b4342bf1669.

Reverted https://github.com/pytorch/pytorch/pull/146940 on behalf of https://github.com/huydhn due to This does not work ([comment](https://github.com/pytorch/pytorch/pull/146940#issuecomment-2652691614))
2025-02-12 05:21:43 +00:00
5a1c7c424d Fix standalone runner for CUTLASS auto-tuning backend (#146764)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146764
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #146755
2025-02-12 04:42:08 +00:00
eb655a2d5f Fix CUTLASS 2.x kernels for auto-tuning (#146755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146755
Approved by: https://github.com/henrylhtsang
2025-02-12 04:42:07 +00:00
683bb1242c [export][ez] Update tag_ for union setters. (#146912)
Summary: ez fix to set tag for union type fields.

Test Plan: CI

Differential Revision: D69467715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146912
Approved by: https://github.com/yiming0416
2025-02-12 03:52:36 +00:00
06f8f9a017 Update instructions about faster linker (#146750)
This PR adds instructions to specify linker via cmake env `CMAKE_LINKER_TYPE` and also adds `mold` as a linker alternative.

Since 3.29, cmake introduced [`CMAKE_LINKER_TYPE`](https://cmake.org/cmake/help/latest/variable/CMAKE_LINKER_TYPE.html) that can specify linker without overwriting `ld` file or changing build script.

`mold` is already stable and **the fastest** (afaict) linker out there, and also easier to install compared with `lld`. So I added it here. After switching to `mold`, the time of linking `libtorch_cuda.so` has been reduced from ~7s to ~0.6s locally.

Also note `gold` has been marked deprecated recently[1].

[1] https://lwn.net/Articles/1007541/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146750
Approved by: https://github.com/albanD
2025-02-12 03:14:08 +00:00
28a2ab6b84 Clear CompiledTritonKernel cache after each inductor compile (#146925)
Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in :

```
compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn)
# run compiled_1
out_compiled = compiled_1
compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2)
```

Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher.

Found this bug testing internal inference models.

This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: https://github.com/pytorch/pytorch/pull/143408

Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146925
Approved by: https://github.com/laithsakka, https://github.com/jansel
ghstack dependencies: #146417
2025-02-12 02:38:42 +00:00
0acbf8039a [BE] Unskip some tensor creation tests on Mac (#146952)
Followup after https://github.com/pytorch/pytorch/pull/145367

One should never use skip, but rather xfail otherwise one never knows when test is finally fixed.

`test_float_to_int_conversion_finite` were fixed on MacOS a while back (guess since the time Intel builds were disbaled), while `test_float_to_int_conversion_nonfinite` is fixed by https://github.com/pytorch/pytorch/pull/145367 that selects architecture-appropriate reference values for Arm ISA

Note, that results of floating to integral types cast are undefined if floating point value is outside of integral dynamic range

"Fixes" https://github.com/pytorch/pytorch/issues/38752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146952
Approved by: https://github.com/atalman, https://github.com/seemethere
2025-02-12 01:59:15 +00:00
78ebd3c502 Revert commit that removed windows testing in VS2019-> update (#146920)
This reverts commit b57b38b52ede2af27d4eb1bf6ba63868a3ee7553.

This commit removed windows testing for the VS build and needs to be added back in with the updated VS2022 build

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146920
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet
2025-02-12 01:12:05 +00:00
df5e232563 [BE] Delete NCCL slimming (#146943)
It was added by https://github.com/pytorch/pytorch/pull/35843 and served its purpose when everything was linked statically in libtorch_cuda.so, but for all our releases it's no longer relevant as nccl is now a dynamic dependency of libtorch_cuda.so

Besides,  It does not work with CXX11 ABI anyway, and creates problems with newer version of NCCL, when two `collectvies.o` are package into library archive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146943
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-02-12 00:35:55 +00:00
a58f421f4b [CUDA][CUDNN][SDPA] Pass dropout seed and offset to cuDNN in int64 (#146734)
Workaround for limitation in cuDNN that does not accept dropout seed/offset in `int32` for SM 10.0 kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146734
Approved by: https://github.com/Skylion007
2025-02-12 00:24:38 +00:00
281249ba54 [torch][amdsmi] Avoid ODR violation when loading amdsmi (#146324)
Summary:
amdsmi bundles its own copy of `libamd_smi.so`. When you're interacting with `amdsmi` from *only* python that's fine, but when you try to interact with `libamd_smi.so` from native code too this poses a problem, because from native code you'll be linking against the copy of `libamd_smi.so` from the SDK.

This means you'll end up with 2 copies of `libamd_smi.so` in your process, and potentially (Murphey's law says you will, as does our CI) violate ODR.

In order to avoid this issue from the PT side of the world we can hook the `dlopen("path/to/bundled/libamd_smi.so")` and try to use the already loaded/SDK version of `libamd_smi.so` first, before proceeding to use the `path/to/bundled/libamd_smi.so`.

Test Plan: CI, inspect process using libamd_smi.so from native + python and observe only a single copy loaded

Differential Revision: D69064038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146324
Approved by: https://github.com/malfet
2025-02-12 00:01:02 +00:00
7aa629f126 Update octokit/request-action to 2.4.0 (#146940)
The current version 2.1.0 has disappeared since yesterday:

* https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml
* https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml

The latest version is 2.4.0 https://github.com/octokit/request-action
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940
Approved by: https://github.com/izaitsevfb
2025-02-11 23:50:24 +00:00
f50d359ce2 [ c10d ] modify API to get device string from device with torch.device (#146290)
Modify the ```get_default_backend_for_device()``` API to extract the device string using ```torch.device()```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146290
Approved by: https://github.com/guangyey, https://github.com/H-Huang
2025-02-11 23:30:57 +00:00
3a29992ee6 [associative_scan] Lifted arguments (#140043)
This PR implements lifted arguments for associative_scan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140043
Approved by: https://github.com/ydwu4
2025-02-11 23:25:55 +00:00
f59a56e56f [ARM] Fix test_float_to_int_conversion_nonfinite (#145367)
We have broken tests on Aarch64 which are not enabled upstream, this PR will fix and enable those tests.

```
AssertionError: Tensor-likes are not equal!

Mismatched elements: 2 / 3 (66.7%)
Greatest absolute difference: 1 at index (1,)
Greatest relative difference: 1.0842021724855044e-19 at index (1,)

To execute this test, run the following from the base repo dir:
    python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_float_to_int_conversion_nonfinite_cpu_int64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145367
Approved by: https://github.com/malfet
2025-02-11 22:22:10 +00:00
a20055288f [DTensor][Test] Create a simple unit test for tensordot (#146514)
Fixes #ISSUE_NUMBER

The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
2025-02-11 21:57:56 +00:00
443437648a Revert "Introduce new template heuristic for triton autotune configs (#144985)"
This reverts commit 69301fb10eb3f7fd49af5c681a2e386af115baba.

Reverted https://github.com/pytorch/pytorch/pull/144985 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it needs a small tweak to avoid breaking some internal code ([comment](https://github.com/pytorch/pytorch/pull/144985#issuecomment-2652021045))
2025-02-11 20:42:41 +00:00
b1ff90ae8a remove Windows XPU build workaround. (#144644)
From the RFC: https://github.com/pytorch/pytorch/issues/141946
Fixes https://github.com/pytorch/pytorch/issues/134989

After we land these fixing PRs:
1. https://github.com/pytorch/pytorch/pull/142245
2. https://github.com/pytorch/pytorch/pull/141943

We can remove the Windows XPU workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144644
Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/gujinghui, https://github.com/atalman
2025-02-11 20:39:51 +00:00
664550ecbf [export] Serialize special values of float into strings for json. (#146490)
Summary: Currently inf is serialized as Infinity in JSON which is not standard compliant. Instead we will tweak all special floating points into strings and handle them at json layer.

Test Plan:
see D69060784
CI

Differential Revision: D69186425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146490
Approved by: https://github.com/yiming0416
2025-02-11 20:01:27 +00:00
110638f702 [inductor] skip _test_insignificant_strides on rocm (#146849)
Check https://github.com/pytorch/pytorch/issues/146848 , the rocm kernel for _scaled_dot_product_attention does not match the meta kernel regarding output shape. cuda kernel is fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146849
Approved by: https://github.com/eellison, https://github.com/atalman, https://github.com/jansel
ghstack dependencies: #145904
2025-02-11 19:55:43 +00:00
b18e3c01aa [Inductor] Unifiy Low Precision FP Legalization for to_dtype_bitcast & constant (#144646)
The upcast in `to_dtype_bitcast()` breaks following operations that only works with the target type (I uses `bitwise_and` in the updated UT).
![image](https://github.com/user-attachments/assets/77a6f3b6-b5e7-4ed8-ab65-09d76f077376)

This PR fixes this problem. Let's check the CI results to make sure it doesn't bring accuracy problems.

- Unified the type promotion of low-precision FP operations in the legalize func, grouping ops into sources (whose results may be promoted) and sinks (whose input may be cast back). (The term of _sink_ and _source_ are from [graph theory](https://en.wikipedia.org/wiki/Directed_graph#Indegree_and_outdegree).)

## Test
```bash
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float16_to_int16_cpu
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_bfloat16_to_int16_cpu
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float32_to_int32_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144646
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-11 19:45:04 +00:00
af349047c3 [FlexAttention] Bug fix broken flag (#146872)
# Summary

I somehow broke this... I think claude was trippin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146872
Approved by: https://github.com/BoyuanFeng
2025-02-11 19:42:37 +00:00
ebd992724f Implement serializable getattr support for tensor subclasses (#145772)
builtins.getattr is not serializable, so we replace it with a custom op that has more refined schema.

Differential Revision: [D68899421](https://our.internmc.facebook.com/intern/diff/D68899421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145772
Approved by: https://github.com/bdhirsh
2025-02-11 19:05:14 +00:00
d5d3bdb55a Fix var CUDA_PATH_V128 in cuda128.bat file (#146906)
Followup after: https://github.com/pytorch/pytorch/pull/146653
This should fix upcoming CUDA 12.8 windows builds.
Issue found during pytorch-canary Windows AMI test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146906
Approved by: https://github.com/malfet, https://github.com/tinglvv
2025-02-11 18:43:55 +00:00
c7515da7b0 Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)
This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361

I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534

Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979
Approved by: https://github.com/eqy, https://github.com/eellison
2025-02-11 18:16:15 +00:00
e3839bd603 [BE] Strip #pragma once when embedding the headers (#146871)
This eliminates compiler warning, for example when compiling Metal shader with embedded headers
```
 with program_source:6:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:81:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:588:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:719:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:829:29: error: use of undeclared identifier 'r0_2'
        auto tmp8 = in_ptr2[r0_2 + 768*x0];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146871
Approved by: https://github.com/dcci
2025-02-11 16:49:00 +00:00
861bf892fb Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748
Approved by: https://github.com/atalman
2025-02-11 15:49:01 +00:00
5235a18cd6 [SkipFiles] remove some more stuff from MOD_SKIPLIST (#146876)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146876
Approved by: https://github.com/anijain2305
ghstack dependencies: #146854
2025-02-11 15:00:56 +00:00
fc5913b6bf [StaticRuntime] Fix a bug that memory planner ignores subblocks (#146728) (#146855)
Summary:

When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors.

Differential Revision: D69195886

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855
Approved by: https://github.com/swolchok
2025-02-11 13:59:54 +00:00
cyy
15635b14ce [4/N] Remove unnecessary once flag usage (#146783)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146783
Approved by: https://github.com/albanD
2025-02-11 13:55:06 +00:00
69301fb10e Introduce new template heuristic for triton autotune configs (#144985)
Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in.

This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs:
- CPUConfigHeuristic()
- CUDAConfigHeuristic()
- ROCmConfigHeuristic()
- XPUConfigHeuristic()

These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs.

The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985
Approved by: https://github.com/jansel
2025-02-11 10:48:09 +00:00
229fb0bc83 [Dynamo][autograd.Function] Relax backward speculation strict mode: support .requires_grad (#146742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146742
Approved by: https://github.com/zou3519
ghstack dependencies: #146571, #146741
2025-02-11 05:39:07 +00:00
f2da810516 [Dynamo][autograd.Function] Relax backward speculation strict mode: support .data (#146741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146741
Approved by: https://github.com/zou3519
ghstack dependencies: #146571
2025-02-11 05:39:07 +00:00
29523aa113 [Dynamo][autograd.Function] Relax backward speculation strict mode a bit (#146571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146571
Approved by: https://github.com/zou3519
2025-02-11 05:39:00 +00:00
a7fe384d0e Remove torch._higher_order_ops from MOD_SKIPLIST (#146853)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146853
Approved by: https://github.com/williamwen42
2025-02-11 04:38:26 +00:00
001ebbf734 [MTIA] (4/n) Implement PyTorch APIs to query/reset device peak memory usage (#146751)
Summary: Public summary (shared with Github): This diff updates the unit test for the PyTorch API "reset_peak_memory_stats".

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_reset_peak_memory_stats
```

https://www.internalfb.com/intern/testinfra/testrun/9007199321947161

Reviewed By: yuhc

Differential Revision: D68989900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146751
Approved by: https://github.com/nautsimon
2025-02-11 03:51:48 +00:00
23524699d5 Only call triton in worker process, kick off worker processes earlier, during inductor codegen (#146417)
### Big idea
This PR extends https://github.com/pytorch/pytorch/pull/144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.
### Implementation Overview
In total, the diff does the following:
- Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
- Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
- Extend @eellison's future cache to a class, mostly as a refactor
- Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent
async_compile.triton call that occurs after codegen to cache hit on cold start.
In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.
Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.

Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146417
Approved by: https://github.com/jansel
2025-02-11 03:46:16 +00:00
fe94ece375 Revert "Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791)"
This reverts commit 3d604b17d91b928c850ded83b2ec25ea066bb3f6.

Reverted https://github.com/pytorch/pytorch/pull/141791 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141791#issuecomment-2649717140))
2025-02-11 03:17:59 +00:00
30cbf13544 [PGNCCL] Associate tensor allocation support with NCCL version (#146842)
This is a forward fix to #146589.
For NCCL version lower than 2.19, previous PR would see `RuntimeError: NCCL mem allocator is not supported in this NCCL version`.
This PR gates the support by checking link-time NCCL version via `ncclGetVersion`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146842
Approved by: https://github.com/XilunWu, https://github.com/wconstab, https://github.com/fduwjj
ghstack dependencies: #146589
2025-02-11 02:52:52 +00:00
1d81ecfc54 Rename PrimHOPBase to BaseHOP + minor changes (#146727)
This PR:
- renames PrimHOPBase to BaseHOP
- changes the backward pass to always return a tuple (to match the
  forward pass).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146727
Approved by: https://github.com/ydwu4
2025-02-11 02:43:37 +00:00
275c034b16 [SkipFiles] remove some stuff from MOD_SKIPLIST (#146854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146854
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2025-02-11 01:34:46 +00:00
5205158c1b Replace is_same with is_same_v for concise syntax (#145450)
Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450
Approved by: https://github.com/Skylion007
2025-02-11 01:34:15 +00:00
f38f1dcd82 Revert "move and fix logic to update unbacked bindings (#146115)"
This reverts commit 103c8b44bcb6fbf30b5411c5af19d312427525e7.

Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/huydhn due to This change has been reverted internally D69129334 but the OSS revert failed https://github.com/pytorch/pytorch/pull/146437 ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2649610877))
2025-02-11 01:26:36 +00:00
0c9fdd6cfb [Docs] Fix description of input in torch.addbmm() (#146664)
Fixes #146613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146664
Approved by: https://github.com/mikaylagawarecki
2025-02-11 01:22:09 +00:00
2fafcd37c3 Revert "cpp_wrapper: Precompile device-specific header files (#144002)"
This reverts commit de6efa1feb0e8c9073640a77afdec1a53a477aed.

Reverted https://github.com/pytorch/pytorch/pull/144002 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this breaks some inductor tests running internally ([comment](https://github.com/pytorch/pytorch/pull/144002#issuecomment-2649569562))
2025-02-11 00:42:22 +00:00
d763093b49 [MPS] fix lu factor for large tensors with bs>1 (#146753)
Try this:
```python
import torch

batch_size = 2
A = torch.eye(256, device="mps")[None, :, :].expand(batch_size, -1, -1) + 0.1 * torch.randn((batch_size, 256, 256), device="mps")
A_cpu = A.cpu()
LU_cpu, pivots_cpu = torch.linalg.lu_factor(A_cpu)
LU, pivots = torch.linalg.lu_factor(A)
torch.testing.assert_close(LU.cpu(), LU_cpu)
```
You'll get huge difference in LU tensors
<img width="706" alt="Screenshot 2025-02-08 at 12 14 39" src="https://github.com/user-attachments/assets/b45f2b3c-e0a5-49c8-aa07-42792150b781" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146753
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-11 00:37:07 +00:00
937b41e3b5 Refactoring pipeline parallelism test cases to be device agnostic [1/n] (#146472)
In this series of PR we intend to refactor pipeline parallelism test cases to enable to be completely device agnostic.

These changes will include the following approaches to do the same :

- Allowing for multiple device types using instantiate_device_type_test
- Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies

This should result in improvement in usability for all devices

For this PR we have shown support for the following devices:

- CPU (wherever applicable)
- CUDA
- HPU
- XPU

To add other device new users can simply append their device to the device list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146472
Approved by: https://github.com/H-Huang
2025-02-11 00:13:23 +00:00
b6273d7f4b [ROCm] Update periodic.yml to use 2GPU runners (#146839)
Temporary fix for rocm workflow.
The 4-GPU runners are all taken offline due to (network timeout issue), and so we aren't able to run any periodic jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146839
Approved by: https://github.com/jeffdaily
2025-02-10 23:41:11 +00:00
aa1622c0b6 Support ignoring parameters in FSDP2 (#146631)
Differential Revision: D69153051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146631
Approved by: https://github.com/awgu
2025-02-10 23:20:28 +00:00
c2bf3be011 [inductor] Remove _get_grid_fn_str (#146800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146800
Approved by: https://github.com/yanboliang
2025-02-10 23:14:30 +00:00
0d5fb0941f [cutlass backend] check against arch >= 100 (#145812)
Summary:
Want to add a guard against silent fallback to SM90.

GenerateSM100 was just added 3 days ago. https://github.com/NVIDIA/cutlass/blame/main/python/cutlass_library/generator.py#L8896

It should show up in CUTLASS 3.8 (not pinned yet).

Test Plan: ci

Differential Revision: D68748705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145812
Approved by: https://github.com/chenyang78, https://github.com/ColinPeppler, https://github.com/Aidyn-A
2025-02-10 22:41:08 +00:00
bab35eb26a fix intermediate debug information with cpp_wrapper (#145527)
Summary: before fix, code like:
```cpp
    aoti_torch_print_tensor_handle(buf0, "after_launch - triton_poi_fused_randn_0 - buf0");
    aoti_torch_print_tensor_handle(buf1, "after_launch - triton_poi_fused_randn_0 - buf1");
    printf("[  after_launch - triton_poi_fused_randn_0 - 0: %ld  ]", 0); printf("
");
    printf("[  after_launch - triton_poi_fused_randn_0 - 1228800L: %ld  ]", 1228800L); printf("
");
```
was generated, which is a syntax error.

Test Plan:
New unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145527
Approved by: https://github.com/desertfire
2025-02-10 22:24:26 +00:00
681894546b Fix bazel job after #144489 (#146840)
This is currently failing in trunk with the following error https://github.com/pytorch/pytorch/actions/runs/13246034191/job/36972742610

### Testing

Bazel job passing https://github.com/pytorch/pytorch/actions/runs/13247495161/job/36977571965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146840
Approved by: https://github.com/atalman
2025-02-10 22:17:36 +00:00
652880e840 Fix logging and test files which misspell "precision" (#146113)
Noticed this while working on something, decided to submit a quick fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146113
Approved by: https://github.com/drisspg
2025-02-10 21:54:16 +00:00
e65b89e4cd [Feat]: Improve KleidiAI 4 bit kernel performance (#146476)
Description:
1. New thread blocking accelerates GEMVs
2. We increase throughput of the lhs quant pack + matmul pipeline by decoupling two operations.
3. The new blocking strategy blocks ```out_feature``` to accelerate GEMVs

Perf improvements:
12% speedup in LLM prefill phase and upto 16% speedup in autoregressive phase

Perf Benchmarking : https://github.com/pytorch/pytorch/issues/143289#issuecomment-2545773370

Change-Id: Ie574ff8459fdb75701ae366158b4e118c70694e4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146476
Approved by: https://github.com/malfet
2025-02-10 21:30:57 +00:00
4d626c261b Fix workarea compute in lapackSyevd (#146456)
work-query APIs return floating point values, that could loose precision when converted back to int. Solve this by using `nextafter` and `ceil`
Add regression test

Fixes #145801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146456
Approved by: https://github.com/malfet
2025-02-10 21:29:48 +00:00
8f073065d5 [while_loop][inductor] support sym expression as cond_fn output (#146222)
As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn.

aoti generated output code looks like:
```
V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]     bool buf7_cond_result;
....
(while_loop_cond_graph_0_arg2_1_handle);
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         buf7_cond_result = u0 + u1 < 10L;
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         if (!buf7_cond_result) break;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222
Approved by: https://github.com/desertfire
2025-02-10 21:25:40 +00:00
97d4753bd3 [hop][inductor] don't promote arg type for cond and while_loop (#146660)
Hop subgraph codegen assumes arguments's type are not promoted. Otherwise, we might generate wrong kernel.

Differential Revision: [D69279031](https://our.internmc.facebook.com/intern/diff/D69279031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146660
Approved by: https://github.com/zou3519, https://github.com/eellison
2025-02-10 21:24:52 +00:00
da216baaa2 Optimize inductor Self typing (#146669)
Replace method return type with `Self` typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146669
Approved by: https://github.com/jansel
2025-02-10 20:39:56 +00:00
86b52f4209 Fix lint (#146846)
[Fixes #ISSUE_NUMBER
](https://github.com/pytorch/pytorch/actions/runs/13248382636/job/36980294598)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146846
Approved by: https://github.com/huydhn, https://github.com/clee2000
2025-02-10 20:00:29 +00:00
3d604b17d9 Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791)
As upsample_bilinear2d.vec is a core ATen op, it should not be decomposed by default in the export path. Because the operator has CompositeImplicitAutograd dispatch, its decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.
In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining three ops: upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.

Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and un-decomposite it, but this is no longer necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141791
Approved by: https://github.com/tugsbayasgalan, https://github.com/digantdesai
2025-02-10 19:30:19 +00:00
97f6480cf5 Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467)
Fixes https://github.com/pytorch/pytorch/issues/146416

Also added contiguity checks in the C++ functional collective ops to prevent striding issues introduced during compilation manifest as silent correctness issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146467
Approved by: https://github.com/Chillee, https://github.com/lw, https://github.com/shunting314
2025-02-10 19:15:49 +00:00
3822a88d21 [symbolic shapes] Log symnode id (#146583)
We want to log the symnode id which will help us with provenance tracking between expressions created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146583
Approved by: https://github.com/bobrenjc93
2025-02-10 19:13:06 +00:00
b45e6fa707 Cleanup VS 2019 refs in pytorch (#145863)
Related to: https://github.com/pytorch/pytorch/issues/128835
Follow up on PR: https://github.com/pytorch/pytorch/pull/145319
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145863
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn, https://github.com/atalman
2025-02-10 19:05:35 +00:00
c02a1ecc1d [export][ez] Allow math.trunc for serialization. (#146715)
Summary: as title.

Test Plan: CI

Differential Revision: D69317084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146715
Approved by: https://github.com/angelayi
2025-02-10 19:05:07 +00:00
9b7d050600 Move capture_provenance to make_node_impl (#146625)
Previously we were only logging `make_user_impl` implementations, which only gets triggered for operations done on python SymInts, not cpp SymInts. Instead `make_node_impl` will get triggered for both python and cpp SymInt operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146625
Approved by: https://github.com/bobrenjc93
2025-02-10 19:00:51 +00:00
0486a996d2 [sigmoid] Implement a OSS only model runner. (#146440)
Summary: Implement an oss version of modelrunner with clean dependencies. The new oss model runner only removes thrift and only use json header to load the model.

Test Plan: Test will be added in the next diff separately. (D69060784)

Differential Revision: D68846877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146440
Approved by: https://github.com/SherlockNoMad
2025-02-10 18:54:05 +00:00
519f547d05 windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-02-10 18:34:59 +00:00
ad847da0cf [cutlass backend] fix bug for accuminator dtype (#146356)
Will add unit tests for accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356
Approved by: https://github.com/Chillee
2025-02-10 18:20:58 +00:00
ddcc97bb8c Make sure cutlass kernel .cu file has configuration name and nvcc compile command (#146668)
I think its good to have everything in the .cu file. Especially the nvcc compile command.

Technically, the configuration name can be found in the template already. So let me know if you think its not needed.

Differential Revision: D69281295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146668
Approved by: https://github.com/chenyang78
2025-02-10 18:16:44 +00:00
6b3f51f870 use None to slice when list has one element only (#146638)
When autotune_num_choices_displayed is None and the list of choices has length 1, slicing with `[:-1]` means getting all elements except the last one, which resulted in an empty list.

Slicing with `[:None]` works.

Differential Revision: D69265168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146638
Approved by: https://github.com/drisspg
2025-02-10 18:15:45 +00:00
374b762bbf [ez][BE] get rid of the extra printf('\n') (#146726)
Summary: as title

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```

Differential Revision: D69328701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146726
Approved by: https://github.com/ColinPeppler
2025-02-10 17:45:55 +00:00
5fd15a04b7 [ROCm] Enable inductor-periodic testing for MI300 (#144594)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144594
Approved by: https://github.com/malfet, https://github.com/huydhn

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-10 17:42:09 +00:00
b8261358ca Revert "windows Magma build for cu128 (#146653)"
This reverts commit d0e70c4fd33d9accca2c66203c19372733a83ea1.

Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/jeanschmidt due to Seems to have broken some windows tests, reverting to see if it gets green ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2648769150))
2025-02-10 17:36:32 +00:00
cbbb11d967 [dynamo][user-defined] Unify standard and non-standard __new__ codebase (#146737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146737
Approved by: https://github.com/jansel
ghstack dependencies: #146677
2025-02-10 17:31:13 +00:00
ee8a06f1f6 [dynamo][user-defined] User class.__new__ instead of special casing (#146677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146677
Approved by: https://github.com/jansel
2025-02-10 17:31:13 +00:00
de6efa1feb cpp_wrapper: Precompile device-specific header files (#144002)
This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone.

Differential Revision: [D69185685](https://our.internmc.facebook.com/intern/diff/D69185685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144002
Approved by: https://github.com/desertfire
2025-02-10 17:13:09 +00:00
3cadce7af2 [NJT] Fix inference mode for composite implicit ops without nested-specific kernel (#146633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146633
Approved by: https://github.com/jbschlosser
2025-02-10 16:59:48 +00:00
dfe3b64282 [mps] Implement eager support for spherical_bessel_j0 (#146818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146818
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-10 16:58:05 +00:00
5f621c5879 [MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#146710)
Summary: Public summary (shared with Github): This diff implements a C++-Python binding to enable `reset_peak_memory_stats`.

Test Plan: The test is implemented in the following diff.

Reviewed By: yuhc

Differential Revision: D68988673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146710
Approved by: https://github.com/nautsimon
2025-02-10 16:57:09 +00:00
68c9e22ef7 FSDP: avoid resetting version counter of all_gather_output in inference_mode (#146709)
Summary:
FSDP needs to hide VC bumps on its allgather buffer, but it does not need to do this is the allgather buffer was generated under inference mode.

more details here: https://www.internalfb.com/diff/D69115649?dst_version_fbid=1316814572779281&transaction_fbid=849120230625711

Test Plan: CI

Differential Revision: D69311496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146709
Approved by: https://github.com/awgu
2025-02-10 16:56:40 +00:00
6aa924af68 Revert "[ONNX] Create deprecation warning on dynamo_export (#146425)"
This reverts commit 41e6d189a39a40b237ab9b9ab195cec1194b331b.

Reverted https://github.com/pytorch/pytorch/pull/146425 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/146425#issuecomment-2648472579))
2025-02-10 15:54:34 +00:00
1557b7bf9a Revert "[ONNX] Adjust and add deprecation messages (#146639)"
This reverts commit 63c2909ae3e293dee96bca5af88bc51d8ca0ce10.

Reverted https://github.com/pytorch/pytorch/pull/146639 on behalf of https://github.com/atalman due to Sorry Need to revert https://github.com/pytorch/pytorch/pull/146425 ([comment](https://github.com/pytorch/pytorch/pull/146639#issuecomment-2648465047))
2025-02-10 15:51:52 +00:00
a36c22f2ed futher scheduler changes for invoke_quant: prologue low prec, (slightly) more aggressive fusion (#145104)
Respect invoke_quant low precision options, also, be more aggressive in attepmting fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145104
Approved by: https://github.com/shunting314, https://github.com/jansel
ghstack dependencies: #139102
2025-02-10 15:50:19 +00:00
899066eedf Fix round(...) with constants (#146495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146495
Approved by: https://github.com/anijain2305
2025-02-10 15:08:09 +00:00
611ca163fd [MPS] Add bilineard2d_aa implementation (#145526)
Interesting quirk of the algorithm, that is not very well documented, is that value of align_corners is ignored in antialias mode, see arguments of
e8304f08fe/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L747-L751)

Error out on  uint8 implementation(as it relies on a very fragile integer integer arithmetic), as it's not implemented on any other Accelerator devices at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145526
Approved by: https://github.com/dcci
2025-02-10 15:03:14 +00:00
d0e70c4fd3 windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-02-10 13:48:55 +00:00
6f15a609d3 Test typing of arithmetic operators on Tensor (see #145838) (#146426)
See #145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146426
Approved by: https://github.com/Skylion007
2025-02-10 12:19:56 +00:00
c24038025d [ROCm] Unskip std:bad_alloc failures (#146407)
Flakey MI300 issue related to memory usage should now be resolved after https://github.com/pytorch/pytorch/actions/runs/13007160888?pr=145829.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146407
Approved by: https://github.com/jeffdaily
2025-02-10 11:01:56 +00:00
c88ae00692 fix: replace stderr with stdout for download messages in hub.py (#146475)
This PR addresses an issue where download logs in `hub.py` are sent to `stderr` instead of `stdout`. Hence, when running models with workers, these messages are incorrectly categorized as errors, leading to confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146475
Approved by: https://github.com/mikaylagawarecki
2025-02-10 10:46:10 +00:00
6667e5d786 [dim order] solve broken doc (#146641)
Differential Revision: [D69265340](https://our.internmc.facebook.com/intern/diff/D69265340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146641
Approved by: https://github.com/svekars, https://github.com/Jack-Khuu
2025-02-10 07:51:26 +00:00
c4d835fbab [DTensor][conv] add DTensor convolution_backward op support for case where the input Tensor has requires_grad=False (#142278)
Fixes #142058

## Summary
DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model.

ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above.

However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent.

## Fix
allow the `grad_input` to be `None` for `convolution_backward` op.

## Test
`pytest test/distributed/tensor/test_convolution_ops.py`

## Follow-up
The current implementation of DTensor conv op also ignores `output_mask` and this may need further care.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278
Approved by: https://github.com/bdhirsh
2025-02-10 07:06:40 +00:00
effc545274 [DDP] Use NCCL allocated memory for gradient bucket (#146589)
So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications.
Less SM usage, less memory contention between NCCL kernel and compute kernels.

Added env `DDP_DISABLE_COMM_MEM` as a back-out option:
```
An environment variable to disable comm-optimized memory pool.
Default is 0, which means comm-optimized memory pool is enabled.
Users can set it to 1 in case of seeing regression or OOM (because this
comm MemPool may not share space with regular compute MemPool).
```

Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589
Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj
2025-02-10 05:23:11 +00:00
387c993c3b [ca] remove private API: _compiled_autograd_should_lift (#146720)
Since the functional autograd + compiled autograd migration, we don't trace into nodes anymore, and everything is lifted. We can't support this flag which tries to inline make_fx style in CA initial pass. There's no more usage internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146720
Approved by: https://github.com/zou3519
2025-02-10 04:29:57 +00:00
e8304f08fe Fix torch.take_along_dim param type and default description (#146474)
## Changes

- Change type description to `LongTensor`, consistent with [`torch.take`](https://pytorch.org/docs/stable/generated/torch.take.html)
- Add `dim` param default value description

## Test Result

**Before**
![image](https://github.com/user-attachments/assets/720ce158-2bc1-48b5-a188-56fcc7188d96)

**After**
![image](https://github.com/user-attachments/assets/05fe20bd-9476-4b97-ac2b-9b161d6532a1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146474
Approved by: https://github.com/mikaylagawarecki
2025-02-10 01:19:30 +00:00
298226f358 [dynamo] check for incompatible configs (#146513)
internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/

Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time.

Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513
Approved by: https://github.com/williamwen42

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
2025-02-10 00:44:23 +00:00
2a55311773 [cuda] Simplify the sinc function a bit. (#146774)
`else` after `return` can be removed & the indentation can be reduced, for readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146774
Approved by: https://github.com/malfet
2025-02-09 20:09:34 +00:00
b133907d0a Update strided test to float32 (#146748)
Fixes #146377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146748
Approved by: https://github.com/BoyuanFeng, https://github.com/leijurv
2025-02-09 17:41:35 +00:00
91c4bf39d3 [mps] Add a shader for spherical_bessel_j0. (#146771)
In preparation for adding the operation to inductor/eager.
Adapted from the CUDA version of the shader.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146771
Approved by: https://github.com/malfet
2025-02-09 05:11:17 +00:00
0e83e7d56e [EZ] Add logic to build Metal shader with debug info (#146768)
By appending `-frecord-sources -gline-tables-only` to the compilation command

Helpful when debugging shaders compiled into libtorch

Test plan: Run
`python ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal ../aten/src/ATen/native/mps/operations/UpSample.mm`
And then run following to capture shader and check that it contains debug info
```python
import torch
import os
os.environ["MTL_CAPTURE_ENABLED"]="1"
inp = torch.rand(size=(6, 3, 10, 20), device="mps", dtype=torch.float32)
with torch.mps.profiler.metal_capture("bilinear2d"):
    out = torch.nn.functional.interpolate(x, scale_factor=(1.7,0.9), mode="bilinear")
```
<img width="769" alt="image" src="https://github.com/user-attachments/assets/e0316c1c-07a4-4da5-97b9-886c56857c1d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146768
Approved by: https://github.com/dcci
2025-02-08 23:40:23 +00:00
6a9a02acbe Set enable_faithful_generator_behavior flag to True (#142513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142513
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420, #145223
2025-02-08 22:42:12 +00:00
580a305681 Raise MutationError if there are side effects when returning generator (#145223)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145223
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420
2025-02-08 22:42:12 +00:00
68cfd36c11 Add CLEANUP_THROW bytecode (#144420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144420
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424
2025-02-08 22:42:12 +00:00
53ab82d8f5 Implement generator.throw(exception) (#144424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144424
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423
2025-02-08 22:42:12 +00:00
8ee095f7c1 Implement generator.close() (#144423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144423
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422
2025-02-08 22:42:12 +00:00
ca9b16e070 Implement generator.send(..) (#144422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144422
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421
2025-02-08 22:42:12 +00:00
d798831167 Implement generator.__iter__() (#144421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144421
Approved by: https://github.com/zou3519
ghstack dependencies: #141055
2025-02-08 22:42:12 +00:00
8603a1c870 Suport generators (#141055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141055
Approved by: https://github.com/zou3519
2025-02-08 22:42:12 +00:00
ade8fee512 Use c10 version of half/bfloat16 in executorch (#144111)
Summary:
X-link: https://github.com/pytorch/executorch/pull/7040

Accomplished by importing relevant files from c10 into
executorch/runtime/core/portable_type/c10, and then using `using` in
the top-level ExecuTorch headers. This approach should keep the
ExecuTorch build hermetic for embedded use cases. In the future, we
should add a CI job to ensure the c10 files stay identical to the
PyTorch ones.
ghstack-source-id: 260047850
exported-using-ghexport

Test Plan: builds

Differential Revision: D66106969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111
Approved by: https://github.com/malfet
2025-02-08 22:40:14 +00:00
92b7e610ab [Inductor changes] Invoke Quant (#139102)
Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0).

The primary motivations are

- Unifying scattered reasoning for quant operators throughout the code base

- Easy of pattern matching - see this very large pattern match expression [here](949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426). Compared to the pattern I have in the tests:

```
        @register_graph_pattern(
            CallFunction(
                torch.ops.aten.mm,
                CallFunction(
                    torch.ops.higher_order.invoke_quant,
                    Ignored(),
                    Ignored(),
                    Ignored(),
                    scheme="nf4",
                ),
                Arg(),
            ),
            pass_dict=test_pass,
        )
```

- Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul.

Example graph:

``` Python
 ===== AFTER POST GRAD =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
         # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self)  # type: ignore[call-arg]
        repeated_subgraph0 = self.repeated_subgraph0
        invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4');  repeated_subgraph0 = arg0_1 = arg1_1 = None
        return (invoke_quant,)

    class repeated_subgraph0(torch.nn.Module):
        def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
             # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self)  # type: ignore[call-arg]
            mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1);  arg0_1 = None
            add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1);  mul = arg1_1 = None
            return add
```

The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, *args, scheme=None)` where the scheme will not always be present.

I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing.

```
invoke_quant = InvokeQuant(codegen_low_precision=True)
invoke_quant(gn, (x, y), scheme="nf4")
```
Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162.

Feedback welcome.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102
Approved by: https://github.com/Chillee
2025-02-08 19:30:19 +00:00
a1bfb39a31 [Inductor] Expand Identity ops prior to block pattern matching (#146000)
# Feature

Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s.

This PR adds a few features to achieve this invariance.
 - Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression.
 -  Preprocess the expression with this expansion prior to pattern matching.
 - Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`.

# Test plan
This PR adds a few new unit tests:
 - Added a unit test specifically for `expr.expand(identity=True)`.
 - Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops.

I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-08 18:11:53 +00:00
eee5622b98 [inductor] Pre-populate cache for simplify_with_ranges return value (#146373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257, #146282, #146297
2025-02-08 18:00:49 +00:00
c098385cb3 [inductor] Refactor CaptureIndexing into global scope (#146297)
And inline SimplifyIndexing into it CaptureIndexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257, #146282
2025-02-08 18:00:49 +00:00
d35f6b2339 [inductor] Minor compile time optimizations in DefaultHandler (#146282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257
2025-02-08 18:00:40 +00:00
06604c4ec1 [inductor] Refactor op handlers part 5 (#146257)
This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255
2025-02-08 18:00:30 +00:00
403db2faee [inductor] Refactor op handlers part 4 (#146255)
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.

Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254
2025-02-08 18:00:17 +00:00
0e31e5932b [inductor] Refactor op handlers part 3 (#146254)
Fixes type errors that arise from typing `V.ops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254
Approved by: https://github.com/shunting314
ghstack dependencies: #146252
2025-02-08 18:00:08 +00:00
71498aeae3 [inductor] Refactor op handlers part 2 (#146252)
This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252
Approved by: https://github.com/yanboliang
2025-02-08 18:00:00 +00:00
46e83bb637 Fix linter F821 error (#146665)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-08 07:19:37 +00:00
a3ca5c7f4e remove incorrect warnings from min/max documentation (#146725)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146725
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-02-08 05:10:08 +00:00
63c2909ae3 [ONNX] Adjust and add deprecation messages (#146639)
Adjust and add deprecation messages to torch.onnx utilities and verification methods because they are only related to torch script and are obsolete.

Removed unused `_exporter_states.py` and removed the internal deprecation module in favor of the typing_extensions deprecated decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146639
Approved by: https://github.com/titaiwangms
2025-02-08 05:09:16 +00:00
2328dcccb9 [MPSInductor] Implement Welford reduction (#146703)
Still work in progress, though fallback works as expected, but custom shader is not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703
Approved by: https://github.com/jansel, https://github.com/dcci
2025-02-08 05:00:00 +00:00
69feef5a94 Fix broken meta function for flex-attention backwards (#146563)
# Summary

Fixes https://github.com/pytorch/pytorch/issues/146377

So what was the original problem: we were codegening a really weird epilogue:

```Python
        # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM]
        # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM]
        xindex = index_k + 64*index_n + 64*off_hkv*ks2 + 128*off_zq*ks2
        tl.store(out_ptr0 + (tl.broadcast_to(index_k + 64*index_n + off_hkv*ks1, dk.shape)), dk, mask)
        x5 = (xindex % ks3)
        tmp2 = tl.load(out_ptr0 + (x5 + ks1*off_hkv), mask, eviction_policy='evict_last')
        tl.store(out_ptr1 + (tl.broadcast_to(xindex, dk.shape)), tmp2, mask)
 ```

 This epilogue was writing and then reading from overlapping regions of memory causing a race condition.

 ### Why were we generating this epilgoue

 During the lowering we created a buffer w/ a different size/stride from the expected return strides. I :think this added an implicit node (for doing the permutation of this wrongly strided output to the the expected one from the meta func. The scheduler for some reason thought it was okay to fuse this into the epilogue, tbh I dont know why.

 This fixes the broken meta func and the original repro. I will add a test but it is hard to pop, better than nothing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146563
Approved by: https://github.com/Chillee
2025-02-08 04:13:52 +00:00
9c78fb920d Fix assertion failure in gemm template lowering (#146353)
Summary:
This commit fixes a crash in the gemm template lowering caused by hitting an [assert](fd515e4f59/torch/_inductor/codegen/common.py (L1181)) that a buffer was previously removed.

The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set.

The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings.

Differential Revision: D68814625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146353
Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475
2025-02-08 01:52:20 +00:00
cyy
6cb2f737ee Enable Windows tests (#146666)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146666
Approved by: https://github.com/albanD
2025-02-08 00:55:20 +00:00
0ab67299c3 [MPS] lu unpack (#146681)
Implements lu unpack function on MPS. Haven't added new tests because they are covered by removing the lu_unpack from UNIMPLEMENTED_XFAILLIST in test_mps with `test_output_match` function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146681
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-08 00:16:17 +00:00
803661526e Update ET pin to 41e7ffa (#145831)
ExecuTorch pin is failing to update due to a change in the executorch install scripts. The previous install_requirements.sh now only installs dependencies and does not build ET. There is a new script - install_executorch.sh, which both installs dependencies and builds the framework.

This PR updates the relevant CI logic to use install_executorch.sh and bumps the pin forward. This should fix the stuck ET pin.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145831
Approved by: https://github.com/metascroy
2025-02-07 23:52:20 +00:00
dcac3c3e06 [MTIA] (2/n) Implement PyTorch APIs to query/reset device peak memory usage (#146659)
Summary:
Public summary (shared with Github): This diff implements the correct version of the PyTorch API "max_memory_allocated".

Nit: The file previously contained two unit tests with the same name (due to wrong revert); I deleted a deprecated one to revamp the correct version.

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated
```

https://www.internalfb.com/intern/testinfra/testrun/12103424065182810

Reviewed By: yuhc

Differential Revision: D68988435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146659
Approved by: https://github.com/nautsimon
2025-02-07 23:06:35 +00:00
fa34128435 revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453)
Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass.

Test Plan:
With the change, build of APS model using rcclexp can now pass:
`sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0`

Reviewed By: c-p-i-o

Differential Revision: D69149588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453
Approved by: https://github.com/c-p-i-o
2025-02-07 22:43:52 +00:00
103c8b44bc move and fix logic to update unbacked bindings (#146115)
Summary:
Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here).

This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde.

Test Plan: added test

 D68880766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115
Approved by: https://github.com/pianpwk
2025-02-07 22:41:19 +00:00
45d35f5f5a Clean up op BC check list (#146577)
Summary: Remove the expired ones

Test Plan: ci

Differential Revision: D69226556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146577
Approved by: https://github.com/hl475
2025-02-07 22:40:49 +00:00
908133f682 [TreeSpec] Add custom comparision function (#146442)
Summary:
https://github.com/pytorch/pytorch/pull/145815 used caching to for treespec_loads calculation to speed up AOTI module call.

However, this made tests flaky due when comparing TreeSpec for objects in local scope. ie. 'test_export.TestExport.test_pytree_register_nested_data_class.<locals>.Inner'

Type comparison will yield False when local scopes are different due to lru_cache.

Since this comparison is only used for testing purpose, we will only test if str(type) are equal.

Test Plan:
```
PYTORCH_TEST_WITH_ROCM=1 python test/export/test_retraceability.py
```

Differential Revision: D69137706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146442
Approved by: https://github.com/angelayi
2025-02-07 22:39:21 +00:00
91dfa82981 [FlexAttention] Fix dynamic shapes in max-autotune (#146657)
# Fixes
https://github.com/pytorch/pytorch/issues/146624

### Updated

From offline discussion going w/ sizehint

However this does incur guards. I couldn't really think of a fancy way to do this. I was going to do `V.graph.sizevars.size_hint` w/ some default for num blocks, but we ultimately need some information about the input.

I am also not sure if size_hint is ALWAYS guaranteed to return the runtime value. I think it would be okay to not supported unbacked symints (maybe).

 For instance, in the repro, we quickly hit the recompile limit.
 ```Shell
 torch._dynamo hit config.recompile_limit (8)
   function: 'flex_attention' (/home/drisspg/meta/pytorch/torch/nn/attention/flex_attention.py:1161)
   last reason: 0/0: tensor 'L['key']' size mismatch at index 2. expected 1, actual 546
To log all recompilation reasons, use TORCH_LOGS="recompiles".
To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146657
Approved by: https://github.com/Chillee, https://github.com/yanboliang
2025-02-07 22:34:28 +00:00
579b9f2ed9 [inductor] Better exception error messages for cache_on_self (#146652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146652
Approved by: https://github.com/yanboliang
2025-02-07 21:22:21 +00:00
04ce02182b [inductor] Use index_dtype (int32/int64 depending on size) for argmax accumulators (#146651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146651
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-02-07 21:21:21 +00:00
80a1696679 Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 5f0901e57341eb9865102c1caa3d986a0c4ae3bd.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))
2025-02-07 21:04:23 +00:00
206ad9f4ad [cutlass backend] Set no fallback to aten, disabled a few broken tests, default to test on H100 (#146554)
This PR does a few things:
* set fall back to aten to False for most tests. Without this, a lot of tests would fail silently since they just use aten
* Disable two subprocess related broken tests. They would crash in subprocess. More investigation needed.
* remove/disable the tests on A100. Let me elaborate a bit more.

There are two types of A100 tests.
* normal tests that also test A100. e.g., mm, addmm, bmm. However, since the shift to cutlass 3x, they don't work anymore. GenerateSM80 would generate ops that use cutlass 2x, but they get filtered out since they are of GemmKind.Universal but only GemmKind.Universal3x are supported in the 3x template.
* tests for A100 only. The mixed mm and sparse semi structure tests are failing due to "TypeError: can't multiply sequence by non-int of type 'str'" for a while. Disabled them for now. Do let us know if you are about them @alexsamardzic

Differential Revision: D69209929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146554
Approved by: https://github.com/chenyang78
2025-02-07 19:59:28 +00:00
f17109bd96 Revert "windows Magma build for cu128 (#146653)"
This reverts commit 9e27d36e2b2a4f037a7e448c2f87a9ebb0d6e628.

Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2643882976))
2025-02-07 19:37:16 +00:00
bc0191802f [inductor] add size-asserts for fallback ops (#145904)
Fix https://github.com/pytorch/pytorch/issues/144717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904
Approved by: https://github.com/jansel
2025-02-07 18:44:32 +00:00
b60f630de8 fuzzer: disable "fail_on_recompile_limit_hit" and "suppress_errors" (#146650)
Summary:
needed for https://github.com/pytorch/pytorch/pull/146513
Test Plan:
the existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146650
Approved by: https://github.com/xmfan
2025-02-07 18:25:00 +00:00
9e27d36e2b windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman
2025-02-07 18:09:30 +00:00
23af9dde4d distributed/serialization: add experimental streaming torch.save/load methods (#146555)
Summary:

This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor.

This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility.

Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata.

Adapted from:

https://github.com/pytorch/torchft/pull/101

https://github.com/pytorch/torchft/pull/54

Test Plan:

pytest test/distributed/test_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555
Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki

Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>
2025-02-07 18:08:11 +00:00
68631f6e87 PyWork: preserve Python reference counting when used in functional collectives (#146376)
@fegin  found an issue where torchft is not compatible with functional collectives.

Found in https://github.com/pytorch/torchtitan/pull/806

The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug.

PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists

To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects.

To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup.

Test plan:

```
cd pytorch
pytest test/distributed/test_c10d_functional_native.py
```

```
cd torchft
pytest torchft/process_group_test.py -k functional -v -x -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376
Approved by: https://github.com/yifuwang
2025-02-07 18:07:53 +00:00
76c8a2dc48 Fix get_top() to return the base level event of the stack, not the most recently started event (#146649)
`get_top()` is really confusing when talking about a stack, because it can mean the most recently started event on the stack or the toplevel event in perfetto(which displays the stack upside down). Rename to `get_outermost` and fix the bug associated with it,  so that it returns the correct value out of the stack.

Running nanogpt now puts `guard_latency_us` correctly in the `dynamo` event:
```
tlp python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only nanogpt --amp --cold-start-latency --print-compilation-time --training --performance 2>&1 --dynamic-shapes | tee out.log
```
<img width="1281" alt="image" src="https://github.com/user-attachments/assets/4eeb371a-4d81-415a-acc4-7d303a4b2a93" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146649
Approved by: https://github.com/masnesral, https://github.com/anijain2305
2025-02-07 18:04:50 +00:00
f138b18d18 [inductor/profiler] add kernel kwargs instrumentation (#145573)
## About

As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc.

## Test program

Note, install triton before proceeding (pip install triton)

triton_test.py>>>
```
import torch
from torch.profiler import profile, ProfilerActivity

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b

def main():
    x = torch.randn(10, 10).cuda()
    y = torch.randn(10, 10).cuda()
    opt_foo = torch.compile(foo)
    z = opt_foo(x, y)

    # Profile the kernel function on the GPU
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True
    ) as prof:
        z = opt_foo(x, y)

    # Export the trace to a file
    prof.export_chrome_trace("my_kernel_trace.json")

if __name__ == "__main__":
    main()
```

Run it and we should get a trace file my_kernel_trace.json

Output has triton event with the kernel_kwargs attribute.
```
  {
    "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815,
    "ts": 2045246693014.959, "dur": 75.662,
    "args": {
      ...
      "kernel_backend": "triton",
      "num_warps": 4,
      "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)",
      "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py",
      "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor"
    }
  },
```

## Unit Test
Updated unit test:
```
pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573
Approved by: https://github.com/davidberard98, https://github.com/jansel
2025-02-07 17:44:30 +00:00
ee45ea599d [dynamo] Actionable message on recompilations for fullgraph=True (#146550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146550
Approved by: https://github.com/zou3519, https://github.com/StrongerXi
ghstack dependencies: #146553
2025-02-07 17:28:43 +00:00
fa0956951c [dynamo] Remove the suggestion to use suppress_errors on compiler error (#146553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146553
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-02-07 17:28:43 +00:00
cyy
25aa7ca62d Cleanup CallOnce.h (#146700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700
Approved by: https://github.com/albanD
2025-02-07 16:44:45 +00:00
076717785c Revert "[while_loop][inductor] support sym expression as cond_fn output (#146222)"
This reverts commit 5ecdc428b230ab5ba44a90678f1c905e314f6ccb.

Reverted https://github.com/pytorch/pytorch/pull/146222 on behalf of https://github.com/atalman due to Internal failure, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146222#issuecomment-2643379933))
2025-02-07 16:19:41 +00:00
eqy
5d7532140f [CUDA][CUDA Graphs] Fix debug mode warning message (#145996)
The real method is `enable_debug_mode()`, `_cuda_enable_graphs_debug_mode` does not exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145996
Approved by: https://github.com/ptrblck, https://github.com/eellison
2025-02-07 08:04:49 +00:00
002accfb8d Check meta strides for expanded dims in effn_attn_bias (#146054)
With the `_scaled_dot_product_efficient_attention.default`, we have lowering logic to realize the bias to specific alignment constraints. Some of the dims can be expanded, and we need to keep the stride of that dim to 0 to avoid materializing a larger tensor than we need. Previously, we had checked stride of tensor, but if it is not realized, that will not work. so we should check the strides of the meta as well.

Note: getting the exact of realizing/slicing/requiring_exact_strides was a little tricky. I commented to @exclamaforte on an example unable-to-fuse message you get if you do it incorrectly.

Fix for https://github.com/pytorch/pytorch/issues/145760

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146054
Approved by: https://github.com/shunting314
2025-02-07 06:35:57 +00:00
71e8a2bda4 Expand inductor codegen dtype asserts, fix scan (#146067)
We were codegening intermediary dtype asserts in some places but not all. expands assertions, fixes newly failing assertion in

`TORCHINDUCTOR_COMPILE_THREADS=1 TORCH_LOGS="output_code" PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_logcumsumexp_cuda_float16` for scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146067
Approved by: https://github.com/shunting314, https://github.com/jansel
2025-02-07 06:35:47 +00:00
cyy
f6bd20e8a2 Enable TemporaryFileName tests on Windows (#146311)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146311
Approved by: https://github.com/albanD
2025-02-07 06:06:18 +00:00
1c872803cb [export][dynamic shapes] log provenance for locals & symbols for non-strict (#143378)
Adds `dtrace_structured` logging so when a guard or real-tensor propagation assert is added, the relevant user code with local symbolic values & free symbols are logged, e.g. from the draft export CLI report (soon to be added to tlparse):
1. Guard added:
```
1. Constraint violation error.
    The specified input dynamic_shapes spec was found to be incorrect during tracing.
    Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}.
    This occured at the following stacktrace:
        File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 267, in forward:
            assert a.shape[0] == 3

        Locals:
            a: Tensor(shape: torch.Size([s0, 3]), stride: (3, 1), storage_offset: 0)

        Symbols:
           s0: L['args'][0][0].size()[0]
...
```

2. Real tensor propagation:
```
1. Data dependent error.
    When exporting, we were unable to evaluate the value of `u2 < 0`.
    This was encountered 8 times.
    This occurred at the following stacktrace:
        File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 217, in forward:
            return res[:c_item]

        Locals:
            res: Tensor(shape: torch.Size([u0, u1]), stride: (Max(1, u1), 1), storage_offset: 0)
            c_item: u2
...
```

Currently the values are extracted from the traceback, and are only valid for non-strict; strict seems to require storing & fakifying locals in the frames reporting by `TracingContext`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143378
Approved by: https://github.com/avikchaudhuri, https://github.com/bobrenjc93
2025-02-07 05:46:05 +00:00
bc40ccf6aa [BE]: Inline special functions for MPS (#146627)
These header functions should be inlined for consistency and to avoid translation unit / symbol issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146627
Approved by: https://github.com/dcci
2025-02-07 05:15:15 +00:00
ecf44d1002 Fixed a typo in dataset.py (#146600)
Changed word 'Mult' to 'Multi'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146600
Approved by: https://github.com/Skylion007
2025-02-07 05:09:51 +00:00
41e6d189a3 [ONNX] Create deprecation warning on dynamo_export (#146425)
Reland #146003

Deprecation of `torch.onnx.dynamo_export`:

* [`torch/onnx/_internal/_exporter_legacy.py`]: Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146425
Approved by: https://github.com/titaiwangms, https://github.com/atalman
2025-02-07 04:20:46 +00:00
cyy
fa0592b568 Remove some NOLINT (#146610)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146610
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-07 01:50:06 +00:00
624d94bdb8 [MPS] Extend torch.special.sinc to complex (#146648)
And to integral data types as well

Was too lazy to deduce the formula myself(or write a sympy script), but ChatGPT did a decent job of doing it, though it forgot that input must be multiplied by $$\pi$$:
```math
\text{Re}\left(\text{sinc}(x + i y)\right) = \frac{\sin(x)\cosh(y) x - \cos(x)\sinh(y) y}{x^2 + y^2}
```
```math
\text{Im}\left(\text{sinc}(x + i y)\right) = \frac{\cos(x)\sinh(y) x + \sin(x)\cosh(y) y}{x^2 + y^2}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146648
Approved by: https://github.com/dcci
2025-02-07 01:12:37 +00:00
9ea1823f96 [ROCm][Windows] Remove external linkage from an anonymous namespace (#146607)
Fixes a clang-cl compiler error related to attempt to export a symbol that doesn't have any external linkage, since its declared within a local anonymous namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146607
Approved by: https://github.com/jeffdaily
2025-02-06 23:48:20 +00:00
3379c65de6 [ROCm][Windows] Fix unrecognized _BitScanReverse intrinsic (#146606)
Since PyTorch with ROCm on Windows is built with clang-cl and not MSVC, the intrinsics used are different and hence an attempt to compile with `_BitScanReverse` fails. However, a call to `__builtin_clz` which follows in the subsequent preprocessor branch is correctly recognized by the clang-cl compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146606
Approved by: https://github.com/jeffdaily
2025-02-06 23:47:18 +00:00
0d8fc00e0a [ROCm][Windows] Fix isnan integer overload errors on MS STL (#146605)
Microsoft's STL has a problem with integer overloads of std::fpclassify used by std::isnan and std::isinf. These functions need a cast to double to function correctly. Otherwise, the call fails with "ambiguous call to overloaded function" error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146605
Approved by: https://github.com/jeffdaily
2025-02-06 23:44:11 +00:00
3f5ed05688 [Windows][ROCm] Fix c10 hip tests (#146599)
- Solves a problem related to .hip source files being ignored by the build system when HIP language is not enabled in CMake.
- Also ensures that the test executables link to an appropriate CRT Runtime Library and hence have access to all the necessary symbols. Previously, there were many problems related to linkage errors.
- Moves part of Linux-related hipBLASLt changes in `LoadHIP.cmake` under the UNIX conditional branch, as these aren't supported on Windows yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146599
Approved by: https://github.com/jeffdaily
2025-02-06 23:41:25 +00:00
e13a544b54 fix tf32 issue in test_inductor_freezing.py unit tests (#146444)
Test is hitting numerical mismatches in NVIDIA internal CI. Add tf32_on_and_off decorater, update check to assertEqual

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146444
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/eqy
2025-02-06 23:34:28 +00:00
eqy
7bd7f735d4 [CUDA][SDPA] Compute reference in test_triton_scaled_dot_product_attention_block_size_16_cuda_float32 in float64 (#146461)
Seems to currently fail with mismatches in the 1e-4 range presumably due to sdpa calling into the `MATH` backend here which is less fused than a triton kernel. Doing the ref computation in `float64` appears to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146461
Approved by: https://github.com/drisspg
2025-02-06 23:28:56 +00:00
2834fe5e93 [inductor] Fix test error test_force_cutlass_backend_aoti_cexpr_codegen (#146564)
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend -- --exact 'caffe2/test/inductor:cutlass_backend - test_force_cutlass_backend_aoti_cexpr_codegen (caffe2.test.inductor.test_cutlass_backend.TestCutlassBackend)'
```

Differential Revision: D69219873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146564
Approved by: https://github.com/yanboliang
2025-02-06 23:02:41 +00:00
0c81b398ab [BE][Ez]: Enable some additional pylint ruff warnings (#146609)
Some additional code hardening with some pylint warnings in ruff that usually indicate bugs.  All code currently conforms nicely to them, but this will ensure these errors can be detected statically before running / creating tests.

The follow rules:
* Ban walrus operators where they would have no effect over regular assignment; making intention more clear.
* Statically check for the common error of forgetting to put parens after the `super` call, which will cause an attribute error
* Ban bad string literal args to builtins `open`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146609
Approved by: https://github.com/aorenste
2025-02-06 21:58:08 +00:00
99dd846672 [torch] fix builds for older pybind (#146630)
Summary:
some versions of pybind we build with don't have `py::set_error`.

So just use the underlying python C API.

Test Plan: unit tests

Differential Revision: D69254629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146630
Approved by: https://github.com/colin2328, https://github.com/ngimel
2025-02-06 21:22:00 +00:00
3008368b12 Honor Dr.CI classification results on auto commit hash update (#146337)
Disable `ignore_flaky_failures` was a safer choice, but it seems that this option doesn't work with the current state of the CI.  For example, https://github.com/pytorch/pytorch/pull/125806 hasn't been merged since May because there would always be a failure in one type or another.  This effectively disables the automate mechanism.

My proposal here is to relax this rule and allows the bot to merge auto commit has update with `@pytorchbot merge` like a regular PR.  Then we will at least have something working.  If this causes issue, we can revert it back and try to longer route of improving CI reliability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146337
Approved by: https://github.com/clee2000
2025-02-06 20:33:38 +00:00
44b69b80c2 [ROCm][TunableOp] Future proof TunableOp unit test. (#146548)
TunableOp UT will fail because the regular expression in the test will not work for future versions of ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146548
Approved by: https://github.com/jeffdaily
2025-02-06 20:26:02 +00:00
5cc1b54a91 [2/N][cp][example] flex attention in context parallel (backward pass) (#146397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146397
Approved by: https://github.com/fegin
ghstack dependencies: #145896
2025-02-06 19:50:02 +00:00
6220c64aea [1/N][cp][example] flex attention in context parallel (forward pass) (#145896)
**Description**
This is an example of how FlexAttention can be used in a context parallel fashion. Right now it's only a flex_attention call with collectives added and has no load balancer, but we're about to add the missing parts step by step:
1. backward pass
2. static load balancing for causal masking
3. dynamic load balancing for other general maskings
4. automatic collective insertion solution
5. non-intrusive context parallel APIs

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/tensor/examples/flex_attention_cp.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145896
Approved by: https://github.com/fegin, https://github.com/Skylion007
2025-02-06 19:50:02 +00:00
5ecdc428b2 [while_loop][inductor] support sym expression as cond_fn output (#146222)
As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn.

aoti generated output code looks like:
```
V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]     bool buf7_cond_result;
....
(while_loop_cond_graph_0_arg2_1_handle);
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         buf7_cond_result = u0 + u1 < 10L;
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         if (!buf7_cond_result) break;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222
Approved by: https://github.com/desertfire
ghstack dependencies: #146194, #146195
2025-02-06 19:39:55 +00:00
1b879fd0ea [Inductor] Add a JIT Inductor unit test following #146293 (#146529)
Summary: To follow up https://github.com/pytorch/pytorch/pull/146293, add a JIT Inductor unit test. Other Triton template may need similar fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146529
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-02-06 19:21:15 +00:00
992388c100 [inductor] use ftz variant of exp (#146216)
Inductor generated exp op is compiled as the following ptx snippet by Triton.

```
        mul.f32         %f74, %f83, 0f3FB8AA3B;
        ex2.approx.f32 %f73, %f74;
```

But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as
```
	mul.ftz.f32 	%f2, %f1, 0f3FB8AA3B;
	ex2.approx.ftz.f32 	%f3, %f2;
```
which uses the FTZ variant.

Let Inductor able to generate the FTZ variant if use_fast_math config is true.

I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216
Approved by: https://github.com/jansel, https://github.com/eellison
2025-02-06 19:12:35 +00:00
9ee506bd93 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-02-06 19:04:50 +00:00
eqy
07b214402a [CUDA][B200] Update the number of threads in avg_pool2d backward for SM 10.0 (#145669)
Fixes register count issue when launching on SM 10.0, originally authored by @bilal2vec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145669
Approved by: https://github.com/nWEIdia, https://github.com/ngimel
2025-02-06 18:57:33 +00:00
99ddbb4802 [dynamo][fullgraph] Do not skip frame with fullgraph=True (#146527)
Earlier if there were no ops in the graph, fullgraph=True will also fallback to eager. This hides issues in testing, where we silently fallback to eager, and do not test optimized bytecode. As can be seen in the PR, I had to fix several tests when I forced to use the optimized bytecode in the absence of graph. A few failing tests will be fixed in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146527
Approved by: https://github.com/zou3519, https://github.com/StrongerXi
2025-02-06 18:56:07 +00:00
15b1ac3e86 Add torch.func.debug_unwrap (#146528)
Use it to unwrap any functorch-wrapped tensor. I don't recommend using
the output in a program since it breaks the semantics of the transforms,
but it seems useful for debugging.

I will note that some people have wanted to get intermediate values out
of an e.g. grad transform, so this might be a way to do that...

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146528
Approved by: https://github.com/Chillee
2025-02-06 18:48:09 +00:00
49082f9dba parallelize sort (#142391)
- use __gnu_parallel::sort for gcc compilations
- add a parallelized version of std::sort and std::stable_sort for non gcc compilations

Using __gnu_parallel::sort:
provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

The performance is measured using the following script:
```python
import torch
import torch.autograd.profiler as profiler

torch.manual_seed(0)

N = 50000
x = torch.randn(N, dtype=torch.float)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):
        _, _ = torch.sort(x)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142391
Approved by: https://github.com/malfet
2025-02-06 18:06:40 +00:00
7725d0ba12 [METAL] inline bfloat min/max (#146588)
After a recent commit 36c6e09528a7e071edecde083254da70cba26c95 , building from source with `python setup.py develop` leads to an error due to multiple symbols for min/max:
```
FAILED: caffe2/aten/src/ATen/kernels_bfloat.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_bfloat.metallib
cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_bfloat.metallib BinaryKernel_31.air Bucketization_31.air CrossKernel_31.air FusedOptimizerOps_31.air Gamma_31.air HistogramKernel_31.air Im2Col_31.air Indexing_31.air LinearAlgebra_31.air Quantized_31.air RMSNorm_31.air RenormKernel_31.air Repeat_31.air SpecialOps_31.air TriangularOps_31.air UnaryKernel_31.air UnfoldBackward_31.air UpSample_31.air
LLVM ERROR: multiple symbols ('_ZN3c105metal3minIDF16bEEN5metal9enable_ifIXgssr5metalE19is_floating_point_vIT_EES4_E4typeES4_S4_')!
```

This PR fixes that.

@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146588
Approved by: https://github.com/FFFrog, https://github.com/Skylion007, https://github.com/malfet
2025-02-06 17:57:31 +00:00
e2e265e27b [dynamo] Use polyfill to implement comparison operators (#144485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485
Approved by: https://github.com/jansel
2025-02-06 17:27:07 +00:00
1090e58687 [mps] Remove a stale comment. (#146619)
The implementation of the function was moved to a shader, but the comment was left there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146619
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-06 17:25:29 +00:00
46390e9a37 [mps] Implement support for sinc() operator (inductor and eager). (#146539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146539
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 16:37:27 +00:00
a14c780c4c [dynamo] fix dynamo_compile logging on RecompileLimitExceeded (#146544)
Logging branches based on RecompileLimitExceeded or not. If we exceed the limit, we fallback to eager before even trying to analyze the frame. We handle RecompileLimitExceeded outside of the try/catch/finally that edits the metrics context:
72405b0c0f/torch/_dynamo/convert_frame.py (L908-L935).

dynamo_config and recompile_reason are both known before we raise the RecompileLimitExceeded, so we can add them with the rest of the "common" metrics. which are logged on metric_context decorator exit and is always called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146544
Approved by: https://github.com/masnesral
2025-02-06 16:20:42 +00:00
6ff3383157 Enable CUPTI on Windows (#141454)
Fixes:
- https://github.com/pytorch/pytorch/issues/93855

The PR enables CUPTI on Windows and enables unit tests to check CUDA profiling events.
Additionally, the changes can be verified using the following script:

```
import torch
from torch.profiler import profile, ProfilerActivity

def check_cupti_enabled():
    # Check if CUDA is available
    if not torch.cuda.is_available():
        print("CUDA is not available on this system.")
        return False

    # Create a simple CUDA tensor
    x = torch.randn(1000, 1000, device="cuda")
    y = torch.randn(1000, 1000, device="cuda")

    try:
        # Use PyTorch profiler to perform a basic check
        with profile(activities=[ProfilerActivity.CUDA]) as prof:
            z = x @ y  # Simple CUDA operation

        # Print profiling results
        print("CUPTI is enabled and profiling works.")
        print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
        return True
    except RuntimeError as e:
        # If profiling fails, CUPTI is likely not set up correctly
        print("Error: CUPTI might not be enabled or accessible.")
        print(f"Details: {e}")
        return False

if __name__ == "__main__":
    if check_cupti_enabled():
        print("CUPTI is properly configured in PyTorch.")
    else:
        print("CUPTI is not configured correctly. Check your CUDA installation.")
```

Sample output:
```
CUPTI is enabled and profiling works.
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
     sgemm_128x128x8_NN_vec         0.00%       0.000us         0.00%       0.000us       0.000us       2.086ms       100.00%       2.086ms       2.086ms             1
                   cudaFree         9.67%       9.816ms         9.67%       9.816ms       9.816ms       0.000us         0.00%       0.000us       0.000us             1
     cudaDeviceGetAttribute         0.01%      10.000us         0.01%      10.000us       0.476us       0.000us         0.00%       0.000us       0.000us            21
    cudaGetDriverEntryPoint         0.00%       1.700us         0.00%       1.700us       0.850us       0.000us         0.00%       0.000us       0.000us             2
       cudaGetSymbolAddress        85.15%      86.438ms        85.15%      86.438ms      86.438ms       0.000us         0.00%       0.000us       0.000us             1
                 cudaMalloc         0.43%     433.300us         0.43%     433.300us     144.433us       0.000us         0.00%       0.000us       0.000us             3
           cudaLaunchKernel         2.61%       2.648ms         2.61%       2.648ms       2.648ms       0.000us         0.00%       0.000us       0.000us             1
      cudaDeviceSynchronize         2.13%       2.163ms         2.13%       2.163ms       2.163ms       0.000us         0.00%       0.000us       0.000us             1
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 101.511ms
Self CUDA time total: 2.086ms

CUPTI is properly configured in PyTorch.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141454
Approved by: https://github.com/malfet
2025-02-06 15:58:20 +00:00
FEI
8a4dd763b8 [CCA] remove TODO for hardware_destructive_interference_size (#145591)
@zyan0 @albanD  @houseroad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145591
Approved by: https://github.com/albanD
2025-02-06 14:41:25 +00:00
ed309b9156 Re-add stft option to align window for center = false (#146379)
Skips advancing the fc window on https://github.com/pytorch/pytorch/pull/145437, since I just found that there were non-trivial efforts to do so a while ago that eventually was reverted: https://github.com/pytorch/pytorch/pull/73434

Works around the issue by keeping the stft sans center overload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146379
Approved by: https://github.com/justinchuby, https://github.com/iseeyuan
2025-02-06 14:07:13 +00:00
1b79d47635 Revert "[dynamo] check for incompatible configs (#146513)"
This reverts commit aab7925418be561a8af6adfcb8cf009a8786c31b.

Reverted https://github.com/pytorch/pytorch/pull/146513 on behalf of https://github.com/atalman due to inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/13174131431/job/36772837627) [HUD commit link](4a545eb85d) ([comment](https://github.com/pytorch/pytorch/pull/146513#issuecomment-2639860568))
2025-02-06 13:42:25 +00:00
340cfe4f28 [dynamo][fbcode] Turn on inline_inbuilt_nn_modules (#145407)
As title.

Some internal testing at https://fb.workplace.com/groups/241460628989036/permalink/411650015303429/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145407
Approved by: https://github.com/ezyang, https://github.com/jansel
2025-02-06 13:18:35 +00:00
bd7d4fb2b5 Revert "[DTensor][Test] Create a simple unit test for tensordot (#146514)"
This reverts commit 1f8baf09ea598c97f30731ddb8328b6aa8d31fe9.

Reverted https://github.com/pytorch/pytorch/pull/146514 on behalf of https://github.com/albanD due to The lint failures that you ignored are real right? ([comment](https://github.com/pytorch/pytorch/pull/146514#issuecomment-2639554636))
2025-02-06 11:26:43 +00:00
4a545eb85d Fix torch.nn.functional.one_hot param num_classes optional description (#146470)
`torch.nn.functional.one_hot` [document](https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html) describe param `num_classes` not optional, but user can call method without pass it.

![image](https://github.com/user-attachments/assets/4e6d4feb-691f-451f-95b5-4ac11bac7bc2)

```python
>>> import torch
>>> a = torch.arange(0, 5) % 3  # [0,1,2,0,1]
>>> torch.nn.functional.one_hot(a)
tensor([[1, 0, 0],
        [0, 1, 0],
        [0, 0, 1],
        [1, 0, 0],
        [0, 1, 0]])

```

`num_classes` has default value -1

93d98aca31/aten/src/ATen/native/native_functions.yaml (L6154-L6157)

## Test Result

![image](https://github.com/user-attachments/assets/2c7203b7-6226-4ebc-84c8-cbf912fc48e2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146470
Approved by: https://github.com/albanD
2025-02-06 07:48:05 +00:00
aab7925418 [dynamo] check for incompatible configs (#146513)
internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/

Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time.

Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513
Approved by: https://github.com/williamwen42
2025-02-06 07:39:52 +00:00
eqy
5f0901e573 [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-06 05:57:33 +00:00
36c6e09528 [MPSInductor] Fix min/max for bfloat16 (#146552)
By introducing a full specialization that upcasts everything to float, as bfloat does not have a native min/max

Test by runing `test_min_max_reduction`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146552
Approved by: https://github.com/dcci
2025-02-06 05:15:00 +00:00
1f8baf09ea [DTensor][Test] Create a simple unit test for tensordot (#146514)
Fixes #ISSUE_NUMBER

The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514
Approved by: https://github.com/tianyu-l
2025-02-06 05:09:34 +00:00
e01a5e9e1e Small improvements to NJT matrix multiplies (#146405)
Fixes #146404

Adds changes to the matmul and matmul_backward operation for nested jagged tensors, to support back propagation when the output is a regular strided tensor.
This required adding support for the nested matmul operation to work when the nested tensor wasn't 'self', i.e
`A@B` where `A` isn't nested but `B` is.

The operation schemas had to be updated to reflect that either input can be a strided tensor instead (and the gradient), so an extra assertion is added in an edge case where neither input is nested.

Unit tests are also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146405
Approved by: https://github.com/soulitzer, https://github.com/jbschlosser
2025-02-06 04:51:12 +00:00
389c5c0842 print out partial fx graph for all data-dependent errors (#146363)
The previous implementation didn't catch the following type of errors

```
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not extract specialized integer from data-dependent expression u2 (unhinted: u2).  (Size-like symbols: none)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146363
Approved by: https://github.com/angelayi, https://github.com/bdhirsh
ghstack dependencies: #146298, #146296
2025-02-06 04:21:34 +00:00
425804db2b [torch] fix exception types in custom class magic setattr/getattr (#146516)
Summary:
`c10::AttributeError` is not automatically converted to Python AttributeError, it needs some special macros (e.g. `HANDLE_TH_ERRORS`).

Some Python functions like `hasattr` rely on the type of the throw exception to be correct.

We don't need the fully generality of those macros, so just do a targeted error type conversion here.

Test Plan: added unit test

Differential Revision: D69197217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146516
Approved by: https://github.com/zdevito
2025-02-06 02:14:11 +00:00
3a6a203b98 [dynamic shapes][real tensor tracing] propagate unbacked hint when creating mod replacement (#146381)
Fixes data-dependent errors for 2 PT2I models in draft export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146381
Approved by: https://github.com/angelayi
2025-02-06 01:48:40 +00:00
c5062cca98 [export] make stack_trace optional in insert_custom_op_guards (#146438)
Summary: Fixes 1 PT2I exportability error

Test Plan: -

Differential Revision: D69132186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146438
Approved by: https://github.com/yiming0416, https://github.com/angelayi
2025-02-06 01:48:26 +00:00
6a985d8b2e Make inductor_utils.requires_gpu accept MPS (#145156)
Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU
(Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped)

- Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2`  otherwise they GPU tests are just running for _cpu suffixes.
- Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0`
- UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU
- Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel
2025-02-06 01:14:36 +00:00
0dc03134d9 [MPS] linalg solve implementation (#146531)
Fixes #98222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146531
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 00:57:49 +00:00
495049860b [BE][Metal] Fix signed unsigned comparison warning (#146549)
I wish I knew how to extract Metal warnings during JIT compilation but https://developer.apple.com/documentation/metal/mtldevice/makelibrary(source:options:)?changes=_7&language=objc is a lie as `error:` stays `nil` unless shader compilation fails. But when it does following warnings are thrown
```
program_source:666:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:677:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:688:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:699:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:710:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:723:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146549
Approved by: https://github.com/dcci
2025-02-06 00:40:17 +00:00
e0cf519ade Revert "[inductor] Refactor op handlers part 2 (#146252)"
This reverts commit 13f0436abdff0386f33c7a8c25caa66e9af16dbd.

Reverted https://github.com/pytorch/pytorch/pull/146252 on behalf of https://github.com/atalman due to Sorry need to revert, failing internally ([comment](https://github.com/pytorch/pytorch/pull/146252#issuecomment-2638305417))
2025-02-06 00:04:04 +00:00
c7087d6b14 [BE][EZ][Metal] Do not pass tensor length as arg (#146522)
As all devices capable of running Metal-2 support nonuniform threadgroup sizes, see https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf for more detail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146522
Approved by: https://github.com/dcci
ghstack dependencies: #146521
2025-02-06 00:03:41 +00:00
54ef029532 [BE][EZ][Metal] Mark constant inputs as constant (#146521)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146521
Approved by: https://github.com/dcci
2025-02-06 00:03:41 +00:00
2001066c61 Revert "[inductor] Refactor op handlers part 3 (#146254)"
This reverts commit 8e9bda8d895e80da0fe480d02e100bae8332ed57.

Reverted https://github.com/pytorch/pytorch/pull/146254 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146254#issuecomment-2638300857))
2025-02-05 23:59:50 +00:00
72405b0c0f [ca] refactor compile reasons and log to tlparse (#146386)
This PR accumulates comple reasons inside each CacheNode, and logs them to tlparse on each CA compile. This defines a compile as an autograd structure change, and a recompile as a dynamic shape change.

sample tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpdbo7gt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

for compiles:
```python
[
  "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]"
]
```

for recompiles:
```python
[
  "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]",
  "!1: Cache miss due to 7 changed tensor shapes (total of 7): sizes[0], sizes[1], sizes[2], sizes[3], sizes[4], sizes[5], sizes[6]"
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146386
Approved by: https://github.com/jansel
ghstack dependencies: #146229
2025-02-05 23:33:21 +00:00
68304dba7a Revert "[inductor] Refactor op handlers part 4 (#146255)"
This reverts commit 7aced455c542f629ffcd4f79c6af259bb966add8.

Reverted https://github.com/pytorch/pytorch/pull/146255 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146255#issuecomment-2638258089))
2025-02-05 23:24:20 +00:00
49effa0deb Revert "[inductor] Refactor op handlers part 5 (#146257)"
This reverts commit d3dd3eeb7f599a2816ba1a067a8fa5a1bb1c84c3.

Reverted https://github.com/pytorch/pytorch/pull/146257 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146257#issuecomment-2638251994))
2025-02-05 23:20:38 +00:00
93e1e6e07c Revert "[inductor] Minor compile time optimizations in DefaultHandler (#146282)"
This reverts commit b8a529cca18ae4d21b1681c5ea3a40635aba5a83.

Reverted https://github.com/pytorch/pytorch/pull/146282 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146282#issuecomment-2638239575))
2025-02-05 23:13:08 +00:00
7dc5cfe2ad Revert "[inductor] Refactor CaptureIndexing into global scope (#146297)"
This reverts commit 7288950bcd4c5851e003dded6ce87da643b93e49.

Reverted https://github.com/pytorch/pytorch/pull/146297 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146297#issuecomment-2638234829))
2025-02-05 23:10:08 +00:00
9555bfce88 Revert "[inductor] Pre-populate cache for simplify_with_ranges return value (#146373)"
This reverts commit 84ba9c6e7844a0b457bc64ca70a9c8cf3655d03d.

Reverted https://github.com/pytorch/pytorch/pull/146373 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146373#issuecomment-2638232033))
2025-02-05 23:07:08 +00:00
8af31e30d7 [Codemod][AddExplicitStrictExportArg] caffe2/torch (#146439)
Differential Revision: D69068432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146439
Approved by: https://github.com/avikchaudhuri
2025-02-05 22:56:54 +00:00
97b64f2e5c Fix workflow for closing nonexistent disable issues (#146447)
The workflow could not update issues because it didn't have permissions, and it looked green because it didn't check return codes.

Tested by running the workflow and seeing that issues did get closed
Fixes https://github.com/pytorch/pytorch/issues/145382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146447
Approved by: https://github.com/huydhn
2025-02-05 22:29:05 +00:00
9b6d680131 Remove stage_index_to_group_rank from schedule (#146217)
This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217
Approved by: https://github.com/wconstab
ghstack dependencies: #146193
2025-02-05 21:26:45 +00:00
4ee7d0de86 Add generate_stage_to_rank_mapping utility (#146193)
We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it.

This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely.

Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193
Approved by: https://github.com/wconstab
2025-02-05 21:26:45 +00:00
98b5d455fd [opcheck] Improve error reporting; allow atol/rtol overrides (#146488)
This PR improves opcheck to:
1. directly use torch.testing.assert_close (without a msg override).
   This allows it to print the absolute and relative differences and the
   number of mismatched elements.
2. take in an atol/rtol tolerance (for if someone just wants to use
   opcheck in their testing).

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146488
Approved by: https://github.com/williamwen42
2025-02-05 21:25:06 +00:00
1f6b566d74 [ONNX] Bump onnx and onnxscript versions in CI (#146097)
Bump onnx onnxscript==0.1 in CI; Skipped onnxruntime 1.19 because it has regression on avgpool.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146097
Approved by: https://github.com/malfet
2025-02-05 21:00:25 +00:00
9da376daa6 Add retain-output argument (#145921)
This PR add retain-output argument which enables appending to the already existing output file if it exists instead of deleting it and creating a new one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145921
Approved by: https://github.com/jansel
2025-02-05 19:45:09 +00:00
dd349207c5 Add check that envvar configs are boolean (#145454)
So we don't get unexpected behavior when higher typed values are passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145454
Approved by: https://github.com/c00w, https://github.com/jamesjwu
2025-02-05 19:40:10 +00:00
9091096d6c Refactoring Distributed test cases to be device agnostic [1/n] (#145222)
In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic.

These changes will include the following approaches to do the same :

- Allowing for multiple device types using instantiate_device_type_test
- Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies
- Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable
- Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536).

This should result in significant improvement in usability for all devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145222
Approved by: https://github.com/kwen2501
2025-02-05 18:47:09 +00:00
eqy
6f7fda3f49 Bump nn.functional.conv3d tolerances for test_comprehensive (#135719)
`float16` tolerance was previously set to `1e-5` which seemed very low
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135719
Approved by: https://github.com/Chillee, https://github.com/albanD
2025-02-05 18:34:12 +00:00
d2a2b9f8a7 Fix constants with non-functional operators (#145593)
Previously, in non-strict path, we always error when trying to inplace update a constant tensor because those constant tensors are not actually wrapped by functional tensors. This is correct behaviour in torch.compile, because dynamo makes all constant tensors into buffers and AOTDispatcher just lifts them and wraps them in functional tensors. However, in non-strict, there is no such step that registers constants as buffers so AOTDispatcher panics when it sees these dangling constant tensors when functioanalizing.

Due to recent change in the IR, this is no longer an issue in non-strict path because we don't call AOTDispatcher at training IR level, but now it is a problem for both strict and non-strict when we lower to inference. (lowering to inference is very similar to non-strict tracing) As a result, we have at least one external (https://github.com/pytorch/pytorch/issues/141336) and internal issues reported due to this difference.

To fix this, there are two ways:
1. Make functionalization be aware of constant tensors and map them to functional tensors on the fly. This makes functionalization invariant uglier and could potentially open up a gate for more nasty bugs.
2. Special handle this in export. This seems more aligned with what dynamo does today so i think we should do it this way. I think the current state could benefit from more refactors to make the run_deocmpositions to be more similar to strict export (because both of them now handle this constant registerinig logic) but it is bit complicated to do it now because strict export version of this logic is also not complete because it doesn't take into account of export graph renaming pass etc). I will follow up with more refactors after this PR (T213466691) to unblock users faster.

For future reference:

Why are we not doing "turning constants into non-persistent buffers and never de-register"? The reason is because in some internal models, they rely on module.to to reliably work to move params/buffers to correct device. As a result, buffers are moved while constants are not. In composibility meeting, we agreed that export won't do device agnostic tracing going forward (it will provide a way to specify FakeTensor in CPU that can be configured to be run on GPU), so after that is done, we can always turn constants into non-persistent buffers which will simplify export's constant handling.

Differential Revision: [D68610739](https://our.internmc.facebook.com/intern/diff/D68610739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145593
Approved by: https://github.com/avikchaudhuri
2025-02-05 17:44:19 +00:00
44248c44eb [ROCm] miopen benchmark behavior now better aligns with cudnn (#145294)
The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145294
Approved by: https://github.com/BrianHarrisonAMD, https://github.com/malfet
2025-02-05 17:19:53 +00:00
f27220e32a Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 157d81c201715f84ead21d0ee420669ab7f58c04.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/atalman due to Failing internally, sorry need to revert ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2637443675))
2025-02-05 16:39:37 +00:00
f55c0af37f [inductor] Support non-power-of-2 cooperative RSPLIT (#145689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145689
Approved by: https://github.com/eellison
2025-02-05 16:36:53 +00:00
db22e9d5a2 Implement blend operation for float, double, int in VEC ATen backend for SVE (#146479)
- Added support for SVE vectorized blend operation for float, double, int8_t, int16_t, int32_t and int64_t data types.
- Utilizes SVE ACLE intrinsic (svcntb, svcntw, svcmpne, svsel) to handle different vector lengths (VL) dynamically.
-  Ensured compatibility with SVE128, SVE256, and SVE512 hardware configurations.
-  Enabled back blend SVE vec tests

**Testing:**
**a) Float DType:**
./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/0.Blend    [Test Passed] on Graviton 3 machine (SVE256)
./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/0.Blend    [Test Passed] on Graviton 4 machine (SVE128)

**b) Double DType:**
./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/1.Blend    [Test Passed] on Graviton 3 machine (SVE256)
./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/1.Blend    [Test Passed] on Graviton 4 machine (SVE128)

**c)Int DType:**
python3 test/inductor/test_cpu_repro.py CPUReproTests.test_vec_remainder
[Test Passed] on Graviton 3 machine (SVE256) and on Graviton 4 machine (SVE128)
<img width="661" alt="grv4_test_case_passed" src="https://github.com/user-attachments/assets/5572fcc0-a861-4bd6-bf9e-356219ffe656" />

Fixes https://github.com/pytorch/pytorch/issues/146309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146479
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-05 16:29:13 +00:00
cd6c0707a8 [aoti] Assign proxy call args by name, and support default values. (#146263)
Fixing the following issue when compiling the following program:
```
                window = torch.hann_window(N_FFT).to(x.device)
                stft = torch.stft(
                    x, N_FFT, HOP_LENGTH, window=window, return_complex=True
                )
                magnitudes = stft[..., :-1].abs() ** 2
                return magnitudes
```
```
Traceback (most recent call last):
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
    self.check_model(model, example_inputs)
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
    actual = AOTIRunnerUtil.run(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
    optimized = AOTIRunnerUtil.load(device, so_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
    return torch._export.aot_load(so_path, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
2025-02-05 15:43:05 +00:00
1bb977a2a4 [auto_functionalized] Support Tensor(a!)[]? (#145400)
Summary:
This is just updating some of the checks to allow the Tensor(a!)[]? type
through.

Fixes #144072

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145400
Approved by: https://github.com/laithsakka
2025-02-05 14:52:39 +00:00
282d185ec1 Revert "[inductor] use ftz variant of exp (#146216)"
This reverts commit b0b3fe8bcf00f30513e9bb3e197ea4cbcc2beef0.

Reverted https://github.com/pytorch/pytorch/pull/146216 on behalf of https://github.com/atalman due to inductor/test_op_completeness.py::TestOpCompleteness::test_triton_overrides [GH job link](https://github.com/pytorch/pytorch/actions/runs/13152430750/job/36702812599) [HUD commit link](b0b3fe8bcf) ([comment](https://github.com/pytorch/pytorch/pull/146216#issuecomment-2636961317))
2025-02-05 14:13:45 +00:00
8a2000fd42 [MPS] Implement support for zeta (both eager and inductor). (#146465)
A test was failing in inductor (`test_pointwise_zeta`) -- and I realized the operation was missing also from eager.
Implemented for both, leveraging the kernel. Happy to split in two (one PR for eager, one for inductor) if folks prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146465
Approved by: https://github.com/malfet
2025-02-05 13:55:50 +00:00
fd0cd6a08f [ROCm][TunableOp] Improve identification of fastest solution (#144942)
This PR addresses some stability issues with identifying the fastest solution on AMD GPUs, particularly the MI300.

Changes include:
- An improved timer, StreamTimerNoSync
- More aggressive skipping of slow solutions
- Additional statistics that can be used for diagnostics PYTORCH_TUNABLEOP_VERBOSE=3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144942
Approved by: https://github.com/jeffdaily
2025-02-05 11:16:49 +00:00
e20b0c82d1 [ca] no longer require is_traceable annotations for c++ autograd functions (#146229)
This PR removes the CA compile-time error for C++ autograd functions, and supports them by having dynamo graph break on them (instead of allow_in_graph). The CppNode's collects are kept as is for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146229
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-02-05 08:49:17 +00:00
cyy
6293d1446b [2/N] Remove NOLINT suppressions (#146402)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146402
Approved by: https://github.com/soulitzer
2025-02-05 08:38:52 +00:00
e5ea7e9cdc add support for capturing provenance of unary operations (#146413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-05 08:31:38 +00:00
b0b3fe8bcf [inductor] use ftz variant of exp (#146216)
Inductor generated exp op is compiled as the following ptx snippet by Triton.

```
        mul.f32         %f74, %f83, 0f3FB8AA3B;
        ex2.approx.f32 %f73, %f74;
```

But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as
```
	mul.ftz.f32 	%f2, %f1, 0f3FB8AA3B;
	ex2.approx.ftz.f32 	%f3, %f2;
```
which uses the FTZ variant.

Let Inductor able to generate the FTZ variant if use_fast_math config is true.

I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216
Approved by: https://github.com/jansel
2025-02-05 07:35:43 +00:00
clr
93d98aca31 inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)
If a nn.module getattr call throws, we should make sure that we don't crash with an internal error

Note that I couldn't figure out how to test this, so advice would be awesome.  I have my best case attempt at  https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122
Approved by: https://github.com/jansel
2025-02-05 05:49:32 +00:00
eb832b7bcc [export] Fix draft-export logging (#146106)
Summary: Fix issue where the lazyTraceHandler does not exist

Test Plan: CI

Differential Revision: D68928070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146106
Approved by: https://github.com/yiming0416
2025-02-05 05:49:22 +00:00
f242da41c7 Revert "move and fix logic to update unbacked bindings (#146115)"
This reverts commit 0144613e6ff6e018ca41085d1509dcceb80987f7.

Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2635695958))
2025-02-05 04:51:39 +00:00
cyy
c6ea4425e5 Enable some tests on Windows (#146243)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146243
Approved by: https://github.com/albanD
2025-02-05 03:54:28 +00:00
f35e60b21c Revert "[cutlass backend] fix bug for accuminator dtype (#146356)"
This reverts commit 7c8ec84dab7dc10d4ef90afc93a49b97bbd04503.

Reverted https://github.com/pytorch/pytorch/pull/146356 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some slow cutlass tests are failing ([comment](https://github.com/pytorch/pytorch/pull/146356#issuecomment-2635594712))
2025-02-05 03:01:50 +00:00
3c0d2bc262 Revert "[Testing] Reduce test_exp flakiness (#146436)"
This reverts commit 4c5a9a5f949ef3019fc3ef095034ccfc973ff13d.

Reverted https://github.com/pytorch/pytorch/pull/146436 on behalf of https://github.com/huydhn due to Some test_exp2 starts failing in trunk I think ([comment](https://github.com/pytorch/pytorch/pull/146436#issuecomment-2635591878))
2025-02-05 02:58:53 +00:00
aafaf4016f [MPS] Add error checking when dispatching kernel (#146458)
That thread-group size should not exceed maximum thread group size
Add regression test to validate that
Make failures like https://github.com/pytorch/pytorch/issues/146430 much easier to detect
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146458
Approved by: https://github.com/dcci
2025-02-05 02:56:40 +00:00
9e45bc82e9 [aarch64] CUDA 12.8 aarch64 builds to nightly binaries (#146378)
https://github.com/pytorch/pytorch/issues/145570

Adding Cuda 12.8 and keeping 12.6 for the sbsa build, supported CUDA_ARCH: 9.0, 10.0, 12.0

Refactor the binaries matrix for cuda sbsa build. Previously cuda-aarch64 was hardcoded to cuda 12.6. Now reads 12.6 and 12.8, new build naming example [manywheel-py3_9-cuda-aarch64-12_8-build](https://github.com/pytorch/pytorch/actions/runs/13132625006/job/36640885079?pr=146378#logs)

TODO: once 12.8 is stable, remove 12.6 in sbsa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146378
Approved by: https://github.com/atalman
2025-02-05 02:55:21 +00:00
001ad5bef5 [MPSInductor] Scope-down test_prod running in MPS (#146460)
As mutli-stage reductions are yet not a thing, but original `test_prod` just returned 0 for large reductions, so failures were reported as flaky ones, but if one to run the same test with `MTL_DEBUG_LAYER=1` than failure was obvious
```
2025-02-04 11:51:30.034 Python[16594:289093] Metal API Validation Enabled
test_prod (__main__.MPSBasicTests.test_prod) ... -[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1266: failed assertion `(threadsPerThreadgroup.width(1) * threadsPerThreadgroup.height(2050) * threadsPerThreadgroup.depth(1))(2050) must be <= 1024. (device threadgroup size limit)'
```

Fixes https://github.com/pytorch/pytorch/issues/146430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146460
Approved by: https://github.com/dcci
2025-02-05 01:47:01 +00:00
52aaadf379 [BE][Ez]: Enable ruff rule E731. use def instead of anonymous lambda (#146410)
Not sure why this isn't enabled, only 1 fix is needed and it supports autofixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146410
Approved by: https://github.com/aorenste, https://github.com/albanD
2025-02-05 01:44:41 +00:00
0e060342b6 [triton] Update pin to tip of 3.2 release (#145867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867
Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte, https://github.com/jansel
2025-02-05 01:42:33 +00:00
616ac94175 [Dynamo] Fix spammy optimizer warning (#146374)
Fixes https://discuss.pytorch.org/t/torch-compile-optimizer-step-generates-excessive-warning-messages/216067/7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146374
Approved by: https://github.com/anijain2305
2025-02-05 01:03:49 +00:00
8177fc4d33 Make regex error catching compatible with Python 3.12+. (#145945)
In Python 3.12, the error message has changed from "Can't pickle local object" to "Can't get local object".
The old regex would no longer catch the error.

This PR make it compatible with Python 3.12 and backward compatible as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145945
Approved by: https://github.com/H-Huang
2025-02-05 00:57:36 +00:00
9d5bf38dec [cpp_builder] refactor to reduce libcudart_static logs (#146394)
Want to reduce logs from `log_msg = f'"libcudart_static.a" not found under {path}'`, which was added in https://github.com/pytorch/pytorch/pull/142175

Differential Revision: D69096354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146394
Approved by: https://github.com/benjaminglass1, https://github.com/chenyang78
2025-02-05 00:41:30 +00:00
658e22d495 Revert "add support for capturing provenance of unary operations (#146413)"
This reverts commit bc33d993acdff2637bc6aee5e604fb969b11fc13.

Reverted https://github.com/pytorch/pytorch/pull/146413 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but some export tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/146413#issuecomment-2635440261))
2025-02-05 00:32:40 +00:00
6e03f4f90e [export] Include metadata in FlatArgsAdapter (#146107)
Summary:
With https://github.com/pytorch/pytorch/pull/145956, which introduces
storing a list of namedtuple field names when serializing, we now want to
expose this list to the args adapater so that APS can utilize this information
and remove extraneous inputs.

Test Plan: No-op

Differential Revision: D68928416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146107
Approved by: https://github.com/pianpwk
2025-02-05 00:29:58 +00:00
84ba9c6e78 [inductor] Pre-populate cache for simplify_with_ranges return value (#146373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282, #146297
2025-02-04 23:36:44 +00:00
7288950bcd [inductor] Refactor CaptureIndexing into global scope (#146297)
And inline SimplifyIndexing into it CaptureIndexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282
2025-02-04 23:36:44 +00:00
b8a529cca1 [inductor] Minor compile time optimizations in DefaultHandler (#146282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257
2025-02-04 23:36:34 +00:00
d3dd3eeb7f [inductor] Refactor op handlers part 5 (#146257)
This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255
2025-02-04 23:36:25 +00:00
7aced455c5 [inductor] Refactor op handlers part 4 (#146255)
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.

Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254
2025-02-04 23:36:17 +00:00
8e9bda8d89 [inductor] Refactor op handlers part 3 (#146254)
Fixes type errors that arise from typing `V.ops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252
2025-02-04 23:36:09 +00:00
13f0436abd [inductor] Refactor op handlers part 2 (#146252)
This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252
Approved by: https://github.com/yanboliang
ghstack dependencies: #146225, #146226, #146235
2025-02-04 23:36:01 +00:00
67be5953fe [inductor] Refactor op handlers part 1 (#146235)
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.

Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%

WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%

WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
2025-02-04 23:35:53 +00:00
ed03f9ca10 [inductor] Refactor CSEProxy into global scope (#146226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226
Approved by: https://github.com/shunting314
ghstack dependencies: #146225
2025-02-04 23:35:43 +00:00
5cac550ddf [inductor] Finish typing common.py (#146225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225
Approved by: https://github.com/Skylion007
2025-02-04 23:35:33 +00:00
7c8ec84dab [cutlass backend] fix bug for accuminator dtype (#146356)
Will add unit tests for accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356
Approved by: https://github.com/Chillee
2025-02-04 22:10:17 +00:00
13e17aa106 Make the CUTLASS swizzle options configurable and default to 2. (#146088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146088
Approved by: https://github.com/henrylhtsang, https://github.com/mlazos
2025-02-04 22:07:26 +00:00
aac0577796 [TEST][Sparse] Force CUTLASS backend in TestSparseSemiStructuredCUTLASS (#146398)
We have noticed some discrepancy between the ways the `test_sparse_semi_structured.py` was called. And in some ways, the test falsely fails, because it was attempting to run on a wrong backend. All because `SparseSemiStructuredTensor._FORCE_CUTLASS = True` was never set in the setup of `TestSparseSemiStructuredCUTLASS` as it was in its `TestSparseSemiStructuredCUSPARSELT` counterpart 8444fe019a/test/test_sparse_semi_structured.py (L1039-L1046)

When I run tests via pytest, just by shear luck it calls `test_values_backend_cutlass_cuda` which sets the backend to CUTLASS bb4bd5f00b/test/test_sparse_semi_structured.py (L475) before `test_conversions_all_patterns_cuda_*`:
```
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_values_backend_cutlass_cuda PASSED [0.0071s]                                                                                          [ 72%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_bfloat16 PASSED [0.0484s]                                                                        [ 73%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_float16 PASSED [0.0041s]                                                                         [ 73%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_int8 PASSED [0.0079s]                                                                            [ 73%]
```
In this scenario everything is good.

But in `python test/test_sparse_semi_structured.py -v -k cuda` way, the order of the tests is not the same, and it sets cuSparseLt backend just before running `test_conversions_all_patterns_cuda_*` which causes failures:
```
test_cusparselt_backend_cuda (__main__.TestSparseSemiStructuredCUSPARSELTCUDA.test_cusparselt_backend_cuda) ... ok
...
test_conversions_all_patterns_cuda_bfloat16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_bfloat16) ... FAIL
test_conversions_all_patterns_cuda_float16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_float16) ... FAIL
test_conversions_all_patterns_cuda_int8 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_int8) ... ERROR
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146398
Approved by: https://github.com/Skylion007, https://github.com/jcaip, https://github.com/eqy
2025-02-04 22:07:12 +00:00
317dae95fa cpp_wrapper: fix CPU cpp_wrapper and max-autotune tests (#145683)
Both of these tests mostly failed due to incorrect assumptions about the generated code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145683
Approved by: https://github.com/desertfire
ghstack dependencies: #145095, #145654, #145655
2025-02-04 22:05:59 +00:00
e2a029054d cpp_wrapper: enable all CPU repro tests (#145655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145655
Approved by: https://github.com/desertfire
ghstack dependencies: #145095, #145654
2025-02-04 22:05:59 +00:00
9873319a42 cpp_wrapper: fix set_.source_Tensor lowering (#145654)
Adds a C-shim fallback for `set_.source_Tensor`, which is effectively required by `ir.SetSourceTensorKernel`. As a necessary prerequisite to use that IR node, updates `CppWrapperCpu` to handle in-place returns in C-shim ops (the arguments for those returns are silently dropped by `torchgen`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145654
Approved by: https://github.com/desertfire
ghstack dependencies: #145095
2025-02-04 22:05:59 +00:00
7c0fe7a045 cpp_wrapper/aot_inductor: handle conjugation and negation dispatch keys (#145095)
Handles conjugation and negation in the same way that runtime dispatch does: by on-the-fly cloning a tensor with either key applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145095
Approved by: https://github.com/desertfire
2025-02-04 22:05:58 +00:00
09b0dfdc90 [metal] Add a missing cast to make the call to copysign unambiguous. (#146422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146422
Approved by: https://github.com/Skylion007, https://github.com/Samkm0084
2025-02-04 22:04:25 +00:00
clr
4e194bbfd6 dynamo: fsdp throw unimplemented vs attribute error (#146188)
Rather than throw a full exception for fsdp, instead just return unimplemented,
and respect the user options (i.e. fullgraph, vs graph break).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146188
Approved by: https://github.com/jansel
2025-02-04 21:45:55 +00:00
4c5a9a5f94 [Testing] Reduce test_exp flakiness (#146436)
By setting `reference_in_float` to false,  as `exp(a + b)` could yield significantly different results than `exp(a.half()+b.half())` as one can see in the following example (which is accidentally the random values generated by MacOS RNG for this test)

```
>>> import torch
>>> x=torch.tensor(2.5599, dtype=torch.half)
>>> y=torch.tensor(0.6970, dtype=torch.half)
>>> (x + y).exp()
tensor(26., dtype=torch.float16)
>>> (x.float() + y.float()).exp()
tensor(25.9799)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146436
Approved by: https://github.com/dcci
2025-02-04 21:24:08 +00:00
bc33d993ac add support for capturing provenance of unary operations (#146413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-04 21:16:15 +00:00
07b9fe0690 [Trace PyDispatcher] Add CustomFunctionHigherOrderOperatorVariable (#146272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146272
Approved by: https://github.com/zou3519
ghstack dependencies: #146270, #146271
2025-02-04 20:55:51 +00:00
d23e4f8109 use DTRACE_ENV_VAR as the trace logs directory of set (#146412)
```
(/home/bobren/local/a/pytorch-env) [7:47] devgpu035:/home/bobren/local/a/pytorch TORCH_DTRACE=/tmp/bb python r1.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146412
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-04 20:54:28 +00:00
7f65a20884 [BE]: Enable ruff SLOT checks (#146276)
This enables a check that which a class which only inherits from immutable classes like str, tuple, and NamedTuple, also defined `__slots__` so they don't allocate memory unnecessarily. This also ensure contributors think about how they define their classes with subclass NamedTuples and str, of which we have many in our codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146276
Approved by: https://github.com/aorenste
2025-02-04 19:18:23 +00:00
3525b834f0 [MPSInductor] Implement argmax/argmin (#146429)
TODOs:
 - Find test with NaN
 - Report internal compiler error when running `test_argmax_argmin1` (which is actually not enough shared memory)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146429
Approved by: https://github.com/dcci
ghstack dependencies: #146423, #146428
2025-02-04 19:16:06 +00:00
c591ad0c03 dump partial fx graph to stderr when dynamo tracing fails with guard on data-dependent (#146296)
As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging.

The following code produces a data dependent error

```
import torch
from torch import nn

# UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)).  (Size-like symbols: u0)
class Repro(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, cache, update, pos):
        _, _, max_seq_len, _ = cache.shape
        _, _, seqlen, _ = update.shape

        pos_item = pos[0].item() # u0
        torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        torch._check(pos_item >= 0)
        before = cache.narrow(2, 0, pos_item)

        # FAIL
        # Laith: why can't we make unbacked expressions size-like?
        after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))

        # PASS
        end = torch.tensor(max_seq_len - pos_item - seqlen).item()
        after = cache.narrow(2, (pos_item + seqlen), end)

        return torch.cat([before, update, after], dim=2)

repro = Repro()

bsz = 1
n_heads = 4
max_seq_len = 512
head_dim = 64
seqlen = 5
pos_item = 1

cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim)
update = torch.ones(bsz, n_heads, seqlen, head_dim)
pos = torch.tensor([pos_item])
example_inputs = (cache, update, pos)

torch.export.export(repro, example_inputs)
```

This is what it now prints out

```
class GraphModule(torch.nn.Module):
    def forward(self, L_cache_: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", L_update_: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", L_pos_: "i64[1][1]cpu"):
        l_cache_ = L_cache_
        l_update_ = L_update_
        l_pos_ = L_pos_

         # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0
        getitem: "i64[][]cpu" = l_pos_[0];  l_pos_ = None
        item: "Sym(u0)" = getitem.item();  getitem = None

         # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        add: "Sym(u0 + 5)" = item + 5
        le: "Sym(u0 + 5 <= 512)" = add <= 512;  add = None
        _check = torch._check(le);  le = _check = None

         # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0)
        ge: "Sym(u0 >= 0)" = item >= 0
        _check_1 = torch._check(ge);  ge = _check_1 = None

         # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item)
        before: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = l_cache_.narrow(2, 0, item);  before = None

         # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
        add_1: "Sym(u0 + 5)" = item + 5
        sub: "Sym(512 - u0)" = 512 - item;  item = None
        sub_1: "Sym(507 - u0)" = sub - 5;  sub = None
        narrow_1 = l_cache_.narrow(2, add_1, sub_1);  l_cache_ = add_1 = sub_1 = narrow_1 = None

Traceback (most recent call last):
  File "/data/users/bobren/a/pytorch/torch/_dynamo/utils.py", line 3075, in run_node
    return getattr(args[0], node.target)(*args[1:], **kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl
    decomposition_table[func](*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward
    return self.as_strided(sizes, strides, storage_offset)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl
    entry = self._make_cache_entry(state, key, func, args, kwargs, output)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry
    output_info = self._get_output_info_for_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry
    synth_output = self._output_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry
    return self._get_output_tensor_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry
    empty.set_(storage, storage_offset, shape, stride)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious
    r = self.shape_env.evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr
    return self._evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)).  (Size-like symbols: u0)

Caused by: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))  # r1.py:21 in forward (utils/_stats.py:27 in wrapper)
For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146296
Approved by: https://github.com/zou3519
ghstack dependencies: #146298
2025-02-04 19:12:39 +00:00
8f861a8dfb [experimental] filter logs by subgraph (#146047)
```
TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[1/0]" python r4.py
```

```
TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[0/0],[1/0_1]" python r4.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146047
Approved by: https://github.com/laithsakka
2025-02-04 19:11:44 +00:00
7d60235aa6 [Metal] Small speedup for sum/prod (#146428)
As they can not really be invoked over empty arrays
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146428
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #146423
2025-02-04 19:10:33 +00:00
b1663b31e1 [Metal][BE] Add #pragma once to all headers (#146423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146423
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-04 19:10:33 +00:00
292af3cc89 [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408)
Apply ruff rule about implicit string concatenation, this autofixes strings that are all the same type and on the same line. These lines are broken up likely as the result of autoformatters in the past. All fixes are automated using the autofixes in ISC001.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146408
Approved by: https://github.com/justinchuby, https://github.com/janeyx99
2025-02-04 19:07:04 +00:00
f38a2ea0d4 [Dynamo] Better unsupported message for Fake Tensor Exception (#146357)
I cannot repro this. But this line shows up in internal logs, and I want
to know what the exception is and the context inside it. All of the
exceptions_allowed_to_be_fallback are dataclasses, so they should print
nicely.

Test Plan:
- code reading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146357
Approved by: https://github.com/williamwen42
2025-02-04 18:52:11 +00:00
b0fe975521 [hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.

This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
2025-02-04 18:47:34 +00:00
157d81c201 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-02-04 18:23:24 +00:00
23fffb54d5 Use OrderedSet in _functorch/partitioners (#146102)
In an attempt to make partitioning more deterministic, change all sets in partitioners.py to OrderedSets. Note that this change does not fix the non-determinism we're seeing in the internal model. But let's at least eliminate this potential source of non-determinism before investigating any changes to the mincut approach?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146102
Approved by: https://github.com/oulgen
2025-02-04 17:43:07 +00:00
53759ccca8 [AOTI] Fix an unaligned memory access issue in mm_template (#146293)
Summary: Fixes a corner case in the Triton MM template, where the dimension M (dynamic size) can be smaller than BLOCK_M (similarly for the N dimenstion) can trigger unaligned memory access error.

Differential Revision: D69034578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146293
Approved by: https://github.com/chenyang78, https://github.com/jansel
2025-02-04 17:12:04 +00:00
87a63a9886 Add @nikitaved to torch.linalg CODEOWNERS/persons_of_interest (#141803)
As per title. I hope there is no objection :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141803
Approved by: https://github.com/albanD
2025-02-04 16:11:31 +00:00
e9f6e273e7 [inductor] Add typing to common.CSE (#145993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993
Approved by: https://github.com/yanboliang
ghstack dependencies: #145916
2025-02-04 16:05:39 +00:00
7a5239afd7 [inductor] Add typing to common.KernelArgs (#145916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916
Approved by: https://github.com/yanboliang
2025-02-04 16:05:39 +00:00
5d81bc3696 [MPSInductor] Implement prod reduction (#146396)
Mostly reusing `sum` reduction logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146396
Approved by: https://github.com/dcci
ghstack dependencies: #146369, #146370, #146380, #146389
2025-02-04 14:08:04 +00:00
bbe95341d9 [MPSInductor] Implement min and max reductions (#146389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146389
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370, #146380
2025-02-04 14:04:10 +00:00
106acf0eec Revert "[aoti] Assign proxy call args by name, and support default values. (#146263)"
This reverts commit 11f69808c64a65c68a4452250ba7719dcff27c78.

Reverted https://github.com/pytorch/pytorch/pull/146263 on behalf of https://github.com/atalman due to multiple build failures, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146263#issuecomment-2633828689))
2025-02-04 12:57:55 +00:00
e0f22e54e8 [ROCm][TunableOp] Support leading dimensions in TunableOp signature. (#146358)
This is a feature enhancement that:
- May improve performance by distinguishing GEMMs with different leading dimensions.
- Fix correctness issues reported by users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146358
Approved by: https://github.com/jeffdaily
2025-02-04 10:27:43 +00:00
cyy
3f63f2bced Use std::string_view in tests (#146120)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146120
Approved by: https://github.com/albanD
2025-02-04 09:51:36 +00:00
8444fe019a [export] Fix requires_grad deserialization (#146351)
Test Plan: CI

Differential Revision: D69072095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146351
Approved by: https://github.com/zhxchen17
2025-02-04 08:02:38 +00:00
bb4bd5f00b [Metal][BE] Fix the arguments of polygamma (#146382)
In the public API, order comes before input, while here they're
 reversed. Match for consistency (and make this less error prone).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146382
Approved by: https://github.com/jansel, https://github.com/malfet
2025-02-04 06:40:34 +00:00
54ceb7c565 [MPSInductor] Add support for sum reduction (#146380)
- Add `threadgroup_sum` template to `c10/metal/reduction_utils.h` that so far uses barrier to compute the reductions

TODOs:
 - Implement efficient reduction using cooperative functions such as `simd_shuffle_down`
 - Figure out how to merge several sum reduction together
 - Implement `reduction_store` that will only write results from the first thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146380
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370
2025-02-04 06:23:44 +00:00
cyy
1c16cf70c3 Apply ruff fixes to tests (#146140)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146140
Approved by: https://github.com/albanD
2025-02-04 05:41:01 +00:00
cyy
71e3575525 Remove unactivated test (#146233)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146233
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-04 05:26:04 +00:00
e68f5087d8 update _unsafe_set_version_counter to accept lists of tensors (#137921)
See the comment [here](https://github.com/pytorch/pytorch/issues/132014#issuecomment-2379547400) (cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @XilunWu @rec) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list.

I left the binding in pybind, and used a `std::vector`. if we **really** need to optimize overhead even further, we could write a manual cpython binding.

I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137921
Approved by: https://github.com/awgu, https://github.com/albanD
2025-02-04 04:51:11 +00:00
425aca40a4 Fix random crash in PyPer (#146327)
Summary: PyPer saw random crashes when writing into ET file. This DIFF is to check if the output file is in condition before writing into it, and catch the exception if something bad happens, instead of crashing.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D69065509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146327
Approved by: https://github.com/sraikund16
2025-02-04 04:50:40 +00:00
0c37c332da [export] Additionally save pytree namedtuple field names (#145956)
If a user passes in a namedtuple as an input, currently the input TreeSpec looks like: `TreeSpec(type=namedtuple, context=”class_fqn”, children_spec=[*, *])`

The user then saves the program containing this input TreeSpec. But what happens if they load it in a new environment where `class_fqn` now contains an additional field?

This means that the exported program is now expected to take in another input. But since those fields were not used in the original program, users should be able just drop those additional fields and the program will run successfully. This is needed/used in APS where they use unflattener's adapter to adapt the inputs based on the previously saved treespecs.

There are a couple of [solutions](https://docs.google.com/document/d/1V4ZSdy-8PUISWc8RqvGu3DU01BVegJhHHPWqa1Io7Eg/edit?tab=t.0) for how we can address this, but eventually we settled on saving a side table mapping namedtuple types to their list of field names, which can then be accessed by the adapter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145956
Approved by: https://github.com/zhxchen17
2025-02-04 04:42:30 +00:00
487400f47f [dynamo] Support functools.partial variables through inspect.signature (#146339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146339
Approved by: https://github.com/jansel
ghstack dependencies: #146322, #146116
2025-02-04 04:39:39 +00:00
9756c7d788 [benchmark] Remove ONNX (#146325)
ONNX exporter experiments in benchmark is obsolete and unmaintained. This PR removes it to unblock https://github.com/pytorch/pytorch/pull/146003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146325
Approved by: https://github.com/titaiwangms
2025-02-04 04:02:47 +00:00
a79d8f8ba4 [ROCm] Tune 3d tensor sums when not using fastest dimension (#146170)
Tune 3d tensor sums when not using fastest dimension.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146170
Approved by: https://github.com/jeffdaily
2025-02-04 04:02:16 +00:00
7997ecf809 [BE] reduce log spew from test_triton_kernels.py (#145895)
One of the tests in this file was setting `self._logging.set_logs(output_code=True)` - which would cause logs to be printed for the rest of the tests in this file.

This PR puts the log-setting in a context manager so that the old behavior is restored afterwards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145895
Approved by: https://github.com/nmacchioni
2025-02-04 03:44:23 +00:00
5f53889850 [dynamo][builtin-skipfiles-cleanup] Remove inspect (#146116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146116
Approved by: https://github.com/williamwen42, https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #146322
2025-02-04 03:36:07 +00:00
762a05b3b3 [DCP] Remove all-gather of state dict keys (#145998)
The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU.

Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang.

In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check.

Resolves #145965 (as a workaround).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998
Approved by: https://github.com/mhorowitz, https://github.com/fegin
2025-02-04 03:16:13 +00:00
7f796eb8b7 Revert "[inductor] Add typing to common.KernelArgs (#145916)"
This reverts commit 68cf36d5ab6165372160f65eb84e13d0f8dbc5dc.

Reverted https://github.com/pytorch/pytorch/pull/145916 on behalf of https://github.com/atalman due to Failing internally, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/145916#issuecomment-2632715678))
2025-02-04 03:07:12 +00:00
d3c7e4bb9c Revert "[inductor] Add typing to common.CSE (#145993)"
This reverts commit 8c657ae4be55c6133307ad278c1740af5db133a7.

Reverted https://github.com/pytorch/pytorch/pull/145993 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/145993#issuecomment-2632712384))
2025-02-04 03:04:01 +00:00
ecbc725fad Revert "[inductor] Finish typing common.py (#146225)"
This reverts commit 3a67c0e48d29578aeeaa872275e730020bb5cbc2.

Reverted https://github.com/pytorch/pytorch/pull/146225 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146225#issuecomment-2632709707))
2025-02-04 03:01:36 +00:00
0061eb5b70 Revert "[inductor] Refactor CSEProxy into global scope (#146226)"
This reverts commit 18380ab877711f2e651c69c78675f0d0b31d2ceb.

Reverted https://github.com/pytorch/pytorch/pull/146226 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146226#issuecomment-2632707618))
2025-02-04 02:58:50 +00:00
cyy
f397c72697 Remove NOLINTNEXTLINE (#146238)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146238
Approved by: https://github.com/albanD
2025-02-04 02:45:32 +00:00
5451c9b7c9 [MPSInductor] Add support for any reduction (#146370)
- Add `_new_accvar` function that creates a threadgroup variable
- As threadgroup variables can not be initialized in place, add explicit initialization for reduction var

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146370
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #146369
2025-02-04 02:45:03 +00:00
71179772cd [MPSInductor] Prep change for reduction support (#146369)
Add `group_pos` parameter as well as set `group_size` when invoking reduction kernels
Separates loads and stores and insert threadgroup barrier if reduction is in place

Should be a no-op right now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146369
Approved by: https://github.com/dcci, https://github.com/jansel
2025-02-04 02:38:07 +00:00
3dcbd04d1d [cutlass backend] Add instantiation level for generating configs (#146230)
Passing through instantiation level to generate more configs.

I do see some C++ compilation error. But running is fine. Using 2222 generates 1k+ configs.

Differential Revision: D68989194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146230
Approved by: https://github.com/Chillee, https://github.com/mlazos
2025-02-04 02:36:04 +00:00
0e49f35e3d Integrate sympy expression provenance logging with structured logs (#145848)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145848
Approved by: https://github.com/angelayi
2025-02-04 01:21:37 +00:00
4168982dad PEP585: .github release triggers (#145708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145708
Approved by: https://github.com/malfet
2025-02-04 01:02:46 +00:00
cf6c5b8fa8 [mps/inductor] Adjust more tests that expect float64 as input. (#146366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146366
Approved by: https://github.com/malfet
2025-02-04 00:48:02 +00:00
2f40f789da Revert "[inductor] Refactor op handlers part 1 (#146235)"
This reverts commit 204be4e0a2e4509bd2457bfb295c429dd92c241f.

Reverted https://github.com/pytorch/pytorch/pull/146235 on behalf of https://github.com/atalman due to Breaks lint, sorry: Definition of polygamma in base class MetalOverrides is incompatible with definition in base class OpsHandler. Please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/146235#issuecomment-2632444514))
2025-02-04 00:00:08 +00:00
3aeccf2a28 DeepSpeed github repo move sync (#146320)
DeepSpeed has moved to a new repo on github https://github.com/deepspeedai/DeepSpeed

This PR updates this repo to use the new URL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146320
Approved by: https://github.com/awgu
2025-02-03 23:20:49 +00:00
204be4e0a2 [inductor] Refactor op handlers part 1 (#146235)
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.

Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%

WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%

WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
2025-02-03 23:15:13 +00:00
18380ab877 [inductor] Refactor CSEProxy into global scope (#146226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226
Approved by: https://github.com/shunting314
ghstack dependencies: #146225
2025-02-03 23:15:13 +00:00
0bc036a9e9 use copy2d in h2d/d2h copy when possible (#146256)
A rewrite of #138964
In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964:
1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d
2) copy2d should record even for the host pinned memory, like the regular copy does
3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking

In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call.

Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added.
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256
Approved by: https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 23:07:54 +00:00
35af193408 [easy] Add type annotation for autotune_num_choices_displayed (#146323)
Test Plan: ci

Differential Revision: D69064447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146323
Approved by: https://github.com/ColinPeppler
2025-02-03 23:04:21 +00:00
0463cb6ca5 [mps/inductor] Add support for digamma(). (#146292)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146292
Approved by: https://github.com/malfet, https://github.com/jansel
2025-02-03 22:48:13 +00:00
178531c95e [ONNX] torch.onnx.export(dynamo=True) changes optimization to default (#146187)
Fixes #145897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146187
Approved by: https://github.com/justinchuby
2025-02-03 22:44:54 +00:00
d69c181d77 log out partial fx graph when guard on data dependent during non stirct tracing (#146298)
As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging.

The following code produces a data dependent error

```
import torch
from torch import nn

# UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)).  (Size-like symbols: u0)
class Repro(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, cache, update, pos):
        _, _, max_seq_len, _ = cache.shape
        _, _, seqlen, _ = update.shape

        pos_item = pos[0].item() # u0
        torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        torch._check(pos_item >= 0)
        before = cache.narrow(2, 0, pos_item)

        # FAIL
        # Laith: why can't we make unbacked expressions size-like?
        after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))

        # PASS
        end = torch.tensor(max_seq_len - pos_item - seqlen).item()
        after = cache.narrow(2, (pos_item + seqlen), end)

        return torch.cat([before, update, after], dim=2)

repro = Repro()

bsz = 1
n_heads = 4
max_seq_len = 512
head_dim = 64
seqlen = 5
pos_item = 1

cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim)
update = torch.ones(bsz, n_heads, seqlen, head_dim)
pos = torch.tensor([pos_item])
example_inputs = (cache, update, pos)

torch.export.export(repro, example_inputs, strict=False)
```

This is what it now prints out

```
class GraphModule(torch.nn.Module):
    def forward(self, arg0_1: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", arg1_1: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", arg2_1: "i64[1][1]cpu"):
         # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0
        select: "i64[][]cpu" = torch.ops.aten.select.int(arg2_1, 0, 0);  arg2_1 = None
        item: "Sym(u0)" = torch.ops.aten.item.default(select);  select = None

         # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        add: "Sym(u0 + 5)" = item + 5
        le: "Sym(u0 + 5 <= 512)" = add <= 512;  add = le = None

         # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0)
        ge: "Sym(u0 >= 0)" = item >= 0;  ge = None

         # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item)
        narrow: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = torch.ops.aten.narrow.default(arg0_1, 2, 0, item);  narrow = None

         # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
        add_1: "Sym(u0 + 5)" = item + 5
        sub: "Sym(512 - u0)" = 512 - item;  item = None
        sub_1: "Sym(507 - u0)" = sub - 5;  sub = None
        narrow_1 = torch.ops.aten.narrow.default(arg0_1, 2, add_1, sub_1);  arg0_1 = add_1 = sub_1 = narrow_1 = None

Traceback (most recent call last):
  File "/data/users/bobren/a/pytorch/r1.py", line 45, in <module>
    torch.export.export(repro, example_inputs, strict=False)
  File "/data/users/bobren/a/pytorch/torch/export/__init__.py", line 368, in export
    return _export(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper
    raise e
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper
    ep = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 2079, in _export
    return _export_for_training(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper
    raise e
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper
    ep = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1944, in _export_for_training
    export_artifact = export_func(  # type: ignore[operator]
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1879, in _non_strict_export
    aten_export_artifact = _to_aten_func(  # type: ignore[operator]
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1665, in _export_to_aten_ir_make_fx
    gm, graph_signature = transform(_make_fx_helper)(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1809, in _aot_export_non_strict
    gm, sig = aot_export(wrapped_mod, args, kwargs=kwargs, **flags)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1585, in _make_fx_helper
    gm = make_fx(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2194, in wrapped
    return make_fx_tracer.trace(f, *args)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2132, in trace
    return self._trace_inner(f, *args)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2103, in _trace_inner
    t = dispatch_trace(
  File "/data/users/bobren/a/pytorch/torch/_compile.py", line 51, in inner
    return disable_fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_dynamo/eval_frame.py", line 749, in _fn
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1136, in dispatch_trace
    graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1692, in trace
    res = super().trace(root, concrete_args)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 834, in trace
    (self.create_arg(fn(*args)),),
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1191, in wrapped
    out = f(*tensors)  # type:ignore[call-arg]
  File "<string>", line 1, in <lambda>
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1488, in wrapped_fn
    return tuple(flat_fn(*args))
  File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/utils.py", line 184, in flat_fn
    tree_out = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 879, in functional_call
    out = mod(*args[params_len:], **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module
    return Tracer.call_module(self, m, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module
    ret_val = forward(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1793, in forward
    tree_out = mod(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module
    return Tracer.call_module(self, m, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module
    ret_val = forward(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/r1.py", line 21, in forward
    after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1239, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1286, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_export/non_strict_utils.py", line 654, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_ops.py", line 866, in handler
    return torch._library.utils.handle_dispatch_mode(
  File "/data/users/bobren/a/pytorch/torch/_library/utils.py", line 296, in handle_dispatch_mode
    return curr_mode.__torch_dispatch__(op_overload, overload_types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1341, in __torch_dispatch__
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 910, in proxy_call
    out = func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_ops.py", line 749, in __call__
    return self._op(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl
    decomposition_table[func](*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward
    return self.as_strided(sizes, strides, storage_offset)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl
    entry = self._make_cache_entry(state, key, func, args, kwargs, output)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry
    output_info = self._get_output_info_for_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry
    synth_output = self._output_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry
    return self._get_output_tensor_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry
    empty.set_(storage, storage_offset, shape, stride)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious
    r = self.shape_env.evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr
    return self._evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)).  (Size-like symbols: u0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146298
Approved by: https://github.com/bdhirsh
2025-02-03 22:16:03 +00:00
0da07a6d1d [dynamo][skip-function] Add missing unimplemented line (#146322)
This is a missing line from the merged PR in the stack below. Lets try to get this in quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146322
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos
2025-02-03 22:11:55 +00:00
00dc5b10f6 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 2fd1b6b3610eb84cd615360a8fd23756a7f2e743.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/atalman due to Breaks executorch tests ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2632202864))
2025-02-03 22:04:28 +00:00
15e12d5ec3 [Trace PyDispatcher] Support temporarily_pop_interpreter_stack ctx manager (#146271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146271
Approved by: https://github.com/zou3519
ghstack dependencies: #146270
2025-02-03 21:47:54 +00:00
bd8d7b1b74 [Dynamo][Trace PyDispatcher] Remove disable from HigherOrderOperator.__call__ (#146270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146270
Approved by: https://github.com/zou3519
2025-02-03 21:47:54 +00:00
fd73ae2068 [Utilization] Convert timestamp to str for datetime64 (#145985)
Convert all timestamp(float) to int  timestamp during data pipeline for db type datetime64.
float does not work when try to insert into clickhouse using jsonExtract.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145985
Approved by: https://github.com/huydhn
2025-02-03 21:05:18 +00:00
1d4adf4e1f [dynamo] log recompile reason to dynamo_compile (#146117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146117
Approved by: https://github.com/bobrenjc93
2025-02-03 21:04:04 +00:00
11f69808c6 [aoti] Assign proxy call args by name, and support default values. (#146263)
Fixing the following issue when compiling the following program:
```
                window = torch.hann_window(N_FFT).to(x.device)
                stft = torch.stft(
                    x, N_FFT, HOP_LENGTH, window=window, return_complex=True
                )
                magnitudes = stft[..., :-1].abs() ** 2
                return magnitudes
```
```
Traceback (most recent call last):
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
    self.check_model(model, example_inputs)
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
    actual = AOTIRunnerUtil.run(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
    optimized = AOTIRunnerUtil.load(device, so_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
    return torch._export.aot_load(so_path, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
2025-02-03 20:15:59 +00:00
e67ce67498 [cutlass backend] update try_import_cutlass to accomodate for pip install (#145891)
The goal of this PR is to provide 3 ways for people to try out CUTLASS backend:
1. fbcode / internal
2. pip install torch (nightly) and pip install nvidia-cutlass
3. build from source

I will go into more detailed combos between building from source and downloading via pip for torch and cutlass.

repro:
```
import torch
import torch.nn as nn

import torch._inductor.config as config

config.force_disable_caches = True
config.max_autotune = True
config.max_autotune_gemm_backends = "CUTLASS"
# the following is only needed if you use a custom cutlass library
# config.cuda.cutlass_dir = "/data/users/henrylhtsang/cutlass"

class TestModule(nn.Module):
    def forward(self, A, B):
        return A @ B

model = TestModule().cuda()
M, K, N = 2048, 2048, 2048
A = torch.randn(M, K).cuda().half()
B = torch.randn(K, N).cuda().half()

C = torch.compile(model, fullgraph=True)(A, B)
```

## pre-requisite
Assuming you have the right cuda toolkit. Recommend 12.4. Make sure PATH, LD_LIBRARY_PATH and CUDA_NVCC_EXECUTABLE are good.

## combo 1: pip install torch + pip install nvidia-cutlass
Check https://pytorch.org/get-started/locally/ for **nightly** install command.
```
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
pip install nvidia-cutlass
```
Then try running the script above. It should work.

## combo 2: build torch from source + pip install nvidia-cutlass
This is going to be be pretty straightforward. Just keep in mind that even though pytorch/third_party/cutlass exists, the one that will be used is the pip package, so mindful of version differences.

## combo 3: build torch from source + use pytorch/third_party/cutlass
This is how most pytorch devs would do it. Just make sure you don't have a cutlass pip package installed, i.e., make sure `import cutlass_library` would fail on its own.

## combo 4: any torch version + cutlass library from somewhere else
This is probably the only case you need to pass in cutlass_dir. Just set cutlass_dir to the cutlass repo library. The expectations is that cutlass_dir is the directory that contains include, tool, and python/cutlass_library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145891
Approved by: https://github.com/Chillee, https://github.com/ColinPeppler
2025-02-03 20:05:41 +00:00
f237172768 Fix not inlining functions used in metal files (#146316)
Fixes issue when building PyTorch with Xcode installed after https://github.com/pytorch/pytorch/pull/146231
```
FAILED: caffe2/aten/src/ATen/kernels_basic.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_basic.metallib
cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_basic.metallib BinaryKernel_30.air Bucketization_30.air CrossKernel_30.air FusedOptimizerOps_30.air Gamma_30.air HistogramKernel_30.air Im2Col_30.air Indexing_30.air LinearAlgebra_30.air Quantized_30.air RMSNorm_30.air RenormKernel_30.air Repeat_30.air SpecialOps_30.air TriangularOps_30.air UnaryKernel_30.air UnfoldBackward_30.air UpSample_30.air
LLVM ERROR: multiple symbols ('_ZN3c105metal4zetaEff')!
[3835/5420] Building CXX object c10/test/CMakeFiles/c10_small_vector_test.dir/util/small_vector_test.cpp.o
ninja: build stopped: subcommand failed.
```

AI to @malfet: Add linter that ensures that `c10/metal/` headers do not have any functions there, only templates
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146316
Approved by: https://github.com/malfet, https://github.com/atalman
2025-02-03 19:33:52 +00:00
674e0b668a Add non-strict export while_loop test back (#146195)
This is fixed by https://github.com/pytorch/pytorch/pull/145762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146195
Approved by: https://github.com/zou3519
ghstack dependencies: #146194
2025-02-03 19:28:22 +00:00
1138d0c4f6 [hop] enable while_loop return torch.ones with unbacked symbol expression. (#146194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146194
Approved by: https://github.com/zou3519
2025-02-03 19:28:22 +00:00
57b1fc35f6 [dynamo] Disable compiling on elementwise_type_promotion_wrapper (#146219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146219
Approved by: https://github.com/zou3519
ghstack dependencies: #146075, #146283
2025-02-03 18:02:48 +00:00
64fc9ff09c Revert "[ONNX] Create deprecation warning on dynamo_export (#146003)"
This reverts commit e6c39d37e90242692cf25ea849abd47d11932cd7.

Reverted https://github.com/pytorch/pytorch/pull/146003 on behalf of https://github.com/atalman due to Broke internally ([comment](https://github.com/pytorch/pytorch/pull/146003#issuecomment-2631599314))
2025-02-03 17:17:14 +00:00
041e08f9dc Add buffers to parameterizaiton rule (#145991)
Differential Revision: [D68959513](https://our.internmc.facebook.com/intern/diff/D68959513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145991
Approved by: https://github.com/bdhirsh
2025-02-03 16:49:03 +00:00
c0979d72b5 Revert "[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)"
This reverts commit 68a363548409a3ff17965770304ee5e12fe718d9.

Reverted https://github.com/pytorch/pytorch/pull/143456 on behalf of https://github.com/atalman due to New tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/143456#issuecomment-2631475900))
2025-02-03 16:25:58 +00:00
01554c7b5a fix incorrect literal strings / accidental tuples (#146037)
* `expr,` is short for `(expr,)`
* literal strings over multiple lines need to escape the newline `\` or use `(...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146037
Approved by: https://github.com/Skylion007
2025-02-03 15:08:11 +00:00
550441a87b Update slow tests (#146301)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146301
Approved by: https://github.com/pytorchbot
2025-02-03 11:37:16 +00:00
08b14936ae Disable has_relational_guards check for dict_tag optimization for now (#146232)
has_relational_guards evaluates to true almost always, and leads to a
slowdown in guards runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146232
Approved by: https://github.com/anijain2305
2025-02-03 07:56:06 +00:00
e3643e1e0e [MPS] Add linalg det and fix lu factor for non contiguous tensors (#146279)
Requested in #77764

This PR adds support for linalg.det on MPS and fixes lu factor for non contiguous tensors, current implementation crashed on any kind of non-contiguous tensor with an error:
```
-[AGXG13XFamilyCommandBuffer blitCommandEncoderCommon:]:833: failed assertion `A command encoder is already encoding to this command buffer'
zsh: abort      python det.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146279
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 06:06:43 +00:00
1580f47bf4 [export][ez] Fix generated header file. (#146208)
Summary: as title.

Test Plan: CI

Differential Revision: D68978788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146208
Approved by: https://github.com/yiming0416
2025-02-03 06:01:05 +00:00
cyy
7b512095ef Enable some tests on MacOS (#146268)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146268
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-03 05:04:24 +00:00
fa48757180 [dynamo] misc fixes for inspect (#146283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146283
Approved by: https://github.com/jansel
ghstack dependencies: #146075
2025-02-03 04:26:10 +00:00
cyy
6ac8bc0cd2 Remove unused import in tests (#146266)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146266
Approved by: https://github.com/Skylion007
2025-02-03 03:40:18 +00:00
d80eef7c6d [inductor] Guard a member variable with a define. (#146278)
It's unused otherwise, and when running MPS tests, I get a bunch of warnings of this kind:

/Users/davidino/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:412:10: warning: private field 'blob_size_' is not used [-Wunused-private-field]
  412 |   size_t blob_size_;
      |          ^
1 warning generated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146278
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-02-03 02:20:08 +00:00
c0ec2e0a0d [dynamo][functions] Improve getattr on functions (#146075)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146075
Approved by: https://github.com/jansel
2025-02-03 02:01:57 +00:00
d28fe3ed47 [metal] Move digamma to special_math.h (#146284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146284
Approved by: https://github.com/jansel
2025-02-03 01:29:14 +00:00
1f21f699ba [metal] Refactor digamma in preparation for moving it. (#146281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146281
Approved by: https://github.com/jansel
2025-02-02 23:54:45 +00:00
511d0dd558 [Dynamo][Trace PyDispatcher] Support calling id function over class (#146269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146269
Approved by: https://github.com/anijain2305
2025-02-02 22:29:30 +00:00
02fd4868d6 Fix unreachable code (#146262)
Fixes #146261

Removed unreachable code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146262
Approved by: https://github.com/Skylion007
2025-02-02 21:35:26 +00:00
5d55a6585d [MPS] lu factor ex implementation (#144651)
Implements `torch.linalg.lu_factor_ex`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144651
Approved by: https://github.com/malfet
2025-02-02 15:09:49 +00:00
0144613e6f move and fix logic to update unbacked bindings (#146115)
Summary:
Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here).

This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde.

Test Plan: added test

Differential Revision: D68880766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115
Approved by: https://github.com/pianpwk
2025-02-02 10:43:55 +00:00
a44a8a7d3a [audio hash update] update the pinned audio hash (#145988)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145988
Approved by: https://github.com/pytorchbot
2025-02-02 04:19:29 +00:00
cyy
8543d8395b [2/N] Enable ruff F841 on distributed tests (#146132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146132
Approved by: https://github.com/Skylion007, https://github.com/rec
2025-02-02 03:44:48 +00:00
cef856faa9 [dynamo][enum] Trace through enum.py for enum construction (#146070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146070
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198, #146258, #146214
2025-02-02 03:12:36 +00:00
31fb691782 [dynamo] Graph break on tensor.retain_grad (#146214)
Fixes https://github.com/pytorch/pytorch/issues/146212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146214
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198, #146258
2025-02-02 03:12:36 +00:00
529eb8d558 [dynamo] Add return to python_type (#146258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146258
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198
2025-02-02 03:12:36 +00:00
7854299b27 [mps/inductor] Implement support for polygamma(). (#146259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146259
Approved by: https://github.com/jansel
2025-02-02 01:54:23 +00:00
d89c7ea401 add WaitCounter type interface and get rid of type errors (#146175)
Summary: as titled

Differential Revision: D68960123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146175
Approved by: https://github.com/andriigrynenko, https://github.com/Skylion007
2025-02-01 23:24:52 +00:00
3a67c0e48d [inductor] Finish typing common.py (#146225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225
Approved by: https://github.com/Skylion007
2025-02-01 22:53:35 +00:00
dca5cc0255 [mps] Move polygamma to special_math.h. (#146253)
In preparation to implement it in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146253
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 21:45:23 +00:00
07dbd539b4 [BE][Ez]: Make c10/special arrays constexpr (#146246)
No reason to have array creation overhead for these constexpr arrays. This is better because it guarantees the array is not duplicated across templates or translation units unless necessary and allows the compiler to do static compile time bounds checking (even in loop based accesses)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146246
Approved by: https://github.com/dcci, https://github.com/malfet
2025-02-01 21:03:18 +00:00
d4ad7b91ad [mps] Move zeta() to special_math.h. (#146231)
In preparation for implementing digamma/polygamma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 19:22:59 +00:00
f97307f463 [Docs] Add clarification for target types in CrossEntropyLoss doc (#145444)
CrossEntropyLoss function requires that target for class indices are provided as a long and class probabilities are provided as a float datatype.

The CrossEntropyLoss function distinguish the two scenarios (indices and probabilities) by comparing the shapes. When input and target shapes are the same it’s a case for probabilities otherwise it will be used as a class index as already covered in the doc. The related code is here,
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624

I think the current documentation is great but seems like it can confuse users about types as reported in the issues so this PR adds a bit more clarification.

Fixes #137188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145444
Approved by: https://github.com/mikaylagawarecki
2025-02-01 18:55:58 +00:00
5ed5793016 Temp disable MKL in DistributionKernels.cpp (#146174)
Until https://github.com/pytorch/pytorch/issues/132395 is addressed

Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential )
```python
import torch

high_bits_for_seed = 16000000000000000000           # to use "good quality" seed
_ = torch.manual_seed (high_bits_for_seed + 2024)

prob = torch.ones (26)
dups_mult = 0
perm_counts_mult = {}
for _ in range (1_000_000):
    p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist())
    if  p in perm_counts_mult:
        dups_mult += 1
        perm_counts_mult[p] += 1
    else:
        perm_counts_mult[p] = 1

print ('duplicate multinomial perms: ', dups_mult)
print ('multiple multinomial perms:  ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item())
print ('max of perm_counts_mult:     ', torch.tensor (list (perm_counts_mult.values())).max().item())
print ('len (perm_counts_mult):      ', len (perm_counts_mult))
```

This is a reland of https://github.com/pytorch/pytorch/pull/132532 but excluding internal builds that already has some hardcoded values

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146174
Approved by: https://github.com/ngimel
2025-02-01 18:53:11 +00:00
e56dcf2772 [CPUInductor] Fix SVE256 detection (#146207)
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes https://github.com/pytorch/pytorch/issues/145441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207
Approved by: https://github.com/angelayi
2025-02-01 18:51:34 +00:00
8c657ae4be [inductor] Add typing to common.CSE (#145993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914, #145915, #145916
2025-02-01 16:34:18 +00:00
68cf36d5ab [inductor] Add typing to common.KernelArgs (#145916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914, #145915
2025-02-01 16:34:18 +00:00
8e56d713c9 [inductor] Add typing to common.OpDecompositions (#145915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145915
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914
2025-02-01 16:34:11 +00:00
79f9f62e3a [inductor] Combine regexp checks in OpOverrides.paren (#145914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145914
Approved by: https://github.com/Skylion007
ghstack dependencies: #145913
2025-02-01 16:34:03 +00:00
4c004caa76 [inductor] Add types to DeviceOpOverrides (#145913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145913
Approved by: https://github.com/Skylion007
2025-02-01 16:33:49 +00:00
0f768c7866 Barebones flat_apply HOP (#146060)
This PR:
- adds pytree.register_constant for registering a class to be treated as
  a constant by torch.compile/torch.fx
- adds a very barebones flat_apply HOP. This should be sufficient to get
  mark_traceable working. A lot more work is necessary to get the custom
  operator case working (when make_fx sees a custom operator with PyTree
  arg types, it needs to emit a call to the flat_apply HOP).
- I expect the flat_apply HOP to change a lot, I want to ship this in
  the current state to unblock the mark_traceable and custom ops
  work.

Test Plan:
- It's kind of difficult to test the barebones flat_apply HOP "works" so
  I added a really simple test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146060
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang
ghstack dependencies: #146059
2025-02-01 16:17:48 +00:00
373606928b Add torch.utils._pytree.register_dataclass (#146059)
This is an API that registers a dataclass as a pytree node.
It directly calls torch.export.register_dataclass, but we should
eventually inline that implementation here. I want to use this API for
something in compile and feel weird calling
torch.export.register_dataclass.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146059
Approved by: https://github.com/StrongerXi, https://github.com/angelayi, https://github.com/yanboliang
2025-02-01 16:17:48 +00:00
cyy
2fd1b6b361 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-02-01 12:33:41 +00:00
2b00d211f0 Build RowwiseScaledMM.cu for SM89 (#145676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676
Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy
2025-02-01 11:44:58 +00:00
f40e013787 Fix aten.to when input is a tensor constant (#146220)
Summary:
Fix aten.to when input is a tensor constant.

In this case, `args_unwrapped` could just be a constant, so not a functional tensor.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  tensor_constant_aten_to
```

Differential Revision: D68984244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146220
Approved by: https://github.com/JacobSzwejbka
2025-02-01 11:07:33 +00:00
30f091da44 add speculation log divergence test (#145659)
Followup from a SEV. Confirmed that this breaks when stacked on top of https://github.com/pytorch/pytorch/pull/145660 (offending PR that caused the SEV)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145659
Approved by: https://github.com/laithsakka
2025-02-01 09:39:22 +00:00
a4e4368157 add node mapping processing (#146103)
Summary:
Add `node_mapping = create_node_mapping(pre_grad_graph_id, inductor_post_to_pre_grad_nodes, debug_info)`, to produce a `inductor_provenance_tracking_node_mappings.json` file. This file will be used by the provenance tracking highlighter tool to create provenance visualization.

`inductor_triton_kernel_to_post_grad_nodes.json` and `inductor_provenance_tracking_node_mappings.json` files are not dumped if they are both empty. So it's removed from some of the `test_structured_trace` tests.

Test Plan:
CI
```
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance

buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing

python test/dynamo/test_structured_trace.py
```

Differential Revision: D68190173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146103
Approved by: https://github.com/chenyang78
2025-02-01 08:29:29 +00:00
f38d5b4a74 Update TorchBench commit to main (#145455)
I'm adding sam2 to TorchBench https://github.com/pytorch/benchmark/issues/2566, so, as part of that, I'm updating PyTorch CI to use latest TorchBench commit.

The corresponding change from TorchBench is https://github.com/pytorch/benchmark/pull/2584

The main thing to call out that the newer transformers added by https://github.com/pytorch/benchmark/pull/2488 is regressing several models. This needs to be investigated further, and I pin the version to unblock this change.

* `hf_Roberta_base` a new model added by https://github.com/pytorch/benchmark/pull/2279, not sure why it fails accuracy on A10G, but it works fine on A100
* `speech_transformer` failures are pre-existing trunk failures, i.e. https://github.com/pytorch/pytorch/actions/runs/13040114684/job/36380989702#step:22:2408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145455
Approved by: https://github.com/kit1980
2025-02-01 06:44:26 +00:00
a97a906dd9 Add "//caffe2:libtorch" to minifier TARGET file (#146203)
Summary: as title. To avoid errors like "undefined symbol: aoti_torch_device_type_cpu" when compiling minifier_launcher.py

Test Plan: CI

Differential Revision: D68978430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146203
Approved by: https://github.com/desertfire
2025-02-01 05:37:23 +00:00
bcd0ba0f69 Adding the best autotuner config (#146121)
Summary: Adding logs to log the best config for autotune configs

Test Plan:
Testing in Mast : aps-omnifmv1-5_32_test_with_best_config-c5e9ceccf8

 {F1974838864}

Reviewed By: oulgen

Differential Revision: D68931164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146121
Approved by: https://github.com/oulgen
2025-02-01 03:43:33 +00:00
549e230c33 [draft_export] Clear pending unbacked symbols when overriding mismatched fake kernels (#146089)
Summary:
When encountering a mismatched fake kernel that also creates unbacked symbols, draft export will fail with `PendingUnbackedSymbolNotFound` error.

Clearing `shape_env.pending_fresh_unbacked_symbols` fixes this issue.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_override_mismatched_fake_kernel_with_unbacked_symbols
```

Differential Revision: D68920990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146089
Approved by: https://github.com/pianpwk
2025-02-01 03:32:50 +00:00
cyy
4d2056efb5 Enable ruff F841 on numpy tests (#146126)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146126
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-01 03:07:28 +00:00
cyy
985a78e9df Enable ruff F841 on distributed tests (#146131)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146131
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-01 03:06:16 +00:00
1de41e6918 [dynamo][exceptions][3.10] Clean symbolic stack on exception handling (#146198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146198
Approved by: https://github.com/williamwen42
ghstack dependencies: #146062
2025-02-01 02:51:44 +00:00
6023684311 [export] Fix symfloat serialization (#146112)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146112
Approved by: https://github.com/pianpwk
2025-02-01 02:28:44 +00:00
8326d27093 [inductor][5/N] triton support post-#5512, fix 1 and None handling (#145515)
This fixes handling for "1" and "None" args with new Triton versions. TL;DR: triton_meta["constants"] (which is passed to ASTSource) should be a map of {"kwarg_name": constant_value} for values which are tl.constexpr, or have a value of 1 or None (i.e. "specialized" constants). For constant args, triton_meta["signature"][arg_name] should be "constexpr" (even for specialized constants).

Note: This adds support for Triton versions after 5512; but not for versions in between 5220 and 5512 (i.e. `TritonAttrsDescriptorVersion.V3_BACKENDS_TUPLE`). There's a completely different format for constants/signature in the commit range in between.

To test: I ran `test_torchinductor.py` and `test_triton_kernels.py` with the main branch of triton (~jan 27). The only failing tests are aoti-related tests (which need to be fixed as a follow-up), and test_mutable_custom_op_fixed_layout2_cuda (which is failing with or without the new triton version on my machine); additionally, the split-scan/split-reduction kernels rely on https://github.com/triton-lang/triton/pull/5723.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145515
Approved by: https://github.com/SamGinzburg
2025-02-01 02:11:48 +00:00
6e734bab93 execution trace export supports gzip format (#146179)
As above, allows Chakra Execution Trace observer to support compressing files.
Usage is straightforward, just add ".gz" suffix to the output file name
```
et = ExecutionTraceObserver()
et.register_callback("my_trace.json.gz")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146179
Approved by: https://github.com/shengfukevin, https://github.com/davidberard98, https://github.com/sraikund16
2025-02-01 01:25:25 +00:00
57c45340e7 include entire GraphModule instead of current node when erroring inside of fx interpreter (#146197)
This seems like it would make it easier to diagnose PT2 issues where the user cannot easily repro, and we need more info in the backtrace, e.g. in https://github.com/pytorch/pytorch/issues/134182#issuecomment-2628076114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146197
Approved by: https://github.com/jamesjwu
2025-02-01 01:09:27 +00:00
73d90d66a4 Cap size of thread pool in select_algorithm to cpu count (#146071)
Summary: With changes from https://github.com/pytorch/pytorch/pull/144829, we can see more autotune configs and the size of the pool can get outta hand when using the cutlass backend.

See internal discussion at: https://fburl.com/workplace/7g4vz0zy

Test Plan: `python test/inductor/test_cutlass_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146071
Approved by: https://github.com/Chillee
2025-02-01 00:41:36 +00:00
cde5ddfd14 fix internal error with reorder submodules (#146181)
Test Plan: hard to isolate as small repro

Differential Revision: D68963033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146181
Approved by: https://github.com/angelayi
2025-02-01 00:30:42 +00:00
35f113e2a0 torch/nn/utils/rnn.py: docs: improvements (#138628)
Fix constants highlighting in generated documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138628
Approved by: https://github.com/mikaylagawarecki
2025-02-01 00:10:30 +00:00
a78c796f0b [AOTI] Support composed dynamic shape constraint (#146044)
Summary: Fixes https://github.com/pytorch/pytorch/issues/145500. When export takes a dynamic shape constraint as an expression containing a symbol, we should be able to solve the symbol at run time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146044
Approved by: https://github.com/angelayi
ghstack dependencies: #146043
2025-02-01 00:02:12 +00:00
43372e70c2 ehnace logging statically known by adding size_oblivious(..) (#145354)
after the diff
```
[0/0_1] eval size_oblivious(Eq(s1, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(u0, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(s0, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(s0*s1*u0, 0)) == False [statically known]
```
before
```
[0/0_1] eval (Eq(s1, 1)) == False [statically known]
[0/0_1] eval (Eq(u0, 1)) == False [statically known]
[0/0_1] eval (Eq(s0, 1)) == False [statically known]
[0/0_1] eval (Eq(s0*s1*u0, 0)) == False [statically known]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145354
Approved by: https://github.com/ezyang
2025-01-31 23:26:37 +00:00
f25f1163dc [dynamo] Support frozenset({..}).__contains__ (#146062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146062
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-01-31 23:22:58 +00:00
eb029fba13 [c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789) (#144794)
Try to land https://github.com/pytorch/pytorch/pull/136789/files on our end and fix any remaining issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144794
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman
2025-01-31 22:39:56 +00:00
af2a39849d [AOTI] Refactor codegen_input_symbol_assignment (#146043)
Summary: Extract the common logic for size and stride symbol generation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146043
Approved by: https://github.com/angelayi
2025-01-31 21:55:18 +00:00
c39c679813 Revert "Tensor .cuda() very slow with specific array sizes (#138964)"
This reverts commit 98f87edd233ea69cee5f3e73e9eb4b5ab77aa744.

Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))
2025-01-31 21:48:51 +00:00
a7cc6d3e84 Manylinux 2.28 migration - remove pre-cxx11 abi libtorch builds (#146200)
Related to: https://github.com/pytorch/pytorch/issues/123649
Removing pre-cxx11 abi builds.
As per announcement : https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146200
Approved by: https://github.com/kit1980, https://github.com/huydhn
2025-01-31 21:43:12 +00:00
8203894eff Resolve affine quantization namespace collision with torchao (#145941)
Summary:
https://github.com/pytorch/pytorch/pull/141421
duplicated affine quantization custom ops from torchao into
the PT2E quantization flow, but these ops are registered under
the same namespace with the same name, causing "Duplicate
registration" errors for the new ops for use cases that import
from both repos. This commit fixes this by moving the PT2E
versions of the ops to a new namespace. In the long term,
we expect to migrate PT2E into torchao so users can migrate
back to the old namespace if they wish to.

Test Plan: python test/test_quantization.py -k test_channel_group_quantization

Differential Revision: D68838437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145941
Approved by: https://github.com/cccclai
2025-01-31 21:29:47 +00:00
781aceee9c [dynamo] Revert abc change due to internal failures (#146177)
xref - https://www.internalfb.com/tasks/?t=191383874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146177
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146141
2025-01-31 21:28:06 +00:00
a0d1393b1a [MTIA][FSDP2] Enable MTIA device in FSDP2 library code (#145842)
Differential Revision: D68560256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145842
Approved by: https://github.com/chaos5958, https://github.com/nautsimon
2025-01-31 21:21:00 +00:00
06850e624a [ca][hop] test CA on all HOPs (#145429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145429
Approved by: https://github.com/zou3519
ghstack dependencies: #145422
2025-01-31 20:45:22 +00:00
2e197c8a2d [dynamo][hop] test torch.compiling all HOPs (#145422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145422
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-01-31 20:45:22 +00:00
5b1abdbf5d [dynamo] remove always-failing eval_frame.c debug check (#145982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145982
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145981
2025-01-31 20:40:59 +00:00
49df8de8be [dynamo] disable eval_frame callback in _TorchDynamoContext __enter__/__exit__ (#145981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145981
Approved by: https://github.com/jansel
2025-01-31 20:40:59 +00:00
3a4e7a589b [CI][Distributed] Fix edge case: One rank case (Rank 0) should get [False, False] (#146099)
To match the expected tensor (i.e. 2nd element in the array). Making rank0 receive [False, False]

Fixes one of the issues reported in #146094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146099
Approved by: https://github.com/eqy
2025-01-31 20:31:13 +00:00
8b8c596503 Remove trivial dispatch_key_allowlist_check function (#146169)
Hmmm...this _is_ removing a public function from a public C++ file. But the GH counts for this function total 83, seemingly all copying pytorch: https://github.com/search?q=dispatch_key_allowlist_check&type=code&p=1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146169
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-01-31 19:59:40 +00:00
ec2522e200 [MPS] optimize cholesky (#145722)
Followup to #145701

Optimizes the syrk and trsm kernels of cholesky decomposition on mps. For SYRK kernel it does matmuls with apple's simdgroup matrices instead of a tiled implementation and for trsm kernel we do vectorized loads. Also this PR puts command encoder inside of the stream queue dispatch (as discussed on last PR).

Script to collect perf
```
mport torch
import numpy as np
import time
import csv

matrix_sizes = [512, 1024, 2048, 4096]
batch_sizes = [1, 2, 4, 8, 16]
num_runs = 10
warmup_runs = 3

def create_spd_matrix(n, batch_size):
    torch.manual_seed(42)
    A = torch.randn(batch_size, n, n, dtype=torch.float32)
    return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1)

def run_cholesky_mps(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    b = torch.linalg.cholesky(A, upper=False)
    torch.mps.synchronize()
    end = time.perf_counter()
    return b, end - start

results = {
    'N': [],
    'batch_size': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    for batch_size in batch_sizes:
        print(f"\nBenchmarking N={n}, batch_size={batch_size}")

        try:
            A_cpu = create_spd_matrix(n, batch_size)
            A_mps = A_cpu.to("mps")

            for _ in range(warmup_runs):
                _, _ = run_cholesky_mps(A_mps)

            times = []
            for _ in range(num_runs):
                _, t = run_cholesky_mps(A_mps)
                times.append(t)

            mean_time = np.mean(times)
            std_time = np.std(times)

            results['N'].append(n)
            results['batch_size'].append(batch_size)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)

            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

        except RuntimeError as e:
            print(f"Error for N={n}, batch_size={batch_size}: {e}")
            continue

with open('cholesky_benchmark_times.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch_size', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['batch_size'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```

Observed speedups on M1 Pro
![cholesky_speedup](https://github.com/user-attachments/assets/be3edb1a-8b4a-4039-9d7f-9b9a10f1c83a)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145722
Approved by: https://github.com/malfet
2025-01-31 19:52:31 +00:00
6a0138fcc1 Torch device backend autoload fix (#145611)
This causes an import failure if an external backend imports a module that uses `torch._as_tensor_fullprec` when it is being loaded.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145611
Approved by: https://github.com/albanD
2025-01-31 19:27:42 +00:00
cyy
18380836eb Remove outdated test skipif conditions for Python3.9 (#146144)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146144
Approved by: https://github.com/albanD
2025-01-31 19:01:04 +00:00
68a3635484 [hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.

This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
2025-01-31 18:29:27 +00:00
aad9f44b2e [export] Sync model container types to schema.py (#145959)
Summary: Synced from D68840230

Test Plan: No behavior changes to existing API. Will be tested internally.

Differential Revision: D68846532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145959
Approved by: https://github.com/yiming0416
2025-01-31 18:17:56 +00:00
16f44fee25 Revert "[inductor/profiler] add kernel kwargs instrumentation (#145573)"
This reverts commit 720b8d0d8dac98f89499bc6b251d1f34dbf68dfe.

Reverted https://github.com/pytorch/pytorch/pull/145573 on behalf of https://github.com/ZainRizvi due to Sorry, but this is failing internally. It's a bit weird since this PR doesn't really appear related at first glance, but despite retries it fails pretty consistently. Please see D68930742 for details ([comment](https://github.com/pytorch/pytorch/pull/145573#issuecomment-2628013872))
2025-01-31 18:13:23 +00:00
67ed47d886 Binary upload checksum (#144887)
Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887
Approved by: https://github.com/atalman
2025-01-31 17:51:27 +00:00
d0748566b4 s390x ci: ensure CI starts correctly if token pipe is not removed (#145840)
Mark stop actions as "may fail".
Container is expected to stop on it's own in normal case.

Remove "may fail" mark from token generation steps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145840
Approved by: https://github.com/huydhn
2025-01-31 17:46:09 +00:00
44ecbcbd5a s390x: disable test_model_exports_to_core_aten.py test (#145835)
It often gets killed by OOM.
Disable it while investigating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145835
Approved by: https://github.com/huydhn
2025-01-31 17:45:10 +00:00
667b94d1c2 [hotfix][dynamo] Skip linecache due to a flaky issue (#146141)
A large number of jit + dynamo wrapped tests fail in linecache tracing.
We need further debugging. Skipping for now to stem the bleeding.

https://github.com/pytorch/pytorch/issues/146076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146141
Approved by: https://github.com/StrongerXi
2025-01-31 17:45:06 +00:00
c3f71eb61b Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit e2917245fb0c0b6aab216e7a0a254b80e7a9e78f.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error.  @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))
2025-01-31 17:43:09 +00:00
f5a61ba0a3 Revert "inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)"
This reverts commit d100e9ae744322a74d9fd05d0851caaf36f19c24.

Reverted https://github.com/pytorch/pytorch/pull/145122 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. See D68924977 for details ([comment](https://github.com/pytorch/pytorch/pull/145122#issuecomment-2627880860))
2025-01-31 17:39:23 +00:00
eb5a0718c2 S390x nightly builds timeouts (#146041)
Sometimes build timeouts at the end.
This should be fixed by increased timeout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146041
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-01-31 17:29:11 +00:00
001e355a56 Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## How does this work

The format for the checkpoint is as such

```
archive_name/
|_ data.pkl
|_.format_version
|_byteorder
|_data/
  |_ 0
  |_ 1
  |_ 2
  |_ ...
|_
```

Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them.

For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage.

Note that we always use `miniz` writer  in the zip64 mode per [here](7796e308d0/caffe2/serialize/inline_container.cc (L701)) A zipfile record written by miniz looks as such

```
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
| 30 byte header | n byte filename | zip64_extra_data | m byte padding | storage | 16 or 24 byte local dir footer  |
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
```

- The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290)
- filename will be `"{archive_name}/{filepath}"`

- `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524))
- `m` is determined by [`getPadding`](7796e308d0/caffe2/serialize/inline_container.cc (L254)), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes
- The local dir footer size is determined based on [this snippet ](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16.

When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following
- We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts
- for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage
- for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage
- Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case
- After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-31 17:09:20 +00:00
98f87edd23 Tensor .cuda() very slow with specific array sizes (#138964)
### **Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch**

#### **Summary**
This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead.

#### **Tracing the Issue**
To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process:

1. **Python Call: `tensor.cuda()`**
   - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU.

2. **`TensorBody.h: cuda()`**
   - The `cuda()` method calls `to()`, specifying the target device as CUDA.

3. **`Tensor.cpp: TensorBase::to()`**
   - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`.

4. **Operator Call: `_ops::to_dtype_layout::call()`**
   - This operator dispatches the request to the backend-specific function responsible for managing the transfer.

5. **`Copy.cpp: copy_()`**
   - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`.

6. **`Copy.cpp: copy_impl()`**
   - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`.

7. **Dispatch to CUDA: `copy_stub`**
   - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`.

8. **`Copy.cu: copy_kernel_cuda()`**
   - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function.

#### **Solution**
To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance:

- **Efficiency of `cudaMemcpy2DAsync`**: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors.
- **Reduction of Overhead**: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers.
- **Asynchronous Execution**: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts.

#### **Performance Results**

In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks:

```plaintext
Average time for contiguous slice (rows 10,000-50,000): 66 ms
Average time for non-contiguous slice (rows 10,000-50,000): 66 ms

Validation of contiguous and non-contiguous tensor copies:
 PASS: Tensor shapes match.
 PASS: Tensor contiguity matches.
 PASS: Tensor contents match.
 PASS: Tensor data types match.

 Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU.
```

#### **Conclusion**
This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964
Approved by: https://github.com/jeffdaily

Co-authored-by: Natalia Gimelshein <ngimel@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-31 17:05:02 +00:00
2d6f6637d3 Remove lexicographical sorting of storage keys in torch.save (#143879)
Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted)

This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads

* __->__ #143879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879
Approved by: https://github.com/albanD
2025-01-31 17:00:23 +00:00
9232355bb0 Add CUDA 12.8 manywheel x86 Builds to Binaries Matrix (#145792)
https://github.com/pytorch/pytorch/issues/145570

Adding cuda 12.8.0 x86 builds first
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145792
Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman
2025-01-31 16:12:02 +00:00
a7c2d85c18 Add overloads to diagonal docs (#144214)
Fixes #126827. Refactored doc to demonstrate when none of the optional values are passed in. Added another example so that all overloads of the function are covered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144214
Approved by: https://github.com/albanD
2025-01-31 15:53:59 +00:00
2af876707b [AOTI] Fix a memory leak in package boxed_run (#146100)
Summary: AOTIModelPackageLoaderPybind::boxed_run missed a decref when constructing the returned py::list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146100
Approved by: https://github.com/cpuhrsch
2025-01-31 13:32:28 +00:00
7b07415aaa [export] nested terms in nn_module_stack deserialization (#145901)
Summary: accounting for terms like "getattr(getattr(a[0], b), c)".

Test Plan: test_serialize

Differential Revision: D68784736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145901
Approved by: https://github.com/angelayi
2025-01-31 10:00:13 +00:00
1f1a9965d5 fix a small typo in comments (#145323)
A minor typo fix.
The description was confusing with the typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145323
Approved by: https://github.com/Skylion007
2025-01-31 06:45:44 +00:00
c55af2b567 [CMake] Delete Caffe2 inspect_gpu binary (#146105)
As it's unbuildable right now, as headers it depends on are gone

Fixes https://github.com/pytorch/pytorch/issues/146042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146105
Approved by: https://github.com/atalman, https://github.com/seemethere
2025-01-31 06:42:52 +00:00
e84bf88dde [ATen][CUDA] Implement 128 bit vectorization v2 (#145746)
This is a re-base PR to my previous one #141959.

Description from the original PR:

This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100.

<details>

<summary>The benchmark code used </summary>

```Python

import time
import torch
from torch.profiler import profile, ProfilerActivity

def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False):
    device = torch.device("cuda")

    shapes = []
    for p in range(24, 30):
        shape = 1<<p
        shapes.append(shape)

    for shape in shapes:
        for _ in range(6):
            x = torch.randn(shape, device=device, dtype=dtype)
            y = function(x)

        if print_profile:
            x = torch.randn(shape, device=device, dtype=dtype)
            with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
                y = function(x)
            print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

        x = torch.randn(shape, device=device, dtype=dtype)
        torch.cuda.synchronize()
        t1 = time.perf_counter()
        for _ in range(6):
            y = function(x)
        torch.cuda.synchronize()
        t2 = time.perf_counter()
        perf_time = (t2 - t1) / 6

        print(f"{function.__name__}, {dtype}, {shape}, {perf_time}")
        if check_numerics:
            x_cpu = x.cpu()
            y_cpu = function(x_cpu).cuda()
            try:
                torch.testing.assert_allclose(y_cpu, y)
            except AssertionError as error:
                print("An exception occurred:", error)

def main():
    ops = [
            torch.relu,
            torch.sigmoid,
            torch.tanh,
            torch.nn.functional.gelu,
            torch.sin,
            torch.exp,
    ]

    dtypes = [
            torch.float16,
            torch.bfloat16,
            torch.float32,
    ]

    for op in ops:
        for dtype in dtypes:
            benchmark(op, dtype=dtype)
            torch.cuda.empty_cache()

if __name__ == "__main__":
    main()
```

</details>

<details>

<summary> Results </summary>

| op | dtype | size | time after | time before | % improvement |
| ---- | ---- | ---- | ---- | ---- | ---- |
| relu | torch.float16 | 33554432 | 4.84E-05 | 5.06E-05 | 4.66296539127052 |
| relu | torch.float16 | 67108864 | 9.22E-05 | 9.64E-05 | 4.56491432752297 |
| relu | torch.float16 | 134217728 | 0.000180343495837102 | 0.000187981834945579 | 4.23543919508829 |
| relu | torch.float16 | 268435456 | 0.000355071155354381 | 0.000370856161074092 | 4.44558942107169 |
| relu | torch.float16 | 536870912 | 0.000704489842367669 | 0.000736006341564159 | 4.47366268483987 |
| relu | torch.bfloat16 | 16777216 | 3.03E-05 | 3.04E-05 | 0.166504085842689 |
| relu | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.45848238875716 |
| relu | torch.bfloat16 | 67108864 | 9.32E-05 | 9.65E-05 | 3.56122651631445 |
| relu | torch.bfloat16 | 134217728 | 0.000180805509444326 | 0.000187998676362137 | 3.97840029317567 |
| relu | torch.bfloat16 | 268435456 | 0.000356242332297067 | 0.000371279485989362 | 4.22104627356745 |
| relu | torch.bfloat16 | 536870912 | 0.000708114336399982 | 0.000736773828975856 | 4.04729732229083 |
| relu | torch.float32 | 16777216 | 5.61E-05 | 5.61E-05 | 0.0442587268354941 |
| relu | torch.float32 | 33554432 | 9.33E-05 | 9.30E-05 | -0.259070913799022 |
| relu | torch.float32 | 67108864 | 0.000181321326332788 | 0.000181289506144822 | -0.0175490597877115 |
| relu | torch.float32 | 134217728 | 0.000356896334172537 | 0.000356570177245885 | -0.0913870206618981 |
| relu | torch.float32 | 268435456 | 0.000709421835684528 | 0.000707465515006334 | -0.275762681635911 |
| relu | torch.float32 | 536870912 | 0.00141372415237129 | 0.00141036518228551 | -0.237597276678471 |
| sigmoid | torch.float16 | 16777216 | 3.10E-05 | 3.16E-05 | 2.10012593866895 |
| sigmoid | torch.float16 | 33554432 | 4.91E-05 | 5.23E-05 | 6.37710600666122 |
| sigmoid | torch.float16 | 67108864 | 9.30E-05 | 0.000100057009452333 | 7.61866144555331 |
| sigmoid | torch.float16 | 134217728 | 0.000180928347011407 | 0.000194982004662355 | 7.76752669390248 |
| sigmoid | torch.float16 | 268435456 | 0.000355658994521946 | 0.00038468533117945 | 8.16128288742412 |
| sigmoid | torch.float16 | 536870912 | 0.000705982849467546 | 0.000764021339515845 | 8.22094900634937 |
| sigmoid | torch.bfloat16 | 16777216 | 3.08E-05 | 3.17E-05 | 2.90965915673149 |
| sigmoid | torch.bfloat16 | 33554432 | 4.87E-05 | 5.24E-05 | 7.63503884668234 |
| sigmoid | torch.bfloat16 | 67108864 | 9.33E-05 | 0.000100019678939134 | 7.21238137428013 |
| sigmoid | torch.bfloat16 | 134217728 | 0.000180786165098349 | 0.000194868014659733 | 7.78922964250206 |
| sigmoid | torch.bfloat16 | 268435456 | 0.000355564659306159 | 0.000384909333661199 | 8.25297835063321 |
| sigmoid | torch.bfloat16 | 536870912 | 0.000705831005082776 | 0.000764102345177283 | 8.2557070566308 |
| sigmoid | torch.float32 | 16777216 | 4.93E-05 | 5.65E-05 | 14.5314136197766 |
| sigmoid | torch.float32 | 33554432 | 9.32E-05 | 9.31E-05 | -0.120169865610833 |
| sigmoid | torch.float32 | 67108864 | 0.000181328505277634 | 0.000180455681402236 | -0.481349512069855 |
| sigmoid | torch.float32 | 134217728 | 0.000357362829769651 | 0.000356093340087682 | -0.35523831137877 |
| sigmoid | torch.float32 | 268435456 | 0.000708921831877281 | 0.000707052337626616 | -0.263709504574663 |
| sigmoid | torch.float32 | 536870912 | 0.00141358317341656 | 0.0014090768333214 | -0.318788464654745 |
| tanh | torch.float16 | 16777216 | 3.03E-05 | 3.03E-05 | -0.0912564658661808 |
| tanh | torch.float16 | 33554432 | 4.90E-05 | 5.07E-05 | 3.46644442974484 |
| tanh | torch.float16 | 67108864 | 9.30E-05 | 9.68E-05 | 3.99871369815531 |
| tanh | torch.float16 | 134217728 | 0.00018052199933057 | 0.000188717152923346 | 4.53969799978138 |
| tanh | torch.float16 | 268435456 | 0.000355684508879979 | 0.000373026006855071 | 4.8755280430115 |
| tanh | torch.float16 | 536870912 | 0.000706660988119741 | 0.000740105014604827 | 4.73268328765002 |
| tanh | torch.bfloat16 | 16777216 | 2.99E-05 | 3.03E-05 | 1.21049563135981 |
| tanh | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.48836101041744 |
| tanh | torch.bfloat16 | 67108864 | 9.28E-05 | 9.69E-05 | 4.39944918036626 |
| tanh | torch.bfloat16 | 134217728 | 0.000180710999605556 | 0.000189167990659674 | 4.67984299382829 |
| tanh | torch.bfloat16 | 268435456 | 0.000356062994493792 | 0.000372666652159144 | 4.66312363882606 |
| tanh | torch.bfloat16 | 536870912 | 0.000707100164921333 | 0.000740134331863374 | 4.67178040408393 |
| tanh | torch.float32 | 16777216 | 5.61E-05 | 5.64E-05 | 0.439595755746353 |
| tanh | torch.float32 | 33554432 | 9.31E-05 | 9.31E-05 | 0.00287633090228212 |
| tanh | torch.float32 | 67108864 | 0.000181465332085888 | 0.000180895323865116 | -0.31411411437098 |
| tanh | torch.float32 | 134217728 | 0.000356963835656643 | 0.000356073161431899 | -0.249513854283251 |
| tanh | torch.float32 | 268435456 | 0.000709201170442005 | 0.00070707315656667 | -0.300057862849997 |
| tanh | torch.float32 | 536870912 | 0.00141367283261692 | 0.00141030051357423 | -0.238550176877922 |
| gelu | torch.float16 | 16777216 | 2.73E-05 | 3.17E-05 | 15.921079070745 |
| gelu | torch.float16 | 33554432 | 5.06E-05 | 5.55E-05 | 9.76345374333098 |
| gelu | torch.float16 | 67108864 | 9.65E-05 | 0.000106600326641152 | 10.4308039074712 |
| gelu | torch.float16 | 134217728 | 0.000187776672343413 | 0.000208565829476962 | 11.0712139447915 |
| gelu | torch.float16 | 268435456 | 0.000370216167842348 | 0.000412251994324227 | 11.3544005187205 |
| gelu | torch.float16 | 536870912 | 0.000737301345604161 | 0.000819394170927505 | 11.1342296895002 |
| gelu | torch.bfloat16 | 16777216 | 3.02E-05 | 3.08E-05 | 1.78405479367653 |
| gelu | torch.bfloat16 | 33554432 | 5.13E-05 | 5.69E-05 | 10.9929393318302 |
| gelu | torch.bfloat16 | 67108864 | 9.76E-05 | 0.00010968199543034 | 12.3420807512356 |
| gelu | torch.bfloat16 | 134217728 | 0.000189661824454864 | 0.000214487663470209 | 13.0895287371091 |
| gelu | torch.bfloat16 | 268435456 | 0.000374197009174774 | 0.000423670164309442 | 13.2211519391275 |
| gelu | torch.bfloat16 | 536870912 | 0.000743675006863972 | 0.000842577001700799 | 13.299088166737 |
| gelu | torch.float32 | 16777216 | 5.06E-05 | 5.04E-05 | -0.413385894716413 |
| gelu | torch.float32 | 33554432 | 9.31E-05 | 9.32E-05 | 0.134157041722546 |
| gelu | torch.float32 | 67108864 | 0.000181480175039421 | 0.000180836669945469 | -0.354586992112075 |
| gelu | torch.float32 | 134217728 | 0.000356874331676712 | 0.000356305002545317 | -0.159532104402047 |
| gelu | torch.float32 | 268435456 | 0.000708909006789327 | 0.000706991491218408 | -0.270488250615287 |
| gelu | torch.float32 | 536870912 | 0.00141321367118508 | 0.00140937082081412 | -0.271922813181618 |
| sin | torch.float16 | 16777216 | 3.04E-05 | 3.11E-05 | 2.21834939018859 |
| sin | torch.float16 | 33554432 | 4.85E-05 | 5.23E-05 | 7.72165512511596 |
| sin | torch.float16 | 67108864 | 9.31E-05 | 9.98E-05 | 7.24947099480072 |
| sin | torch.float16 | 134217728 | 0.000180371008658161 | 0.000194791161144773 | 7.99471744039613 |
| sin | torch.float16 | 268435456 | 0.000355454161763191 | 0.000384903668115536 | 8.28503630574026 |
| sin | torch.float16 | 536870912 | 0.000705183832906187 | 0.000764360166310022 | 8.39161799270973 |
| sin | torch.bfloat16 | 16777216 | 3.11E-05 | 3.10E-05 | -0.257677954940036 |
| sin | torch.bfloat16 | 33554432 | 4.89E-05 | 5.24E-05 | 7.34808420323539 |
| sin | torch.bfloat16 | 67108864 | 9.26E-05 | 0.000100248667877167 | 8.22347488801205 |
| sin | torch.bfloat16 | 134217728 | 0.000180674154156198 | 0.00019567032965521 | 8.30012215584937 |
| sin | torch.bfloat16 | 268435456 | 0.000355360486234228 | 0.000386023331278314 | 8.62865913118873 |
| sin | torch.bfloat16 | 536870912 | 0.00070483615854755 | 0.000766805159704139 | 8.79197248964745 |
| sin | torch.float32 | 16777216 | 5.67E-05 | 5.64E-05 | -0.441348534920039 |
| sin | torch.float32 | 33554432 | 9.34E-05 | 9.30E-05 | -0.496458540364117 |
| sin | torch.float32 | 67108864 | 0.000181706990891447 | 0.000180556671693921 | -0.633062708199702 |
| sin | torch.float32 | 134217728 | 0.000356894995396336 | 0.000356046327700218 | -0.237791985616354 |
| sin | torch.float32 | 268435456 | 0.000708777321657787 | 0.000707602652255446 | -0.165731798471427 |
| sin | torch.float32 | 536870912 | 0.00141263716310884 | 0.00140912582476934 | -0.248566187496451 |
| exp | torch.float16 | 16777216 | 3.00E-05 | 3.04E-05 | 1.40099098901014 |
| exp | torch.float16 | 33554432 | 4.86E-05 | 5.03E-05 | 3.44611943643906 |
| exp | torch.float16 | 67108864 | 9.37E-05 | 9.55E-05 | 1.96412400380129 |
| exp | torch.float16 | 134217728 | 0.000180913504057874 | 0.000187193179347863 | 3.47109262113439 |
| exp | torch.float16 | 268435456 | 0.00035607748820136 | 0.000369079003576189 | 3.65131630210701 |
| exp | torch.float16 | 536870912 | 0.000707551507124056 | 0.000732363162872692 | 3.50669251620789 |
| exp | torch.bfloat16 | 16777216 | 2.98E-05 | 3.04E-05 | 1.74345594341654 |
| exp | torch.bfloat16 | 33554432 | 4.88E-05 | 5.04E-05 | 3.40217856534821 |
| exp | torch.bfloat16 | 67108864 | 9.32E-05 | 9.62E-05 | 3.29219958210226 |
| exp | torch.bfloat16 | 134217728 | 0.000180999826019009 | 0.000187239318620414 | 3.44723679499521 |
| exp | torch.bfloat16 | 268435456 | 0.000355944503098726 | 0.000369370992605885 | 3.77207384585864 |
| exp | torch.bfloat16 | 536870912 | 0.000707135167128096 | 0.000733066000975668 | 3.66702648277075 |
| exp | torch.float32 | 16777216 | 4.89E-05 | 5.63E-05 | 15.1245314346532 |
| exp | torch.float32 | 33554432 | 9.34E-05 | 9.31E-05 | -0.259945454477446 |
| exp | torch.float32 | 67108864 | 0.000181152504713585 | 0.000180474346658836 | -0.374357536939058 |
| exp | torch.float32 | 134217728 | 0.000356771342922002 | 0.000355627329554409 | -0.3206573034212 |
| exp | torch.float32 | 268435456 | 0.000708404501589636 | 0.00070713268360123 | -0.179532736671163 |
| exp | torch.float32 | 536870912 | 0.00141283582585553 | 0.00140944866385932 | -0.23974208002295 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746
Approved by: https://github.com/eqy, https://github.com/ngimel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-01-31 06:42:08 +00:00
eeb5e1bf20 [AOTI] Cache treespec_loads calculation (#145815)
Summary: Treespec can be reused instead of calculated from str every AOTI module call. Using cached result saves 0.2ms for each module call.

Test Plan:
Before:
{F1974751578}

After:
 {F1974751667}

Differential Revision: D68749539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145815
Approved by: https://github.com/henrylhtsang
2025-01-31 06:38:21 +00:00
57d8278ab9 pickler for GraphModule (#141659)
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.

Differential Revision: [D68921318](https://our.internmc.facebook.com/intern/diff/D68921318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
2025-01-31 05:34:28 +00:00
f9227e7c33 Expose ToIValueAllowNumbersAsTensors to TORCH_PYTHON_API so we can use it in monarch (#146087)
Summary: TSIA

Test Plan: Tested up the stack but existing unittests

Reviewed By: suo

Differential Revision: D68917233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146087
Approved by: https://github.com/suo
2025-01-31 05:08:11 +00:00
cf2de4e230 Introduce aoti_call_delegate HOP (#145630)
Summary:
Previously, aoti compile node is represented as a kernel-less custom op in the exported program. The node was not eager runnable, which is a common practice for numerical validation during lowering.

I introduce a new HOP to address this.

The schema is following
```
aoti_call_delegate(lower_moduel: AOTInductorEPModule, original_gm: fx.GraphModule, weights: List[Tensor], inputs: List[Tensor])
```

There are a few problems exposed by HOP
- AOTI expects a FX graph with weights as getattr nodes, aka stateful graph. HOP expect graph_module arguments to be stateless. Export serializer also expect a stateless graph. Currently, to make AOTI happy, I am making `original_gm` stateful, and bypassing the serialization for `original_gm`.
- As a result, the HOP is not re-traceable, as functionalization on stateful graph module argument will fail.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test

Reviewed By: zhxchen17

Differential Revision: D68359391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145630
Approved by: https://github.com/zou3519
2025-01-31 04:57:36 +00:00
f358d4d004 [ONNX] Migrate test_torch_export_with_onnxruntime.py to test_small_models_e2e.py (#146095)
With [the deprecation of torch.onnx.dynamo_export](https://github.com/pytorch/pytorch/pull/146003), this PR turns the torch.export related tests toward torch.onn.export(..., dynamo=True), and places it in test_small_models_e2e.py

NOTE: test_exported_program_as_input_from_file and test_onnx_program_supports_retraced_graph are not kept, because they are more of testing whether exported program stays the same after save/load and retrace. However, in torch.onnx.export(..., dynamo=True), we focus more on the export of from nn.Module to ONNX proto.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146095
Approved by: https://github.com/justinchuby
2025-01-31 03:40:26 +00:00
27e35de6c2 [export] Add distributed test (#146050)
Reland https://github.com/pytorch/pytorch/pull/145886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146050
Approved by: https://github.com/avikchaudhuri
2025-01-31 02:56:42 +00:00
ffb424eab6 [dynamo/export] call local_scalar_dense when full() value is scalar tensor (#144999)
Fixes https://github.com/pytorch/pytorch/issues/144907
```
        class Foo(torch.nn.Module):
            def forward(self, val):
                return torch.full((80, 2), val, dtype=torch.float32)

        export(Foo(), args=(torch.tensor(1),))
```

When we have a `torch.full` call like above, where the fill value is a scalar Tensor and not a scalar value, the FX graph from `_dynamo.export()` contains a single node: the full op. We run into a `PendingUnbackedSymbolNotFound` error, because the `item()` call is implicit; the UnbackedSymInt is extracted but goes directly into the data of the output tensor value, and we're then unable to locate it when we try to compute unbacked bindings.

On the other hand, non-strict export doesn't face this, because an explicit `item()`, or `local_scalar_dense` node is inserted, and the unbacked binding is directly the example value of that node.

This adds a dynamo handler to imitate what happens in non-strict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144999
Approved by: https://github.com/angelayi
2025-01-31 02:45:43 +00:00
e01c898e51 [Customized Optimus] Add select cat aten pass (#145918)
Summary: This is a follow up work of D68695717, where we can further reduce the number of cat kernels in the backward by designing new aten pass in the aten level.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/6943087f-91be-4dbd-9693-df0a11a50b73
Test UI: https://www.internalfb.com/intern/testinfra/testrun/11821949087998233
Network: Up: 101KiB  Down: 132KiB  (reSessionID-60e898af-f366-4247-a9f7-d8d7cd129fe0)
Analyzing targets. Remaining      0/78148
Executing actions. Remaining      0/476147
Command: test.     Finished 2 local
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to add the config

```
        post_grad_fusion_options: {
          "normalization_aten_pass": {},
          "split_cat_aten_pass": {},
          "select_cat_aten_pass": {},
        }
```

{F1974778773}

baseline:

aps-recgpt_ranking_1115_pt2_optimus-e52c1f277e

proposal

aps-recgpt_ranking_1115_pt2_optimus-1b0047ee0e

Differential Revision: D68803384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145918
Approved by: https://github.com/Yuzhen11
2025-01-31 02:35:10 +00:00
08d88127fe Use Magma-cuda 12.8 for libtorch (#146019)
https://github.com/pytorch/pytorch/issues/145570

Build failure for libtorch wheel
`CUDAContext.cpp:(.text+0x157): additional relocation overflows omitted from the output
/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
collect2: error: ld returned 1 exit status`

Unsure if this is related, fixing as a start
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146019
Approved by: https://github.com/eqy
2025-01-31 02:19:23 +00:00
2811f33d12 Fix code cache + freezing compile-time regression (#145868)
Summary: The current implementation introduces a compile-time regression due to overhead hashing large constants. To support freezing+caching, we consider only the tensor metadata of frozen params, but we neglect to do the same for any constants created as a result of folding frozen params. This PR Explicitly marks the constants created during freezing (and constant folding during freezing) and uses that info in the inductor cache to determine when to hash a tensor value+metadata vs. metadata only.

Test Plan: `python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only alexnet --bfloat16 --cold-start-latency --print-compilation-time --inference --performance --freezing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145868
Approved by: https://github.com/eellison
2025-01-31 02:04:15 +00:00
bf9d053fb8 [Break XPU] Fix Inductor cuda bias UT (#145934)
# Motivation
[Break XPU] inductor ut: `inductor/test_inplace_padding.py::InplacePaddingTest::test_pad_non_zero - RuntimeError: Expected to find "empty_strided_cuda((2048, 2048), (2048, 1), torch.float32).as_strided((2048, 2047), (2048, 1))" but did not find it`

With this PR, `test_pad_non_zero` will pass on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145934
Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/desertfire
2025-01-31 01:39:39 +00:00
ccd27e8129 Turn on fx graph cache and automatic dynamic pgo local caches in fbcode (#146065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146065
Approved by: https://github.com/jamesjwu
2025-01-31 01:11:48 +00:00
3fae5c8509 torchgen: support exception boundary for ExecuTorch functions (#144341)
Needed for ExecuTorch diff D67904052.

Differential Revision: [D67906411](https://our.internmc.facebook.com/intern/diff/D67906411/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144341
Approved by: https://github.com/Jack-Khuu
2025-01-31 01:05:21 +00:00
cyy
d94d816d96 Simplify handling of max jobs in CMake builds (#145820)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145820
Approved by: https://github.com/malfet
2025-01-31 00:55:39 +00:00
c70362fac8 [AsyncMM] re-enable and adapt to cutlass 3.6.0 (#144011)
[D68734067](https://our.internmc.facebook.com/intern/diff/D68734067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-31 00:48:51 +00:00
1e3d1738a4 [dynamo][polyfills]Support getrecursionlimit (#145989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145989
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145986, #145987, #145994
2025-01-31 00:47:31 +00:00
e7bb608d02 [dynamo][dicts] Support construction of types.MappingProxyType (#145994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145994
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145986, #145987
2025-01-31 00:47:31 +00:00
4665bc2cc0 [dynamo][functions] Support id on function (#145987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145987
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #145986
2025-01-31 00:47:23 +00:00
56307dc370 [dynamo][dicts] Raise exception on pop (#145986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145986
Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
2025-01-31 00:47:13 +00:00
e6704a2447 Allow replacing unbacked with very large upperbound by returning no-op for FloorToInt(int) (#146001)
* Let's say x is an integer beyond 2^53 where Python floats lose precision i.e. can't increment by 1.
* Therefore, float(x) will lose precision and won't retain the exact value of x even though it's an integer.
* That means `FloorToInt(very_large_number)` will lose precision if we cast it to float
```
>>> int(float(1000000007999999992))
1000000008000000000
```

This means when we try to do this in set_replacement():
32bb6f83d5/torch/fx/experimental/symbolic_shapes.py (L6011-L6019)

We run into this:
```
TORCH_LOGS="+torch.fx.experimental.symbolic_shapes" pytest -s test_export.py -k test_replace_unbacked_with_very_large_upperbound

  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6258, in _maybe_guard_rel
    self._set_replacement(rhs, self._find(lhs), "trivial_rhs")
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6039, in _set_replacement
    assert tgt_bound.issubset(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2*s0,)), FakeTensor(..., size=(u0,))), **{}):
tgt_bound=VR[4, 1000000008000000000] not a subset of src_bound=VR[4, 1000000007999999992]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146001
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145898
2025-01-31 00:25:20 +00:00
c72b536420 Add manual override flag for core ATen op detection during bc check (#146052)
Fixes https://github.com/pytorch/pytorch/issues/146049

Today the bc detection logic ignores allow_list for core ATen ops (A PR landed 4 months ago to enable this). The problem is that if I have a PR that removes an op, the script can no longer check whether that op is core ATen op (today we just error out).

With my fix: (1) conservatively assume core ATen op in such cases (2) allows the user to specify in their ALLOW_LIST entry that their op is not a core ATen op.)

Test plan:
- This is tested 2 PRs above

016bdafdcb/test/forward_backward_compatibility/check_forward_backward_compatibility.py (L129-L137)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146052
Approved by: https://github.com/albanD
2025-01-30 23:57:01 +00:00
720b8d0d8d [inductor/profiler] add kernel kwargs instrumentation (#145573)
## About

As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc.

## Test program

Note, install triton before proceeding (pip install triton)

triton_test.py>>>
```
import torch
from torch.profiler import profile, ProfilerActivity

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b

def main():
    x = torch.randn(10, 10).cuda()
    y = torch.randn(10, 10).cuda()
    opt_foo = torch.compile(foo)
    z = opt_foo(x, y)

    # Profile the kernel function on the GPU
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True
    ) as prof:
        z = opt_foo(x, y)

    # Export the trace to a file
    prof.export_chrome_trace("my_kernel_trace.json")

if __name__ == "__main__":
    main()
```

Run it and we should get a trace file my_kernel_trace.json

Output has triton event with the kernel_kwargs attribute.
```
  {
    "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815,
    "ts": 2045246693014.959, "dur": 75.662,
    "args": {
      ...
      "kernel_backend": "triton",
      "num_warps": 4,
      "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)",
      "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py",
      "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor"
    }
  },
```

## Unit Test
Updated unit test:
```
pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573
Approved by: https://github.com/davidberard98, https://github.com/jansel
2025-01-30 23:51:44 +00:00
8117656162 nonzero_static with symint size (#146006)
Summary: Previously `nonzero_static` would force specialization on the `size` argument. This PR enables it to be used with a dynamic `size` argument.

Test Plan: added test

Differential Revision: D68874784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146006
Approved by: https://github.com/angelayi
2025-01-30 23:42:42 +00:00
9fdc20809a [PGNCCL] Simplify support macro definition (#145964)
- Promotes usage of `NCCL_VERSION_CODE >= NCCL_VERSION(X, Y, Z)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145964
Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang
ghstack dependencies: #145893
2025-01-30 23:26:32 +00:00
4280232f21 Revert "Advance past fc window for stft center (#145437)"
This reverts commit 3ef1551f5a745c1d37ff421eb4678814ef4483e4.

Reverted https://github.com/pytorch/pytorch/pull/145437 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks some slow trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145437#issuecomment-2625840742))
2025-01-30 23:14:16 +00:00
f85e4c1360 Enable C++ API parity tests on AArch64 (#145370)
Re-enables C++ API parity tests on AArch64 which now pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145370
Approved by: https://github.com/albanD
2025-01-30 22:42:49 +00:00
2f60f12f8b [Torch] Extract arange_out resizing logic into a helper function that can be used by other devices (#145747)
Summary: We want to use the resizing implementation for arange_out in other devices (in this case MTIA), to make sure that the computations match and to avoid off-by-one-errors.

Test Plan: Existing CI tests pass.

Differential Revision: D68694489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145747
Approved by: https://github.com/mortzur
2025-01-30 22:37:00 +00:00
99a0940991 [MPS] Fix regression in con-contig bitwise ops (#146085)
Caused by https://github.com/pytorch/pytorch/pull/128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous

Fixes https://github.com/pytorch/pytorch/issues/145203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146085
Approved by: https://github.com/dcci
2025-01-30 22:36:56 +00:00
e2917245fb [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-01-30 22:33:50 +00:00
7391cea857 Revert "[triton] Update pin to tip of 3.2 release (#145867)"
This reverts commit 5e5da9bd9afdbb51da3dcc39947347279ccd9130.

Reverted https://github.com/pytorch/pytorch/pull/145867 on behalf of https://github.com/ZainRizvi due to Sorry, this PR may have been written correctly, but something is clearly broken with the infra that's making CI very unhappy with this new triton version.  Since this has been blocking viable/strict upgrades for a couple days now, I'm reverting this PR.  I'll sync with @atalman on how we should fix this. ([comment](https://github.com/pytorch/pytorch/pull/145867#issuecomment-2625720817))
2025-01-30 22:24:09 +00:00
23695ea002 Fix dynamo use of list[int] in graph break (#145554)
This reintroduces the change backed out by #145393 and fixes the underlying problem.

Although using a BuiltinVariable was better than nothing when we saw a GenericAlias it had problems if there was a graph break and we had to reconstruct the original python code which BuiltinVariable did as a simple `list` instead of a `list[int]`.

This changes it to use a TypingVariable instead and then teaches TypingVariable how to reconstruct.

Original commit changeset: 77b9193acb23

python test/dynamo/test_repros.py ReproTests.test_graph_break_on_jit_isinstance

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145554
Approved by: https://github.com/anijain2305
ghstack dependencies: #145551, #145552, #145553
2025-01-30 22:21:40 +00:00
fbb076cc45 Fix call to create_load_global (#145553)
There is no version of create_load_global() that takes three parameters - any use of this function will fail. I think this is probably the correct fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145553
Approved by: https://github.com/anijain2305
ghstack dependencies: #145551, #145552
2025-01-30 22:21:40 +00:00
ccbbc88bbb Turn on mypy for _dynamo/variables/builtin.py (#145552)
The fact that mypy errors were ignored was hiding several bugs in builtin.py (for example the previous diff's incorrect override and use of `call_getattr`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145552
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #145551
2025-01-30 22:21:32 +00:00
f3120f6d26 Remove incorrect BuiltinVariable.call_hasattr() (#145551)
BuiltinVariable.call_hasattr() overrides the base class - but actually behaves differently. The base is `obj.call_hasattr(tx, attr)` but BuiltinVariable's version is `<unused>.call_hasattr(tx, obj, attr)`.

The BuiltinVariable version is used as a pattern from `call_self_handler()` for `BuiltinVariable(hasattr)`. I think the other version is just used for internal `hasattr(obj, name)` so I renamed that one to `call_obj_hasattr`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145551
Approved by: https://github.com/anijain2305
2025-01-30 22:21:19 +00:00
clr
d100e9ae74 inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)
If a nn.module getattr call throws, we should make sure that we don't crash with an internal error

Note that I couldn't figure out how to test this, so advice would be awesome.  I have my best case attempt at  https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122
Approved by: https://github.com/jansel
2025-01-30 21:55:29 +00:00
08ff11e9d0 initialize device when pinning memory on this device, short circuit i… (#145752)
…s_pinned if device is not initialized
Do not land
RFC
potential fix for #144687

Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752
Approved by: https://github.com/albanD

Co-authored-by: albanD <albandes@fb.com>
2025-01-30 21:37:29 +00:00
1252c1933d Update to remind users to use torch.compile template (#145960)
Users have been submitting fuzzer issues without meeting the requirements outline in the torch.compile issue template. This updates the note to remind users to use the torch.compile template for torch.compile bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145960
Approved by: https://github.com/eellison
2025-01-30 21:34:40 +00:00
d14046b58d Update fuzzer guidance to include rng (#145962)
Add another condition to fuzzer issue guidance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145962
Approved by: https://github.com/eellison
2025-01-30 21:33:57 +00:00
7e7341bddd [hop] fix unbacked_bindings meta for while_loop (#143559)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143559
Approved by: https://github.com/zou3519
2025-01-30 21:33:09 +00:00
9f9904172d [scan] scan dim handling in user-facing scan() (#145179)
This PR introduces the capability that the scan dim is handled in the user facing scan() call. Internally, the scan dim is always shifted to dim 0 and then the scan is performed over that dim.

This is a follow-up PR from https://github.com/bohnstingl/pytorch/pull/3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145179
Approved by: https://github.com/ydwu4
2025-01-30 21:09:07 +00:00
70f6aaa786 [OSS] Add kwargs to fsspec reader/writer (#145845)
Summary: Add kwargs to fsspec reader/writer. This will be used when reading/writing from huggingface because it needs a token to access the repositories

Test Plan: https://fburl.com/anp/agkrlas1 ability to read write to hf with fsspec

Differential Revision: D68738777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145845
Approved by: https://github.com/mhorowitz
2025-01-30 21:00:58 +00:00
e6c39d37e9 [ONNX] Create deprecation warning on dynamo_export (#146003)
Deprecation of `torch.onnx.dynamo_export`:

* [`torch/onnx/_internal/_exporter_legacy.py`](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86): Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`. [[1]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86) [[2]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR231-R234) [[3]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR442-R445) [[4]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR700-R703)

This PR also removed the `**_` kwarg on onnx.export such that users get an error when they supply an unexpected augument.

Updated to emit deprecation warning because it is more appropriate: https://docs.python.org/3/library/exceptions.html#DeprecationWarning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146003
Approved by: https://github.com/titaiwangms
2025-01-30 20:13:32 +00:00
1fdb4d65c0 [MPS] Extend torch.mm/torch.bmm to integral types (#145809)
By using `naive_mm` kernel, but make sure that accumulation is done over int32 for smaller int types (and float for half and bfloat) as well as adding `navie_bmm` that follows the same pattern.
Remove stale restriction on `torch.dot` (which works fine on MacOS-14/15)
This also enables integer op flavors for:
- `addmv`
- `einsum`
- `inner`
- `linalg.multi_dot`
- `matmul`
- `mv`
- `tensordot`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145809
Approved by: https://github.com/dcci
2025-01-30 19:35:25 +00:00
3ef1551f5a Advance past fc window for stft center (#145437)
Long overdue follow-up on https://github.com/pytorch/pytorch/pull/73432/files#diff-5f3d4caa0693a716fc46fd7f6339312f1b5f0bf89e3a3ff58e9dc13a9486b17aR719

Onnx stft doesn't support centering, [and all of the existing tests are for center = False](https://github.com/pytorch/pytorch/blob/main/test/onnx/test_pytorch_onnx_onnxruntime.py#L8026). I will open a follow-up issue to address this, this is just a nice-to-have.

Pr chain:
- -> [Advance past fc window for stft center #145437](https://github.com/pytorch/pytorch/pull/145437)
- [Add stft option to align window for center = false #145324](https://github.com/pytorch/pytorch/pull/145324)
- [Add istft option to align window for center = false](https://github.com/pytorch/pytorch/pull/145510)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145437
Approved by: https://github.com/justinchuby, https://github.com/iseeyuan
2025-01-30 19:09:18 +00:00
a3698ebd5c [while_loop] specialize when cond_fn return constants (#144515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144515
Approved by: https://github.com/zou3519
2025-01-30 19:02:34 +00:00
16420a78eb [AOTI] Remove AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 (#146039)
Summary: The AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 macro was used to solve a FC issue and it can be removed now.

Test Plan: CI

Differential Revision: D68871245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146039
Approved by: https://github.com/yushangdi, https://github.com/hl475
2025-01-30 19:01:19 +00:00
d1143c4b37 [export] fix non-strict pre_dispatch exporting while_loop (#145762)
fix https://github.com/pytorch/pytorch/issues/145737.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145762
Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519, https://github.com/avikchaudhuri
2025-01-30 18:58:34 +00:00
clr
f746bb6311 config: Don't spam warnings about reference type configs (#145800)
Summary:
https://github.com/pytorch/pytorch/issues/145755

The is_dynamic check for reference types was subtly broken, causing log spam
after it was accessed

Added an explicit type for is_default for reference types to make sure this
behaviour is correct
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145800
Approved by: https://github.com/eellison
2025-01-30 18:57:16 +00:00
5a527fa5ee Make sure not using cpp wrapper when setting nvtx training annotation (#145538)
Longer term would be good to add as a feature to cpp_wrapper, but this makes sure it doesn't fail on main.

Not sure if this needs a test because it's not meant to compose, but will add one if necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145538
Approved by: https://github.com/desertfire
2025-01-30 18:34:22 +00:00
3ee655e4d4 [async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846)
There's a sleep that is issued in order to "nudge" CUDA to do the right scheduling decision, but this is issued on iteration number 2. However, when the world size is 2, we never reach that iteration, which led to a suboptimal scheduling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145846
Approved by: https://github.com/yifuwang
2025-01-30 18:26:34 +00:00
51ee9b154e [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-30 18:19:00 +00:00
7796e308d0 Record inputs at time of tracing, constrain to them for triton fn (#145448)
Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context.

We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448
Approved by: https://github.com/zou3519
ghstack dependencies: #145953
2025-01-30 16:54:08 +00:00
967cf85f3a Revert "Update mi300 labels to account for multiple clusters. (#145923)"
This reverts commit 3e135993bd0fa08cbff565ae76bb15cb08e1d6d0.

Reverted https://github.com/pytorch/pytorch/pull/145923 on behalf of https://github.com/atalman due to reverting back to one cluster ([comment](https://github.com/pytorch/pytorch/pull/145923#issuecomment-2625022826))
2025-01-30 16:45:50 +00:00
1c3df9ca8c Fix signif_strides_equal for symints, dedupe (#145953)
Previous impl would take a size hint, which was failing internally with a
```
strides1 = [V.graph.sizevars.size_hint(strides1[i]) for i in non_1_indices]
  File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/torch/_inductor/sizevars.py", line 554, in size_hint
    return int(out)
  File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/sympy/core/expr.py", line 307, in __int__
    raise TypeError("Cannot convert symbols to int")
```

There are unbacked tests in test_triton which should exercise this, as well as other tests for these functions when they were added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145953
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-30 16:44:32 +00:00
aaddfc5a7f Add TORCHINDUCTOR_VEC_ISA_OK env var for vec_isa_ok (#134667)
Adds a `TORCHINDUCTOR_VEC_ISA_OK` for `vec_isa_ok` for A|B testing purposes. Similar setup to `fx_graph_remote_cache` to allow for default `None`.

No tests were present for any other config settings here, nor for `vec_isa_ok` so I didn't add any.

Motivation:
PyTorch uses filelock with a timeout to determine if the CPU supports particular intrinsics: pytorch/torch/_inductor/cpu_vec_isa.py
Therefore if 2 processes are running, each processes encounters the HAS_CPU test, if it cannot acquire the lock for checking vec_isa_ok the main thread will be put to sleep. Hence there is a bias towards non-sleeping processes in acquiring the lock i.e. new spawned processes.

To avoid this, use a env variable so that each process is aware of this without going through the check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134667
Approved by: https://github.com/eellison
2025-01-30 16:22:48 +00:00
5fa28bbe40 Revert "[c10d] Add NCCL memory allocator (#145675)"
This reverts commit 18a7a04c4adecda3be17dd364d48d484fd1dcdba.

Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally. See D68866823 for details ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2624900562))
2025-01-30 16:01:52 +00:00
50086ab537 [ONNX] Delete rename_dynamic_shapes_with_model_inputs (#146002)
Basically, this function brings more cons than pros.

It was nice to have an automation help users to convert top-level key of dynamic shapes to arg names. However, this function has a bug when the model input has the same amount as dynamic_shapes in coincidence:

```python
input_names
# 'input_ids', 'past_key_values.0.key', 'past_key_values.0.value', 'past_key_values.1.key', 'past_key_values.1.value', 'past_key_values.2.key', 'past_key_values.2.value', 'past_key_values.3.key', 'past_key_values.3.value', 'past_key_values.4.key', 'past_key_values.4.value', 'attention_mask', 'position_ids'

inspect.sig(model.forward).parameters
# mappingproxy(OrderedDict([('input_ids', <Parameter "input_ids: Optional[torch.LongTensor] = None">), ('past_key_values', <Parameter "past_key_values: Union[transformers.cache_utils.Cache, Tuple[Tuple[torch.Tensor]], NoneType] = None">), ('attention_mask', <Parameter "attention_mask: Optional[torch.FloatTensor] = None">), ('token_type_ids', <Parameter "token_type_ids: Optional[torch.LongTensor] = None">), ('position_ids', <Parameter "position_ids: Optional[torch.LongTensor] = None">), ('head_mask', <Parameter "head_mask: Optional[torch.FloatTensor] = None">), ('inputs_embeds', <Parameter "inputs_embeds: Optional[torch.FloatTensor] = None">), ('labels', <Parameter "labels: Optional[torch.LongTensor] = None">), ('use_cache', <Parameter "use_cache: Optional[bool] = None">), ('output_attentions', <Parameter "output_attentions: Optional[bool] = None">), ('output_hidden_states', <Parameter "output_hidden_states: Optional[bool] = None">), ('return_dict', <Parameter "return_dict: Optional[bool] = None">), ('cache_position', <Parameter "cache_position: Optional[torch.LongTensor] = None">)]))
```

In the above case, the given input_names is following onnx graph, while it has the same length as torch model forward call. This kind of case makes it difficult to detect, and automate for users.

On the other hand, the error message from torch.export.export is quite informative that I believe users will know how to go from there:

```python

import torch

class Model(torch.nn.Module):
    def forward(self, x=None, y=None):
        return x + y

dim = torch.export.Dim("x", min=1, max=6)
onnx_program = torch.export.export(
    Model(),
    (),
    kwargs={"x": torch.randn(2, 3), "y": torch.randn(2, 3)},
    dynamic_shapes={"custom_input_x": {0: dim}, "custom_input_y": {0: dim}},
)

# torch._dynamo.exc.UserError: When `dynamic_shapes` is specified as a dict, its top-level keys must be the arg names ['x', 'y'] of `inputs`, but here they are ['custom_input_x', 'custom_input_y']. Alternatively, you could also ignore arg names entirely and specify `dynamic_shapes` as a list/tuple matching `inputs`. For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146002
Approved by: https://github.com/justinchuby
2025-01-30 16:01:38 +00:00
894ef8c1e3 [torchbench] Inductor freezing bfloat16 conv folding needs high tolerance (#145623)
Issue:
https://github.com/pytorch/pytorch/issues/144888

Torchbench of timm lcnet_050 model fails on accuracy in case of `--frezing` `--inference` `--bfloat16`
`res_error==0.12`
If to turn off convolution inductor constant folding - `res_error==0.016`

`float16 error ~ 0.00669`
`float16 without conv folding ~ 0.0018`

convolution folding results in increase of error almost at one order of magnitude.

I think we should revisit and try to do something to improve the accuracy for conv folding.
E.g. For example doing conv folding at compilation time with float64?

At the moment I am adding counters to identify if convolution folding happened, and in case of bfloat16 and conv_folding - increase multiplier to the max level (10) to pass accuracy test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145623
Approved by: https://github.com/eellison
2025-01-30 12:46:35 +00:00
ffa628169d [ATen][Native][CUDA][SCALED_MM] limit f8f8bf16 rowwise scaled matmul to sm_90 (#145728)
The CUTLASS-based kernel for f8f8bf16 rowwise scaled matmul is specific to Hopper devices only. It is not re-usable on newer devices without modifications. This PR adds a guard for this matmul to be sm_90 specific. Once the kernel is there, the guard may be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145728
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-01-30 11:19:58 +00:00
6bd19e65b1 add inductor_triton_kernel_mapping_post_grad.json to tlparseadd changes (#145954)
Landing D67612181 here. The original exported PR somehow fails OSS CI, but this one doesn't (though the PR content is the same).

Add debug trace artifact to inductor_triton_kernel_mapping_post_grad.json (debug artifact for provenance tracking) to tlparse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145954
Approved by: https://github.com/YUNQIUGUO
2025-01-30 06:18:48 +00:00
8a6e9a88e9 Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905)
Fixes #145661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905
Approved by: https://github.com/eqy, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-30 05:11:10 +00:00
58cc6693cb [BE] Type annotate wrapper_benchmark.py and cuda_combined_scheduling.py (#145542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145542
Approved by: https://github.com/eellison
2025-01-30 03:53:52 +00:00
8cc6f17334 [CD] Install OpenMP from homebrew (#145889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145889
Approved by: https://github.com/atalman
ghstack dependencies: #145871, #145870
2025-01-30 03:19:51 +00:00
0d5f0a81c5 [CMake] Find HomeBrew OpenMP on MacOS (#145870)
Either via `OMP_PREFIX` envvar or by searching in `/opt/homebrew/opt/libomp` folder

Modify libomp bundling logic in setup.py to change absolute path to libomp.dylib to a relative one if necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870
Approved by: https://github.com/Skylion007, https://github.com/atalman
ghstack dependencies: #145871
2025-01-30 03:19:51 +00:00
cyy
116af809eb Use std::string_view (#145906)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906
Approved by: https://github.com/albanD
2025-01-30 03:14:27 +00:00
933b6d9830 cpp_wrapper: enable in aarch64 and x86 nightly dashboard performance runs (#145791)
Adds `cpp_wrapper` mode to the nightly inductor benchmark runs, as well as optionally for manually triggered runs. This is justified by `aot_inductor` already being in those runs.

Additionally, re-enables `aot_inductor` in the nightly aarch64 runs. It was disabled 5 months ago to deal with a performance instability, which has likely gone away at this point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145791
Approved by: https://github.com/desertfire
2025-01-30 02:55:45 +00:00
32bb6f83d5 Make sure that benchmark_harness is set before running (#145532)
Running torch compile with these options causes an error, because the benchmark code isn't generated but is still called.
```
options={'profile_bandwidth_output': 'foo', 'benchmark_harness': False}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145532
Approved by: https://github.com/eellison
2025-01-30 01:25:53 +00:00
25ca05eebf [PGNCCL] Correct some ifdef's (#145893)
`create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893
Approved by: https://github.com/c-p-i-o
2025-01-30 01:05:21 +00:00
73dde451b7 [pytorch] Sprinkle in a few template keywords (#145877)
Summary:
These seem to be necessary to get compilation working on Windows with
CUDA 12.8. I'm not sure whether this means that all of the previous compilers
were broken, and the new one is better, or whether this is a regression in NVCC
12.8. Either way, as long as the CI passes for existing versions, this should
unblock us from CUDA 12.8 enablement on Windows.

See D68663662 for more details on the CUDA 12.8 enablement.

Test Plan: CI!

Reviewed By: akrieger

Differential Revision: D68787925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145877
Approved by: https://github.com/Skylion007
2025-01-30 00:57:40 +00:00
72699950b0 Copy model before benchmark warmup runs (#145858)
Fixes https://github.com/pytorch/pytorch/issues/144772

The eager warmup runs causes the model to change state so that later when we export it, the model is different than when we export it directly out of box. For some reason exporting the model with the changed state causes issues but exporting the inital model is ok. This is the reason why the accuracy checks pass but the performance check fails when exporting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145858
Approved by: https://github.com/desertfire
2025-01-30 00:36:33 +00:00
clr
6b41f310c2 config: Support str env variables (#145980)
Summary:
This allows us to use environment variables to set string values. We've added
tests for the specific functionality implemented here. Note that we already
accidentally started setting up configs to use this, so we're just adding the
feature.

Additionally, we're not fully validating the underlying type when we set the
value (and in general, it's more difficult than we would like to do this). Let
me know if people feel strongly, and we can add a PR to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145980
Approved by: https://github.com/yushangdi, https://github.com/oulgen
2025-01-30 00:13:02 +00:00
a9ed7bd78e [utilization] pipeline to create clean db records (#145327)
upload_utilization_script to generate db-ready-insert records to s3
- generate two files: metadata and timeseries in ossci-utilization buckets
- convert log record to db format ones
- add unit test job for tools/stats/

Related Prs:
setup composite action for data pipeline: https://github.com/pytorch/pytorch/pull/145310
add permission for composite action to access S3 bucket: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595
add insert logic in s3 replicator: https://github.com/pytorch/test-infra/pull/6217
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145327
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-01-29 23:48:50 +00:00
18a7a04c4a [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-29 23:20:22 +00:00
b60120d0df Revert "[ATen][CUDA] Implement 128 bit vectorization v2 (#145746)"
This reverts commit 81685d81eb86595d169f55a564da26eaafb2ddf5.

Reverted https://github.com/pytorch/pytorch/pull/145746 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking in trunk. See functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_multi_head_attention_forward_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/13032483748/job/36358184032) [HUD commit link](81685d81eb) ([comment](https://github.com/pytorch/pytorch/pull/145746#issuecomment-2623108958))
2025-01-29 23:02:23 +00:00
521588519d re-use FloorDiv for RShift (#145898)
I encountered this C++ compilation error.
```
  579 |     int64_t var_6 = (static_cast<int64_t>(std::floor((1.0/2.0)*u0)) | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))) | std::floor((1.0/16.0)*(static_cast<int64_t>(std::floor((1.0/2.0)*u0)) | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))));
      |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                     |                                                                                                         |
      |                                                                     int64_t {aka long int}                                                                                    double
```

Then, I figured out where this std::floor came from with the help of Bob's guard provenance tool. It comes from RShift which is used in `triton.next_power_of_2`.

---
Before, we used `std::floor`
```
int64_t var_6 = (
   static_cast<int64_t>(std::floor((1.0/2.0)*u0)) |
   static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0)))))
   | std::floor((1.0/16.0)*(static_cast<int64_t>(std::floor((1.0/2.0)*u0))             # no cast to int here.
   | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))));
```

Now, we use `c10::div_floor_integer` instead
```
int64_t var_6 = (
   (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L))) |
   (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))) |
   (c10::div_floor_integer(static_cast<int64_t>((c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L)))
   | (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))), static_cast<int64_t>(16L)));
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145898
Approved by: https://github.com/desertfire, https://github.com/bobrenjc93
ghstack dependencies: #145802
2025-01-29 22:50:22 +00:00
3df961d99b give emulate_precision_casts an envar (#145948)
this was requested internally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145948
Approved by: https://github.com/mlazos
2025-01-29 22:43:32 +00:00
2e5886dcc4 Add fake_impl for unique_consecutive (#145649)
Summary:
It's fairly similar to torch.unique and torch.unique_dim.

Test Plan:
New test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145649
Approved by: https://github.com/ezyang, https://github.com/eellison
2025-01-29 22:33:16 +00:00
1e57154af3 Require that all HOPs be imported at import torch time (#145939)
E.g. torch.ops.higher_order.cond does not exist until it is imported,
which is bad if it shows up in an FX graph or is used in some code
somewhere.

This PR also makes some more HOPs get imported at `import torch` time.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145939
Approved by: https://github.com/ydwu4
ghstack dependencies: #145938
2025-01-29 22:27:52 +00:00
2141c1aebe Better hop_db comment; move test to a non-export test file (#145938)
Goal is for people to better test their HOPs.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145938
Approved by: https://github.com/ydwu4
2025-01-29 22:27:52 +00:00
e02c038a23 [dynamo][benchmarks] Stop benchmarking compile time of dead code (#145590)
FIXES https://github.com/pytorch/pytorch/issues/144775 frfr

See details on the problem: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2611699385
We fixed some silent incorrectness, but it results in less nodes DCE'd. The benchmark iteration loop had some dead code which could contain side effect ops that aren't safe to DCE. The regression is expected.

This PR removes the compile time benchmarking of the dead code, which should reduce the noise of the benchmark and aligns with the benchmarking used by performance tests

New benchmark results:
```python
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,39.322364  # after https://github.com/pytorch/pytorch/pull/144319
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,38.972257  # before https://github.com/pytorch/pytorch/pull/144319
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145590
Approved by: https://github.com/jansel
ghstack dependencies: #145447
2025-01-29 22:14:47 +00:00
793dfc27e0 [inductor] Add some typing to triton.py (#145688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145688
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #145671, #145695
2025-01-29 21:56:40 +00:00
5db0ad92e3 [inductor] Remove mask_str from IndexingOptions (#145695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145695
Approved by: https://github.com/eellison
ghstack dependencies: #145671
2025-01-29 21:56:40 +00:00
23ff899164 [inductor] Fix handling of fixed XBLOCK larger than xnumel=1 (#145671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145671
Approved by: https://github.com/eellison
2025-01-29 21:56:32 +00:00
bb2fb554a9 [BE]: Update CUTLASS submodule to 3.7.0 (#145172)
* This has a couple of new features, but mostly has a lot of bugfixes for the prior releases
* This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it.
* Most of the remaining diff noise is copyright year updates on the CUTLASS submodule
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172
Approved by: https://github.com/eqy, https://github.com/henrylhtsang
2025-01-29 21:48:01 +00:00
d0aa1386b8 Disable AOTAutogradCache for triton version < 3.2 (#145937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145937
Approved by: https://github.com/bdhirsh
2025-01-29 21:32:16 +00:00
1185b81c51 Revert "[dynamo] Use polyfill to implement comparison operators (#144485)"
This reverts commit d1f82de2bf4ce4d4461791a9c9b2e759202db0bb.

Reverted https://github.com/pytorch/pytorch/pull/144485 on behalf of https://github.com/huydhn due to This seems to break dynamo tests in trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/144485#issuecomment-2622893294))
2025-01-29 21:30:42 +00:00
953e80936e [linter] Grep linter batches long command (#145950)
If the command is too long, the linter fails with
```
Failed due to OSError:
[Errno 7] Argument list too long: 'grep'
```
Fix this by batching the command so it is shorter

Limit of 750k was chosen due to `getconf ARG_MAX` returns ~1M on my mac.  My guess is that most people shouldn't hit this unless they run --all-files and the directory length is long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145950
Approved by: https://github.com/wdvr
2025-01-29 21:23:27 +00:00
a6e3f294f1 Don't use mypy daemon in CI (#145961)
This is an attempt to fix flaky mypy errors in CI that look like:

```
dmypy status --verbose
connection_name         : /var/folders/rf/qrn1jkgj0b9_tcznwp8ck46w0000gn/T/tmpjoqsid7_/dmypy.sock
pid                     :      32233
error                   :  timed out
Daemon is stuck; consider /Users/zainr/pytorch/venv/bin/dmypy kill
```

"Fix" it by not using the daemon at all, since it doesn't actually provide any perf benefits in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145961
Approved by: https://github.com/malfet
2025-01-29 21:15:29 +00:00
40ccb7a86d cpp_wrapper: Move #includes to per-device header files (#145932)
Summary:
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Reland https://github.com/pytorch/pytorch/pull/143909 after merge conflicts.

Co-authored-by: Benjamin Glass <[bglass@quansight.com](mailto:bglass@quansight.com)>

Differential Revision: D68656960

Pulled By: benjaminglass1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145932
Approved by: https://github.com/yushangdi, https://github.com/benjaminglass1

Co-authored-by: bglass@quansight.com <bglass@quansight.com>
2025-01-29 21:08:45 +00:00
8bd7bf3269 [Inductor-CPU] Add profiling support for codegened flex attention kernels (#145894)
### Summary

`RECORD_FUNCTION` wasn't present in codegened Inductor-CPU Flex Attention C++ kernels, so flex attention kernels weren't present in the PyTorch profiler profiling data.

Fixes #145825 by adding `RECORD_FUNCTION` calls in the codegened flex-attention kernels.

### Caveat

#### _Before_
No corresponding results in PyTorch profiler profiling data

#### _After_

| Inductor config settings |  What kernel name looks like in profiling data | Comments|
|-------------------|------------------------------------|--------------------|
| Env variable `TORCHINDUCTOR_CPP_WRAPPER=1` OR `inductor.config.cpp_wrapper=1` in python code | `graph_x_cpp_fused_y` | No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
|  `inductor.config.cpp.descriptive_names = "inductor_node"` but not CPP wrapper | `graph_x_kernel` | No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
| Both `inductor_config.cpp.descriptive_names = "inductor_node"`  & Inductor CPP Wrapper | `graph_x_cpp_fused_flex_attention_y`| Easy to interpret data |
| Neither of the two configs  | `graph_x_kernel`| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145894
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
2025-01-29 20:54:46 +00:00
bb4964013f Add determinmistic kernel for reflection2d (#136241)
Adds feature for #98925

Tests pass for both existing reflectionpad2d and the new one I inserted.

**Summary of the work:**

Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor.

So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`.

I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241
Approved by: https://github.com/eqy, https://github.com/albanD
2025-01-29 20:34:03 +00:00
2b8c28099a [OSS] Add no dist as an argument to DCP top level apis (#145754)
Summary: No-dist, for a non-distributed checkpoint, was a top level param in the past, but was removed. This was requested back in https://github.com/pytorch/pytorch/issues/125777 and will be needed for our torchtune changes to use DCP

Test Plan: existing tests pass

Differential Revision: D68714246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145754
Approved by: https://github.com/daulet-askarov
2025-01-29 20:33:37 +00:00
2d5d022594 Fix a number of flexattention issues (cse, cudagraph, etc.) (#145059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145059
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-29 20:27:39 +00:00
6aed6c042e [CD] Install ninja and setuptools from PyPI (#145871)
As well as typing extensions, they are available from PyPI, no reason to install them from Anaconda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871
Approved by: https://github.com/Skylion007
2025-01-29 19:47:16 +00:00
b80482988f Revert "[CMake] Find HomeBrew OpenMP on MacOS (#145870)"
This reverts commit c26bb9ba5bd40d256a25436212279bc7e4b436ae.

Reverted https://github.com/pytorch/pytorch/pull/145870 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))
2025-01-29 19:34:27 +00:00
b52e8d521e Revert "[CD] Install ninja and setuptools from PyPI (#145871)"
This reverts commit eea7d395e5faa9a4be5b60f6668c0bdf5163e3a0.

Reverted https://github.com/pytorch/pytorch/pull/145871 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))
2025-01-29 19:34:27 +00:00
082fab0fc7 [64-bit] Int64 casting for UpSampleNearest3D (#144865)
Fixes #144855

Follows approach in https://github.com/pytorch/pytorch/pull/141923 to use int64 types to increase INT_MAX limits
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144865
Approved by: https://github.com/eqy
2025-01-29 19:30:09 +00:00
1c9014a135 [export] Add tlparse to draft-export (#145810)
Dependent on https://github.com/ezyang/tlparse/pull/87/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145810
Approved by: https://github.com/pianpwk
2025-01-29 19:26:00 +00:00
6371c25b91 Revert "[c10d] Add NCCL memory allocator (#145675)"
This reverts commit 9fd6722fc9068eeaa176754acb315fc7e0f6416c.

Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))
2025-01-29 18:30:30 +00:00
e0525dbca9 Revert "inductor.config.descriptive_names = False is not actually supported (#145523)"
This reverts commit edf266e9bbbf6063f7c4a336ffb50234e11a0a82.

Reverted https://github.com/pytorch/pytorch/pull/145523 on behalf of https://github.com/ZainRizvi due to Hi, this breaks type checks internally. Can you please take a look? See D68801083 for details ([comment](https://github.com/pytorch/pytorch/pull/145523#issuecomment-2622510900))
2025-01-29 18:27:44 +00:00
284f217011 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 97b3b73f3e96bb8684064715b93c825ba0395475.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))
2025-01-29 18:24:29 +00:00
0d6343347f Revert "Record inputs at time of tracing, constrain to them for triton fn (#145448)"
This reverts commit a699034eeca8c096c44a690e405a60efa442d4ed.

Reverted https://github.com/pytorch/pytorch/pull/145448 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68779678 for details ([comment](https://github.com/pytorch/pytorch/pull/145448#issuecomment-2622470810))
2025-01-29 18:07:12 +00:00
1a613c3342 bump counters for unbacked binding names (#145882)
Instead of bumping symint counters when we process unbacked bindings during deserialization, it's better to bump them at the beginning based on what the symbols in the original shape env before serialization were. This allows symbols in unbacked bindings to have "gaps" that bumping alone would not be able to match.

Why is bumping counters important at all? It is because when the shape env coming out of deserialization is used later for propagating symints, say in run_decompositions, we don't want new names to clash with existing names (bad things happen).

Differential Revision: [D68798191](https://our.internmc.facebook.com/intern/diff/D68798191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145882
Approved by: https://github.com/pianpwk
2025-01-29 17:46:21 +00:00
4abff4b271 Introduce cache clearing APIs for the lazy graph executor (#144489)
This PR introduces two new methods to the LazyGraphExecutor class:

- ClearComputationCache(): Allows clearing the entire computation cache.
- RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash.

The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance:
- Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently.
- Selectively remove cache entries to analyze the impact on subsequent computations.
- Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations.

On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash.

Simultaneously, **another motivation**, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489
Approved by: https://github.com/wconstab
2025-01-29 17:38:01 +00:00
d1f82de2bf [dynamo] Use polyfill to implement comparison operators (#144485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485
Approved by: https://github.com/jansel
2025-01-29 17:37:40 +00:00
3e135993bd Update mi300 labels to account for multiple clusters. (#145923)
We now have multiple Kubernetes clusters of mi300x resources, and this commit updates labels accordingly to target both clusters evenly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145923
Approved by: https://github.com/jeffdaily
2025-01-29 16:56:43 +00:00
4499d60d56 [dynamo][builin-skipfiles-cleanup] Remove types (#145909)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145909
Approved by: https://github.com/zou3519
ghstack dependencies: #145856, #145875, #145878, #145892
2025-01-29 16:47:02 +00:00
ed141d7d1a dont assign a size to _assert_scalar in partitioner (#143877)
Fixes https://github.com/pytorch/pytorch/issues/143876

Open to other suggestions - we have an invariant that all nodes in our ATen graphs should have a `meta['val']` field, but I don't think this is actually true in all cases, so I just hardcoded the invariant to ignore `_assert_scalar()` (which is a "special" op used in dynamic shapes for runtime asserts, and doesn't have a meta['val'] field)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143877
Approved by: https://github.com/zou3519
2025-01-29 16:21:37 +00:00
3b3aac0cde Filter out iGPU if dGPU is found on XPU (#144378)
# Motivation
for https://github.com/pytorch/pytorch/issues/143914
On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context.

Now I generalize the logic as below:
1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform.
2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform.
3. No GPU is found (neither iGPU nor dGPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
2025-01-29 15:53:16 +00:00
5e5da9bd9a [triton] Update pin to tip of 3.2 release (#145867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867
Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte
2025-01-29 15:17:58 +00:00
81685d81eb [ATen][CUDA] Implement 128 bit vectorization v2 (#145746)
This is a re-base PR to my previous one #141959.

Description from the original PR:

This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100.

<details>

<summary>The benchmark code used </summary>

```Python

import time
import torch
from torch.profiler import profile, ProfilerActivity

def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False):
    device = torch.device("cuda")

    shapes = []
    for p in range(24, 30):
        shape = 1<<p
        shapes.append(shape)

    for shape in shapes:
        for _ in range(6):
            x = torch.randn(shape, device=device, dtype=dtype)
            y = function(x)

        if print_profile:
            x = torch.randn(shape, device=device, dtype=dtype)
            with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
                y = function(x)
            print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

        x = torch.randn(shape, device=device, dtype=dtype)
        torch.cuda.synchronize()
        t1 = time.perf_counter()
        for _ in range(6):
            y = function(x)
        torch.cuda.synchronize()
        t2 = time.perf_counter()
        perf_time = (t2 - t1) / 6

        print(f"{function.__name__}, {dtype}, {shape}, {perf_time}")
        if check_numerics:
            x_cpu = x.cpu()
            y_cpu = function(x_cpu).cuda()
            try:
                torch.testing.assert_allclose(y_cpu, y)
            except AssertionError as error:
                print("An exception occurred:", error)

def main():
    ops = [
            torch.relu,
            torch.sigmoid,
            torch.tanh,
            torch.nn.functional.gelu,
            torch.sin,
            torch.exp,
    ]

    dtypes = [
            torch.float16,
            torch.bfloat16,
            torch.float32,
    ]

    for op in ops:
        for dtype in dtypes:
            benchmark(op, dtype=dtype)
            torch.cuda.empty_cache()

if __name__ == "__main__":
    main()
```

</details>

<details>

<summary> Results </summary>

| op | dtype | size | time after | time before | % improvement |
| ---- | ---- | ---- | ---- | ---- | ---- |
| relu | torch.float16 | 33554432 | 4.84E-05 | 5.06E-05 | 4.66296539127052 |
| relu | torch.float16 | 67108864 | 9.22E-05 | 9.64E-05 | 4.56491432752297 |
| relu | torch.float16 | 134217728 | 0.000180343495837102 | 0.000187981834945579 | 4.23543919508829 |
| relu | torch.float16 | 268435456 | 0.000355071155354381 | 0.000370856161074092 | 4.44558942107169 |
| relu | torch.float16 | 536870912 | 0.000704489842367669 | 0.000736006341564159 | 4.47366268483987 |
| relu | torch.bfloat16 | 16777216 | 3.03E-05 | 3.04E-05 | 0.166504085842689 |
| relu | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.45848238875716 |
| relu | torch.bfloat16 | 67108864 | 9.32E-05 | 9.65E-05 | 3.56122651631445 |
| relu | torch.bfloat16 | 134217728 | 0.000180805509444326 | 0.000187998676362137 | 3.97840029317567 |
| relu | torch.bfloat16 | 268435456 | 0.000356242332297067 | 0.000371279485989362 | 4.22104627356745 |
| relu | torch.bfloat16 | 536870912 | 0.000708114336399982 | 0.000736773828975856 | 4.04729732229083 |
| relu | torch.float32 | 16777216 | 5.61E-05 | 5.61E-05 | 0.0442587268354941 |
| relu | torch.float32 | 33554432 | 9.33E-05 | 9.30E-05 | -0.259070913799022 |
| relu | torch.float32 | 67108864 | 0.000181321326332788 | 0.000181289506144822 | -0.0175490597877115 |
| relu | torch.float32 | 134217728 | 0.000356896334172537 | 0.000356570177245885 | -0.0913870206618981 |
| relu | torch.float32 | 268435456 | 0.000709421835684528 | 0.000707465515006334 | -0.275762681635911 |
| relu | torch.float32 | 536870912 | 0.00141372415237129 | 0.00141036518228551 | -0.237597276678471 |
| sigmoid | torch.float16 | 16777216 | 3.10E-05 | 3.16E-05 | 2.10012593866895 |
| sigmoid | torch.float16 | 33554432 | 4.91E-05 | 5.23E-05 | 6.37710600666122 |
| sigmoid | torch.float16 | 67108864 | 9.30E-05 | 0.000100057009452333 | 7.61866144555331 |
| sigmoid | torch.float16 | 134217728 | 0.000180928347011407 | 0.000194982004662355 | 7.76752669390248 |
| sigmoid | torch.float16 | 268435456 | 0.000355658994521946 | 0.00038468533117945 | 8.16128288742412 |
| sigmoid | torch.float16 | 536870912 | 0.000705982849467546 | 0.000764021339515845 | 8.22094900634937 |
| sigmoid | torch.bfloat16 | 16777216 | 3.08E-05 | 3.17E-05 | 2.90965915673149 |
| sigmoid | torch.bfloat16 | 33554432 | 4.87E-05 | 5.24E-05 | 7.63503884668234 |
| sigmoid | torch.bfloat16 | 67108864 | 9.33E-05 | 0.000100019678939134 | 7.21238137428013 |
| sigmoid | torch.bfloat16 | 134217728 | 0.000180786165098349 | 0.000194868014659733 | 7.78922964250206 |
| sigmoid | torch.bfloat16 | 268435456 | 0.000355564659306159 | 0.000384909333661199 | 8.25297835063321 |
| sigmoid | torch.bfloat16 | 536870912 | 0.000705831005082776 | 0.000764102345177283 | 8.2557070566308 |
| sigmoid | torch.float32 | 16777216 | 4.93E-05 | 5.65E-05 | 14.5314136197766 |
| sigmoid | torch.float32 | 33554432 | 9.32E-05 | 9.31E-05 | -0.120169865610833 |
| sigmoid | torch.float32 | 67108864 | 0.000181328505277634 | 0.000180455681402236 | -0.481349512069855 |
| sigmoid | torch.float32 | 134217728 | 0.000357362829769651 | 0.000356093340087682 | -0.35523831137877 |
| sigmoid | torch.float32 | 268435456 | 0.000708921831877281 | 0.000707052337626616 | -0.263709504574663 |
| sigmoid | torch.float32 | 536870912 | 0.00141358317341656 | 0.0014090768333214 | -0.318788464654745 |
| tanh | torch.float16 | 16777216 | 3.03E-05 | 3.03E-05 | -0.0912564658661808 |
| tanh | torch.float16 | 33554432 | 4.90E-05 | 5.07E-05 | 3.46644442974484 |
| tanh | torch.float16 | 67108864 | 9.30E-05 | 9.68E-05 | 3.99871369815531 |
| tanh | torch.float16 | 134217728 | 0.00018052199933057 | 0.000188717152923346 | 4.53969799978138 |
| tanh | torch.float16 | 268435456 | 0.000355684508879979 | 0.000373026006855071 | 4.8755280430115 |
| tanh | torch.float16 | 536870912 | 0.000706660988119741 | 0.000740105014604827 | 4.73268328765002 |
| tanh | torch.bfloat16 | 16777216 | 2.99E-05 | 3.03E-05 | 1.21049563135981 |
| tanh | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.48836101041744 |
| tanh | torch.bfloat16 | 67108864 | 9.28E-05 | 9.69E-05 | 4.39944918036626 |
| tanh | torch.bfloat16 | 134217728 | 0.000180710999605556 | 0.000189167990659674 | 4.67984299382829 |
| tanh | torch.bfloat16 | 268435456 | 0.000356062994493792 | 0.000372666652159144 | 4.66312363882606 |
| tanh | torch.bfloat16 | 536870912 | 0.000707100164921333 | 0.000740134331863374 | 4.67178040408393 |
| tanh | torch.float32 | 16777216 | 5.61E-05 | 5.64E-05 | 0.439595755746353 |
| tanh | torch.float32 | 33554432 | 9.31E-05 | 9.31E-05 | 0.00287633090228212 |
| tanh | torch.float32 | 67108864 | 0.000181465332085888 | 0.000180895323865116 | -0.31411411437098 |
| tanh | torch.float32 | 134217728 | 0.000356963835656643 | 0.000356073161431899 | -0.249513854283251 |
| tanh | torch.float32 | 268435456 | 0.000709201170442005 | 0.00070707315656667 | -0.300057862849997 |
| tanh | torch.float32 | 536870912 | 0.00141367283261692 | 0.00141030051357423 | -0.238550176877922 |
| gelu | torch.float16 | 16777216 | 2.73E-05 | 3.17E-05 | 15.921079070745 |
| gelu | torch.float16 | 33554432 | 5.06E-05 | 5.55E-05 | 9.76345374333098 |
| gelu | torch.float16 | 67108864 | 9.65E-05 | 0.000106600326641152 | 10.4308039074712 |
| gelu | torch.float16 | 134217728 | 0.000187776672343413 | 0.000208565829476962 | 11.0712139447915 |
| gelu | torch.float16 | 268435456 | 0.000370216167842348 | 0.000412251994324227 | 11.3544005187205 |
| gelu | torch.float16 | 536870912 | 0.000737301345604161 | 0.000819394170927505 | 11.1342296895002 |
| gelu | torch.bfloat16 | 16777216 | 3.02E-05 | 3.08E-05 | 1.78405479367653 |
| gelu | torch.bfloat16 | 33554432 | 5.13E-05 | 5.69E-05 | 10.9929393318302 |
| gelu | torch.bfloat16 | 67108864 | 9.76E-05 | 0.00010968199543034 | 12.3420807512356 |
| gelu | torch.bfloat16 | 134217728 | 0.000189661824454864 | 0.000214487663470209 | 13.0895287371091 |
| gelu | torch.bfloat16 | 268435456 | 0.000374197009174774 | 0.000423670164309442 | 13.2211519391275 |
| gelu | torch.bfloat16 | 536870912 | 0.000743675006863972 | 0.000842577001700799 | 13.299088166737 |
| gelu | torch.float32 | 16777216 | 5.06E-05 | 5.04E-05 | -0.413385894716413 |
| gelu | torch.float32 | 33554432 | 9.31E-05 | 9.32E-05 | 0.134157041722546 |
| gelu | torch.float32 | 67108864 | 0.000181480175039421 | 0.000180836669945469 | -0.354586992112075 |
| gelu | torch.float32 | 134217728 | 0.000356874331676712 | 0.000356305002545317 | -0.159532104402047 |
| gelu | torch.float32 | 268435456 | 0.000708909006789327 | 0.000706991491218408 | -0.270488250615287 |
| gelu | torch.float32 | 536870912 | 0.00141321367118508 | 0.00140937082081412 | -0.271922813181618 |
| sin | torch.float16 | 16777216 | 3.04E-05 | 3.11E-05 | 2.21834939018859 |
| sin | torch.float16 | 33554432 | 4.85E-05 | 5.23E-05 | 7.72165512511596 |
| sin | torch.float16 | 67108864 | 9.31E-05 | 9.98E-05 | 7.24947099480072 |
| sin | torch.float16 | 134217728 | 0.000180371008658161 | 0.000194791161144773 | 7.99471744039613 |
| sin | torch.float16 | 268435456 | 0.000355454161763191 | 0.000384903668115536 | 8.28503630574026 |
| sin | torch.float16 | 536870912 | 0.000705183832906187 | 0.000764360166310022 | 8.39161799270973 |
| sin | torch.bfloat16 | 16777216 | 3.11E-05 | 3.10E-05 | -0.257677954940036 |
| sin | torch.bfloat16 | 33554432 | 4.89E-05 | 5.24E-05 | 7.34808420323539 |
| sin | torch.bfloat16 | 67108864 | 9.26E-05 | 0.000100248667877167 | 8.22347488801205 |
| sin | torch.bfloat16 | 134217728 | 0.000180674154156198 | 0.00019567032965521 | 8.30012215584937 |
| sin | torch.bfloat16 | 268435456 | 0.000355360486234228 | 0.000386023331278314 | 8.62865913118873 |
| sin | torch.bfloat16 | 536870912 | 0.00070483615854755 | 0.000766805159704139 | 8.79197248964745 |
| sin | torch.float32 | 16777216 | 5.67E-05 | 5.64E-05 | -0.441348534920039 |
| sin | torch.float32 | 33554432 | 9.34E-05 | 9.30E-05 | -0.496458540364117 |
| sin | torch.float32 | 67108864 | 0.000181706990891447 | 0.000180556671693921 | -0.633062708199702 |
| sin | torch.float32 | 134217728 | 0.000356894995396336 | 0.000356046327700218 | -0.237791985616354 |
| sin | torch.float32 | 268435456 | 0.000708777321657787 | 0.000707602652255446 | -0.165731798471427 |
| sin | torch.float32 | 536870912 | 0.00141263716310884 | 0.00140912582476934 | -0.248566187496451 |
| exp | torch.float16 | 16777216 | 3.00E-05 | 3.04E-05 | 1.40099098901014 |
| exp | torch.float16 | 33554432 | 4.86E-05 | 5.03E-05 | 3.44611943643906 |
| exp | torch.float16 | 67108864 | 9.37E-05 | 9.55E-05 | 1.96412400380129 |
| exp | torch.float16 | 134217728 | 0.000180913504057874 | 0.000187193179347863 | 3.47109262113439 |
| exp | torch.float16 | 268435456 | 0.00035607748820136 | 0.000369079003576189 | 3.65131630210701 |
| exp | torch.float16 | 536870912 | 0.000707551507124056 | 0.000732363162872692 | 3.50669251620789 |
| exp | torch.bfloat16 | 16777216 | 2.98E-05 | 3.04E-05 | 1.74345594341654 |
| exp | torch.bfloat16 | 33554432 | 4.88E-05 | 5.04E-05 | 3.40217856534821 |
| exp | torch.bfloat16 | 67108864 | 9.32E-05 | 9.62E-05 | 3.29219958210226 |
| exp | torch.bfloat16 | 134217728 | 0.000180999826019009 | 0.000187239318620414 | 3.44723679499521 |
| exp | torch.bfloat16 | 268435456 | 0.000355944503098726 | 0.000369370992605885 | 3.77207384585864 |
| exp | torch.bfloat16 | 536870912 | 0.000707135167128096 | 0.000733066000975668 | 3.66702648277075 |
| exp | torch.float32 | 16777216 | 4.89E-05 | 5.63E-05 | 15.1245314346532 |
| exp | torch.float32 | 33554432 | 9.34E-05 | 9.31E-05 | -0.259945454477446 |
| exp | torch.float32 | 67108864 | 0.000181152504713585 | 0.000180474346658836 | -0.374357536939058 |
| exp | torch.float32 | 134217728 | 0.000356771342922002 | 0.000355627329554409 | -0.3206573034212 |
| exp | torch.float32 | 268435456 | 0.000708404501589636 | 0.00070713268360123 | -0.179532736671163 |
| exp | torch.float32 | 536870912 | 0.00141283582585553 | 0.00140944866385932 | -0.23974208002295 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-01-29 13:32:59 +00:00
354fe48db9 Add magma cuda build 12.8 (#145765)
https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145765
Approved by: https://github.com/malfet
2025-01-29 08:43:38 +00:00
501c5972f0 [pytorch] raise exception when calling dim order on sparse tensor (#145888)
This diff introduces a change to the PyTorch library that raises an exception when calling the `dim_order` method on a sparse tensor.

Differential Revision: [D68797044](https://our.internmc.facebook.com/intern/diff/D68797044/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145888
Approved by: https://github.com/Jack-Khuu
2025-01-29 06:15:44 +00:00
2e8c080ab1 [inductor][4/N] triton support post-#5512, fix constexpr signatures (#145583)
Prior to this PR, constexprs were appearing in signatures as `{.. "XBLOCK : tl.constexpr": "constexpr"}` when they really should appear as `{.. "XBLOCK": "constexpr"}`.

This PR represents the argument names as ArgName objects, which can optionally be marked as constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145583
Approved by: https://github.com/jansel
2025-01-29 05:46:05 +00:00
3f77002b96 [dynamo][builtin-skipfiles-cleanup] remove abc, enum, importlib (#145892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145892
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
ghstack dependencies: #145856, #145875, #145878
2025-01-29 05:30:06 +00:00
236793684d [dynamo][builtin-skipfiles-cleanup] Remove threading, _collections_abc, _weakrefset, threading (#145878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145878
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
ghstack dependencies: #145856, #145875
2025-01-29 05:30:06 +00:00
a479656cd2 [dynamo][builtin-skipfiles-removal] Remove logging (#145875)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145875
Approved by: https://github.com/williamwen42
ghstack dependencies: #145856
2025-01-29 05:29:58 +00:00
64ee57847b [dynamo][builtin-skipfiles-cleanup] Remove some builtins (#145856)
[dynamo][builtin-skipfiles-cleanup] Remove more builtins

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145856
Approved by: https://github.com/zou3519
2025-01-29 05:29:47 +00:00
7178b827d7 PEP585: Missed conversions (#145342)
Differential Revision: [D68785969](https://our.internmc.facebook.com/intern/diff/D68785969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145342
Approved by: https://github.com/bobrenjc93
2025-01-29 05:24:36 +00:00
8696e59ae2 add test for capture_dynamic_output_shape_ops=True changing expected output between eager and compiled versions (#145821)
Followup from https://github.com/pytorch/pytorch/issues/130290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145821
Approved by: https://github.com/eellison, https://github.com/ezyang
2025-01-29 04:36:32 +00:00
776bdb962c [ONNX] Support subgraphs with 1+ outputs (#145860)
Fixed a bug in _handle_output_node where additional output values were not added as graph outputs

Fixes #145734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145860
Approved by: https://github.com/titaiwangms
2025-01-29 04:13:23 +00:00
cyy
fd515e4f59 Fix C++20 Wambiguous-reversed-operator warnings (#144126)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144126
Approved by: https://github.com/albanD
2025-01-29 03:13:57 +00:00
90a6db4a9c [be][pytorch] Fix backend in autocast (#145859)
Summary: fixing backend typo (BAKCNEDS -> BACKENDS)

Test Plan: ci

Differential Revision: D68573324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145859
Approved by: https://github.com/jvandebon
2025-01-29 03:13:08 +00:00
9be2e88d41 Fix lowering to inductor IR for triton CPU (#144389)
Example failing test:
 `pytest -s test_torchinductor_opinfo.py  -k test_comprehensive_special_polygamma_special_polygamma_n_0_cpu_float32` when using triton CPU.

Failure:
```shell
triton.compiler.errors.CompilationError: at 10:11:
def triton_poi_fused_polygamma_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 25
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = 1.0
    tl.static_assert(tmp1.dtype == tl.float32)
    tmp2 = ops.polygamma(tmp1, tmp0)
           ^
NameError('ops is not defined')
```
This occurs because the registered triton fallbacks are not used during the lowering to inductor IR.

Marked the problematic code in the excerpt below from 6bc17b0725/torch/_inductor/lowering.py (L572)

```python
def make_pointwise(
    fn,
    override_return_dtype=None,
    override_device=None,
    override_fn_when_input_bool=None,
    override_fn_when_gpu_float64=None,
    allow_alpha=False,
    triton_fallback=None,
):
    def inner(*inputs: TensorBox, alpha=None):
        if triton_fallback is not None and any(
            isinstance(inp, IRNode) and is_triton(inp) for inp in inputs <--- is_triton should return True when using triton CPU
        ):
            assert not allow_alpha  # not implemented
            return triton_fallback(*inputs)

        inputs = promote_constants(inputs, override_return_dtype)
        if allow_alpha:
            if alpha is not None and alpha != 1:
                inputs = list(inputs)

```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144389
Approved by: https://github.com/jansel
2025-01-29 03:10:53 +00:00
50f834f134 [export] allow bit shift builtin ops (#145802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145802
Approved by: https://github.com/pianpwk
2025-01-29 03:05:48 +00:00
f4ca98950e Add CUDA 12.8 libtorch image (#145789)
https://github.com/pytorch/pytorch/issues/145570

Builds 12.8 libtorch docker/deprecate 12.1 meanwhile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145789
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-01-29 02:59:37 +00:00
9330b6d098 Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)
Summary:

Test Plan:

Differential Revision: [D68751149](https://our.internmc.facebook.com/intern/diff/D68751149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829
Approved by: https://github.com/Chillee
2025-01-29 02:52:55 +00:00
9fd6722fc9 [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-29 02:48:56 +00:00
29521256e1 [Customized Optimus][Inductor] Add split cat pattern in aten level (#145721)
Summary:
Thanks Microve for discovering that recGPT has some repeated similar kernels that might be optimized through optimus. After investigation, I designed a pattern in the aten level to remove such excessive kernels.

trace: https://fburl.com/perfdoctor/82fauil7
tlparse: https://fburl.com/98q6tadx

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/e8458d63-b8ca-498b-a731-77a83fb4d1cb
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16325548715106567
Network: Up: 341KiB  Down: 359KiB  (reSessionID-7d3de666-7fc1-4988-8d11-d75ba958016d)
Executing actions. Remaining     0/3
Command: test.     Finished 2 local
Time elapsed: 3:04.8s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# local run

```
buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=local_recgpt_ranking_30x_v0_unified_seq_1115
```

https://www.internalfb.com/mlhub/pipeline/1630903954173593

# E2E

```
buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=mast_recgpt_ranking_30x_v0_unified_seq_1115 launcher.oncall=ads_model_platform launcher.data_project=ai_large_scale launcher.fbl_entitlement=ads_global_tc_training_efficiency launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] launcher.hardware=SMC_T20 launcher.job_name=recgpt_ranking_1115_pt2_with_optimus data_loader.dataset.table_ds=[2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18]
```

### how to add the config
Add the following patterns to the dynamo config

```
        post_grad_fusion_options: {
          "normalization_aten_pass": {},
          "split_cat_aten_pass": {},
        }
```

{F1974700331}

baseline:
aps-recgpt_ranking_1115_pt2_5-8cb4905c7d

{F1974700216}

proposal:

Differential Revision: D68695717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145721
Approved by: https://github.com/Yuzhen11
2025-01-29 01:59:06 +00:00
331f49057d Removes threadfence from topk kernel to improve AMD performance (#145536)
Also marginally improves cuda perf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145536
Approved by: https://github.com/eqy
2025-01-29 01:29:15 +00:00
6f5c8fb128 [DTensor] Add pointwise ops strategy for aten.minimum (#145816)
Need it for Shampoo optimizer.
9c5700ad5e/matrix_functions.py (L240-L242)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145816
Approved by: https://github.com/XilunWu
2025-01-29 01:19:01 +00:00
15e37e4253 [export] don't always print GM in serdes logging (#145857)
Summary: Didn't realize print_readable() would also print and not just return string

Test Plan: .

Differential Revision: D68781525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145857
Approved by: https://github.com/angelayi, https://github.com/yiming0416
2025-01-29 01:03:02 +00:00
a24b25942a Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)
Fixes #140092

Here's what this PR does:

Case 1: no `eps`  is passed to python frontend:
Use `eps` associated with opmath_t instead of than `eps` associated with`scalar_t` for intermediate computation

Case 2: `eps` is passed to python frontend
Avoid downcasting `eps` to `scalar_t` and then upcasting it again implicitly in the `rqrst_input` computation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848
Approved by: https://github.com/albanD
2025-01-29 01:01:44 +00:00
ae0f305bf9 [inductor] Make triton kernel autotune config defaults backward-compatible (#145494)
If a model was torch.packaged using triton<=3.1, any user-defined
autotuned kernels will have reps/warmups burned in with the old defaults
(100/25).  If this model is loaded with triton>=3.2, inductor's checks for
unsupported non-default autotune args will fail, because triton.Autotuner's
defaults for these parameters has changed to `None`.  Let's explicitly support
those values for backward compatibility with these older models.

Differential Revision: [D68561014](https://our.internmc.facebook.com/intern/diff/D68561014/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145494
Approved by: https://github.com/aorenste
2025-01-29 00:31:39 +00:00
9036a22c83 [Inductor][Triton] Change propagated dtype for fp16/bf16 unwrapped 0d tensors (#145613)
Fixes TestInductorOpInfoCPU.test_comprehensive_max_binary_cpu_float16 and related tests for Triton CPU. TestInductorOpInfoCPU is currently not run in the CI. See https://github.com/pytorch/pytorch/pull/144389#issuecomment-2608050755 for some additional context.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145613
Approved by: https://github.com/davidberard98, https://github.com/eellison, https://github.com/jansel
2025-01-29 00:23:44 +00:00
2f24f2eb46 Make sure to evaluate annotation strings in the context of where the prototype was created (#145667)
This was incorrectly evaluating the annotation in the context of infer_schema - make sure to evaluate annotation strings in the context of where the prototype was created instead.

Fixes #145481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145667
Approved by: https://github.com/zou3519
2025-01-29 00:14:45 +00:00
82859f6185 [associative_scan] scan dim handling in user-facing associative_scan() (#139864)
This PR implements the user-facing dim change, i.e., that the scan dim provided by the user is always moved to dim 0 and then the associative_scan operation always operates on dim 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139864
Approved by: https://github.com/ydwu4
2025-01-28 23:58:10 +00:00
7ca156f0ee partitioner: avoid inserting duplicates into heap (#145082)
Fixes https://github.com/pytorch/pytorch/issues/145081

This looks like it was a source of quadratic compile times in the torchtitan CP graphs. There's some code in the partitioner that iteratively adds users of a node to a heap, and pops the earliest user. If you have long parallel chains of fusible ops that all eventually feed into some shared ops, then this can result in:
(1) a node getting added to the heap many times
(2) each time we pop that node, we add (duplicates of) each of that node users to the heap
(3) repeat with each user

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145082
Approved by: https://github.com/xmfan
2025-01-28 23:44:45 +00:00
02dd7a7803 Extend abi-stable nitpick message to all the c stable files (#145862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145862
Approved by: https://github.com/ezyang
2025-01-28 23:22:23 +00:00
049f042e52 Update build_wheel.sh 2025-01-28 15:14:41 -08:00
eea7d395e5 [CD] Install ninja and setuptools from PyPI (#145871)
Rather than Conda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871
Approved by: https://github.com/Skylion007
ghstack dependencies: #145870
2025-01-28 23:09:38 +00:00
c26bb9ba5b [CMake] Find HomeBrew OpenMP on MacOS (#145870)
Either via `OMP_PREFIX` envvar or just searching in that folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870
Approved by: https://github.com/Skylion007
2025-01-28 23:09:37 +00:00
f388ba5986 Update CUDNN frontend submodule to 1.10.0 (#145780)
Update to CUDNN 1.10. Most of this is release is about supporting some new APIs needed for Blackwell integration and new features in the corresponding CUDNN version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145780
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/malfet
2025-01-28 22:54:24 +00:00
af43b445a5 [ONNX] Set USE_EXPERIMENTAL_LOGIC to True (#137296)
This sets dynamo_export to use the new export logic. The legacy dynamo export logic will be removed as a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137296
Approved by: https://github.com/titaiwangms
2025-01-28 22:35:11 +00:00
5aa5a5763e [inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684)
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:

1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.

To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy
2025-01-28 22:01:08 +00:00
1ffed44b42 [aotinductor] update unbacked symint runtime assertion msg (#145569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145569
Approved by: https://github.com/chenyang78
2025-01-28 21:42:58 +00:00
a06a18b1bb [ATen] Implement exception handling for hipsolver APIs (#145839)
Summary: TSA

Test Plan: CI

Differential Revision: D68741194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145839
Approved by: https://github.com/Mellonta
2025-01-28 21:37:23 +00:00
9003d81144 change the test wheel to release wheel when release wheel available (#145252)
change the test wheel to release wheel when release wheel available

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145252
Approved by: https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-28 21:23:53 +00:00
4f949f282d [c10d][ez] Remove goto in PGNCCL and make linter happy for PGNCCL and NCCLUtils (#145855)
While working on PGNCCL I found that the code triggers some lint warnings so this PR is to address them or add lint suppressor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145855
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2025-01-28 21:19:49 +00:00
6bcb545d9c [CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups (#145793)
Make ci cusparselt installation be consistent with nightly binary
Remove cu121 related docker build jobs and inductor runs Update test failures relating to cu121

Retry of https://github.com/pytorch/pytorch/pull/145696
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145793
Approved by: https://github.com/eqy, https://github.com/tinglvv
2025-01-28 21:01:58 +00:00
ccc2878c97 Fix fractional_max_pool lowering in inductor (#144395)
Fixes https://github.com/pytorch/pytorch/issues/141538
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144395
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-28 21:00:18 +00:00
ef28df5c9e [Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593)
Reland of #137843 , after checking the code again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140593
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-28 20:51:49 +00:00
3481c2aec4 Revert "[dynamo] save/restore system random state more carefully (#145750)"
This reverts commit e3d3f2b22e4b75c64eaa2f940a2dd80c1e43435c.

Reverted https://github.com/pytorch/pytorch/pull/145750 on behalf of https://github.com/eellison due to bisected perf regression ([comment](https://github.com/pytorch/pytorch/pull/145750#issuecomment-2620028414))
2025-01-28 20:51:07 +00:00
28982ceb3b [aarch64] Rebuild everything with ArmPL (#145742)
Summary: Rebuild everything that used OpenBLAS with ArmPL

Test Plan: CI, prod test

Reviewed By: Nicoshev

Differential Revision: D68219559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145742
Approved by: https://github.com/malfet
2025-01-28 20:48:42 +00:00
edf266e9bb inductor.config.descriptive_names = False is not actually supported (#145523)
Summary:
This config is not supported (it throws an error when set), and doesn't really make sense imo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145523
Approved by: https://github.com/eellison
2025-01-28 20:22:23 +00:00
515e55e692 Set -DPy_LIMITED_API flag for py_limited_api=True extensions (#145764)
This could be BC breaking, because there was a period of time when we use py_limited_api=True but don't enforce the flag, and now that we will start enforcing the flag, people's custom extensions may fail to build.

This is strictly still better behavior, as it is sketchy to claim CPython agnosticism without the flag, but calling this out as potential people yelling at us. Ways to mitigate this risk + reasons this may not be too big a deal:
- People haven't known about py_limited_api for extensions much due to lack of docs from python so usage is low right now
- My current tutorial is in store to make new users of py_limited_api pass this flag, so it'd be a noop for them.

Test plan:
* Locally i'm confident as I tried rebuilding ao with this change and it reliably failed (cuz importing torch/extension.h is a nono)
* Unit test wise, the normal python_agnostic one I added should work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145764
Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD
2025-01-28 20:11:05 +00:00
8d91bfd965 [BE] Include CheckFunctionExists in FindBLAS.cmake (#145849)
It's used in the script, so it must be included
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145849
Approved by: https://github.com/Skylion007
2025-01-28 19:47:05 +00:00
eaff13275e [dynamo] Properly branch on an unspecialized NN module (#145786)
User defined NN module might have their own `__len__` or `__bool__`
methods which Dynamo needs to trace through, so that side effects and/or
reads to buffered writes are properly handled.

This patch removes the special `UnspecializedNNModuleVariable` branch in
Dynamo's branch handling, and lets these cases fall into the
`UserDefinedObjectVariable` branch, which handles the aforementioned
cases correctly.

Fixes #145284.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145786
Approved by: https://github.com/williamwen42
2025-01-28 19:45:17 +00:00
d9ffa5da65 Log info for AOTAutogradCache bypasses instead of warning (#145768)
Fixes #145767

FxGraphCache also logs to info instead of warning so lets do that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145768
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2025-01-28 19:25:36 +00:00
6c09954a9e Windows builds with VS2022 (#145319)
[Fixes #ISSUE_NUMBER
](https://github.com/pytorch/pytorch/issues/128835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145319
Approved by: https://github.com/huydhn
2025-01-28 19:07:24 +00:00
cbc4094298 [draft_export] add LOC for data-dep error logging (#145443)
Summary:
maybe this is too much info, but it's difficult to go through old draft export reports where the stack trace is out of sync with the current codebase. Data-dependent errors now look like:
```
2. Data dependent error.
    When exporting, we were unable to evaluate the value of `u306`.
    This occurred at the following stacktrace:
    File /data/users/pianpwk/fbsource/buck-out/v2/gen/fbcode/78204cab86e8a0fb/sigmoid/inference/ts_migration/__pt2i_readiness_main__/pt2i_readiness_main#link-tree/caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/embedding_bag_proxy.py, lineno 109, in _forward_impl:
         `if offsets[-1] > len(input):`
    As a result, it was specialized to evaluate to `261`, and asserts were inserted into the graph.
    Please add `torch._check(...)` to the original code to assert this data-dependent assumption.
    Please refer to https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit#heading=h.boi2xurpqa0o for more details.
```

This would be even more helpful for reports on torch-packaged models, but that requires some more work on PT2I-specific stack trace processing

Test Plan: .

Differential Revision: D68534017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145443
Approved by: https://github.com/angelayi
2025-01-28 18:55:16 +00:00
c32bafeb0b [ROCm] Bump AOTriton to 0.8.2b (#145508)
We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem.

Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases,  but it is considered experimental and will not be enabled right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508
Approved by: https://github.com/jeffdaily
2025-01-28 18:34:25 +00:00
621604ce46 Maintain multiple configs (#145103)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Previously, we would finalize the config of a triton template after its first fusion. this maintains multiple configs, in case we epilogue fuse, then prologue fuse, and prologue fusion has a new better config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145103
Approved by: https://github.com/jansel, https://github.com/shunting314
ghstack dependencies: #143408
2025-01-28 18:32:14 +00:00
eaec97ab1f [dynamo] Properly prune dead input cell object (#145781)
This patch models input cell object as "newly created" rather than
"pre-existing" python object (see added documentation for why this
actually captures the semantics more accurately).

This enables the `SideEffects.prune_dead_object_new` algorithm to prune
away writes to input cell objects which are no longer relevant; this
didn't happen prior to this patch because we modelled them as
pre-existing objects, which forces us to codegen their attribute
mutations.

Fixes #145564.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145781
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-28 18:28:13 +00:00
8e258e2ecd Parallelize epilogue/prologue benchmarking (#143408)
When we attempt prologue or epilogue fusion with a TritonTemplate, we benchmark it at compile time in order to determine profitability. This avoids slowdowns/register spilling, and allows us to pick fusion when a base triton template is slower than cublas but faster when considering an epilogue. However, that fused benchmarking does not do the same async compilation as we do for the base TritonTemplate. The Base TritonTemplate is async compiled during lowering, then later waited on and benchmarked.

This PR extends a similar process to benchmarking fused TritonTemplates in the scheduler. We keep a list of pending fusions which have async compilations. And we resolve any pending fusions a node is in prior to attempting to fuse it with any other node.

Initially, I saw some slowdowns with this because we kick off async compilations of identical fusions in parallel. To address this I added source code caching at the `async_compile` level (we also already cache benchmark runs, but that would not happen in parallel).

Compilation speedups:

<img width="717" alt="image" src="https://github.com/user-attachments/assets/8e8f7d6c-7824-4210-83f9-a2a0f6db5ac9" />

This also should let us be a bit more aggressive with either configs, or benchmarking other fusions which are hard to determine profitability of.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143408
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-01-28 18:18:24 +00:00
3fd4691908 [MPS] Add op_math_t (#145808)
Similar to `at::opmath_t` to be used for reduction (and int mms)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808
Approved by: https://github.com/dcci
2025-01-28 18:03:52 +00:00
5382ab57d7 Move trunk windows builds to CUDA-12.4 (#145844)
Same as : https://github.com/pytorch/pytorch/pull/130446

That should catch build regressions that were previously only detectable during the nightly builds for 12.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145844
Approved by: https://github.com/janeyx99, https://github.com/malfet
2025-01-28 18:00:51 +00:00
56915b093a Fix environment deployment spam (#145823)
With https://github.com/pytorch-labs/pytorch-gha-infra/pull/598 in place, the environment can now be removed.

Fixes https://github.com/pytorch/pytorch/issues/145704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145823
Approved by: https://github.com/clee2000
2025-01-28 17:46:31 +00:00
cfbb27462e Revert "[inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373)"
This reverts commit b8087747f5ca7be0d37b1ac85dc0894f6a33e3a3.

Reverted https://github.com/pytorch/pytorch/pull/145373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145373#issuecomment-2619674197))
2025-01-28 17:46:11 +00:00
dbef2a9bc9 Revert "Remove lexicographical sorting of storage keys in torch.save (#143879)"
This reverts commit 7db0afabaaff17dd37cf846cd786610ebf6aedd3.

Reverted https://github.com/pytorch/pytorch/pull/143879 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68746524 for details ([comment](https://github.com/pytorch/pytorch/pull/143879#issuecomment-2619661492))
2025-01-28 17:40:16 +00:00
097ccd9c39 Move ROCm MI300 jobs to unstable to make CI green (#145790)
This is a temporary change to reduce intermittent tests failures. Jobs can be moved back once those machines get better runner isolation.

This also sneaks in a small fix to all the rocm job's build step to be run on Linux Foundation runners (the get-label-type dependency).  The inductor-rocm-mi300 workflow already had it, but it was missing in the rocm-mi300 workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145790
Approved by: https://github.com/yangw-dev
2025-01-28 17:25:15 +00:00
7eb51e5464 Ensure GPU isolation for kubernetes pod MI300 runners. (#145829)
Fixes the reason behind moving the tests to unstable initially. (https://github.com/pytorch/pytorch/pull/145790)
We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here.
Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145829
Approved by: https://github.com/jeffdaily
2025-01-28 17:20:46 +00:00
cyy
c751541e79 Fix cppcoreguidelines-init-variables ignorance (#141795)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795
Approved by: https://github.com/albanD
2025-01-28 17:11:37 +00:00
ac87388e61 [AOTInductor] Refactor CPU and GPU to remove ifdef macros (#145639)
Summary: Remove #ifdef USE_CUDA macros through some refactor

Test Plan: Refactor code, existing tests.

Differential Revision: D68636743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145639
Approved by: https://github.com/desertfire
2025-01-28 16:46:00 +00:00
6967ef1b07 [ROCm] fix test_cublas_workspace_explicit_allocation for gfx12 (#145227)
gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300
Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300.
Now `default_workspace_size=32MB` is used for gfx12 and the test passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2025-01-28 16:19:27 +00:00
80a0412b76 [dynamo][builtin-skipfiles-cleanup] Remove posixpath (#145828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145828
Approved by: https://github.com/zou3519
ghstack dependencies: #145744, #145753, #145826
2025-01-28 16:14:34 +00:00
6824a4a75d [dynamo][builtin-skipfiles-cleanup] Remove re (#145826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145826
Approved by: https://github.com/zou3519
ghstack dependencies: #145744, #145753
2025-01-28 16:14:34 +00:00
4307e6c008 [dynamo][builtin-skipfile-cleanup] Remove signal (#145753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145753
Approved by: https://github.com/zou3519
ghstack dependencies: #145744
2025-01-28 16:14:23 +00:00
3a56089217 fix unbacked + view incorrectness (#145548)
fix for https://github.com/pytorch/pytorch/issues/143498

We were incorrectly using contiguous strides for a non-contiguous tensor. There are two separate causes:

1. https://github.com/pytorch/pytorch/pull/110520 made it so we turn Views contiguous with unbacked symints becuase
`dynamic_reshape_indexer below will fail due to the size_hint's inability to process unbacked SymInts`. Seems like we should fix. Regardless - it will make the input contiguous if input is unbacked to workaround this.

2. We weren't actually making it contiguous! I filed an issue for this here: https://github.com/pytorch/pytorch/issues/145561.

This is still worth landing as a fix, even though we should those issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145548
Approved by: https://github.com/desertfire
2025-01-28 16:03:45 +00:00
97b3b73f3e [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-01-28 15:21:12 +00:00
a08f7f3266 OpenReg: fix issue of pin_memory (#145046)
Fix issue of `pin_memory` when rewrapping a storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145046
Approved by: https://github.com/albanD
2025-01-28 09:41:04 +00:00
bdf6dfa17d [chore][ez] change alloc buffer size from 4000 to 4096 (#145759)
Summary:
Allocations typically happen as a power of 2 anyway.
Change the default alloc size to 4096 so eek out a bit more perf.

Test:
unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145759
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756, #145757
2025-01-28 09:14:07 +00:00
5c5306e8bc [dynamo][builtin-skiplist-cleanup] Remove weakref (#145744)
WeakKeyDictionary already works very nicely with the UserDefinedObject Variable Tracker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145744
Approved by: https://github.com/jansel
2025-01-28 07:55:12 +00:00
45f64e770a relax assertion to warning for unbacked binding names (#145777)
Summary:
Quick fix following up on https://github.com/pytorch/pytorch/pull/144894 to unblock internal tests.

Will keep investigating a more principled fix.

Test Plan: Failures in T213563826 now pass

Differential Revision: D68731710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145777
Approved by: https://github.com/angelayi
2025-01-28 07:52:40 +00:00
0a8a0ef767 [inductor] Fix crash running wrapper_benchmark with no device (#145644)
Fixes #145434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145644
Approved by: https://github.com/shunting314
2025-01-28 07:31:36 +00:00
a699034eec Record inputs at time of tracing, constrain to them for triton fn (#145448)
Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context.

We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448
Approved by: https://github.com/zou3519
2025-01-28 07:07:14 +00:00
0f5a68344a [BE][Inductor] Simplify custom_op tests (#145814)
Not sure what were the motivation behind repeating the same function over and over again for different backends
Change `test_custom_op_[123]` from acceptig separate (but identical) implementations for CPU, CUDA and XPU, to take just `fn` and `fn_meta` args

Test that it also extendable to MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145814
Approved by: https://github.com/jansel
2025-01-28 05:58:51 +00:00
23eb0a3201 Improve typing in torch/types.py (#145237)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145237
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2025-01-28 05:29:12 +00:00
8e46d0f595 [BE]: Update typing of OrderedSet ancestor (#145783)
Now that we are on python 3.9 minimum version we can properly use Generics in the superclass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145783
Approved by: https://github.com/eellison
2025-01-28 04:43:49 +00:00
cyy
67fcc7cf02 [3/N] Remove unnecessary once flag usage (#145672)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672
Approved by: https://github.com/albanD
2025-01-28 04:28:18 +00:00
01a4d86b31 add pt2 callbacks for backward pass and prevent duplicate callbacks (#145732)
Summary: This change adds callbacks for lazy backwards compilation while preventing duplicate callbacks to be fired.

Differential Revision: D68577593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145732
Approved by: https://github.com/mlazos
2025-01-28 03:50:02 +00:00
1a26cdd5cb [cond] remove warning for unsupported tuple returns (#145766)
I guess this is supported now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145766
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-01-28 03:13:36 +00:00
9010649292 Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)"
This reverts commit db3685a35cdce32622ab89f6c92e09d52210ff53.

Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))
2025-01-28 03:07:17 +00:00
78f02bf07c [bug] handle case when remote peer closes connection (#145757)
Summary:
In the case where remote peer closes the connection, nread returns 0. In
this case, we still want to free up the allocated buffer.
Also, reorder the if so that the likely success cases (nread > 0) is at
the top of the function with an early return.

Test Plan:
unit tests

Differential Revision: [D68733192](https://our.internmc.facebook.com/intern/diff/D68733192)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145757
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756
2025-01-28 03:06:38 +00:00
4be831ba2d [draft_export] fix dense-in-memory check for inferring fakes (#145653)
Test Plan: fixes check for dense tensors with size-1 dimensions

Differential Revision: D68644028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145653
Approved by: https://github.com/zou3519
2025-01-28 02:52:14 +00:00
7c1fc0a047 Log cache state for AOTAutograd in title of file (#145715)
Differential Revision: [D68692755](https://our.internmc.facebook.com/intern/diff/D68692755/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145715
Approved by: https://github.com/bobrenjc93
2025-01-28 02:14:18 +00:00
78a94c9114 [inductor] Remove type ignores from scheduler.py (#145712)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145712
Approved by: https://github.com/yanboliang, https://github.com/Skylion007
ghstack dependencies: #145692
2025-01-28 01:44:32 +00:00
2df2f9d895 [inductor] Change type of get_backend_features to OrderedSet (#145692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145692
Approved by: https://github.com/yanboliang
2025-01-28 01:44:32 +00:00
db33d23aa8 [SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144886
Approved by: https://github.com/awgu
2025-01-28 01:43:37 +00:00
e3d3f2b22e [dynamo] save/restore system random state more carefully (#145750)
Reattempt of https://github.com/pytorch/pytorch/pull/145435 since the state of the linked internal diff appears to be messed up.

Note: I have verified that the previously failing internal tests now pass internally.

Differential Revision: [D68723334](https://our.internmc.facebook.com/intern/diff/D68723334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145750
Approved by: https://github.com/StrongerXi
2025-01-28 01:34:13 +00:00
f16ce3c7e9 Refactor fuzzer and add support for Dynamo (#145565)
## Summary:
Dynamo now works with config fuzzer.

For BE week, we also found and fixed 5 different bugs (in inductor):
- https://github.com/pytorch/pytorch/pull/145426
- https://github.com/pytorch/pytorch/pull/145523
- https://github.com/pytorch/pytorch/pull/145527
- https://github.com/pytorch/pytorch/pull/145532
- https://github.com/pytorch/pytorch/pull/145538

## Test Plan:
New Dynamo Unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145565
Approved by: https://github.com/masnesral
2025-01-28 00:44:27 +00:00
6eb74fbec6 Updates NCCL user buffer registration test for NCCL 2.24.3 (#145285)
NCCL 2.24.3 changed the content of the debug output for NVLS registration. We use this debug output in our test suite to check if NVLS was successfully registered or not. Hence we need to specialize for the NCCL version in the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145285
Approved by: https://github.com/kwen2501
2025-01-28 00:24:53 +00:00
5a4d959cdb [dynamo] Properly model torch profiler context objects (#145537)
Prior to this patch, Dynamo conveniently modelled torch profiler context
objects (e.g., `torch.profiler.profile`) as `NullContextVariable`
because `torch.compile` ignore the effect of these profiler contexts.

However, the semantics of these profiler contexts diverges from
`contextlib.nullcontext` in the `__enter__` function, where the former
returns `self` and the latter returns `None`. This causes subtle error
as observed in #125021.

This patch adds back a `ProfilerContextVariable`, which addresses the
aforementioned semantic discrepency.

Fixes #125021.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145537
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-01-28 00:03:36 +00:00
db3685a35c Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-27 23:57:30 +00:00
7db0afabaa Remove lexicographical sorting of storage keys in torch.save (#143879)
Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted)

This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads

Differential Revision: [D67673025](https://our.internmc.facebook.com/intern/diff/D67673025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879
Approved by: https://github.com/albanD
2025-01-27 23:57:30 +00:00
c1161957a4 inductor_config_logging: Don't drop keys (#144700)
This bit me while I was trying to debug some trace issues.
In general this config is already quite large when dumping, so adding
more fields doesn't make it significantly worse.

Also a number of the items we are type checking for (except the test
configs), don't even show up. Primarily this will help us when debugging
rocm, halide, and trace configs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144700
Approved by: https://github.com/ezyang
2025-01-27 23:47:25 +00:00
7d01f6e6f2 Add ignorable commits on run_test.py to git blame ignore (#145787)
Chanced upon it while searching through cpp_extension related code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145787
Approved by: https://github.com/malfet
2025-01-27 23:24:48 +00:00
3ce68dc61e [c10d] Flush file in file recorder (#145458)
Summary:
Flushing file to hopefully prevent file corruptions as reported in
https://github.com/pytorch/pytorch/pull/145125

Test Plan:
Couldn't get file corruption to occur in my tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145458
Approved by: https://github.com/kwen2501
2025-01-27 23:15:52 +00:00
5534c270db [chore] fix new linter (#145756)
Summary:
Fix new linter that's complaining when I made changes to this file:
class 'LibUVStoreDaemon' defines a non-default destructor but does not
define a copy constructor, a copy assignment operator, a move
constructor or a move assignment operator

Test Plan:
make lint passes

Differential Revision: [D68733191](https://our.internmc.facebook.com/intern/diff/D68733191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145756
Approved by: https://github.com/XilunWu, https://github.com/Skylion007, https://github.com/fduwjj
2025-01-27 22:48:12 +00:00
2de53b3b65 Revert "pickler for GraphModule (#141659)"
This reverts commit c6ad08357bf8e766b5220bfb5cbbfdb2a4ec0ca5.

Reverted https://github.com/pytorch/pytorch/pull/141659 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, please take a look at D68694181 for more details. ([comment](https://github.com/pytorch/pytorch/pull/141659#issuecomment-2617045120))
2025-01-27 22:39:30 +00:00
006397fac3 Remove FBGEMM sccache hack (#145664)
Testing https://github.com/pytorch/pytorch/actions/runs/12959358756, sccache is working correctly now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145664
Approved by: https://github.com/wdvr
2025-01-27 22:00:06 +00:00
69e82d02d3 [inductor][3/N] triton support post-#5512, tt.divisibility format (#145575)
1. Fix the tt.divisibility format in hints.py. Previously, it was `{((0,), (1,)): [["tt.divisibility", 16]]}`. Now it is `{(0,): [["tt.divisibility", 16]], (1,): [["tt.divisibility", 16]]}`. This was an oversight in the first PR I added. I've verified that we now get `{ tt.divisibility = 16 }` in the generated TTGIR.
2. Update the test_codegen_triton.py test to work with multiple triton versions (and test this divisibility format in the new triton version)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145575
Approved by: https://github.com/SamGinzburg
2025-01-27 21:48:58 +00:00
993b229665 [dynamo][dicts] Fix dict.__new__ bug (#145723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145723
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #145519, #145547, #145558
2025-01-27 21:42:43 +00:00
7e1c7253e9 [dynamo][builtin-skipfile-cleanup] Support tuple.__new__ (#145558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145558
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #145519, #145547
2025-01-27 21:42:43 +00:00
1ba1b7b597 Support remaining *_like factory functions for NJT (#144889)
Fixes #144761

This PR adds NJT impls for those *_like functions that were previously missing:
* `full_like()`
* `rand_like()`
* `randint_like()`

It also fixes a bug in existing *_like functions when a new device is specified. Fix is to also transfer `offsets` / `lengths` to the new device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144889
Approved by: https://github.com/soulitzer
2025-01-27 21:33:51 +00:00
3a23d75b37 [MPS] Fix c0:🤘:log_gamma correctness on M4 (#145740)
To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 )
```swift
import Metal

func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) {
  guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") }
  let device = library.device
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") }
  guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") }
  guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<nelem {
    ibuf_data[i] = T(sin(Float(2 + i)))
  }
  guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let obuf_data = obuf.contents().assumingMemoryBound(to: T.self)

  computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc))
  computeEncoder.setBuffer(obuf, offset:0, index: 0)
  computeEncoder.setBuffer(ibuf, offset:0, index: 1)
  computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1))
  computeEncoder.endEncoding()
  cmdBuffer.commit()
  cmdBuffer.waitUntilCompleted()

  print("Results for \(String(describing: T.self)):", terminator: " ")
  for i in 0..<nelem {
    print(obuf_data[i], terminator: " ")
  }
  print()
}

let shader_source = """
#include <metal_stdlib>

template<typename T>
float foo(T x) {
  const auto abs_x = :🤘:abs(static_cast<float>(x));
  auto rc = :🤘:log(abs_x);

  return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x)));
}

kernel void half_kernel(
    device half* out_ptr0,
    constant half* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<half>(out);
}

kernel void float_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<float>(out);
}
"""
let options = MTLCompileOptions()
options.mathMode = .safe
options.mathFloatingPointFunctions = .precise

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
let library = try! device.makeLibrary(source:shader_source, options:options)
run_shader(library:library, kernel_name:"half_kernel", type: Float16.self)
run_shader(library:library, kernel_name:"float_kernel", type: Float.self)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740
Approved by: https://github.com/dcci
2025-01-27 21:24:22 +00:00
60f98262f1 PEP585: .github (#145707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145707
Approved by: https://github.com/huydhn
2025-01-27 21:21:01 +00:00
bfaf76bfc6 [dynamo] clear out traced frames at the start of test_log_traced_frames (#145640)
The test was being flaky in CI, and this patch fixes it.

Fixes #137461.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145640
Approved by: https://github.com/williamwen42
2025-01-27 20:49:59 +00:00
93dd6bc4d8 Add CUDA 12.8 installation and manylinux-cuda12.8 (#145567)
Breaking https://github.com/pytorch/pytorch/pull/145557 into two parts.
Need to have manylinux-cuda12.8 in order to build magma.

Issue: https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145567
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-01-27 20:49:07 +00:00
64cd81712d torch.distributions: replace numbers.Number with torch.types.Number. (#145086)
Fixes #144788 (partial)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145086
Approved by: https://github.com/malfet
2025-01-27 20:24:55 +00:00
2f8ad8f4b9 Run inductor perf benchmark on ROCm (#145763)
This requires https://github.com/pytorch/pytorch/pull/144594.  The test run on PT2 dashboard is at https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2020%20Jan%202025%2019%3A46%3A14%20GMT&stopTime=Mon%2C%2027%20Jan%202025%2019%3A46%3A14%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=rocm&lBranch=144594&lCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b&rBranch=144594&rCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145763
Approved by: https://github.com/jeffdaily
2025-01-27 20:19:03 +00:00
66631bc84b [dynamo] Fix read/write conflicts in a cuda test (#145658)
Prior to this patch, the `test_cuda_event_created_outside_of_graph`
is flaky in CI, and that's because we have read and write to the same
`foo` tensor buffer from 2 different streams. This patch eliminates that
by adding a synchronization to wait till read finishes before starting
the write.

Fixes #133837, #133828.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145658
Approved by: https://github.com/yifuwang
2025-01-27 19:55:57 +00:00
c986eba560 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit abf28982a8cb43342e7669d859de9543fd804cc9.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See  D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))
2025-01-27 19:38:26 +00:00
9728e900dc [Inductor][CPP] fix torch logit decomposition (#145576)
**Summary**

Fix issue https://github.com/pytorch/pytorch/issues/145379, current decomposition using `self = torch.clamp(self, lo, hi)` which gives wrong result when `lo` is larger than `hi` comparing to eager implementation: cd68d54911/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L165)
Align their behavior in this PR.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_torch_logit
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145576
Approved by: https://github.com/jgong5, https://github.com/eellison
2025-01-27 19:37:51 +00:00
635b98fa08 Add nitpick warning that aoti_torch/c/shim.h is ABI stable (#145745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145745
Approved by: https://github.com/albanD
2025-01-27 19:25:37 +00:00
bc377c503e [Custom Ops] Fix f-strings in custom ops error message (#145673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145673
Approved by: https://github.com/zou3519
ghstack dependencies: #145588
2025-01-27 19:22:43 +00:00
ec91b7720f [Custom Ops] Add a new API to allow users to register an autocast for the custom op (#145588)
Fixes #137033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145588
Approved by: https://github.com/zou3519
2025-01-27 19:22:43 +00:00
f951d216e0 [autocast][pytorch] Support autocast for MTIA (policy) (#145666)
Summary: Add autocast support for MTIA (policy)

Reviewed By: egienvalue

Differential Revision: D68604796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145666
Approved by: https://github.com/chaos5958
2025-01-27 18:26:04 +00:00
1835e1eb98 [BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145307
Approved by: https://github.com/zou3519, https://github.com/FindHao
2025-01-27 18:12:39 +00:00
835e770bad Use typing.IO[bytes] instead of io.BytesIO in annotations (#144994)
Fixes #144976

Using appoach ① `IO[bytes]`, but could also try with a protocol.

## Notes:

- moved `torch.serialization.FILE_LIKE` to `torch.types.FileLike`
- Use `FileLike` annotation where it makes sense
- made sure those functions also support `os.PathLike`
- Replaced `isinstance(x, io.BytesIO)` with `isinstance(x, (io.IOBase, IO))` where appropriate.
- Replaced `BinaryIO` with `IO[bytes]` (the two ABCs are almost identical, the only difference is that `BinaryIO` allows `bytearray` input to `write`, whereas `IO[bytes]` only `bytes`)
- needed to make `torch.serialization._opener` generic to avoid LSP violations.
- skipped `torch/onnx/verification` for now (functions use `BytesIO.getvalue` which is not part of the `IO[bytes]` ABC, but it kind of seems that this is redundant, as e.g. `onnx.load` supports `str | PathLike[str] | IO[bytes]` directly...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144994
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-01-27 18:08:07 +00:00
abf28982a8 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-27 18:05:23 +00:00
30dea8429d [MPS][BE] Use conveinence methods to set args (#145736)
It's better to call `mtl_setArgs` rather than set arguments one by one with the risk of making a typo

Also, all interactions with MTLCommandBuffer must be serialized, which is commonly done using dispatch queues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145736
Approved by: https://github.com/Skylion007
2025-01-27 17:42:01 +00:00
7db20ffd68 Remove public_allowlist from TestPublicBindings.test_correct_module_names and ensure private_allowlist-ed things are actually private (#145620)
This passes locally, also sanity checked importing these modules on [colab](https://colab.research.google.com/drive/1edynWX1mlQNZIBxtb3g81_ZeTpAqWi19?usp=sharing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145620
Approved by: https://github.com/albanD
2025-01-27 17:30:02 +00:00
5d01a2874f Increase the number of perf benchmark shards (#145534)
Per the discussion on https://github.com/pytorch/pytorch/issues/140332#issuecomment-2610805551, this adds 2 more shards for HF, 2 more for TorchBench, and 1 more for TIMM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145534
Approved by: https://github.com/jeanschmidt
2025-01-27 16:20:42 +00:00
639dd54ef7 [BE] Use copy_method to import all tests (#145718)
Less chances for typo when doing the imports

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145718
Approved by: https://github.com/dcci
2025-01-27 16:01:12 +00:00
2e80093306 setitem node shouldn't be deadcode eliminated (#145714)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/145697. The `operator.setitem` has been eliminated as dead code, causing a correctness issue. Mark it as impure in this PR to avoid this side effect.

**TestPlan**
```
python -u -m pytest -s -v test/fx/test_dce_pass.py -k test_keep_setitem
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145714
Approved by: https://github.com/ezyang
2025-01-27 15:08:21 +00:00
0674ab7e33 solve apl dependency issue (#145215)
According to the [APL documentation](https://developer.arm.com/documentation/101004/2404/General-information/Arm-Performance-Libraries-example-programs), libraries ending with _mp are OpenMP multi-threaded libraries.

When a project is compiled with MSVC and the -openmp flag, the vcomp library (Visual C++ implementation of OpenMP) is used for runtime calls.

However, the current APL implementation uses the libomp.dll (LLVM) variant.

As a result, there are unexpected behaviors at runtime.

---

For Example:

```python
import torch

# Create a sparse tensor
# Input (Sparse Tensor):
# [[0, 1],
#  [1, 0]]
indices = torch.tensor([[0, 1], [1, 0]])
values = torch.tensor([1, 1], dtype=torch.float32)
size = torch.Size([2, 2])

sparse_tensor = torch.sparse_coo_tensor(indices, values, size)

# Convert sparse tensor to dense tensor
dense_tensor = sparse_tensor.to_dense()

# Expected Output (Dense Tensor):
# [[0, 1],
#  [1, 0]]
print("\nDense Tensor:")
print(dense_tensor)
```

However, it prints unexpected outputs such as:

```python
# [[0, 11],
#  [10, 0]]
```

The issue arises because the following code does not function as expected at runtime:

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h#L30

```c++
// returns 1 , however since OpenMP is enabled it should return total number of threads
int64_t num_threads = omp_get_num_threads();
```

---

In the runtime, loading multiple OpenMP libraries (in this case `libomp` and `vcomp`) is causing unexpected behaviours.

So, we've changed libraries from `_mp` to non `_mp` versions and we used `vcomp` for OpenMP calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145215
Approved by: https://github.com/ozanMSFT, https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-27 13:02:16 +00:00
7b6029dcc2 Update slow tests (#145206)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145206
Approved by: https://github.com/pytorchbot
2025-01-27 11:40:39 +00:00
e6c1e6e20e simplify torch.utils.cpp_extension.include_paths; use it in cpp_builder (#145480)
While working on conda-forge integration, I needed to look at the way the include paths are calculated, and noticed an avoidable duplication between `torch/utils/cpp_extension.py` and `torch/_inductor/cpp_builder.py`. The latter already imports the former anyway, so simply reuse the same function.

Furthermore, remove long-obsolete include-paths. AFAICT, the `/TH` headers have not existed since pytorch 1.11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145480
Approved by: https://github.com/ezyang
2025-01-27 07:19:42 +00:00
e90cf4abcf [inductor] Add some typing to common.py (#145691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145691
Approved by: https://github.com/malfet
ghstack dependencies: #145690
2025-01-27 06:27:13 +00:00
ddae87f792 [inductor] Add some typing to simd.py (#145690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145690
Approved by: https://github.com/malfet
2025-01-27 06:27:13 +00:00
71caac2b30 [MPSInductor] Add rand support (#145705)
Using Philox4 as PRNG

Test plan (other that CI)
Run
```python
mport torch
from torch._inductor.utils import run_and_get_code
from contextlib import nullcontext

def foo(x):
   return x * torch.randn_like(x)

foo_c = torch.compile(foo)

x = torch.ones(100, 100, device="mps")

y = foo_c(x)

print(y.mean().item(), y.std().item())
for i in range(25):
  print(y[i].mean(), y[i].std())
```
And observe that printed values are close to 0 and 1

TODO: Better `randint` algorithm for large ranges

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705
Approved by: https://github.com/dcci, https://github.com/jansel
2025-01-27 06:07:36 +00:00
ea141d8134 functional compiled autograd (#144707)
This PR squashes together the following commits:

https://github.com/pytorch/pytorch/pull/144115
https://github.com/pytorch/pytorch/pull/143417
https://github.com/pytorch/pytorch/pull/143405
https://github.com/pytorch/pytorch/pull/143387
https://github.com/pytorch/pytorch/pull/143304
https://github.com/pytorch/pytorch/pull/143296

This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses.

For more information, please read the commit messages for each PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707
Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel
2025-01-27 05:20:56 +00:00
87fdadde1d Remove FFT from stride incorrect ops (#145080)
I gotta say, the FFT implementation is completely insane, there's gotta be a better way to do this than repeatedly inplace restriding the output tensor. Anyway, this is a faithful translation of both the MKL and cuFFT paths to Python.

Fixes https://github.com/pytorch/pytorch/issues/135087

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145080
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #145530
2025-01-27 04:26:04 +00:00
b75afa2e2e [MPS] cholesky implementation (#145701)
Requested in #77764

Closed #144193  due to a lot of conflicts when rebasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145701
Approved by: https://github.com/malfet
2025-01-27 01:53:03 +00:00
c6ad08357b pickler for GraphModule (#141659)
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
2025-01-26 19:29:13 +00:00
f3ddc08ddc Additional operators in operator benchmark (#145625)
The list of added operators:
add_, addcmul, arange, baddbmm…, bmm, clamp, div, div_, gelu, index_add, logical_and, mul_, sub_, topk, where

This pull request is the same as a previous one: https://github.com/pytorch/pytorch/pull/145121 which inadvertently got deleted while merging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145625
Approved by: https://github.com/jeffdaily
2025-01-26 19:20:02 +00:00
6a4fb4b615 Revert "Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)"
This reverts commit cb814c0b961369a7ab154c58856c730cafaa2307.

Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/malfet due to It broke ROCM tests again, see 5cd2b34e82/1 ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2614523822))
2025-01-26 17:49:05 +00:00
5cd2b34e82 [inductor] Adjust test_log_fp64 to only run when float64 is supported. (#145686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145686
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-26 15:58:19 +00:00
ed015143ef Set RUNPATH on CUDA and XPU tests (#144305)
#136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left.

This PR fixes the rest.

The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`.

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305
Approved by: https://github.com/malfet
2025-01-26 08:40:22 +00:00
c4523999a1 Fix incorrect type comparison (#145449)
Summary: This change was incorrectly made as part of #145166

Differential Revision: D68536221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145449
Approved by: https://github.com/bobrenjc93
2025-01-26 04:40:26 +00:00
09ae69a364 Revert "Fix type annotation of Linear.bias (#142326)"
This reverts commit 81e370fc6b90f9cb98c88f3173e738aba0dc650a.

Reverted https://github.com/pytorch/pytorch/pull/142326 on behalf of https://github.com/malfet due to This introduced a graph break and regressed inductor tests, see 73622fc5fa/1 ([comment](https://github.com/pytorch/pytorch/pull/142326#issuecomment-2614196349))
2025-01-26 03:41:00 +00:00
73622fc5fa Fix Throughputbenchmark issue (#144669)
Fixes [144461](https://github.com/pytorch/pytorch/issues/144461)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144669
Approved by: https://github.com/leslie-fang-intel, https://github.com/williamwen42, https://github.com/jansel
2025-01-26 03:37:20 +00:00
cb814c0b96 Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)
Fixes https://github.com/pytorch/pytorch/issues/142466.
Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case.

Test plan:
```
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859
Approved by: https://github.com/mingfeima, https://github.com/malfet
2025-01-26 01:56:40 +00:00
90448f0128 Output of nonzero is transposed, fix fake tensor (#144695)
Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695
Approved by: https://github.com/bobrenjc93, https://github.com/albanD
2025-01-26 01:07:22 +00:00
76bec878da Remove unnecessary HPUHooksInterface method (#145272)
getDefaultHPUGenerator is no longer necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145272
Approved by: https://github.com/ezyang
2025-01-26 01:06:34 +00:00
3cf7874ebe [MPS][BE] Implement bilineard2d as shader (#145581)
That significantly improves performance and addresses correctness problem(to an extend permitted by reducing precision of scale factor computation to float32). uint8 scaling algorithm mimics CPU/Pillow implementation
569b785371/src/libImaging/Resample.c (L306-L309)
I.e. using fixed precision integral arithmetic and rounding results of horizontal interpolation back to integers before performing vertical one, which results in technically less accurate results.

But even with those changes, `atol`, `rtol` must be tweaked to `1, 0` when scale factor is `1/3` or `2/3` because of the difference of representation  of those values as floats and doubles.

Changes in the performance could be measured using the following script
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="bilinear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16, torch.uint8]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS                 |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results before
```
Benchmarking Results (collected on Apple M4 Pro):
----------------------------------------
Device            :                MPS                 |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8
Outputs Match     :   True  |  True  |  True  |  False |  True  |  True  |  True  |  True
Average Time (us) :  277.3  | 197.2  | 188.0  | 163.5  | 302.8  | 248.1  | 308.7  | 650.9
```
After(almost **100x** perf gain):
```
Benchmarking Results (collected on Apple M4 Pro):
----------------------------------------
Device            :                MPS                 |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    1.7  |   1.5  |   1.7  |   1.5  | 296.5  | 236.0  | 310.8  | 642.6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145581
Approved by: https://github.com/Skylion007
ghstack dependencies: #145578
2025-01-25 21:09:46 +00:00
0afdee4c39 [dynamo] raise IndexError when inserting into a full deque (#139379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139379
Approved by: https://github.com/jansel
2025-01-25 18:04:49 +00:00
513f889a36 [Rocm][Inductor][CK] silence ck package not installed warning when CK backend is not used to autotune bmm (#145626)
As titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145626
Approved by: https://github.com/coconutruben
2025-01-25 08:44:35 +00:00
c5216d2b6c [ca] add test_reset for 2.6 release validation (#145549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145549
Approved by: https://github.com/atalman
2025-01-25 06:28:58 +00:00
bbe7f53218 Save integral tensor data for ET (#144508)
Summary:
et_replay uses random data to run operators, however, the operators using index tensor to access memory won't work with random data. It usually ran into two exceptions: 1. illegal memory access since index is out of range, it has been fixed with the environment variable ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR_RANGE to record the min/max value of index tensors. 2. unaligned memory access, FBGEMM ops have speical requirements for the memory layout.

To fix the second execption, ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR is added to allow user to specify the node names, separated by comma, so ET will save the integral tensor data for these nodes. The saved data will be used in et_replay.

Be careful to turn on this option since it will use more space to save the extra data.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_data_cuda

Differential Revision: D67989856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144508
Approved by: https://github.com/briancoutinho
2025-01-25 05:38:10 +00:00
3d506491b9 [inductor] Fix duplicate detection in _dynamic_scale_rblock (#145577)
Before this the code was doing nothing because Config doesn't define `__hash__` or `__eq__` (so it was based on object id).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145577
Approved by: https://github.com/shunting314
ghstack dependencies: #142026
2025-01-25 04:58:54 +00:00
9007eb5f8e [inductor] Kernel memory analysis for use in heuristics (#142026)
This computes statistics about each kernel's memory usage that should allow us to write more precise heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142026
Approved by: https://github.com/eellison
2025-01-25 04:58:54 +00:00
cc1ecead07 [Dynamo] Allow format() to handle int (#144956)
Fixes #144830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144956
Approved by: https://github.com/jansel
2025-01-25 04:12:45 +00:00
b2a0feac85 Update OSS nested tensor docs to focus on NJT (#145402)
Updated nested tensor docs to be NJT-centric (instead of NST-centric). They now include:
* High-level description of NST vs. NJT + a recommendation to use NJT
* General NJT construction / usage
* torch.compile() integration w/ dynamic shapes
* Common errors and how to fix them
* Contribution guide
* Data layout / shape information (with diagram)
* Links to more extensive tutorials involving Transformers / SDPA / FlexAttention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145402
Approved by: https://github.com/soulitzer
2025-01-25 04:08:19 +00:00
392dc177a9 OpenReg: Refactor impl_registry (#145465)
Refactor impl_registry to use `driver.exec` as fallback.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145465
Approved by: https://github.com/albanD
2025-01-25 03:31:49 +00:00
6939a56e13 [autocast][pytorch] Support autocast for MTIA (#145627)
Summary: Add autocast support to MTIA

Reviewed By: egienvalue

Differential Revision: D68572548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627
Approved by: https://github.com/egienvalue
2025-01-25 03:24:59 +00:00
ef60de07a0 [dynamo] Log guard latency (#145132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132
Approved by: https://github.com/ezyang
ghstack dependencies: #145509
2025-01-25 03:01:18 +00:00
42b8e233d9 serde unbacked bindings (#144894)
Adds unbacked bindings during deserialization. These are carried by a node's metadata, and map pending fresh unbacked symbols to paths to such symbols inside the corresponding example value carried by the node's metadata.

Since it is awkward to serialize paths, we only serialize the names of these symbols and reconstruct the paths on deserialization, using a shape env util. We also need to bump counters for unbacked symbols here, because the shape env util we use to create these symbols (when deserializing example values) don't do so, and not doing so makes later passes (like `run_decompositions`) crash because new unbacked symbols don't get new names.

This is enough for non-strict. For strict, the unbacked bindings and example values in node metadata can get out of sync, because of running AOTAutograd as an additional step after Dynamo. So we have to sync those back.

Differential Revision: [D68232274](https://our.internmc.facebook.com/intern/diff/D68232274/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144894
Approved by: https://github.com/pianpwk
2025-01-25 02:34:27 +00:00
5725462cd8 Update NJT linear_backward to return non-aliased tensor bias grad (#145399)
Fixes https://github.com/pytorch/pytorch/issues/141292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145399
Approved by: https://github.com/jbschlosser
ghstack dependencies: #145520, #145531, #145533
2025-01-25 00:58:04 +00:00
3a3e2cf90a Remove det_singular OpInfo (#145533)
Fixes https://github.com/pytorch/pytorch/issues/93045 https://github.com/pytorch/pytorch/issues/93044

From previous discussion https://github.com/pytorch/pytorch/issues/93045#issuecomment-1477674083 the resolution is that we're okay with removing this.

Some older attempts:
- https://github.com/pytorch/pytorch/pull/102581
- https://github.com/pytorch/pytorch/pull/109249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145533
Approved by: https://github.com/lezcano, https://github.com/malfet
ghstack dependencies: #145520, #145531
2025-01-25 00:58:03 +00:00
c7ca1df37e Disable slow gradcheck for nn.Transformer ModuleInfo (#145531)
Fixes https://github.com/pytorch/pytorch/issues/117140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145531
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #145520
2025-01-25 00:58:03 +00:00
9e0ee152e5 Fix allow_mutation_on_saved_tensors for inplace foreach (#145520)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145520
Approved by: https://github.com/albanD
2025-01-25 00:58:03 +00:00
clr
b4fe3c159d inductor: Explicitly test that torch.compile(option=...) does something (#145321)
This would have prevented https://github.com/pytorch/pytorch/pull/139833 from dropping the triggers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145321
Approved by: https://github.com/jansel
2025-01-25 00:48:26 +00:00
efebec5ef5 [dcp] Add ZStandard transformer (#143360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360
Approved by: https://github.com/saumishr, https://github.com/albanD
ghstack dependencies: #145528
2025-01-25 00:14:07 +00:00
f2ad2cdf1c [utils] add try_import method for importing optional modules (#145528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145528
Approved by: https://github.com/albanD
2025-01-25 00:14:07 +00:00
f3304571fc [BE][Ez]: FURB148 - remove useless enumerate calls (#145619)
Remove useless enumerate calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145619
Approved by: https://github.com/drisspg
2025-01-24 23:37:15 +00:00
0741963e01 [CI][CUDA][Blackwell] sm_\d\d no longer matches sm_100. (#145641)
Therefore making it sm_\d+

Fixes this unit test failure: python test/test_cpp_extensions_jit.py -k TestCppExtensionJIT.test_jit_cuda_archflags

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145641
Approved by: https://github.com/eqy, https://github.com/malfet
2025-01-24 23:20:22 +00:00
4cc5e880f9 Add accuracy issue support in AOTI Minifier (#145539)
Summary:

Add three more repro levels for AOTI minifier (level 2 already exists). They are the same as the existing dynamo minifier repro levels.

Now AOTI minifier can minify and repro programs that have numerical accuracy issues as well.

1: Dumps the original graph out to repro.py if compilation fails
2: Dumps a minifier_launcher.py if aoti fails.
3: Always dumps a minifier_launcher.py. Good for segfaults.
4: Dumps a minifier_launcher.py if the accuracy fails.

Refactor AOTI minifier unit tests to be cleaner and better re-use the existing minifier testing code. We do not need to manually patch {"aot_inductor.dump_aoti_minifier": True} to each test now, this config is generated in the test code.

Differential Revision: D68294638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145539
Approved by: https://github.com/desertfire
2025-01-24 23:07:19 +00:00
5b988ac4fa [Easy] Replace paper description with link to make a concise description. (#145031)
Description in [Transformer,](https://pytorch.org/docs/main/generated/torch.nn.Transformer.html), [TransformerEncoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerEncoderLayer.html), [TransformerDecoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerDecoderLayer.html) pages contain authors and paper details seems redundant for users who want to know how to use it, replace with a link to paper content, users can go to the paper detail if they want to learn more.

**Test Result**

**Before**
![image](https://github.com/user-attachments/assets/678402b1-e759-402c-b56b-e24f63dc8490)
![image](https://github.com/user-attachments/assets/ca191734-f2ce-493f-bf34-2d7046a9868f)
![image](https://github.com/user-attachments/assets/10f55083-6eb6-4b1c-9a77-579f0c4c56ed)

**After**
![image](https://github.com/user-attachments/assets/020f81ca-d89b-47d1-a7a9-cae1893df968)
![image](https://github.com/user-attachments/assets/5b9b34df-b892-4d71-8cdb-df18380b2744)
![image](https://github.com/user-attachments/assets/b3348da2-842a-4037-bad3-f23687503cf8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145031
Approved by: https://github.com/mikaylagawarecki
2025-01-24 23:01:02 +00:00
57591edca1 [mps/inductor] Add support for erfinv. (#145643)
After several rounds of refactoring, this seems to be done now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-24 22:55:44 +00:00
46e06e1d09 Avoid data-dependent errors in NJT tests via capture_scalar_outputs=True (#144588)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

There are several xfails related to data-dependent errors in torch.compile. This PR sets `torch._dynamo.config.capture_scalar_outputs=True` to avoid these, which tends to exercise unbacked SymInt logic and will require `torch._check()`-related fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144588
Approved by: https://github.com/soulitzer
ghstack dependencies: #144586, #144587
2025-01-24 22:45:01 +00:00
81e370fc6b Fix type annotation of Linear.bias (#142326)
Currently the `bias` attribute of `torch.nn.Linear` (and `Bilinear`) is typed incorrectly, because it relies on the implicit `Module.__getattr__` which types it as `Tensor | Module`. This has two issues:

- It hides the fact that `bias` is optional, and can be `None`, which in turn can hide actual bugs on user side.
- It blurs the type due to having `Module` in the union, which can require unnecessary `isistance(linear.bias, Tensor)` on user side.

This PR types the `bias` attribute explicitly to fix these issues.

CC @ezyang @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142326
Approved by: https://github.com/ezyang
2025-01-24 22:43:52 +00:00
70577d335e [ATen][CUDA][Transformers] Add Blackwell support to SDPA (#145602)
This PR adds sm_100 and sm_120 archs to support SDPA (Flash Attention and Memory Efficient Attention) on Blackwell machines.

Special thanks to @Fuzzkatt for co-authoring these changes!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145602
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet

Co-authored-by: Patrick Wang <22803332+Fuzzkatt@users.noreply.github.com>
2025-01-24 22:27:39 +00:00
5bf5ce0e15 Modify enable logic of COLLECTIVE_COMM profiler activity type (#145478)
Summary:
Since `KINETO_NCCL_PROFILER` flag is not used anymore (we are moving from linking the profiler during compile time to loading it dynamically), we change the logic for enabling the profiler to use `TORCH_PROFILER_ENABLE_COLLECTIVE_PROFILING` environment variable for NCCL Collective Communication Profiler.

For HCCL, we still keep the same logic

Test Plan: See  https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/gpu_traces/tree/traces/clientAPI/0/1737579474/devvm29927.cln0/nccl_activities_2387985.json.gz for sample trace on nccl-profiler

Differential Revision: D68515945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145478
Approved by: https://github.com/sraikund16
2025-01-24 22:21:09 +00:00
d4171b724e Let tensor_a.new_tensor() be on tensor_a.device by default (#144958)
Fixes #144957
Closes #73838 cc @albanD @ezyang

Currently, `tensor_a.new_tensor()` will return a on-cpu tensor no matter where is `tensor_a`. This differs from the document and is a side-effect of https://github.com/pytorch/pytorch/pull/41984.

See #144957 how current logic breaks dynamo.

This PR restore the documented behavior and add tests for `new_tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144958
Approved by: https://github.com/ezyang
2025-01-24 22:12:31 +00:00
2a70de7e92 [CUDA] Change slim-wheel libraries load order (#145638)
There is no libnvjitlink in  CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc

Fixes issues reported in https://github.com/pytorch/pytorch/pull/145614#issuecomment-2613107072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145638
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-24 22:00:56 +00:00
FEI
615bdd9c81 Improve the caching allocator test for raw alloc (#145269)
1 Prevent block allocated by torch._C._cuda_cudaCachingAllocator_raw_alloc from affecting torch.cuda.empty_cache() in other unit tests
2 Additionally, tested the changes to raw_delete in https://github.com/pytorch/pytorch/pull/131114

@jeffdaily @albanD @houseroad @eqy @aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145269
Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily
2025-01-24 21:07:17 +00:00
d79c6f4946 Improve torchrun documentation (#144354)
Fixes #142042:
- #142042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144354
Approved by: https://github.com/c-p-i-o, https://github.com/H-Huang
2025-01-24 20:40:05 +00:00
caf60395f4 [torchbench] Increase tolerance for amp only poolformer_m36 (#145375)
https://github.com/pytorch/pytorch/issues/144893

```
python benchmarks/dynamo/timm_models.py --only poolformer_m36 --accuracy --no-translation-validatio  --training --amp --device cuda --backend inductor
```

`--float32`, `--bfloat16` - passes the accuracy
`--disable-cudagraph` does not change the result

accuracy_fail only for `--amp` and  gives `0.048` res_error, on 1-element result Tensor.

This fails with `0.01` tolerance.

If to increase tolerance to 0.04 it passes. I have not reproduced "eager_two_runs_differ" on H100.
I think this is a true distribution of results with `--amp`, so increasing tolerance to 0.04 for ano case only makes it passing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145375
Approved by: https://github.com/desertfire
2025-01-24 19:56:21 +00:00
457facf7e2 [caffe2] Use the manifold cache backend as the default (#144773)
Test Plan: CI

D68155591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144773
Approved by: https://github.com/izaitsevfb
2025-01-24 19:48:34 +00:00
c16866a582 [BE] mv test/inductor_skips/* to test/inductor_expected_failures/ (#145572)
Summary: I think skipping these tests is suboptimal. If we categorize as expected failures, then we'll see test failures when they start passing, which means they're more likely to be removed. As a skip, they quietly continue to skip.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145572
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-24 19:41:38 +00:00
cf063d41f8 Spruce up docs for emulate_precision_casts (#145579)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145579
Approved by: https://github.com/gchanan
2025-01-24 19:28:37 +00:00
96149a201a [Inductor] be able to disable cache for test (#141195)
Let TORCHINDUCTOR_FX_GRAPH_CACHE=0 being respected in unit test. This is helpful if I want the compilation to happen for testing.   Setting INDUCTOR_TEST_DISABLE_FRESH_CACHE to 1 is not the same, since that will cause the generated wrapper file being deleted. But we may want to check those files after running a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141195
Approved by: https://github.com/masnesral, https://github.com/desertfire
2025-01-24 19:15:55 +00:00
2fd2a950e6 [torchbench] Add meta function for _cudnn_rnn_flatten_weight (#145488)
https://github.com/pytorch/pytorch/issues/144989

This fixes tts_angular model on torchbench for `--export-aot-inductor`

I put meta function in cpp, as shape calculation requires cudnn API calls.
I've extracted shape calculation to be used in implementation as this logic has some non-trivial actions and comments.

```
└─ $ python benchmarks/dynamo/torchbench.py --only tts_angular --accuracy --no-translation-validation --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda
loading model: 0it [00:00, ?it/s]WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
loading model: 0it [00:01, ?it/s]
WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
cuda eval  tts_angular
WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145488
Approved by: https://github.com/eqy, https://github.com/zou3519
2025-01-24 19:08:14 +00:00
ad36f4f42c Revert "Add generator parameter to rand*_like functions (#136780)"
This reverts commit c7b2f7dd142fc97c8ce4ad7ad591687cf295fcda.

Reverted https://github.com/pytorch/pytorch/pull/136780 on behalf of https://github.com/izaitsevfb due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/136780#issuecomment-2613191933))
2025-01-24 19:00:21 +00:00
a989a0b13a [NFC] Fix some minor typos. (#145599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599
Approved by: https://github.com/Skylion007
2025-01-24 18:58:59 +00:00
6cda572c98 [mps] Hoist erfinv logic out of the kernel in preparation for moving. (#145568)
Will be used in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145568
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-01-24 18:51:09 +00:00
8eea554332 [Dynamo] Fix names collisions with foreach decomps (#145479)
Fixes https://github.com/pytorch/pytorch/issues/138698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145479
Approved by: https://github.com/yanboliang
2025-01-24 18:46:58 +00:00
e57cdb8402 [ROCm] trunk.yml only runs pre-merge via ciflow/trunk label (#145629)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145629
Approved by: https://github.com/jeffdaily
2025-01-24 18:31:33 +00:00
b8087747f5 [inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373)
Differential Revision: D68278174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145373
Approved by: https://github.com/Skylion007
2025-01-24 17:59:13 +00:00
74cfb4f364 [dynamo][refactor] Move collections.namedtuple out of SkipFunctionVariable (#145547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145547
Approved by: https://github.com/zou3519
ghstack dependencies: #145519
2025-01-24 17:39:33 +00:00
97c0b7cb0a Add unique identifer to bmm thread_mm functions (#145303)
Summary:
The bmm template generates code like this

```
template<bool accum>
void cpp_fused_bmm_66_micro_gemm(...) {
    ...
}

void single_thread_mm() {
    ...
    cpp_fused_bmm_66_micro_gemm(...)
    ...
}

void threaded_mm() {
    ...
    cpp_fused_bmm_66_micro_gemm(...)
    ...
}

void cpp_fused_bmm_66(...)
{
    ...
    single_thread_mm(...);
    ...
    threaded_mm(...);
    ...
}
```

The generated  `fused_bmm` and `fused_bmm_microgemm` functions both have unique identifiers added to their names, but the `single_threaded_mm` and `threaded_mm` do not.

This diff adds unique identifies to those generated functions as well. The identifier is based on the kernel name. So for the example above we would generate a bmm template name like `cpp_fused_bmm_66_single_thread_mm()`.

Differential Revision: D68364772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145303
Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475
2025-01-24 17:35:50 +00:00
547c18ee9f Add Torchao docs link to Pytorch libraries (#145412)
Add Torchao docs link to the libraries section in torch docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145412
Approved by: https://github.com/svekars
2025-01-24 17:11:20 +00:00
ce371ab4c6 [ROCm] Create inductor-rocm-mi300 (#145621)
- Adds an mi300 inductor workflow to main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145621
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-24 17:04:17 +00:00
c0861d092c [PGNCCL] Add an API to get the status/error code at the PG level (#144498)
Summary:
This PR is basically a replacement of
https://github.com/pytorch/pytorch/pull/140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread

If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective

Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498
Approved by: https://github.com/kwen2501
2025-01-24 16:47:32 +00:00
9132f4b7ce [dynamo][guards] Log guard latency to tlparse (#145509)
Example
![image](https://github.com/user-attachments/assets/1503ee59-ff35-46d9-9b61-16352a4a30e2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145509
Approved by: https://github.com/ezyang
2025-01-24 16:33:29 +00:00
1335882b2a If mypy fails it should report the error back to lintrunner (#145550)
This happened to me because I had a bad LD_LIBRARY_PATH and mypy was failing to run (.so load error) - but lintrunner was silent about the underlying problem.

Differential Revision: [D68593081](https://our.internmc.facebook.com/intern/diff/D68593081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145550
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2025-01-24 15:40:30 +00:00
7c314bfed4 [Intel GPU] Add TORCH_API macro to export symbol NestedTensor_to_mask for libtorch_xpu (#145467)
Part of https://github.com/intel/torch-xpu-ops/issues/1141.

The `TORCH_API` macro is added to export the symbol `NestedTensor_to_mask`, which is needed by libtroch_xpu for `NestedTensor_softmax_dropout_xpu`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145467
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-01-24 15:38:46 +00:00
5d24a9a274 Advance docker release latest verison to cuda 12.4 (#145566)
Fixed latest tag in ghcr.io to be cuda 12.4 docker image. Todo, Need to add it to : https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD

Will need to check if we can automate this by introducing cuda_stable variable or something like this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145566
Approved by: https://github.com/nWEIdia, https://github.com/kit1980, https://github.com/malfet
2025-01-24 15:27:25 +00:00
5c64aaea40 [triton] Update triton pin to include warp specialization support (#145120)
The warp specialization work has been landed to the triton rc/3.2.x branch as b2684bf3b0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120
Approved by: https://github.com/bertmaher
2025-01-24 14:45:12 +00:00
bc62930765 Work around buggy use_const_ref_for_mutable_tensors (#145530)
See https://github.com/pytorch/pytorch/issues/145522 for context

This doesn't fix the problem with use_const_ref_for_mutable_tensors and the boxed wrapper, instead it just gets all of our out kernels off of this flag so that the mutable matching pattern works correctly. I also add a check in torchgen to prevent people from making this mistake in the future.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145530
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2025-01-24 14:38:49 +00:00
9d6927715f Revert "Fix triton masked loading for non-block tl.loads (#144782)"
This reverts commit 31c2f36989e35ccf023a8e35c4bc21aca077d344.

Reverted https://github.com/pytorch/pytorch/pull/144782 on behalf of https://github.com/ezyang due to This regresses compile time for one of our internal models by 20%, internal xref https://fb.workplace.com/groups/1075192433118967/posts/1591490218155850 ([comment](https://github.com/pytorch/pytorch/pull/144782#issuecomment-2612660287))
2025-01-24 14:28:48 +00:00
cyy
6a35d9aaa4 Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-24 12:22:13 +00:00
f08b9bc7e4 [WIP] Move XNNPACKQuantizer from PyTorch to ExecuTorch (#144940)
Summary:
This replicates XNNPACKQuantizer from PyTorch to ExecuTorch.

Rationale:
Main motivation is to avoid pytorch pin update in OSS after updating XNNPACKQuantizer, which can be rather frequent.
Other impact and considerations:
PT2e flow (which lives in PyTorch) relies havily on XNNPACKQuantizer for a "example" implementation for quantizer and more importantly tests. Fow now, we will keep the torch.ao.quantization.xnnpack_quantizer as is but mark is as not BC, and deprecated to discourace future new dependencies on it.
Other OSS repository using XNNPACKQuantizer from PyTorch now have to take an additional dependency on ExecuTorch.

Differential Revision: D68191752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144940
Approved by: https://github.com/jerryzh168, https://github.com/mcr229
2025-01-24 10:06:07 +00:00
d3989ca636 Add multi env variable support to configs (#145288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288
Approved by: https://github.com/c00w
2025-01-24 10:04:24 +00:00
10bdd0a1cc [BE][export] Fix hop tests with flaky memory leak (#145391)
Summary:
As title. Added `torch._dynamo.reset()` for each test

This should fix several flaky tests in `test_hop.py` such as https://github.com/pytorch/pytorch/issues/139073

Test Plan:
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/export/test_hop.py TestHOPCUDA.test_serialize_export_scan_simple_cuda_float32
```

Differential Revision: D68506280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145391
Approved by: https://github.com/ydwu4
2025-01-24 09:53:21 +00:00
72da0a8a42 [Submodule] Add flash as third-party submodule [Prep for later PRs] (#145502)
# Context

Prototyped here: https://github.com/pytorch/pytorch/pull/144120, we are going to make flash-attention a 3rd party submodule. We will then use the c++ sources and include into our build of libtorch.so

This requires various changes to work including external and internal changes. Since these require internal changes we need to co-dev and in the co-dev environment I haven't found a way to sync submodule changes + internal only changes.

This is unused for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145502
Approved by: https://github.com/Skylion007
2025-01-24 09:21:41 +00:00
d62e900d8c [CI][CUDA][MultiGPU][Regression] Skip a failure due to https://github.com/pytorch/pytorch/issues/139520 (#145318)
Related: https://github.com/pytorch/pytorch/issues/139520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145318
Approved by: https://github.com/eqy
2025-01-24 06:58:05 +00:00
0e98b26b28 [CI][CUDA][Dynamic Shape] xfail: DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda (#145204)
python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda

failed to generate triton kernels, causing assert failures on 2x H100 systems (and 2x Grace H100 systems).

Failures like below:

Finline_call []                                                                                                                                                    stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)]

FAIL: test_linspace4_dynamic_shapes_cuda (__main__.DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda)                                       [61/1892]----------------------------------------------------------------------                                                                                             Traceback (most recent call last):                                                                                                                                   File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 12212, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/testing.py", line 420, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 2603, in test_linspace4
    self.common(fn, (torch.Tensor([]),))
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 424, in common
    return check_codegen(
           ^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 82, in check_codegen
    self.assertTrue("def triton" in code, f"Failed to find triton kernel\n{code}")
AssertionError: False is not true : Failed to find triton kernel

# AOT ID: ['0_inference']                                                                                                                                 [42/1892]from ctypes import c_void_p, c_long, c_int
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks
from torch._inductor.utils import maybe_profile
from torch._inductor.codegen.memory_planning import _align as align
from torch import device, empty_strided
from torch._inductor.async_compile import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels
from torch._inductor.codegen.multi_kernel import MultiKernelCall

aten = torch.ops.aten
inductor_ops = torch.ops.inductor
_quantized = torch.ops._quantized
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
alloc_from_pool = torch.ops.inductor._alloc_from_pool
async_compile = AsyncCompile()
empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p

async_compile.wait(globals())
del async_compile

def call(args):
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf0 = empty_strided_cuda((0, ), (1, ), torch.float32)
    return (buf0, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    fn = lambda: call([])
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145204
Approved by: https://github.com/eellison
2025-01-24 06:57:35 +00:00
817fd14714 [BE] Type annotation for _inductor/dependencies.py (#145311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145311
Approved by: https://github.com/eellison
2025-01-24 06:32:48 +00:00
2ce70da96c [cp] override compute_log_sumexp to True for aten._scaled_dot_product_efficient_attention.default if False (#145421)
## Description
Our current CP doesn't support efficient attention when `compute_log_sumexp=False`. `compute_log_sumexp=False` only if that `requires_grad=False` and since PP's [shape inference](d95a6babcc/torch/distributed/pipelining/stage.py (L1387)) happens under `torch.no_grad()` context , we need to override `compute_log_sumexp` to `True` in our CP attention implementation.

## Test
- Test PP+FSDP+CP w/ `mixed_precision = "float32"` in torchtitan

- `pytest test/distributed/tensor/test_attention.py -s -k test_ring_attention_sdpa`

Before:
<img width="1880" alt="image" src="https://github.com/user-attachments/assets/872ff583-295e-4751-a280-cf7f2d41c61a" />

After:
<img width="2988" alt="image" src="https://github.com/user-attachments/assets/4bdcc2e5-22a5-427a-91a5-82206d5bd78f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145421
Approved by: https://github.com/H-Huang, https://github.com/tianyu-l
2025-01-24 06:17:54 +00:00
53fc921ce2 [dynamo][trace-rules-cleanup] Remove functools from the Builtins skiplist (#145519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145519
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2025-01-24 06:02:03 +00:00
9752c7c1c8 [CD] Fix slim-wheel cuda_nvrtc import problem (#145582)
Similar fix as: https://github.com/pytorch/pytorch/pull/144816

Fixes: https://github.com/pytorch/pytorch/issues/145580

Found during testing of https://github.com/pytorch/pytorch/issues/138340

Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.

CUDA 11.8 path:
```
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib
__init__.py  __pycache__  libnvrtc-builtins.so.11.8  libnvrtc-builtins.so.12.4  libnvrtc.so.11.2  libnvrtc.so.12
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib
__init__.py  __pycache__  libnvJitLink.so.12
```

Test with rc 2.6 and CUDA 11.8:
```
python cudnn_test.py
2.6.0+cu118
---------------------------------------------SDPA-Flash---------------------------------------------
ALL GOOD
---------------------------------------------SDPA-CuDNN---------------------------------------------
ALL GOOD
```

Thank you @nWEIdia for discovering this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145582
Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-24 04:47:57 +00:00
732c4998f3 [NVIDIA] Full Family Blackwell Support codegen (#145436)
More references:
https://github.com/NVIDIA/nccl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145436
Approved by: https://github.com/ezyang, https://github.com/drisspg
2025-01-24 04:36:00 +00:00
c184055743 [BE] Use value_or in layer_norm.cpp (#145417)
Now that we have proper optional, no need to do `if (has_value) value else default_value;`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145417
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-24 04:02:23 +00:00
4799ebf326 [MPS][BE] Turn bicubic2d into generic metal template (#145578)
In preparation for more metal shaders to come
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145578
Approved by: https://github.com/Skylion007
2025-01-24 04:01:23 +00:00
68a1505985 serde and_ operator (#145506)
Differential Revision: [D68565887](https://our.internmc.facebook.com/intern/diff/D68565887/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145506
Approved by: https://github.com/zhxchen17, https://github.com/Skylion007
2025-01-24 03:48:03 +00:00
29ddf9a63e Document dispatch trace build flag (#145517)
Ok, the build flag seems to have been broken for a while since the function it calls doesn't exist anymore.
Repurposed it to enable dispatcher printing (which requires a full (and slow) debug build otherwise).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145517
Approved by: https://github.com/bdhirsh
2025-01-24 03:19:39 +00:00
a40ead1fd6 Don't fail if fresh_inductor_cache fails to clean up its tmp dir. (#145513)
Summary: I see we have a test failure due to an error removing the tmp dir: https://github.com/pytorch/pytorch/issues/141761. Seems like we should not raise an exception for this case in general. Also, let's clean up the exception handling related to windows. The comment makes it sound like we want to specifically ignore failures cleaning up, but the current impl is swallowing all exceptions.

Fixes #141761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145513
Approved by: https://github.com/eellison
2025-01-24 03:17:03 +00:00
36fcf98db6 [cutlass backend tests] Manually clear cache, test more tests in fbcode and limit configs in some tests (#145545)
Summary:
Manually clear cache:
You want to clear cache in most tests. Otherwise link command won't work and you have multiple .o files and you get something like `ld.lld: error: duplicate symbol: cuda_fused_0`.

test more tests in fbcode:
A few tests have been skipping in fbcode. Unskip them.

limit configs in some tests:
to reduce time spent on each test

Differential Revision: D68584071

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145545
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-01-24 03:06:59 +00:00
386650353b [ARM] Fix bf32 and tf32 precision for tensordot unit test (#141136)
Fixes unit test failure on aarch64 ( neoverse-v1 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141136
Approved by: https://github.com/malfet
2025-01-24 02:59:45 +00:00
d6bea398ac Only include RMSNorm.h in layer_norm.cpp for MPS (#145524)
Test Plan: CI

Differential Revision: D68578213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145524
Approved by: https://github.com/malfet
2025-01-24 02:08:49 +00:00
d5629889f1 cpp_wrapper: Properly handle scalars when input to tensor arguments (#144910)
Additionally, reduce code duplication in `cpp_wrapper_cpu_array_ref.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144910
Approved by: https://github.com/desertfire
2025-01-24 02:06:35 +00:00
47e65077b1 OpenReg: Remove REGISTER_GENERATOR_PRIVATEUSE1 (#144841)
Replace REGISTER_GENERATOR_PRIVATEUSE1 with new API in AcceleratorHooksInterface.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144841
Approved by: https://github.com/albanD
2025-01-24 01:52:10 +00:00
cd68d54911 Inductor cache: Revamp how we handle frozen params (#143808)
Summary: In https://github.com/pytorch/pytorch/pull/143563 we have a report of a problem with the treatment of frozen params in the inductor cache implementation. There seems to be a path where new constants are added in the `GraphLowering`. On a cache hit when we try to find those constant names in the `torch.fx.GraphModule`, they do not exist. The current approach treats all constants differently if the GM has any frozen params. This PR changes the approach to only treat the _frozen_ params specially, but store all other constants in the cache entry (as we do without freezing):
1) When creating a cache entry, store the names of any frozen params, but the values of any other constants.
2) On a cache hit, restore the values of the frozen params by looking up in the current GM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143808
Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison
2025-01-24 01:20:07 +00:00
54e2f4b201 Fix lerp weight type promotion (#141117)
Fixes #140601

Enable `promote_inputs_to_common_dtype` when tensors not same dtype when invoke `lerp` function.

For `lerp_Tensor`
- Check whether same `dtype` of tensors, enable promote if not
- Remove type check assert

For `lerp_Scalar`
- Seems already enable `promote_inputs_to_common_dtype` by default, just remove the type check. Make sure promote behavior consistent with `lerp_Tensor`

`lerp_Scalar` get TensorIteratorConfig from here
c37185c76a/aten/src/ATen/TensorIterator.cpp (L979-L985)

**Test Result**
Test case in issue passed

```python
>>> import torch
>>>
>>> x = torch.ones(2, 2, dtype=torch.float64)
>>> w = torch.ones(2, 2, dtype=torch.float64)
>>> s = torch.tensor(2.2)
>>> x.lerp_(w, s)
tensor([[1., 1.],
        [1., 1.]], dtype=torch.float64)

>>> x = torch.ones(2, 2, dtype=torch.float16)
>>> w = torch.ones(2, 2, dtype=torch.float16)
>>> s = torch.tensor(2.2)
>>> x.lerp_(w, s)
tensor([[1., 1.],
        [1., 1.]], dtype=torch.float16)

```

```bash
$ pytest test/test_binary_ufuncs.py -k 'test_lerp_tensor_type_promotion or test_lerp_scalar_type_promotion'
```
![image](https://github.com/user-attachments/assets/288a5294-a9ee-47f3-bbf7-d4ff986f3ba8)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/d469836f-5c49-4d89-a2fd-379cad4db3af)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141117
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-24 01:18:20 +00:00
b2c89bc115 [inductor][2/N] triton support post-#5512, user-defined triton kernels (#145348)
Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This PR fixes user-defined triton kernel handling (in most cases) for these new triton commits.

What this PR fixes:
* in triton_kernel_wrap.py, AST->TTIR parsing was to be updated for the new triton API
* ir.py - don't remove None args when using newer triton versions
* wrapper.py - update signature & constant handling

What this doesn't fix:
* correct None handling - I want to do a closer look at constant handling (including None, equal_to_1, and other constants).
* cpp wrapper (which needs to be fixed for both user-defined triton kernels and inductor-generated kernels)

test/inductor/test_triton_kernels.py passed on triton commit 74de6b46, with the exception of three tests (those shown here: 1374074098)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145348
Approved by: https://github.com/jansel
ghstack dependencies: #145051
2025-01-24 00:34:01 +00:00
b963ab5325 [inductor][1/N] triton support post-#5512, main components (#145051)
Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This is an initial PR to add support for Triton versions after commit 5512 landed.

The main changes in 5220 and 5512 that need to be supported:
* AttrsDescriptor() gets replaced with a raw dict. The raw dict has the format `{(TUPLES): [["tt.divisibility", 16]]}`, where `(TUPLES)` is a tuple of indices, e.g. `((0,), (1,), (3,))` to indicate that args 0, 1, and 3 are divisible by 16. These indices are, themselves, represented as tuples to support nested inputs (e.g. an argument that's a tuple), but support for tuples is not implemented right now.
* "signature" changes: the signature now contains _all_ args, including constexpr and constant args.
* ASTSource now takes "constexprs" instead of "constants" - for example, equal-to-1 args are constants but not constexprs so we don't need to pass these args as "constants".

What this PR supports:
* Triton versions before Dec 9, 2024, and (partial support for) Triton versions after Jan 1, 2025
* (triton jan 1+) typical inductor-generated triton: updated AttrsDescriptor, signatures, constexpr/constant handling.

What this PR doesn't support (TODO in follow-up PRs):
* Triton versions between Dec 9, 2024 and before Jan 1, 2025
* (triton jan 1+) user-defined triton kernel support (this is implemented already in @anmyachev's patch)
* (triton jan 1+) triton_helper support (failing in triton codegen - needs investigation)
* (triton jan 1+) AOTI / cpp wrapper

thanks to @anmyachev for patches in https://github.com/intel/intel-xpu-backend-for-triton/blob/main/scripts/pytorch.patch, which contains most of these changes already

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145051
Approved by: https://github.com/jansel
2025-01-24 00:34:01 +00:00
714f64329b Revert "Add multi env variable support to configs (#145288)"
This reverts commit a8b7cb6a2ddbba4924b6b2531f1ecd2f5ed6d512.

Reverted https://github.com/pytorch/pytorch/pull/145288 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint from a landrace with some recent PEP585 changes ([comment](https://github.com/pytorch/pytorch/pull/145288#issuecomment-2611278428))
2025-01-24 00:20:00 +00:00
6a2b4db0a1 Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)"
This reverts commit 42f4fda2ebb27693411f7acca1665778d539bf79.

Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))
2025-01-24 00:17:34 +00:00
6f60c65a3a Revert "[dynamo] Log guard latency (#145132)"
This reverts commit 0a310d738819ae000f49b32298305724117634c2.

Reverted https://github.com/pytorch/pytorch/pull/145132 on behalf of https://github.com/anijain2305 due to CI failures observed after PR was merged ([comment](https://github.com/pytorch/pytorch/pull/145132#issuecomment-2611268421))
2025-01-24 00:11:50 +00:00
f0e9f87a9b [BE/mps] Mark input args as constant to prevent incorrect usage. (#145535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145535
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-24 00:11:44 +00:00
6aaae9d78f Make torchelastic etcd rendezvous publicly importable (#145396)
Make torchelastic publicly importable by raising error on import etcd lazily, [BE task, row 7](https://docs.google.com/spreadsheets/d/1TtATnLJf1rVXaBQd3X3yYqm9xNN9BIWG7QqRgrFiRRI/edit?gid=1748512924#gid=1748512924)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145396
Approved by: https://github.com/albanD
ghstack dependencies: #145387
2025-01-23 23:56:45 +00:00
f8a4f16634 [c10d] fix memory leak on shutdown (#145507)
Summary:
Fix memory leak on shutdown when socket is closed.
We still need to free the buffer to make valgrind happy.

Test Plan:
Use `mtiavm`.
Repro steps provided by cristianlume.

on window 1:
```
vm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2
```
on window 2:
```
vm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1
```

without the fix:
```
==8766==ERROR: LeakSanitizer: detected memory leaks
```
With fix, no leak

Differential Revision: D68566104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145507
Approved by: https://github.com/XilunWu, https://github.com/d4l3k
2025-01-23 23:36:15 +00:00
6dd8283381 Revert "[compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296)"
This reverts commit 5531fafffefc45cd894040b2b07b0d5227430082.

Reverted https://github.com/pytorch/pytorch/pull/143296 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
c3fadacf84 Revert "[compiled autograd] Proxy a node for CopyBackwards into the graph (#143304)"
This reverts commit 8c7c5f7bfcbc55638a0e4aed6eaa27f6194dbebe.

Reverted https://github.com/pytorch/pytorch/pull/143304 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
9553301ade Revert "[compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387)"
This reverts commit 784bb2127ca9729c646f1650ecc2cf946a583da8.

Reverted https://github.com/pytorch/pytorch/pull/143387 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
16c4f8c395 Revert "[compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405)"
This reverts commit ec820fe57c2d6a2847569a107856e7fcff87dc5c.

Reverted https://github.com/pytorch/pytorch/pull/143405 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
3f6cfd0156 Revert "[compiled autograd] stop specializing on metadata during initial trace (#143417)"
This reverts commit 99dd1bf1b93bc26080e611af54497a73a618e02a.

Reverted https://github.com/pytorch/pytorch/pull/143417 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:12 +00:00
ab082863a1 Revert "[compiled autograd] support Tensor Subclasses in AOTBackward (#144115)"
This reverts commit 082c28c3c655984ce65c13336cff822db95ee470.

Reverted https://github.com/pytorch/pytorch/pull/144115 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:12 +00:00
0a310d7388 [dynamo] Log guard latency (#145132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132
Approved by: https://github.com/ezyang
ghstack dependencies: #145351, #145420
2025-01-23 23:30:07 +00:00
bf62222d81 Revert "[compiled_autograd] Rename interface to pyinterface (#145495)"
This reverts commit e1407f5aeb658c8c959d33158f465e975799a3d0.

Reverted https://github.com/pytorch/pytorch/pull/145495 on behalf of https://github.com/izaitsevfb due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145495#issuecomment-2611194932))
2025-01-23 23:07:17 +00:00
a8b7cb6a2d Add multi env variable support to configs (#145288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288
Approved by: https://github.com/c00w
2025-01-23 23:00:23 +00:00
dad9bc3461 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit de945d78da9198e58df7c19c53b737d0f987ddff.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))
2025-01-23 22:59:25 +00:00
cyy
42f4fda2eb Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-23 22:47:18 +00:00
6f07847efe Bail on checking internal overlap when dealing with unbacked symints (#145385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145385
Approved by: https://github.com/ezyang
2025-01-23 22:31:31 +00:00
e1407f5aeb [compiled_autograd] Rename interface to pyinterface (#145495)
Summary: interface is a reserved word in some MSVC variants.

Test Plan: build

Differential Revision: D68561379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145495
Approved by: https://github.com/xmfan
2025-01-23 21:40:59 +00:00
302b07f166 Implement deepcopy for AOTICompiledModel (#145423)
Summary:

Fix https://github.com/pytorch/pytorch/issues/145411

Support deepcopying AOTICompiledModel. The `loader` is shallow copied.

Test Plan:
```
buck2 run fbcode//mode/opt //caffe2/test/inductor:aot_inductor_package -- -r deepcopy
```

Differential Revision: D68524673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145423
Approved by: https://github.com/desertfire
2025-01-23 21:05:30 +00:00
e924ddbef1 [BE] [mps] Refactor UnaryConstants to be its own kernel. (#145230)
In preparation for using this file for inductor (for erfinv).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145230
Approved by: https://github.com/malfet
2025-01-23 20:58:43 +00:00
881eb86692 Fix staging for CPU tensors in OSS DCP async_save (#145408)
Fix staging for CPU tensors in OSS DCP async_save (#145408)

Summary:

As found in
https://github.com/pytorch/pytorch/issues/144657
for CPU tensors we accidentally skip copying during staging due to using offload to cpu helper, which does a no-op for CPU tensors. This means that if the trainer changes the original source CPU tensor value after launch async save but before the actual writing/uploading to the destination commences, the writing/uploading logic will accidentally pick up the latest state of the tensor, while it should have dealt with its own dedicated copy saved earlier. Dropping _offload_state_dict_to_cpu in favor of _copy_state_dict fixes this bug.

Test Plan:
Running the user script from the linked GitHub issue verifies the fix:
```
import os

import torch

import torch.distributed as dist
import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_model_state_dict
import torch.nn as nn


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(1, 1))

    def forward(self, x):
        return self.layer(x)

os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12345"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"

dist.init_process_group()

model = Net()
state_dict = get_model_state_dict(model)
pg = dist.new_group(backend="gloo")

try:
    steps = [10, 20, 30, 40, 50]
    future = None
    for step in steps:
        # simulate a training step, e.g. optimizer updating values
        with torch.no_grad():
            model.weight.data.fill_(step)

        if future is not None:
            future.result()
            future = None
        future = dcp.async_save(
            state_dict,
            checkpoint_id=f"outputs/{step}",
            process_group=pg,
        )

    future.result()

    for step in steps:
        dcp.load(
            state_dict,
            checkpoint_id=f"outputs/{step}",
            process_group=pg,
        )
        assert state_dict["weight"][0, 0] == step, f"got {state_dict['weight'][0, 0]=} on {step=}"
finally:
    dist.destroy_process_group(pg)
    dist.destroy_process_group()
```
passes all asserts with this fix. If the script is run in trunk, confirmed that it fails the first assert.

Differential Revision: D68518689
2025-01-23 12:49:26 -08:00
6a44a61514 [BE] Bump TIMM pin (#145320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145320
Approved by: https://github.com/Skylion007
2025-01-23 20:44:26 +00:00
99367ecbed [draft export] count how many times a data-dep error shows up (#145030)
Summary: maybe this is helpful?

Test Plan: draft_export

Differential Revision: D68303934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145030
Approved by: https://github.com/angelayi
2025-01-23 20:27:31 +00:00
5ebca3015d [BE]: Simplify set add with set update (#145152)
Simplifies the set update slightly to be more readable and efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-01-23 20:18:13 +00:00
d7b6746470 Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347)"
This reverts commit c27dd9cf72265161f85a18c0b19f365097f7a1ac.

Reverted https://github.com/pytorch/pytorch/pull/145347 on behalf of https://github.com/huydhn due to Remove -e breaks the theme somehow ([comment](https://github.com/pytorch/pytorch/pull/145347#issuecomment-2610911258))
2025-01-23 20:06:07 +00:00
d53f2067fe [BE][export] add "+export" logging to de/serialization (#145283)
adds de/serialization debug logging to `TORCH_LOGS="+dynamic"`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145283
Approved by: https://github.com/ydwu4, https://github.com/angelayi
2025-01-23 19:47:48 +00:00
ce4a097bf7 Revert "Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)"
This reverts commit 55084443cabbaf6c28d8c546d8988cf3ed0f3d1c.

Reverted https://github.com/pytorch/pytorch/pull/144829 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/144829#issuecomment-2610855579))
2025-01-23 19:37:54 +00:00
527101fa95 Move Windows arm64 scripts from pytorch/builder (#144317)
This PR moves the Windows Arm64 scripts from the builder repository to the main repository. The corresponding PR to pytorch/builder that removes them is here : https://github.com/pytorch/builder/pull/2058
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144317
Approved by: https://github.com/Skylion007, https://github.com/seemethere

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-23 19:29:29 +00:00
66bf7da446 Enable sleef for Win Arm64 (#144876)
Sleef module was disabled for Windows Arm64 on b021486405
This PR enables it again since the issue is no longer valid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144876
Approved by: https://github.com/albanD, https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-23 19:22:58 +00:00
991a4b5925 [dynamo] Add --profile-details and --export-perfdoctor option (#144751)
Summary:
Add `--profile-details` option to add shapes and other details to the Kineto profile.

Add `--export-perfdoctor` to directly dump trace to perfdoctor for webview.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench_internal -- --only mrs_video_watch_over --performance --training --amp --export-profiler-trace --backend=inductor --profile-details --export-perfdoctor
```

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/pyper_traces/tree/traces/test/inductor_mrs_video_watch_over_rank_0_20250113_173817_6535183793.json.gz

Differential Revision: D68134547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144751
Approved by: https://github.com/drisspg
2025-01-23 19:09:40 +00:00
5b37249259 Enable fp16 linear layers in PyTorch via ACL (#144992)
This pull request aims to enable the use of linear layers with the fp16 data type through the ACL.

On a Graviton3 instance running with 16 threads, `torch.randn(2048, 4096, dtype=torch.half)` will take 50+% less time to complete compared with `torch.randn(2048, 4096, dtype=torch.float32)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144992
Approved by: https://github.com/ng-05, https://github.com/digantdesai, https://github.com/malfet
2025-01-23 19:07:54 +00:00
6d4f5f7688 [Utilization][Usage Log] Add data model for record (#145114)
Add data model for consistency and data model change in the future.

The data model will be used during the post-test-process pipeline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114
Approved by: https://github.com/huydhn
2025-01-23 19:04:41 +00:00
2f317bbdbc Missing autorelease in lstm_mps caused a ton of leaked memory (#145503)
The dictionary held onto the new MPSGraphTensorData objects and MPSNDArrays.  Regression caused by https://github.com/pytorch/pytorch/pull/95137

Fixes #145374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145503
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-23 18:54:30 +00:00
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44f3cf69dd840db990245cbe1f78e770 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
34b8d8b0c0 update compile time benchmarks to dump compile times to stdout and csv (#145447)
```python
# inductor.csv
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,cait_m36_384,8,pass,2510,1,0,0,0,0,0,87.705186
```

```python
loading model: 0it [01:27, ?it/s]
cuda eval  cait_m36_384
Compilation time (from dynamo_timed): 87.705186276  # <----------------
pass
TIMING: _recursive_pre_grad_passes:0.11023 pad_mm_benchmark:0.50341 _recursive_joint_graph_passes:3.88557 _recursive_post_grad_passes:6.71182 async_compile.wait:4.16914 code_gen:17.57586 inductor_compile:42.55769 backend_compile:72.47122 entire_frame_compile:87.70519 gc:0.00112 total_wall_time:87.70519
STATS: call_* op count: 2510 | FakeTensorMode.__torch_dispatch__:101743 | FakeTensor.__torch_dispatch__:12959 | ProxyTorchDispatchMode.__torch_dispatch__:41079
Dynamo produced 1 graphs covering 2510 ops with 0 graph breaks (0 unique)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145447
Approved by: https://github.com/ezyang
2025-01-23 18:49:19 +00:00
629fb1590c [BE] Type annotate pad_mm.py (#145409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145409
Approved by: https://github.com/Skylion007
2025-01-23 18:34:24 +00:00
015c6d6fdb [dynamo][guards] Turn on profiling of guard manager (#145420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145420
Approved by: https://github.com/ezyang
ghstack dependencies: #145351
2025-01-23 18:17:43 +00:00
fef92c9447 Fix IdentationError of code example (#145251)
I found there is IndentationError when try to copy paste the example of inference with torch.compile
fix the format in this pr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145251
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-23 18:17:11 +00:00
9a5bc7b6dd [BE] Type annotate metrics.py (#145418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145418
Approved by: https://github.com/Skylion007
2025-01-23 18:13:59 +00:00
bdc2c2a237 [be] fix flaky test aot_export_ cond caused by free symbol lifting and automatic dynamic shape (#145330)
Fixes https://github.com/pytorch/pytorch/issues/139998#issuecomment-2605908426.

It seems to be an issue caused by the interaction between dynamoed hop X automatic dynamic shape X auto_lift_free symbols. The immediate error is that the asserteExpectedInline of the graph can sometimes be different e.g. see https://hud.pytorch.org/flakytest?name=test_aot_export_with_torch_cond&suite=TestAOTExport&limit=100, where sometimes the shapes are lifted as input to the cond and sometimes they're not.

The root cause of the flakyness is that the two invocations of torch.cond triggers two torch.compile on the same code object ([code](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/cond.py#L192)), and triggers automatic dynamic shape because in test_aot_export_with_torch_cond, x has shape (3, 4) while the pre_dispatch one has shape (2, 2). Because of we auto lift free symbols for dynamic shaped input, this causes cond sometimes have the shape as arguments and sometimes not.

This PR adds a simple fix by adding a _dynamo.reset before each torch.cond tests. This fixes the error by not triggering automatic dynamic shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145330
Approved by: https://github.com/zou3519
2025-01-23 18:12:58 +00:00
3c247ee8c4 [hop][be] add utils for more comprehensive input alias and mutation (#145298)
This PR implements  the idea of checking input mutations through tensor version and check aliasing via storage  from @zou3519. Previously, we rely on whether there's a in place op that takes placeholder input, which doesn't take views into account.

When writing the PR, I also noticed a bug in previous input mutation checking logic: we were checking the whether there are operators functionalized_f where all the mutating ops have been replaced so we won't be able to detect any thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145298
Approved by: https://github.com/zou3519
2025-01-23 18:12:28 +00:00
b0f3597133 Add fused rms_norm implementation for MPS backend (#145301)
Adding a fused rms_norm implementation for MPS backend. This eliminates most of the current CPU overhead, making this computation GPU bound and improving latency of rms_norm by **30x-40x** on MPS backend
The metal shader was adapted from MLX: e6a7ab9675/mlx/backend/metal/kernels/rms_norm.metal

The numbers below are averages over 1000 runs of RMSNorm, obtained on an M1 Pro.

Benchmarking Results (Before):
```
Device            :            MPS            |           CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  140.5  | 171.0  | 170.4  |  10.9  |  13.3  |  13.5
```

Benchmarking Results (After):
```
Device            :            MPS            |           CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    4.0  |   3.9  |   3.9  |  10.0  |  12.4  |  13.0
```

Profiling Results (Before):
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
               aten::rms_norm         2.35%       3.284ms       100.00%     140.038ms     140.038us          1000
                    aten::mul        33.61%      47.068ms        33.61%      47.068ms      23.534us          2000
                    aten::pow        17.04%      23.868ms        17.43%      24.402ms      24.402us          1000
                   aten::add_        16.52%      23.130ms        16.78%      23.497ms      23.497us          1000
                   aten::mean        15.82%      22.151ms        15.82%      22.151ms      22.151us          1000
                  aten::rsqrt        13.63%      19.085ms        13.71%      19.198ms      19.198us          1000
                   aten::item         0.46%     639.370us         0.56%     788.376us       0.394us          2000
                aten::type_as         0.21%     295.507us         0.27%     371.291us       0.371us          1000
                     aten::to         0.13%     177.742us         0.13%     177.742us       0.059us          3000
    aten::_local_scalar_dense         0.11%     149.006us         0.11%     149.006us       0.075us          2000
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 140.038ms
```

Profiling Results (After):
```
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
         aten::rms_norm        63.21%     832.875us       100.00%       1.318ms       1.318us          1000
       aten::empty_like        16.06%     211.631us        36.79%     484.681us       0.485us          1000
    aten::empty_strided        20.72%     273.050us        20.72%     273.050us       0.273us          1000
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.318ms
```

Benchmarking and profiling script:
```python
import torch
import torch.nn as nn
from torch.profiler import profile
import time

def benchmark(device, dtype):
    model = nn.RMSNorm(2048, device=device)

    # Create example inputs
    x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype)
    w = torch.randn(2048, requires_grad=False, device=device, dtype=dtype)
    eps = 1e-5

    # Check output
    y = torch.ops.aten.rms_norm(x, [2048], w, eps)
    z = torch.ops.aten.rms_norm(x.cpu(), [2048], w.cpu(), eps)
    outputs_match = torch.allclose(y.cpu(), z)

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        with torch.no_grad():
            y = model(x)
            torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return outputs_match, average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

print("\nBenchmarking Results:")
print("---------------------")
print("Device            :            MPS            |           CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16")
print(f"Outputs Match     :  ", "  |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))

device = "mps"
dtype = torch.float32
model = nn.RMSNorm(2048, device=device)
x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype)

# Run and profile the model
with profile() as prof:
    with torch.no_grad():
        for _ in range(1000):
            y = model(x)
            torch.mps.synchronize

# Print profiling results
print("\n\nProfiling Results (MPS/FP32):")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145301
Approved by: https://github.com/malfet
2025-01-23 18:07:10 +00:00
a86fa779ce [BE] Fix edge case in translation validation bisector (#145414)
This patch fixes a small bug for the binary-search algorithm in
translation validation bisector. Fixes #131303.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145414
Approved by: https://github.com/ysiraichi, https://github.com/zou3519
2025-01-23 17:35:28 +00:00
045698653a [BE] Remove test_ops_gradients from FIXME_inductor_dont_reset_dynamo (#145308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145308
Approved by: https://github.com/zou3519
ghstack dependencies: #145306
2025-01-23 17:25:04 +00:00
3a8d3785f7 [ca][bug_fix] Fix ref counting of objects in the set_autograd_compiler function. (#145482)
PR#141153 exposed the option to collect sizes as dynamic. After this
change, the function set_autograd_compiler returns PyTuple object which
is populated using PyTuple_SET_ITEM function. Yet, that function steals
reference to the object and doesn't INCREF it. So currently we are
missing INCREF on prior_compiler when it is Py_None and INCREF on
prior_dynamic which is either Py_False or Py_True. This bug may lead to
the possible memory corruption.

@xmfan @jansel @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145482
Approved by: https://github.com/albanD, https://github.com/jansel
2025-01-23 17:13:56 +00:00
c6707734de Enable non power of 2 head_dim for FlexAttention (#133495)
# Summary
- Adds support for non-power of 2 headdim by launching blocks w/ head_dim rounded to the next valid power.
- Other option I considered was building up the final dot_products with smaller blocks (this would probably work but for sake of code complexity going with this option for now)

### Corollary
We had a bug in our backwards kernel where we were using index_k instead of index_v. This should have shown up for the qk_head_dim != v_head_dim cases..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133495
Approved by: https://github.com/Chillee
2025-01-23 17:05:38 +00:00
bf4f8919df Fix test_modules_can_be_imported (#145387)
`test_modules_can_be_imported` test is currently failing due to a few missing private modules and this PR gets it working before I start to clean up the public allow list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145387
Approved by: https://github.com/albanD
2025-01-23 16:03:00 +00:00
768ad0886f Revert "Binary upload checksum (#144887)"
This reverts commit 2efa98d69d362e4ee6f15938ec8ded30bf5c40fd.

Reverted https://github.com/pytorch/pytorch/pull/144887 on behalf of https://github.com/atalman due to Broke nightly index ([comment](https://github.com/pytorch/pytorch/pull/144887#issuecomment-2610066277))
2025-01-23 15:10:42 +00:00
0802e78315 [CD] Disable Kineto for XPU Windows CD (#145255)
Due to issue #145155, disable Kineto for XPU Windows CD temporally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145255
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-01-23 14:09:52 +00:00
629840e038 Backout PEP585 use of Iterable (#145438)
Summary:
Importing Iterable from collections.abc here causes an internal product to fail
MRO discovery causing a collision between Iterable and Generic.

This fixes the failure on D68461304

Differential Revision: D68531443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145438
Approved by: https://github.com/izaitsevfb
2025-01-23 11:45:37 +00:00
cyy
29f52e3972 [2/N] Remove unnecessary once flag usage (#145057)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057
Approved by: https://github.com/albanD
2025-01-23 09:48:46 +00:00
b6941d4e42 [inductor] fix autotuning memory usage (#145410)
We use `cpu_tensor.copy_(gpu_tensor)` to clone mutated kernel arguments for autotuning. The purpose is to avoid increasing peak memory due to the clone. But if `gpu_tensor` is not contiguous, this `copy_` will need allocate an temporary tensor on GPU to store a contiguous copy of `gpu_tensor`:

6e53588789/aten/src/ATen/native/cuda/Copy.cu (L322-L334)

Here is a standalone script to illustrate this behavior: https://gist.github.com/shunting314/812a848dc67b1d674ae42415a7a462c8 . The script report 6GB rather than 3GB peak memory usage.

Note that, with all the following efforts
1. donated buffer
2. inplace padding
3. this PR

We save 3GB peak memory (18.6GB -> 15.5GB) for GPT2 model for torch.compile.

The peak memory of GPT2 is like a '...\_M\_...' shape. There are 2 places that we reach the peak. Donated buffer remove the first peak by computing grad_softmax inplace, and inplace padding removes the second peak by not allocating an extra buffer for mm-padding.

Before all these optimizations, the peak memory is 18.6GB for GPT2 with torch.compile.
With 1 & 2, the peak memory is
1. 17.7GB with a cold cache
2. 15.5GB with a warm cache (since the autotuning overhead is skipped)

With 1 & 2 & 3, we save 3GB peak memory  (18.6GB -> 15.5GB) no matter if autotuning happens or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145410
Approved by: https://github.com/masnesral, https://github.com/jansel
ghstack dependencies: #140249, #145325
2025-01-23 09:34:23 +00:00
638903aeee Adapt Dynamo tests to HPUs using instantiate_device_type_tests (#144387)
**MOTIVATION**

We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

**CHANGES**

Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

Previously we had submitted some changes in https://github.com/pytorch/pytorch/pull/140131 . However, deleted that PR due to merge conflicts and other issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144387
Approved by: https://github.com/ankurneog, https://github.com/EikanWang, https://github.com/yanboliang, https://github.com/guangyey
2025-01-23 09:24:42 +00:00
d3f196909d [inductor] let inplace-padding support cpp-wrapper (#145325)
Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth.  One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed).

The fix is when we allocate the tensor, instead of doing something like:
```
  buf0 = randn_strided([2048, 2047], [2048, 1])
```
we do some small overallocation
```
  buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1])
```

cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #140249
2025-01-23 09:22:38 +00:00
f52901a0a7 [ONNX] Remove LegacyDynamoStrategy (#145442)
It's legacy. So remove. Shouldn't affect anything and will facilitate cleaning up our legacy code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145442
Approved by: https://github.com/titaiwangms
2025-01-23 07:56:04 +00:00
28c251dd0b [BE] Remove test_modules from FIXME_inductor_dont_reset_dynamo (#145306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145306
Approved by: https://github.com/zou3519
2025-01-23 06:37:21 +00:00
f56c638849 [c10/metal] Add a vectype variant for short/int/long (#145430)
Some of the kernels (exp_complex/atan_complex) need the specialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-23 04:52:56 +00:00
c58198184b [dynamo][dicts] Insert LENTGH guard on an if condition on dict (#145432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145432
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-23 04:40:56 +00:00
faa10faa2c [ROCm] CK SDPA - Move arch check to CK patch (#144777)
__gfxXXX__ should only be visible by device code. Move the check to the ck kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144777
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell, https://github.com/jianyuh
2025-01-23 04:12:25 +00:00
5e6451ea78 [c10] catch c10 error and log message (#145413)
Summary:
Explicitly catch c10 error and log the error message only.

The standard exception `e.what()` below ends up logging the stack trace that is confusing users.
See S477887 for details.

Test Plan:
tested locally.
```
buck test caffe2/test/cpp/c10d:TCPStoreTest
buck2 daemon constraint mismatch: Version mismatch; killing daemon...
Starting new buck2 daemon...
Connected to new buck2 daemon.
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68
Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918
Network: Up: 1.5GiB  Down: 4.7GiB  (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161)
Loading targets.   Remaining      0/3024                                                                                                                                 69199 dirs read, 687558 targets declared
Analyzing targets. Remaining      0/31483                                                                                                                                1481904 actions, 1719048 artifacts declared
Executing actions. Remaining      0/250391                                                                                                                               77:11:29.7s exec time total
Command: test.     Finished 2031 local, 45445 remote, 51473 cache (52% hit)                                                                                              20:16:36.9s exec time cached (26%)
Time elapsed: 7:32.7s
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D68516080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413
Approved by: https://github.com/fduwjj
2025-01-23 03:45:47 +00:00
719938c77f Generalize pin memory logic for accelerator when non blocking copy happened (#143783)
# Motivation
fix https://github.com/pytorch/pytorch/issues/143641
Generalize pin memory logic for accelerator when non-blocking copy happened. Each accelerator has its implementation on `empty_strided`. The accelerator which doesn't have pin memory mechanism could ignore or mimic when pin_out is True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143783
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #144959
2025-01-23 03:43:05 +00:00
28b6430823 Introduce a new API isAcceleratorExcluded (#144959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144959
Approved by: https://github.com/albanD
2025-01-23 03:43:05 +00:00
5a18f1e1eb [dynamo] Support fx map_aggregate (#145351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145351
Approved by: https://github.com/zou3519
2025-01-23 03:19:30 +00:00
d95a6babcc Revert "Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)"
This reverts commit 0bff37788043626ee5e472389f88cbbbf7add997.

Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failures look legit ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2608631019))
2025-01-23 01:10:31 +00:00
0d28188cc8 Move privateuse1 test out of test_utils and make them serial (#145380)
Fixes https://github.com/pytorch/pytorch/issues/132720

The reason is that changing the privateuse1 module is global and so can race when other tests happen to check if it is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145380
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-01-23 00:31:39 +00:00
c9e12d6a3b [ROCm] Update rocm.yml and add rocm-mi300.yml (#145398)
- Added another workflow to run the mi300 jobs post-merge.
- Updated rocm.yml to use mi200s instead of mi300s.
- Required to get an idea of how PRs are landing on our mi200s and mi300s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145398
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-23 00:07:50 +00:00
1e32842324 Improve softmax's perf in cuda (#144679)
Fixes #144645

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144679
Approved by: https://github.com/eqy
2025-01-23 00:02:57 +00:00
d0a2e11284 [BE][export] Change custom_op registeration style (#145315)
Summary:
`test_unbacked_bindings_for_divisible_u_symint` has been flaky for a while due to

```
Tried to register an operator (mylib::foo(Tensor a, Tensor b) -> Tensor) with the same name and overload name multiple times.
```

It is likely due to when all variants of this test are being run (non-strict, retrace, serdes) simultaneously. In later tests, the operator has already been registered.

In this diff, we change registration style.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test:test_export -- -r test_unbacked_bindings_for_divisible_u_symint
```

Differential Revision: D68465258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145315
Approved by: https://github.com/zou3519
2025-01-22 23:46:51 +00:00
4803e20bc7 [S481486] Move MTIA dynamic library loading from __init__.py to a separate module (#145322)
Summary: As titled

Test Plan:
- Passed CI tests

buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu -- --exact 'ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu - test_icvr_e2e_gpu (ai_infra.distributed_ai.pyper_local_run.tests.integration_tests.test_icvr_e2e_gpu.TestIcvrE2EGpu)' --run-disabled
```

https://www.internalfb.com/intern/testinfra/testconsole/testrun/9007199320480497/

Differential Revision: D68463242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145322
Approved by: https://github.com/yuhc, https://github.com/albanD
2025-01-22 23:39:43 +00:00
35c8c31f11 Fix for failure in D68425364 (#145304)
Summary: Back out change from #145166 which causes an internal model to fail.

Differential Revision: D68459095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145304
Approved by: https://github.com/izaitsevfb
2025-01-22 23:33:02 +00:00
e6a84be3d3 [PyTorch] Add backend aot_eager_decomp_partition_with_mode (#143250)
Summary:
## Why
To make it possible to run torch dispatch mode inside compiled modules. This is to enable running MemoryTrackerMode (in next diff) to collect memory usage of compiled modules.

## What
Add a backend aot_eager_decomp_partition_with_mode.
Add an enable_log to the backend to control the compilation logging (which can be very verbose and slow the run of mode)

Test Plan:
unittest

E2e tested in the next diff which shows the memory read from the mode passed to this backend is very close to the actual job's memory snapshot.

Differential Revision: D67227144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143250
Approved by: https://github.com/bdhirsh
2025-01-22 23:20:59 +00:00
f0a210bf5d Revert "Output of nonzero is transposed, fix fake tensor (#144695)"
This reverts commit 693d8c7e945cc494bd31ad694ae4f4b6ff13b82a.

Reverted https://github.com/pytorch/pytorch/pull/144695 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68461259 ([comment](https://github.com/pytorch/pytorch/pull/144695#issuecomment-2608443589))
2025-01-22 23:04:50 +00:00
de945d78da [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-22 22:42:48 +00:00
6e53588789 Revert "[BE]: Simplify set add with set update (#145152)"
This reverts commit 0cb9b2284a31fa497d684dbc2f56398c1d1e3114.

Reverted https://github.com/pytorch/pytorch/pull/145152 on behalf of https://github.com/davidberard98 due to land race with https://github.com/pytorch/pytorch/pull/145165 broke lint ([comment](https://github.com/pytorch/pytorch/pull/145152#issuecomment-2608378172))
2025-01-22 22:14:26 +00:00
dddf52b1b9 Revert "Enable grep_linter to use -a (#144589)"
This reverts commit 3c55669b8814237e018a613a494564da5bea9f15.

Reverted https://github.com/pytorch/pytorch/pull/144589 on behalf of https://github.com/clee2000 due to the line parameter is kind of important and -a is not as important as I thought it was so I'm going to revert this ([comment](https://github.com/pytorch/pytorch/pull/144589#issuecomment-2608349155))
2025-01-22 21:55:27 +00:00
082c28c3c6 [compiled autograd] support Tensor Subclasses in AOTBackward (#144115)
Compiled autograd's initial trace traces through the AOTBackward
epilogue. The Tensor Subclass code is not traceable. This PR changes it
so that when we see Tensor Subclass constructors, we proxy nodes for
their construction into the graph.

Test Plan:
- New basic test with TwoTensor
- Existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144115
Approved by: https://github.com/jansel, https://github.com/xmfan, https://github.com/bdhirsh
ghstack dependencies: #143296, #143304, #143387, #143405, #143417
2025-01-22 21:51:07 +00:00
99dd1bf1b9 [compiled autograd] stop specializing on metadata during initial trace (#143417)
The previous PRs built up to this. We change compiled autograd's initial
trace to stop baking in metadata.

While tracing, we allocate some weirdly shaped tensors that we can put
proxies on. The initial trace should not be accessing any metadata of
these tensors (it will likely error out if it does because of how weird
the shapes are).

This involved fixing some various sites where we do specialize on the
metadata, like:
- we change CopySlices's apply_with_saved to proxy some calls
  into the graph (this change is fairly hard to split out by itself).
- we stop calling InputBuffer::add
- we delete the weird metadata from the graph so that no graph passes
  can make use of it.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387, #143405
2025-01-22 21:51:07 +00:00
ec820fe57c [compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405)
We will always proxy autograd.Function nodes in compiled autograd's
initial graph capture (previously there was an
option to proxy vs trace into the autograd.Function)

We have some requirements for the AOTBackward. Compiled Autograd runs
accumulate grad reordering passes on the AOTBackward graph directly
after the initial graph capture, so we can't just proxy a single node for it.

Instead, we:
- proxy the AOTBackward prologue function into the CA graph
- copy-paste the AOTBackward graph into the CA graph
- trace directly through the epilogue (the traced nodes go into the CA
  graph).

Tracing through the epilogue is safe (assuming no Tensor subclasses)
because the only thing the epilogue does is drop some outputs. The
Tensor subclass situation was already broken so this doesn't regress
anything but this PR sets it up to be fixed (in a followup, where we
will proxy "make_subclass" calls into the graph from the epilogue).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143405
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387
2025-01-22 21:50:56 +00:00
784bb2127c [compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387)
We define a functional version of a C++ torch::autograd::Function. The
functional version reconstructs the ctx object and then calls
backward with it.

Some more details:
- we define how to pack/unpack ctx.saved_data into an IValue. It's a
  Dict[str, IValue], so it wasn't difficult.
- every call to CppNode::apply_with_saved binds a new function to
  Python. This is because we're unable to reuse the a previously bound
  function for reasons (the schema may change depending on what the user
  actually puts into their Dict[str, IValue]).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143387
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304
2025-01-22 21:50:47 +00:00
8c7c5f7bfc [compiled autograd] Proxy a node for CopyBackwards into the graph (#143304)
CopyBackwards is a manual C++ torch::autograd::Node; we update its
apply_with_saved to proxy a functional version of it into the graph instead
of inlining into it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143304
Approved by: https://github.com/xmfan, https://github.com/jansel
ghstack dependencies: #143296
2025-01-22 21:50:37 +00:00
5531fafffe [compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296)
This PR is on the way to getting compiled autograd's initial capture to
stop specializing on Tensor metadata.

This PR changes compiled autograd's initial capture to proxy an opaque
(w.r.t. Dynamo) function into the graph for all built-in codegen'ed
autograd nodes and validate_outputs.

We changed each codegen'ed apply_with_saved (e.g.
MulBackward0::apply_with_saved) to call into Python to proxy a function
(compiled_autograd.ops.MulBackward0) into the graph. Then, we use the
node's InputMetadata to "guess" at the properties of the output Tensors
to create some new FakeTensors.

Some details:
- MulBackward0::apply_with_saved lives in libtorch_cpu, but needs to be
  call to Python via libtorch_python. There is an indirection
  (PyCompilerInterface) to do this.
- MulBackward0::apply_with_saved passes a C++ function to Python. To make
  our lives easier, every codegen'ed apply_with_saved passes a C++
  function with the same signature
  `(variable_list, ivalue_list) -> variable_list`.
- We define how to pack arbitrary C++ types into IValue via a helper
  IValuePacker struct and codegen functional variants of each builtin
  C++ autograd node (e.g. MulBackward0_apply_functional_ivalue).

MulBackward0 before this PR:
https://gist.github.com/zou3519/a80381d5fa38e970e413fcd91b0530de

MulBackward0 after this PR:
https://gist.github.com/zou3519/0c2eee8b3d8d96232b51ef430b53c5b0

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143296
Approved by: https://github.com/jansel
2025-01-22 21:50:29 +00:00
0cb9b2284a [BE]: Simplify set add with set update (#145152)
Simplifies the set update slightly to be more readable and efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-01-22 21:31:13 +00:00
9f150786bb [dynamo] Fix numpy test accuracy error induced by randomness divergence (#145293)
Previously `TestGradient.test_second_order_accurate` was failing because
of a small tolerance error (0.03... which is above the 0.03 tolerance).

Upon investigating, `np.random.random` caused some divergence between
eager and compiled randomness because in compiled we are not using
`np.random`'s random seed, rather we end up using `torch`'s. This in
turn caused numerical divergence and aforementioned accuracy issue.

This patch fixes the failure by patching the test case with
`use_numpy_random_stream=True`, which forces a graph break on
`np.random.random()` and thereby falling back to eager to ensure
consistency of the numpy randomness.

Fixes #116746.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145293
Approved by: https://github.com/lezcano
2025-01-22 20:53:02 +00:00
2efa98d69d Binary upload checksum (#144887)
Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887
Approved by: https://github.com/atalman
2025-01-22 20:46:04 +00:00
a57133e3c7 [NVIDIA] Jetson Thor Blackwell Support codegen (#145395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145395
Approved by: https://github.com/eqy, https://github.com/malfet
2025-01-22 20:13:19 +00:00
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
95ff9f0340 [Doc] Add period at the end of the sentence (#145384)
Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/145384/generated/torch.compiler.disable.html#torch-compiler-disable
Fixes https://github.com/pytorch/pytorch/issues/145365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145384
Approved by: https://github.com/huydhn, https://github.com/svekars, https://github.com/kit1980
2025-01-22 19:56:31 +00:00
3917053f63 [audio hash update] update the pinned audio hash (#145328)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145328
Approved by: https://github.com/pytorchbot
2025-01-22 19:39:03 +00:00
70ccbade83 [MPSInductor] Add gamma op (#145341)
By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #145309
2025-01-22 19:37:45 +00:00
b81209557b Fix tests broken by #145176 (#145393)
#145176 broke
test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_graph_break_on_jit_isinstance_dynamic_shapes
test/dynamo/test_repros.py::ReproTests::test_graph_break_on_jit_isinstance

this backs out the offending change until it can be fixed properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145393
Approved by: https://github.com/ZainRizvi
2025-01-22 19:33:16 +00:00
e8e3c03f96 [Test][Inductor] Fix test_tma_graph_breaks (#145271)
Per title. Before these changes, below tests:
```
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_False
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_True
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_False
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True
```

fail with the following message:
```
__________________________________________________________________ KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ___________________________________________________________________
Traceback (most recent call last):
  File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 557, in instantiated_test
    test(self, **param_kwargs)
  File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1760, in test_tma_graph_breaks
    eager_out = f(a, b)
                ^^^^^^^
  File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1740, in f
    t.element_size(),
    ^
UnboundLocalError: cannot access local variable 't' where it is not associated with a value

To execute this test, run the following from the base repo dir:
    python test/inductor/test_triton_kernels.py KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145271
Approved by: https://github.com/jansel
2025-01-22 19:18:59 +00:00
ac8ddf1150 [export][be] Clean up local imports from export [1/n] (#145287)
Summary: as title

Test Plan: CI

Differential Revision: D68449844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145287
Approved by: https://github.com/pianpwk
2025-01-22 19:09:17 +00:00
30717d25fe Move Dynamo test to skip from expected_failures (#145390)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/116105
This test is consistently failing. It shouldn't be marked as a flaky
test in the CI using the disabld tests mechanism. I'm skipping the test for now.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145390
Approved by: https://github.com/williamwen42
2025-01-22 19:06:39 +00:00
0bff377880 Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)
Fixes https://github.com/pytorch/pytorch/issues/142466.
Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case.

Test plan:
```
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859
Approved by: https://github.com/mingfeima, https://github.com/malfet
2025-01-22 17:52:53 +00:00
698106951e [dynamo] Re-enable test_fs family for dynamo (#145302)
Fixes #91467.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145302
Approved by: https://github.com/zou3519
2025-01-22 17:50:05 +00:00
057d9aff39 [S481486] [MTIA] Correct mtia.device_count() API (#145338)
Summary:
Prev: Count the number of "general" accelerators

Curr: Count the number of MTIA devices by using the MTIA runtime API

Test Plan:
```
buck test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r  test_get_device_count
```

https://www.internalfb.com/intern/testinfra/testrun/8162774572631995

Reviewed By: BoyueZheng

Differential Revision: D68472668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145338
Approved by: https://github.com/BoyueZheng, https://github.com/egienvalue
2025-01-22 17:45:15 +00:00
c27dd9cf72 Fix deprecated pytorch_sphinx_theme editable installation (#145347)
Fixes https://github.com/pytorch/pytorch/issues/145221

Pip editable install is going to be deprecated soon https://github.com/pypa/pip/issues/11457.  The fix here is just to remove it and install `pytorch_sphinx_theme` normally.

### Testing

Doc build is working with the change:

* PR https://github.com/pytorch/pytorch/actions/runs/12901499736/job/35975042345?pr=145347
* Nightly https://github.com/pytorch/pytorch/actions/runs/12901500521/job/35975046289
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145347
Approved by: https://github.com/ZainRizvi
2025-01-22 17:28:16 +00:00
288f21cc11 [MPS][BE] Prepare Gamma funcs to be moved ot headers (#145309)
----
- Use `float y = 1.0 + metal::frac(x)` instead of complex
```metal
float y = x;
int n = 0;
bool less_than_one = (y < 1.0);
// Add or subtract integers as necessary to bring y into (1,2)
if (less_than_one) {
  y += 1.0;
} else {
  n = static_cast<int>(floor(y)) - 1;
  y -= n;
}
```
- Declare them all as templates, to avoid instantiation
- Move global arrays to be local to the specific functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145309
Approved by: https://github.com/dcci
2025-01-22 16:14:06 +00:00
c2b401933f [torchbench] Fix mobilenetv2 inductor freezing fail_accuracy (#145296)
Issue: https://github.com/pytorch/pytorch/issues/144891

inductor freezing effectively enables inductor conv-batchnorm fusion. This fusion increases the accuracy error.

More context about this: https://github.com/pytorch/pytorch/issues/120545
For Timm models that are run through benchmarks/dynamo/timm_models.py with TimsRunner the tolerance was increased here:
https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/timm_models.py#L367

If to comment out  conv-batchnorm fusion as Elias suggested in Context issue, the accuracy is back.

=>
Increasing tolerace for mobilenetv2  to the same value via introducing the special configuration for tolerance for freezing only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145296
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-01-22 15:54:09 +00:00
0dbff7e4be Add MKLDNN support for Half GELU (#145339)
Add MKLDNN support for Half GELU to align with BFloat16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145339
Approved by: https://github.com/yanbing-j, https://github.com/leslie-fang-intel, https://github.com/Skylion007
2025-01-22 15:14:51 +00:00
0efa843392 Dynamic shape guards in C++ (#139899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139899
Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/jansel
ghstack dependencies: #143385, #143164
2025-01-22 14:58:35 +00:00
fbaef0ac03 Add a language option for symbolic shape guards (#143164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143164
Approved by: https://github.com/ezyang
ghstack dependencies: #143385
2025-01-22 14:58:35 +00:00
4b77ff9784 Fix PythonMod printing for C++ (#143385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385
Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305
2025-01-22 14:58:35 +00:00
079a3e0f75 [BE] Add type annotations to cudagraph_utils.py and test_cases.py (#145291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145291
Approved by: https://github.com/Skylion007
2025-01-22 14:54:45 +00:00
31c2f36989 Fix triton masked loading for non-block tl.loads (#144782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782
Approved by: https://github.com/eellison
2025-01-22 14:30:56 +00:00
3cbc8c54fd [BE][export] Remove disabled floordiv test in export (#145292)
Summary:
Removing `test_slice_with_floordiv` as it doesn't raise the Runtime Error as expected and it has been disabled since the time it was added https://github.com/pytorch/pytorch/issues/131101

For the case that we expect to fail, it actually returns an empty tensor. This is consistent with the following snippet which prints an empty tensor

```
a = torch.ones(4)
print(a[5:])
```

Test Plan: CI

Differential Revision: D68450650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145292
Approved by: https://github.com/pianpwk
2025-01-22 05:17:56 +00:00
99dbc5b0e2 PEP585 update - test (#145176)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176
Approved by: https://github.com/bobrenjc93
2025-01-22 04:48:28 +00:00
40e27fbcf2 Refactor CPUReproTests to be more vector-length agnostic (#141245)
This changes the hardcoded assumptions of a `256-bit` vector length to querying from `cpu_vec_isa` and changes relevant tests to share the logic.

Also refactored the `config.cpp.simdlen != 1` into the assertion so we stop duplicating it throughout the test cases.

Fixes issues on `128-bit` machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141245
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-01-22 04:24:45 +00:00
dcd9de79e7 [dynamo] Re-enable a AOT-Dispatch test with Dynamo (#145299)
Fixes #124590.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145299
Approved by: https://github.com/zou3519
2025-01-22 03:47:05 +00:00
3a58512613 [Inductor] inplace padding (#140249)
https://github.com/pytorch/pytorch/issues/139865

This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine.

Perf for `test_linear_and_cel`:
```
# TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=False ms=83.311

# TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=True ms=79.827
```

The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c.
- Without the feature: 182.151ms per batch, 180.9K tokens/s
- With the feature:  178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase.

Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) .

UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration.

Differential Revision: [D68340248](https://our.internmc.facebook.com/intern/diff/D68340248)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249
Approved by: https://github.com/jansel, https://github.com/eellison
2025-01-22 03:37:06 +00:00
46851022ff [Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM (#143187)
## Summary

Templated `int8xint8->int32` GEMM that uses AMX ISA (present on Intel Xeon Gen 4 & above). Any epilogues such as weight scale, activation scale, and bias are applied per output block in a fused manner .
Performs well for large values of `M` dimension (assuming canonical dimensions [`M, K`] and [`K, N`] for the activation & weight matrices'/tensors' sizes) when the activation is quantized per-token.
Also supports SmoothQuant GEMM pattern when activation is quantized per-tensor (scalar scale) or per-token (vector scale is applied as an epilogue in this case).

Also increased coverage of GEMM template for uint8 activation, int8 weight GEMM UTs for when the activation zero point is a 1D tensor (the existing implementation only accepted 0D tensors). However, some of such UTs would have to be explicitly enabled with `max-autotune` Inductor config.

## Performance data

The templated codegened fused GEMM with M=32, K=4096, N=14336 used in LLaMA3 exhibits more than 2x perf-gain compared to oneDNN qlinear + mul (for activation's scale) with 48 cores of one socket of Xeon SP 4th gen Platinum 8468 when per-token quantization is used.

For M=1, K=4096, N=14336, regardless of whether per-tensor quantization was used for activation or per-token, the perf gain was more than 3x.

Intel OpenMP & libtcmalloc had been preloaded. All cores used by the workload corresponded to distinct physical cores.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143187
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5

Co-authored-by: Leslie Fang <leslie.fang@intel.com>
2025-01-22 02:27:53 +00:00
27598cd154 [fx] move DCE rand check to import time (#145118)
Mitigates the deterministic benchmark regression: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2593411844. and maybe the dashboard issue.

fx.Node.is_impure is unexpectedly a hot spot. It gets called for every node in the graph whenever we invoke DCE, which should be okay, EXCEPT we invoke DCE on the full graph ~10 times at various stages of torch.compile, and an insane number of times (>O(parameters)) for the subgraphs traced by the pattern matcher.

I considered addressing this problem by reducing the amount of times DCE is called, but I think we can only trim the ones from the pattern matcher, which will require some refactor/caching solution that I leave out of this PR.

torch.Tag.nondeterministic_seeded is provided by native_functions.yml and is implemented as a list. Most of the time, it has <=2 elements, so it's not really worth it to turn it into a set for fast lookup.

Using the deterministic instruction count benchmarks
```python
# before
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8914894946
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8866669058
# after
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8770562314
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8779547794
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145118
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-01-22 02:23:02 +00:00
f2cfe8b59f PEP585 update - mostly toplevels (#145178)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145178
Approved by: https://github.com/bobrenjc93
2025-01-22 02:21:14 +00:00
1ce533867f Teach dynamo to handle GenericAlias without a graph break (#145240)
Dynamo wasn't handling the new PEP585 type annotations:
```
x = list[Foo]
```
Although this worked in py3.9 this was causing an `unimplemented` (Unexpected type in sourceless builder) in py3.12.

This fixes it to treat them as a BuiltinVariable.

Fixes #145226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145240
Approved by: https://github.com/anijain2305
2025-01-22 01:55:51 +00:00
2d1649bc2a Revert "[triton] Update triton pin to include warp specialization support (#145120)"
This reverts commit e261629dc85c061ee35f539ee8bd35aec9971215.

Reverted https://github.com/pytorch/pytorch/pull/145120 on behalf of https://github.com/ZainRizvi due to Reverting since the test failures area about not being able to find a version of triton to install, and this is breaking trunk as well ([comment](https://github.com/pytorch/pytorch/pull/145120#issuecomment-2606107792))
2025-01-22 01:52:36 +00:00
f2d7fe12d8 [BE][MPS] Mark gamma inputs as const (#145295)
Doubt it will change the perf, but it's good to correctly mark const inputs as const
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145295
Approved by: https://github.com/manuelcandales
ghstack dependencies: #145289
2025-01-22 01:00:53 +00:00
c106e9b4c6 [BE][MPS] Move Gamma kernels to its own file (#145289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145289
Approved by: https://github.com/manuelcandales, https://github.com/dcci
2025-01-22 01:00:53 +00:00
1908116ace [MPS][BE] Move vectypes from Quantized to utils (#145312)
That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312
Approved by: https://github.com/dcci
2025-01-22 00:37:28 +00:00
266fd35c58 Fix ExecuTorch, XLA, Triton hash updates (#145314)
Fix some stale hash updates https://github.com/pytorch/pytorch/pulls/pytorchupdatebot reported by @izaitsevfb

* XLA and ExecuTorch now wait for all jobs in pull instead of hardcoding the job names which are not correct anymore and the bot waits forever there
* Trion commit hash hasn't been updated automatically since 2023 and people have been updating the pin manually with their testings from time to time, so I doubt that it would be an useful thing to keep.

The vision update failures looks more complex though and I would need to take a closer look.  So, I will keep it in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145314
Approved by: https://github.com/izaitsevfb
2025-01-21 23:24:21 +00:00
1e8d6d6f0e [SkipFiles] New modules added to torch.* are inlined by default (#145279)
This PR:
- makes it so that new modules added to torch are inlined by default
- adds a list of the previously "skipped by default" modules to avoid
  regressing anything. This is a new MOD_SKIPLIST list that is consulted
  in trace_rules.check_file.
- Follow-up work will go through this list, one-by-one, and try to delete
  modules. I think we should be able to delete almost everything,
  except for torch._dynamo.

Test Plan
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145279
Approved by: https://github.com/yanboliang
2025-01-21 23:24:12 +00:00
e261629dc8 [triton] Update triton pin to include warp specialization support (#145120)
The warp specialization work has been landed to the triton rc/3.2.x branch as b2684bf3b0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120
Approved by: https://github.com/bertmaher
2025-01-21 22:14:13 +00:00
19c3ba44a2 Use TORCH_CHECK instead of std::runtime_error in stack.h and ivalue.h (#145280)
TORCH_CHECK will preserve the stacktrace for when TORCH_CPP_SHOW_STACKTRACES=1, whereas std::runtime_error will not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145280
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-21 21:58:59 +00:00
7dd9d1f243 Update clickhouse-connect to 0.8.14 (#144915)
Corresponds to https://github.com/pytorch/test-infra/pull/6177

I only tested the slow test script but I also did testing on the new version with scripts in https://github.com/pytorch/test-infra/pull/6177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144915
Approved by: https://github.com/huydhn
2025-01-21 21:43:18 +00:00
35f5668f7e [NVIDIA] RTX50 Blackwell Support codegen (#145270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145270
Approved by: https://github.com/ezyang
2025-01-21 21:10:05 +00:00
895659cb41 Revert "Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)"
This reverts commit 07e23653cd9ef8cfda01773d94d9f76e5072528d.

Reverted https://github.com/pytorch/pytorch/pull/142848 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68355212 ([comment](https://github.com/pytorch/pytorch/pull/142848#issuecomment-2605734067))
2025-01-21 21:04:45 +00:00
bac62341eb PEP585 update - torch/_inductor (#145198)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145198
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:33 +00:00
2f9d378f7b PEP585 update - torch/utils (#145201)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145201
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:10 +00:00
693d8c7e94 Output of nonzero is transposed, fix fake tensor (#144695)
Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695
Approved by: https://github.com/bobrenjc93, https://github.com/albanD
2025-01-21 20:50:09 +00:00
323fb4dad0 Unconditionally exclude upper bound in all size oblivious tests (#144867)
I was thinking about https://github.com/pytorch/pytorch/pull/144471 some more and I thought, "Hmm, why not just always exclude the constant upper bound." So here it is.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144867
Approved by: https://github.com/bobrenjc93
2025-01-21 20:44:09 +00:00
df67ac4c86 [CI][CUDA][Distributed][FSDP] Remove hardcoded world size of 2 (#145195)
as these unit tests would fail if run

on a single GPU (i.e**. skip_if_lt_x_gpu(2)) seems to view world size as 2 even on platforms with 1 GPU.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145195
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-01-21 20:25:52 +00:00
505ade7471 [inductor] Simplify mode options, only apply CompilerBisector changes once (#145232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145232
Approved by: https://github.com/yanboliang
2025-01-21 19:25:46 +00:00
85811631d7 [Intel CPU] Fix issue #143489. (#145062)
Fix issue in https://github.com/pytorch/pytorch/issues/143489.
kernel_height * kernel_weight will cause Floating point exception, so we will divide by them one by one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145062
Approved by: https://github.com/soulitzer
2025-01-21 18:38:33 +00:00
128f3627b1 Implement backward for NJT matmul (#144587)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

This PR implements missing backward support for NJT matmul. Notably, for dense tensors, matmul dispatches to bmm. However, due to historical reasons related to NST, NJT handles matmul directly, and thus can't rely on the CompositeImplicit impl of matmul to get the derivative formula.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144587
Approved by: https://github.com/soulitzer
ghstack dependencies: #144586
2025-01-21 18:27:50 +00:00
af204135d8 Fix NJT fill.Scalar for contiguous inputs (#144586)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

This PR implements the missing `fill.Scalar` support, which works fine for contiguous inputs, but there is still some AOTAutograd debugging required to handle non-contiguous transposed NJTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144586
Approved by: https://github.com/soulitzer
2025-01-21 18:22:08 +00:00
efa88e04e1 Don't overspecialize float when propagating cache guards to ShapeEnv (#145078)
Fixes https://github.com/pytorch/pytorch/issues/142507

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145078
Approved by: https://github.com/Skylion007
2025-01-21 18:05:43 +00:00
b3e90c8c33 Add support for torch function on dtype arguments (#145085)
Along the lines of https://github.com/pytorch/pytorch/issues/119194 although it doesn't actually address the FCD case.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145085
Approved by: https://github.com/vmoens, https://github.com/Skylion007
2025-01-21 17:44:47 +00:00
eb553ae3cf Fix broken gpt_fast micro benchmark after #144315 (#145235)
The benchmark is failing with the following error

```
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 333, in <module>
    main(output_file=args.output, only_model=args.only)
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 308, in main
    lst = func(device)
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 66, in run_mlp_layer_norm_gelu
    us_per_iter = benchmarker.benchmark(compiled_mod, (x,)) * 1000
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
    return fn(self, *args, **kwargs)
TypeError: benchmark() missing 1 required positional argument: 'fn_kwargs'
```

An example error is https://github.com/pytorch/pytorch/actions/runs/12862761823/job/35858912555

I also assign `oncall: pt2` as the owner of this job going forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145235
Approved by: https://github.com/nmacchioni
2025-01-21 17:42:24 +00:00
2cffbff7da Add 3.13t Windows and MacOS binary builds (#141806)
Related to: https://github.com/pytorch/pytorch/issues/130249

For conda uses approach described here:
https://conda-forge.org/blog/2024/09/26/python-313/

Create Python 3.13t conda env like so:
```
conda create -n py313 python=3.13 python-freethreading  -c conda-forge
```

For windows executable installation we need to pass additional parameter to enable 3.13t:
```
Include_freethreaded=1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141806
Approved by: https://github.com/albanD
2025-01-21 17:16:19 +00:00
0afd335174 PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175
Approved by: https://github.com/bobrenjc93
2025-01-21 16:57:27 +00:00
803017f3cb [inductor] fix MA on poor gpu (#145133)
Found this bug when debugging a MA issue in CI that can not be repro-ed on devgpu.

On GPU with less than 68 SMs (like NVidia L4 used in CI), running torch compile in max-autotune mode may result in the following confusing error https://gist.github.com/shunting314/370f42f547e3367a3773237942725a86 complaining about layout:
```
torch._inductor.exc.InductorError: LoweringException: AssertionError: convert FlexibleLayout to FixedLayout first
```
The reason is, even if we don't pick Triton template, Inductor still returns a MultiTemplateBuffer for tuned addmm. MultiTemplateBuffer.get_reads called from Reduction.num_splits may indexing a FlexibleLayout which results in the error aforementioned.

The issue does not appear on devgpu because we freeze the layout of addmm inputs when rendering triton templates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145133
Approved by: https://github.com/jansel
2025-01-21 09:31:34 +00:00
b5655d9821 PEP585 update - .ci android aten (#145177)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145177
Approved by: https://github.com/Skylion007
2025-01-21 06:31:26 +00:00
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
c6986ca2e1 Revert "[dcp] Add ZStandard transformer (#143360)"
This reverts commit 7b56b039afe2b4a4038c09d8b6cb1597823f3d5f.

Reverted https://github.com/pytorch/pytorch/pull/143360 on behalf of https://github.com/atalman due to Broke 3.13t builds please test with ciflow/binaries label attached ([comment](https://github.com/pytorch/pytorch/pull/143360#issuecomment-2603433066))
2025-01-21 01:10:16 +00:00
5fd881a5b6 Revert "PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)"
This reverts commit 54a00af2c6026a830f40d9e6a659ff81d51f9bc6.

Reverted https://github.com/pytorch/pytorch/pull/145175 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145175#issuecomment-2603418267))
2025-01-21 00:49:55 +00:00
dea7ad3371 PEP585 update - torch/testing (#145200)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200
Approved by: https://github.com/bobrenjc93
2025-01-20 22:42:42 +00:00
805c4b597a PEP585 update - torch/_higher_order_ops torch/_subclasses torch/backends torch/compiler torch/cuda torch/masked torch/mtia torch/nested (#145202)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145202
Approved by: https://github.com/bobrenjc93
2025-01-20 22:37:26 +00:00
54a00af2c6 PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175
Approved by: https://github.com/bobrenjc93
2025-01-20 22:32:59 +00:00
bd97ce0b45 PEP585 update - torch/ao (#145199)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145199
Approved by: https://github.com/bobrenjc93
2025-01-20 22:32:35 +00:00
cf05f6a134 [BE]: Improve typing for torch/fx/_pytree.py and torch/utils/_pytree.py (#145173)
Improve type inference in _pytree.py utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145173
Approved by: https://github.com/bobrenjc93
2025-01-20 22:18:19 +00:00
225a10febe [CI] Add xpu linux build into pull workflow (#145084)
To mitigate the XPU build failure risk introduced by non-XPU specific PRs. Refer #144967 & #143803
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145084
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-01-20 19:31:48 +00:00
d0100050dd [aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [2/n] (#145091)
Summary: Following up D68122536 to remove configurable aot_mode for inner_compile

Test Plan: CI

Reviewed By: desertfire

Differential Revision: D68158512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145091
Approved by: https://github.com/ydwu4
2025-01-20 19:09:10 +00:00
0b2a3687b9 PEP585 update - torch/fx (#145166)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145166
Approved by: https://github.com/bobrenjc93
2025-01-20 18:11:54 +00:00
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279bc179a6bb63f0226e24ab42a07b394.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
57b2b64acf Fix always true scaled_mm test (#143912)
Looks like `out_fp8` should use matmul without scales and `out_fp8_s` with
Scales were optional arguments before PR https://github.com/pytorch/pytorch/pull/128683
Then test_float8_scale started comparing two identical results and lost its meaning
Reason of making scales required https://github.com/pytorch/pytorch/pull/128683#issuecomment-2169146402UMBER

This PR uses scale=1.0 to compare result with scaled matmul

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143912
Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/pruthvistony
2025-01-20 16:17:46 +00:00
53e2408015 Improve cleanup of cancelled jobs on s390x for tests too (#144968)
Follow up to https://github.com/pytorch/pytorch/pull/144149
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144968
Approved by: https://github.com/huydhn
2025-01-20 12:56:07 +00:00
92b9da1fc2 fix torch.atan for torch.complex datatypes on CPU (#144749)
Fix https://github.com/pytorch/pytorch/issues/141487.
This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `atan`. For correctness, I temporarily fallback the implementation of `atan` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144749
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-20 08:45:03 +00:00
ed669a9db7 fix torch.div for torch.complex datatypes on CPU (#140375)
Fix https://github.com/pytorch/pytorch/issues/135428.
Fix https://github.com/pytorch/pytorch/issues/106845.
These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `div`. For correctness, I temporarily fallback the implementation of `div` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140375
Approved by: https://github.com/mingfeima
2025-01-20 08:34:29 +00:00
c922ccb7c4 fix sigmoid for torch.complex datatypes on CPU (#140391)
Fix https://github.com/pytorch/pytorch/issues/135777.
This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `reciprocal`. For correctness, I temporarily fallback the implementation of `reciprocal` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140391
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
ghstack dependencies: #140358
2025-01-20 08:23:58 +00:00
507bf65c6a fix torch.exp for torch.complex datatypes on CPU (#140358)
Fix https://github.com/pytorch/pytorch/issues/48010, https://github.com/pytorch/pytorch/issues/136063.
These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `exp`. For correctness, I temporarily fallback the implementation of `exp` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140358
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-20 08:03:17 +00:00
972d4a154d Add facility to run dynamo UTs for non-cuda devices (#140929)
This is in line with changes introduced with https://github.com/pytorch/pytorch/pull/130714, additional files are included to support non-cuda devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140929
Approved by: https://github.com/kwen2501, https://github.com/EikanWang, https://github.com/guangyey
2025-01-20 05:56:38 +00:00
2b809e58ad PEP585 update - torch/onnx (#145174)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145174
Approved by: https://github.com/justinchuby
2025-01-20 05:48:52 +00:00
19584b28fd [dynamo][dicts] Consolidate dict(..) construction (#144342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342
Approved by: https://github.com/StrongerXi
2025-01-20 04:42:06 +00:00
980c75fe6e [MPSInductor] Add TrueDiv and Round[Int|Decimal] (#145160)
That fixes `test_builtins_round_float_ndigits_neg` and `test_builtins_round`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145160
Approved by: https://github.com/jansel, https://github.com/dcci
2025-01-20 04:29:42 +00:00
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
b6c5562c1f PEP585 update - torch/export (#145165)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145165
Approved by: https://github.com/bobrenjc93
2025-01-19 20:56:55 +00:00
316808e4e9 PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
2025-01-19 20:55:59 +00:00
c64e657632 PEP585 update - torch/distributed/fsdp (#145162)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162
Approved by: https://github.com/bobrenjc93
2025-01-19 20:04:05 +00:00
371a361db9 Enable bfloat16 testing on MacOS14+ (#145159)
As Metal-3.1 supports this dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145159
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #145157
2025-01-19 19:35:31 +00:00
97d4d3c40a PEP585 update - torch/_export (#145138)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145138
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145154
2025-01-19 18:48:35 +00:00
cd8d0fa20c Tweak schema_check to handle annotated builtin types (#145154)
As of python 3.9 annotated lists can be written as `list[T]` and `List[T]` has been deprecated.  However schema_check was converting `list[T]` to simply be `list`. This change teaches it to handle `list[T]` the same as `List[T]`.

A couple small drive-by changes I noticed as well:
- Path concatenation should use `os.path.join`, not `+`
- Spelling in error message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145154
Approved by: https://github.com/bobrenjc93
2025-01-19 18:48:35 +00:00
9e0437a04a PEP585 update - torch/ao/quantization (#145140)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145140
Approved by: https://github.com/bobrenjc93
2025-01-19 10:20:00 +00:00
78bff1e8c1 PEP585 update - torch/_functorch (#145139)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145139
Approved by: https://github.com/bobrenjc93
2025-01-19 07:06:10 +00:00
10e4d3aebb [DCP] Fix fsspec fsync bug on .finish() (#144753)
Fixes #144752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144753
Approved by: https://github.com/Skylion007, https://github.com/saumishr
2025-01-19 03:21:00 +00:00
8cc415774f [mps/inductor] Introduce a metal approx for erf() and use it. (#145161)
Probably we can do better, but this is a start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161
Approved by: https://github.com/malfet
2025-01-19 02:29:05 +00:00
893ca1dfe1 PEP585 update - torch/_inductor/[_-i]* (#145137)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137
Approved by: https://github.com/bobrenjc93
2025-01-19 01:22:47 +00:00
cede43e06b [MPSInductor][BE] NaN-propagating min/max to header (#145157)
May be to be later reused from eager op as well

Also, didn't know that Metal already have type_traits
And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent  `a != a || b != b`, but I suspect it might have a best native implementation for the specific architecture

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157
Approved by: https://github.com/dcci
2025-01-18 22:52:44 +00:00
5b5766665d PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145102
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145105
2025-01-18 20:47:12 +00:00
a79100ab11 PEP585 update - torch/_dynamo (#145105)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105
Approved by: https://github.com/bobrenjc93
2025-01-18 20:47:11 +00:00
c95efc37ba PEP585 update - torch/distributed/tensor (#145141)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145141
Approved by: https://github.com/bobrenjc93
2025-01-18 20:01:59 +00:00
4f8237dbad [mps/inductor] Skip "double" tests as 64-bits FP is not supported. (#145123)
257 tests failed (before) -> 242 tests failed (after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145123
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-18 19:13:34 +00:00
5802be698e Revert "parametrized test name handles class arguments (#133546)"
This reverts commit 4e4b8592a32f701b4974679ab1381ba7cccd4844.

Reverted https://github.com/pytorch/pytorch/pull/133546 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but trying to disable the new tests does seem to fully cover all the cases and some are still failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/133546#issuecomment-2599814339))
2025-01-18 18:12:18 +00:00
b63b81410c Fix NJT frexp() to handle both outputs (#144585)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Before this PR, `frexp()` for NJT was handled via the unary pointwise fallback. The op returns a tuple, however, and the fallback doesn't handle that. This PR defines an explicit impl for `frexp()` that wraps both returned `(mantissa, exponent)` as NJTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144585
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583, #144584
2025-01-18 15:59:56 +00:00
3ee531f8b9 Support NJT chunk() backward on batch dim (#144584)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Implements `chunk()` backward on the batch dim, which was left out before. This PR unbinds the components and invokes `copy_()` on these to pass along the appropriate gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144584
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583
2025-01-18 15:58:24 +00:00
8a57234033 [MPSInductor] Implement i0 and i1 ops (#145092)
Using shared definitions with eager op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145092
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #145023, #145087
2025-01-18 15:41:02 +00:00
1d9fc9df38 Downgrade ignored guard to info level (#145075)
Fixes https://github.com/pytorch/pytorch/issues/101265

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145075
Approved by: https://github.com/Skylion007
2025-01-18 15:30:01 +00:00
5e4cf3e6ad Moved .all() checks for distributions to _is_all_true (#145029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145029
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-18 07:55:48 +00:00
2bf772d1ba PEP585 update - torch/_inductor/codegen (#145106)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145106
Approved by: https://github.com/bobrenjc93
2025-01-18 06:56:03 +00:00
4bf29f44b7 [aoti] Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass (#145028)
Summary:
Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass because this op is blocking fusion.

This should not have any affect on the result, because the op would not show up in the final aoti compiled model anyway (the assertion has no effect).

An real example where this improves performance:

In the example below, the post grad graph would contain `torch.ops.aten._assert_tensor_metadata.default`, because of PR  https://github.com/pytorch/pytorch/pull/142420. This op is added when functionalizing aten.to.

We want the `add` node from `linear` to be fused with the rest of the pointwise ops, instead of fused with the `mm` from `linear`.

```

class Model(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Model, self).__init__()
        self.linear = nn.Linear(input_dim, hidden_dim).half()
        self.rms_norm = nn.RMSNorm(hidden_dim)

    def forward(self, x):
        linear_458 = self.linear(x)  # Linear layer with weights'
        # mimic the torchtune rms norm: /torchtune/torchtune/modules/rms_norm.py
        linear_458 = linear_458.to(torch.float32)
        rms_norm_34 = self.rms_norm(linear_458)  # RMS Normalization
        sigmoid_168 = torch.sigmoid(rms_norm_34)  # Sigmoid activation function
        mul_168 = sigmoid_168 * rms_norm_34  # Element-wise multiplication

        return mul_168

def main():
    with torch.no_grad():
        input_dim = 512
        hidden_dim = 256
        batch_size = 32
        model = Model(input_dim, hidden_dim).to("cuda")
        example_inputs = (
            torch.randn(batch_size, input_dim).to("cuda").to(torch.float16),
        )
        ep = torch.export.export(model, example_inputs)
        package_path = torch._inductor.aoti_compile_and_package(ep)
```

Test Plan:
CI

Differential Revision: D68303114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145028
Approved by: https://github.com/angelayi
2025-01-18 06:06:25 +00:00
dc9b77cc55 [MPS] Support includes in metal objects (#145087)
Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation.
Test using:
 -  `TestMetalLibrary.test_metal_include`
 - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact:
 ```metal
template <typename T, typename Tout = T>
void kernel
i0(constant T* input,
   device Tout* output,
   uint index [[thread_position_in_grid]]) {
  output[index] = c10::i0(static_cast<Tout>(input[index]));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087
Approved by: https://github.com/dcci
ghstack dependencies: #145023
2025-01-18 05:35:22 +00:00
2859b11bdb [pytorch/ncclx] Remove Alltoallv specialization for PTD all_to_all (#145045)
Summary:
PTD all_to_all uses a list of tensors, while ncclAllToAllv (provided
by NCCLX and RCCL) assumes that a single contiguous buffer is used.
These are fundamentally mismatched.  The list of tensors might not be
contiguous or even ordered (buffer addresses might not be in
increasing order).

This patch removes the ncclAllToAllv specialization for PTD
all_to_all, and instead let's it directly call ncclSend/ncclRecv.

Co-authored by @pavanbalaji
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145045
Approved by: https://github.com/pavanbalaji, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/ezyang
2025-01-18 05:26:55 +00:00
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
2c4281d7da Make MultiProcContinuousTest timeout configurable (#145099)
Allows test classes using MPCT to set their own timeout as a class
property, which is good enough since the processgroup is shared across
test instances and the timeout is set at processgroup init.

Also sets a default timeout of 2 minutes, which is probably (?) long
enough for reasonable tests, but can be changed if it causes flakyness.
It's preferable to have as short default timeout as possible, since when
debugging tests getting a timeout quickly helps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145099
Approved by: https://github.com/d4l3k, https://github.com/fduwjj
ghstack dependencies: #145010, #145011
2025-01-18 04:37:12 +00:00
bdfeda5c9a composability test cleanup (#145011)
minor changes to test public PP api instead of internal/private one and
also save a few lines of code for microbatch splitting in the process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145011
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
ghstack dependencies: #145010
2025-01-18 04:37:12 +00:00
4eea2f7496 [inductor] Fix ignored options for torch.compile (#145131)
#139833 broke `torch.compile(options=...)` so that many (all?) options passed in get completely ignored.  @alexreinking pointed this out when `options={"cpu_backend":"halide"}` did nothing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145131
Approved by: https://github.com/exclamaforte
2025-01-18 03:39:49 +00:00
668fb7dfba [ca] Use aot_eager on flex attention test (#145097)
FIXES https://github.com/pytorch/pytorch/issues/144912

The flex attention lowering incompatibilities are covered by https://github.com/pytorch/pytorch/blob/main/test/inductor/test_flex_attention.py. For the CA + flex integration, we don't actually need to test the lowering, only the frontend graph capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145097
Approved by: https://github.com/drisspg
2025-01-18 02:47:13 +00:00
55084443ca Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)
Summary:

Test Plan:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829
Approved by: https://github.com/Chillee
2025-01-18 02:39:22 +00:00
2f51d06210 basic InductorBenchmarker (#133058)
This PR adds the most basic custom benchmarker (i.e. a benchmarker that is not provided by Triton), which we call `InductorBenchmarker`. This new benchmarker is very basic in principal, and very closely follows Triton's `do_bench` implementation with slight changes such as flushing the exact L2 cache size (Triton defaults to 256mb), using a buffer zero for warmup (Triton uses the benchmarked kernel itself, I found that buffer zeroes are more consistent),  and returning the min runtime (Triton can return min, among other things, currently Inductor picks median).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133058
Approved by: https://github.com/eellison
ghstack dependencies: #144315
2025-01-18 02:35:00 +00:00
ee3e89190a refactor benchmarking to use dynamo_timed (#144315)
use dynamo_timed for all our wrapped calls, instead of our custom timer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144315
Approved by: https://github.com/eellison
2025-01-18 02:35:00 +00:00
17c3a10cbd PEP585 update - torch/_inductor/fx_passes (#145107)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145107
Approved by: https://github.com/oulgen, https://github.com/bobrenjc93
2025-01-18 02:04:29 +00:00
8e4539245e Update ci_expected_accuracy for TIMM levit_128 for further investigation (#145112)
TSIA, it looks like an upstream change, but I'm not sure from where yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145112
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-01-18 01:55:34 +00:00
0b151f260f [AOTI] Add an option to skip optimizing generated wrapper code (#144866)
Summary: In some cases, generated wrapper code faces a long cpp compilation time. As an alleviation, this PR adds an option to skip cpp compiler optimizers for the generated main wrapper function body.

D68174038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144866
Approved by: https://github.com/chenyang78, https://github.com/hl475
2025-01-18 01:44:21 +00:00
7c1fb9b1ae [inductor] Refactor CachingAutotuner so that it can pickle (#144044)
These are refactors needed for #144288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144044
Approved by: https://github.com/eellison
2025-01-18 01:44:16 +00:00
02385ed625 [Break XPU][Inductor UT] Fix broken XPU CI introduced by community changes (#145058)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145058
Approved by: https://github.com/jansel
2025-01-18 01:30:24 +00:00
c434a64f31 Delete torch._library.register_functional_op (#145110)
Fixes #117816, #117834, #117871

This has been superceded by auto_functionalized_v2. There are no
internal usages and this is private API so it is safe to delete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145110
Approved by: https://github.com/williamwen42
ghstack dependencies: #145109
2025-01-18 00:58:25 +00:00
712d9882d2 Skip test responsible for causing flakiness (#145109)
Investigation is a separate issue. For now I want to get the CI back up
and running on the other tests. The problem seems to be that
IncludeDispatchKeyGuard doesn't actually reset the state, which seems
very, very wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145109
Approved by: https://github.com/williamwen42
2025-01-18 00:58:25 +00:00
c338dda6be fix test_rng bisector test (#143662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143662
Approved by: https://github.com/zou3519
2025-01-18 00:15:38 +00:00
d02c396fbb add fp8 support to index_cuda (#144747)
Fixes #133605

**Summary**

This PR adds support for FP8 data types to the `index_cuda` op.

It uses `AT_DISPATCH_V2` which is a new macro that can handle arbitrary number of dtypes, as opposed to the old implementations which had a separate macro for each possible number of dtype arguments (e.g. `AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND{2,3,4,5...}`).

**Test plan**

Updated test `index_cuda_with_cpu` in `test/test_fake_tensor.py` to have cases for all dtypes handled by `index_cuda`, including fp8 dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144747
Approved by: https://github.com/vkuzo
2025-01-17 22:53:23 +00:00
4e4b8592a3 parametrized test name handles class arguments (#133546)
Previously, parametrized tests with class arguments, for example

```
@parametrize("this_cls", (Foo, Bar))
```

would create parametrized tests with names `test_foo_this_cls0` and `test_foo_this_cls1`. With this change, we instead should get `test_foo_this_cls_Foo` and `test_foo_this_cls_Bar`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133546
Approved by: https://github.com/eellison
2025-01-17 22:48:38 +00:00
64e54d5af6 [Pipelining] Relax scale_grads assert (#145010)
The assert felt morally valid- if no gradients are scaled, then something
is definitely wrong with the setup.  In one instance, PP +
optimizer-in-backward (in torchtitan) resulted in grad=None after
running .backward() and before scaling grads.

On the other hand, the existing assert is too restrictive.  It's
possible that a model used with pipelining would have some parameters
that do not receieve gradients, and we shouldn't hard-error in these
cases.  (E.g. if the parameter is literally not used, or is frozen).
In the extreme case, the whole stage could be frozen.  So we do not
complain if no grads are scaled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145010
Approved by: https://github.com/mori360, https://github.com/tianyu-l
2025-01-17 21:33:28 +00:00
07e23653cd Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)
Fixes #140092

Here's what this PR does:

In before, we create a `scalar_t eps_val;` variable, and the `eps` is mostly a double scalar which passed from python frontend, like 1e-6.

While we do `eps_val = std::numeric_limits<at::scalar_value_type<scalar_t>::type>::epsilon();` or `eps_val = eps.value();`, we down cast this epsilon to match input tensor dtype (`scalar_t`), in case of BFloat16, the 1e-6 double would be cast to `1.00136e-05`.

However, while we act `auto rqrst_input = rsqrt(at::pow(upcasted_input, 2).mean(dims_to_reduce_ref, /*keepdim=*/true).add_(eps_val));`, we up cast `eps_val` to match the `opmath_t`, the conversion between these two dtypes is UNNECESSARY, so we could just make the `opmath_t eps_val` instead of `scalar_t`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848
Approved by: https://github.com/mikaylagawarecki
2025-01-17 21:30:54 +00:00
a8ef423fed Fix NJT min / max backward() for non-ragged reductions (#144583)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

`value_selecting_reduction_backward()` is used in the backward for min / max, so this PR implements it for NJT. Notably, this isn't enough for reducing over the ragged dim, since that results in a dense tensor and thus NJT's torch_dispatch will not be called for this op. We need factory function support for nested ints to fix that case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144583
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582
2025-01-17 20:57:11 +00:00
cac10b8190 Fix NJT OpInfo entry for nn.functional.prelu (#144582)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

The OpInfo entry for prelu was wrong before this PR; `weight` needs to be passed as well. The op isn't fully implemented yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144582
Approved by: https://github.com/soulitzer
2025-01-17 20:36:15 +00:00
eaef613688 Fix issue with test/nn/test_convolution:TestConvolutionNNDeviceTypeCUDA.test_conv_large_batch_1_cuda (#145067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145067
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia

Co-authored-by: Wei Wang <143543872+nWEIdia@users.noreply.github.com>
2025-01-17 20:31:25 +00:00
0eda02a94c Prevent legacy_load when weights_only=True (correctly) (#145020)
Only prevent `legacy_load` (.tar format removed in https://github.com/pytorch/pytorch/pull/713), not the whole of `_legacy_load` (.tar format + _use_new_zipfile_serialization=False)

Differential Revision: [D68301405](https://our.internmc.facebook.com/intern/diff/D68301405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145020
Approved by: https://github.com/kit1980, https://github.com/albanD
2025-01-17 20:10:22 +00:00
2ef7b68666 [inductor] fix TORCH_LOGS="benchmarking" (#144997)
Saw this error with TORCH_LOGS="benchmarking"
```
  File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 37, in wrapper
    result = fn(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 66, in wrapper
    return fn(self, *args, **kwargs)
torch._inductor.exc.InductorError: TypeError: Benchmarker.benchmark() missing 1 required positional argument: 'fn_kwargs'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144997
Approved by: https://github.com/eellison, https://github.com/nmacchioni
2025-01-17 19:41:18 +00:00
d996d7ec13 upgrade to sccache 0.9.1 - dealing with nvcc -E correctly (#145012)
sccache 0.9.1 should be dealing with `nvcc -E` correctly

see https://github.com/mozilla/sccache/pull/2300

If this works as expected, we can get rid of this code:
https://github.com/pytorch/pytorch/pull/142813/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145012
Approved by: https://github.com/malfet
2025-01-17 19:26:01 +00:00
46fbd63405 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-17 18:21:22 +00:00
18638b91fe Adding more compile time logging in pad_mm (#144884)
Summary: As title

Test Plan:
[midin@6262.od /data/sandcastle/boxes/fbsource/fbcode (99e64d2e4)]$ tlp buck run mode/opt caffe2/test/inductor:pad_mm -- -r test_exclude_padding

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json&local_cache_key

 {F1974355662}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144884
Approved by: https://github.com/oulgen
2025-01-17 17:35:55 +00:00
567552b98b fix typo in doc and import for torch._library.triton (#144882)
Previously, the doc's suggested `from torch._library.triton import wrap_triton, triton_op` doesn't work because wrap_triton is not imported in torch/_library/__init__.py but `from torch.library import wrap_triton` works. This PR imports wrap_triton and fix the doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144882
Approved by: https://github.com/zou3519
2025-01-17 17:32:12 +00:00
18eba9575f [Accelerator] Use uniform GetAllocator for devices in new_qtensor function (#144849)
Fixes #144848
This PR is intended to use a uniform `GetAllocator()` call for all the accelerators for `new_qtensor` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144849
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-01-17 16:37:37 +00:00
a215e174a1 [BE] Remove conda from scripts and build files Part 2 (#145015)
Continuation of https://github.com/pytorch/pytorch/pull/144870

Remove conda logic from scripts:

1. Remove conda build from triton build script
2. Remove conda checks from setup.py
3. Remove conda from release scripts
4. Script read_conda_versions.sh is not used (checked via git grep)

Related to: https://github.com/pytorch/pytorch/issues/138506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145015
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-01-17 16:26:24 +00:00
b7af210d8d Add SM89 support for f8f8bf16_rowwise() (#144348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144348
Approved by: https://github.com/drisspg
2025-01-17 15:12:35 +00:00
f522502b97 Revert "patch for block-wise quantization + pt2e (#144492)"
This reverts commit 1d43b8150852cdfcbe754edcf027d6e25f33ac63.

Reverted https://github.com/pytorch/pytorch/pull/144492 on behalf of https://github.com/albanD due to Broke a few things in trunk ([comment](https://github.com/pytorch/pytorch/pull/144492#issuecomment-2598485291))
2025-01-17 14:27:53 +00:00
dbed747aae Add Intel GPU specific CMake files to merge rules (#135110)
As the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135110
Approved by: https://github.com/atalman
2025-01-17 09:44:13 +00:00
a0d2c09115 Add flop formula for _scaled_mm (#144973)
This will make it work correctly with the partitioner's AutoAC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144973
Approved by: https://github.com/jeffdaily
2025-01-17 09:38:30 +00:00
96c0dbbe97 Enhance running pr time benchmarks locally experience. (#144838)
Summary: title

Test Plan: NA

Differential Revision: D68195894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144838
Approved by: https://github.com/huydhn
2025-01-17 07:57:40 +00:00
465a1cfe2e update get start xpu (#143183)
- Support new Intel client GPU on Windows [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) and [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html)
- Support vision/audio prebuilt wheels on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143183
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-17 06:31:40 +00:00
fd8e0e3e10 [mps/inductor] Introduce is_mps_backend/skip_if_mps decorators. (#145035)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145035
Approved by: https://github.com/jansel
2025-01-17 05:36:38 +00:00
cfd9cc19a3 [executorch hash update] update the pinned executorch hash (#145022)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145022
Approved by: https://github.com/pytorchbot
2025-01-17 04:51:56 +00:00
f13c864eda Fuzzer Improvements (#144952)
Added more tests and cleaned up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144952
Approved by: https://github.com/masnesral
2025-01-17 04:46:58 +00:00
1d43b81508 patch for block-wise quantization + pt2e (#144492)
Summary: As title, needed for enable qcom block-wise quantization kernel

Test Plan: local test

Differential Revision: D67985303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144492
Approved by: https://github.com/angelayi, https://github.com/billmguo
2025-01-17 04:10:49 +00:00
adbbcd87d9 OpenReg: Split Allocator (#144843)
Split the Allocator into HostAllocator and DeviceAllocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144843
Approved by: https://github.com/albanD
2025-01-17 03:38:15 +00:00
43a00d73b3 [Trace Python Dispatcher] Support FuncTorchInterpreter (#144444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144444
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #144439
2025-01-17 02:26:37 +00:00
5d02575aa1 [Trace Python dispatcher] Support torch.DispatchKey & torch.DispatchKeySet (#144439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144439
Approved by: https://github.com/zou3519
2025-01-17 02:26:36 +00:00
3a50aba7d3 [dynamo] add option to not skip on empty graph (#144885)
Temporary fix to https://github.com/pytorch/pytorch/issues/144360.

Turning the config on globally will cause a bunch of tests to fail, which needs to be addressed in followups.

I had a previous attempt at https://github.com/pytorch/pytorch/pull/144712, but this is a more complicated change and will likely be absorbed into work to refactor Dynamo's exception handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144885
Approved by: https://github.com/jansel
2025-01-17 02:12:20 +00:00
7b56b039af [dcp] Add ZStandard transformer (#143360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360
Approved by: https://github.com/saumishr
ghstack dependencies: #143358, #143359
2025-01-17 01:51:37 +00:00
9c909bf3bb [dcp] Integrate stream extensions into DCP impl (#143359)
Summary: Updates FileSystemReader/Writer, Planner, DefaultLoad/SavePlanner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143359
Approved by: https://github.com/saumishr
ghstack dependencies: #143358
2025-01-17 01:51:37 +00:00
ba3f1c29ee [dcp] Add extension mechanism (#143358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143358
Approved by: https://github.com/saumishr
2025-01-17 01:51:37 +00:00
176cde6240 Use torch with statement in torch distributed module (#144951)
# Motivation
In https://github.com/pytorch/pytorch/pull/137678, we help use the device-agnostic APIs to generalize distributed module. As this [comment](https://github.com/pytorch/pytorch/pull/137678#discussion_r1828645683) said, we will use the with statement of `torch.Stream` once https://github.com/pytorch/pytorch/pull/140138 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144951
Approved by: https://github.com/kwen2501, https://github.com/albanD
2025-01-17 01:49:28 +00:00
a61a65ff82 [MPSInductor] Add Worker.current_device method (#145023)
That just returns 0, as multi-gpu is not currently supported by MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145023
Approved by: https://github.com/dcci
2025-01-17 01:41:01 +00:00
55b0819bee Revert "Add tests for different dtypes with max autotune (#144721)"
This reverts commit d2a77f48c9dc6df056051de270ce5875d8d2edd0.

Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/kit1980 due to breaking internal builds, max autotune tests a failing, see D68297606 ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2597250605))
2025-01-17 01:36:14 +00:00
45e6647268 [FSDP2] Make post-backward condition more robust (#144781)
Fixes https://github.com/pytorch/pytorch/issues/144755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144781
Approved by: https://github.com/fegin
2025-01-17 01:28:56 +00:00
6077102415 [DSD][BE] Rewrite some tests to remove with_comms (#143241)
Summary:
This saves ~ 1 minute test time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143241
Approved by: https://github.com/mori360, https://github.com/XilunWu
ghstack dependencies: #143240
2025-01-17 01:15:55 +00:00
5d54e7b812 [Pipelining] move scale_grads to base class, add docs (#144833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833
Approved by: https://github.com/H-Huang
2025-01-17 01:07:12 +00:00
3afc5170d4 [Submodule] Upgrade to Cutlass 3.6 part deux (#144911)
# Summary
Take 2 of [D67866269](https://www.internalfb.com/diff/D67866269)
Main change is that we identified and fixed the FA2 regression. More details can be found here https://github.com/pytorch/pytorch/issues/144729 and have landed that before this here: [D68194635](https://www.internalfb.com/diff/D68194635)

Differential Revision: D68194470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144911
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-17 00:53:42 +00:00
6c713ccb5e Revert "Make functionalization ViewMeta serializable with pickle. (#143712)"
This reverts commit b8abdaa286fd161af48af57a675827f4f849914d.

Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))
2025-01-17 00:52:50 +00:00
42c64bd35c [MPSInductor] More is_dtype_supported gating (#144981)
This makes `GPUTest.test_scalar_cpu_tensor_arg_mps` pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144981
Approved by: https://github.com/dcci
ghstack dependencies: #144971
2025-01-17 00:48:02 +00:00
94c0f15302 Revert "cpp_wrapper: Move #includes to per-device header files (#143909)"
This reverts commit d62b3979dadfa4928ec1c76e850f874d49803125.

Reverted https://github.com/pytorch/pytorch/pull/143909 on behalf of https://github.com/kit1980 due to breaking internal builds because of removal of torch‎/_inductor‎/codegen‎/aoti_runtime‎/implementation.cpp‎ ([comment](https://github.com/pytorch/pytorch/pull/143909#issuecomment-2597188669))
2025-01-17 00:36:38 +00:00
5e6e6200bf Revert "[dynamo][dicts] Consolidate dict(..) construction (#144342)"
This reverts commit a54a784b8207617d2b99fbded9bb34c94fb6dd23.

Reverted https://github.com/pytorch/pytorch/pull/144342 on behalf of https://github.com/kit1980 due to breaking internal builds, see D68125388 ([comment](https://github.com/pytorch/pytorch/pull/144342#issuecomment-2597184167))
2025-01-17 00:32:09 +00:00
cyy
2ea394ba29 Modernize C++ code (#144603)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144603
Approved by: https://github.com/malfet
2025-01-17 00:25:18 +00:00
c3fcb3606d Profile compile_inner instead of _compile_inner (#144930)
Summary: title

Test Plan: NA

Reviewed By: jamesjwu

Differential Revision: D67990492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144930
Approved by: https://github.com/jamesjwu
2025-01-16 23:59:27 +00:00
573fc42f25 [BE][CP] Use run_subtests instead of parametrize (#143240)
Summary:
This provides a 15X increase in test performance speed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143240
Approved by: https://github.com/XilunWu
2025-01-16 23:55:05 +00:00
fea9d18d5a [Utilization Log] Concurrently collect aggregate data during the output interval (#143235)
# overview
Add worker to collect metrics in short intervals
1.Worker: Add a worker to collect usage metrics, by default, every 500ms, notice this is configurable
2.Calculate &  avg and max as data point, by default, every 5 second.

# Other
clean up the log format for necessary needs, currentl we do not need to track gpu processesors etc, or all pids from psutil
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143235
Approved by: https://github.com/huydhn
2025-01-16 23:52:43 +00:00
288d67d6c2 [inductor] [bug fix] align avg_pool with eager when handling uint (#144313)
Fixes #144310

~~We just need to add a check in lowering~~

updated: we add the error checking in `meta registration`

### UT
```
 pytest -s -v test/inductor/test_torchinductor.py -k test_avg_pool_errors_with_uint
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144313
Approved by: https://github.com/jansel, https://github.com/jgong5
2025-01-16 23:37:51 +00:00
d2a77f48c9 Add tests for different dtypes with max autotune (#144721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721
Approved by: https://github.com/cpuhrsch, https://github.com/etaf
2025-01-16 23:04:56 +00:00
clr
171fb7f358 easy: Fix missing tab in test/dynamo/test_compile.py (#145013)
It turns out that if you request a merge on a pytorch PR, and then push a fix for a bad rebase, and the test is
relativley new, the merge will go through with the previous commit and not notice the test break.

Explicitly running the test now passes vs failing, and this is just the last missing commit from https://github.com/pytorch/pytorch/pull/144817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145013
Approved by: https://github.com/masnesral, https://github.com/jansel
2025-01-16 22:51:51 +00:00
181d93b4f2 [BE] Move is_device_supported to helper function (#144971)
And extend `test_inf` to check half (explicitly instead of check_lowp) and bfloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144971
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/jansel
2025-01-16 22:44:03 +00:00
a33e02cb26 [executorch hash update] update the pinned executorch hash (#144813)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144813
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-01-16 22:39:00 +00:00
7c7bcb1e33 update IS_JETSON check (#144725)
update IS_JETSON check to include the latest SM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144725
Approved by: https://github.com/eqy
2025-01-16 22:34:48 +00:00
95c363cc9b dynamo: Don't crash with internal error if getattr on a tensor fails (#144817)
This prevents crashes when getattr is called on a tensor for something
which doesn't exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144817
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-16 22:04:06 +00:00
0e6d44df3f Add heuristic to fail block pointer match early (#144681)
This PR adds a heuristic to potentially fail the block pointer match early. Expressions like below take a long time to match using sympy (e.g. > 100 seconds)
```python
# torch._inductor.config.triton.use_block_ptr = True
# torch._inductor.config.triton.prefer_nd_tiling = True
# Expression from pytest -k test_max_pool2d1_dynamic_shapes_cuda:
 ((xindex//ps1))*((s2 - 3//2))**2 + 2*((xindex//ps1))*((s2 - 3//2)) + ((xindex//ps1)) + ((s2 - 3//2))*(ModularIndexing(xindex, ps0, ps0)) + (ModularIndexing(xindex, 1, ps0)) + (ModularIndexing(xindex, ps0, ps0))
```
Additionally, the heuristic for the number of dimensions based on the indexing expression is refined to only add dimensions for FloorDiv(index, denom) and ModularIndexing(index, denom, modulo) instead of including FloorDiv/ModularIndexing expressions that don't involve the index.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144681
Approved by: https://github.com/jansel
2025-01-16 21:57:30 +00:00
46b92c025d Revert "Cholesky mps implementation (#144193)"
This reverts commit 727ae1331820bb3d83d70e9cd3c9d3cd4c79ff56.

Reverted https://github.com/pytorch/pytorch/pull/144193 on behalf of https://github.com/malfet due to Alas, inductor changes broke inductor tests, see aa4a1ff027/1 ([comment](https://github.com/pytorch/pytorch/pull/144193#issuecomment-2596938163))
2025-01-16 21:37:32 +00:00
aa4a1ff027 Revert "Prevent _legacy_load with weights_only=True (#144914)"
This reverts commit 7c3aa1da1c97812af54d41f3f0eff2ef922c0f32.

Reverted https://github.com/pytorch/pytorch/pull/144914 on behalf of https://github.com/izaitsevfb due to breaking inductor on trunk ([comment](https://github.com/pytorch/pytorch/pull/144914#issuecomment-2596922781))
2025-01-16 21:29:50 +00:00
4ea189422d Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit a6763b7b81cd1a55c8316dfdb5bca19819a1429a.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))
2025-01-16 21:12:41 +00:00
3a5bf0bc36 expose extra torch_python apis (#144746)
Fixes #144302
After checking the code of my third-party devices, I think these APIs are also relied on by us, so I exposed them according to the discussion in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144746
Approved by: https://github.com/albanD
2025-01-16 20:50:31 +00:00
577708e6de Unskipped multiple inductor tests for ROCm (#143581)
All of them should be fine to run now after the triton fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-16 20:46:06 +00:00
a9bfc5f70c Fix boundary conditions for hardswish backward (#143899)
Fixes #136345.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143899
Approved by: https://github.com/jgong5, https://github.com/ezyang
2025-01-16 20:26:27 +00:00
aad5f600ff [mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)
Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877
Approved by: https://github.com/jansel, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-16 19:51:45 +00:00
3908be676c Fix loading older state_dict into AdamW after refactor (#144972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144972
Approved by: https://github.com/albanD
2025-01-16 19:50:31 +00:00
b8abdaa286 Make functionalization ViewMeta serializable with pickle. (#143712)
Fix: #141974

This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.

In summary, these are the changes this PR introduces:

- `ViewMeta` is turned into an abstract class (see
  _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
  functions that need to be implemented. `to_out_index` should be
  implemented by operations that might return more than 1 output.

- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
  created (see _FunctionalizeFallbackKernel.h_).

- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
  declaration and definition of the `ViewMeta` specializations, which
  are automatically generated in the ATen codegen (see _gen.py_).

- New `_functionalization` Python sub-module is created (see
  _Module.cpp_). It serves as namespace for the `ViewMeta`
  specializations and `InverseReturnMode` enum.

- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
  the automatically generated Python bindings for the `ViewMeta`
  specialization, which are generated in the torch codegen (see
  _generate_code.py_).

Note that this PR makes use of codegen at 2 different moments:

- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
  them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
2025-01-16 19:41:41 +00:00
7c3aa1da1c Prevent _legacy_load with weights_only=True (#144914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144914
Approved by: https://github.com/malfet, https://github.com/albanD
2025-01-16 19:33:46 +00:00
cf28d613f1 Allow ROCm runner to upload benchmark results if found (#144710)
https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database. This will unblock AMD when they try to run benchmark MI300 benchmarks on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144710
Approved by: https://github.com/kit1980
2025-01-16 19:31:45 +00:00
31a73eb712 fix acquire pattern in topk (#144945)
Similar to #128455, topk needs another threadfence to complete acquire pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144945
Approved by: https://github.com/Skylion007
2025-01-16 19:20:43 +00:00
3004b657f0 [Inductor][FlexAttention] Supports dynamic shapes with custom kernel options (#144938)
Fixes #144815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144938
Approved by: https://github.com/drisspg
2025-01-16 19:02:35 +00:00
e32d2bf853 Document decoupled_weight_decay for Adam for consistency with N/RAdam (#144984)
Followup from #144972 and #143710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144984
Approved by: https://github.com/albanD
2025-01-16 18:58:29 +00:00
ad15436db6 Fix pt2-bug-report.yml formatting (#144987)
This is a 2nd regression caused by https://github.com/pytorch/pytorch/pull/144574

Test plan: `python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"`
Before it printed
```
% python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"
{'type': 'markdown', 'attributes': {'value': ''}}
```
After
```
% python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"
{'type': 'markdown', 'attributes': {'value': '#### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.\n'}}
```

Fixes https://github.com/pytorch/pytorch/issues/144970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144987
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-16 18:58:07 +00:00
829c4570ca Revert "[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)"
This reverts commit 1b34665767fcc35ae4a8f211945a24701c79df79.

Reverted https://github.com/pytorch/pytorch/pull/144877 on behalf of https://github.com/malfet due to Actually no, lint is red ([comment](https://github.com/pytorch/pytorch/pull/144877#issuecomment-2596385712))
2025-01-16 18:10:37 +00:00
13d35ea67a [BE] Add missing throw of std::runtime_error in scrc/cuda/utils.cpp (#144962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144962
Approved by: https://github.com/amjames, https://github.com/Skylion007, https://github.com/malfet
2025-01-16 17:35:39 +00:00
53256edff9 [export] Support module inputs for non strict mode. (#143925)
Summary:
Add experimental support for torch.nn.Module as input types.

Before this change, we don't support module inputs but recently we saw some interesting use cases like gpt-fast https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L68 where we directly pass in a module input for different variants of the same models.

Since we don't really care about non-param or non-buffer states in non strict mode, we don't care about those either and pretend they are like plain constants during tracing. We treat any module input like a nested container of tensor, and each time we will automatically register a pytree handler for these module types to flatten its state dict into a group of tensors. We will just inline any module method call during tracing like we did for `self` module in export_for_training. This will make input modules' behavior very similar to the training module in typical case, except that we don't record the inputs as parameter or buffers but rather just plain user inputs.

Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_module_input

Differential Revision: D67680827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143925
Approved by: https://github.com/tugsbayasgalan
2025-01-16 17:30:36 +00:00
519269a415 [BE] - Remove conda test and upload scripts and env variables from Workflows Part 1 (#144870)
Remove conda test and upload scripts and env variables from Workflows

Related to: https://github.com/pytorch/pytorch/issues/138506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144870
Approved by: https://github.com/malfet
2025-01-16 17:20:14 +00:00
727ae13318 Cholesky mps implementation (#144193)
Requested in #77764

PR is still in draft because it needs some cleanups and optimizations to get to cpu performance the least. Tasks:
- [x] Make `upper=True` work, only `upper=False` works now
- [x] Code cleanup
- [x] Optimizations(Though might need some help on this)(tried my best, maybe there is still some more to squeeze out)
- [x] Checks for positive definite input
- [x] Support for (*, N, N) input, currently only supports (B, N, N) input
- [x] Support other dtypes(float16, bfloat16)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144193
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-16 16:26:46 +00:00
1b34665767 [mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)
Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-16 16:22:06 +00:00
3d29de3ac8 [aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [1/n] (#144709)
Summary:
According to angelayi, these two flags indicated different things when we have two-pass codegen but since now we basically keep the two flags all the same, we should merge two flags.

This can prevent some bug (e.g. we change value of aot_mode which will not cover branches like if V.aot_compialtion is True) from happening when we're trying to add different code paths to tweak the value of aot_mode in the future.

Test Plan: CI

Differential Revision: D68122536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144709
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-01-16 16:02:18 +00:00
241a8a101b Fix erroneous at_vreinterpretq_u16_bf16 call (#144883)
Here, `mask` is definitely a `uint16x8_t`, not an `at_bfloat16x8_t`, so we shouldn't be reintepreting it. Candidate fix for #144818 .

Differential Revision: [D68224128](https://our.internmc.facebook.com/intern/diff/D68224128/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144883
Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/malfet
2025-01-16 15:16:28 +00:00
6559374494 Revert "Add flop formula for _scaled_mm (#144872)"
This reverts commit f31452268bf9f7e395f263cd8a9d693633ea75ce.

Reverted https://github.com/pytorch/pytorch/pull/144872 on behalf of https://github.com/lw due to Breaks ROCm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/144872#issuecomment-2595994134))
2025-01-16 15:16:18 +00:00
6470b0ea6f Update torch-xpu-ops commit pin (#144739)
Update the torch-xpu-ops commit to [22cc419e4e60f469341712a5a103fa309a7dfd48](22cc419e4e), includes:

- Fix building issue https://github.com/intel/torch-xpu-ops/issues/1279
- Aten operator coverage improvement

Note: new torch-xpu-ops commit don't support bundle 0.5.3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144739
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-01-16 15:12:37 +00:00
f31452268b Add flop formula for _scaled_mm (#144872)
This will make it work correctly with the partitioner's AutoAC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144872
Approved by: https://github.com/vkuzo
2025-01-16 13:57:54 +00:00
1c290912e4 Revert "Add tests for different dtypes with max autotune (#144721)"
This reverts commit 9e568cbaa22df89b77e112f1a373d82acb2e6219.

Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2595210355))
2025-01-16 10:59:30 +00:00
0c0583254e [inductor] fix index.Tensor fallback (#144736)
The original issue is we see accuracy problem in a meta internal model [meta internal link](https://fb.workplace.com/groups/1075192433118967/posts/1567334737238065/).  The debugging is hard but the root cause is relatively simple. The root cause is that the model has mix-device inputs for index.Tensor which causes Inductor to fallback. And the meta kernel for index.Tensor returns a tensor with inconsistent strides to the eager kernel.

The following code snippet
```
import torch
from torch._subclasses import FakeTensorMode

device = "cuda"

x = torch.randn((24, 16, 32, 32), device=device).to(memory_format=torch.channels_last)
x = x.view(2, 12, 16, 32, 32)

i1 = torch.arange(2).unsqueeze(-1)
i2 = torch.argsort(torch.rand(2, 12), dim=-1)[:, :3]

print(f"Eager stride: {x[i1, i2].stride()}")

mode = FakeTensorMode()
with mode:
    f_x = mode.from_tensor(x)
    f_i1 = mode.from_tensor(i1)
    f_i2 = mode.from_tensor(i2)
    f_out = f_x[f_i1, f_i2]
    print(f"Meta stride: {f_out.stride()}")
```

would output:
```
Eager stride: (49152, 16384, 1, 512, 16)
Meta stride: (49152, 16384, 1024, 32, 1)
```

In this PR, I fix the problem to run eager kernel to get the index.Tensor fallback's output layout. A better solution would be to change meta/eager kernel implementation so that their output layout matches. But I'm not sure how to properly do that.
In the index.Tensor meta kernel, we always produce dense output:  6d56277682/torch/_meta_registrations.py (L3184) . While the eager kernel seems to leverage TensorIteratorBase to decide some dimension permutation: 6d56277682/aten/src/ATen/TensorIterator.cpp (L232-L308) .  We can duplicate this logic to the meta kernel implementation if we really want meta matches eager. I can follow up on this if people have strong opinion to do this.

And here is an issue https://github.com/pytorch/pytorch/issues/144717 for asserting size/strides for fallback kernels. With that, the issue debugged here would be much easier to root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144736
Approved by: https://github.com/jansel
2025-01-16 09:38:29 +00:00
57d5659c3b XFAIL test_save_load_checkpoint (#144927)
Fixes https://github.com/pytorch/pytorch/issues/137771

The issue keeps showing up and rerun disable tests couldn't reproduce the issue.  So, XFAIL it while waiting for distributed team to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144927
Approved by: https://github.com/kit1980, https://github.com/malfet
2025-01-16 07:31:56 +00:00
7d8c087e24 [Pipelining] Improve shape inference debug logging (#144929)
Remove log that just said "running forward" since that is not so useful
in itself, replace with somewhat equivalent log that reports both input
and output shapes after running forward.

Note: enabled by `TORCH_LOGS=+pp`

Example:
```
[rank0]:V0115 13:28:58.282000 3908366 torch/distributed/pipelining/stage.py:1400] Shape inference: stage 0 inputs (tensor(..., device='meta', size=(1, 64), dtype=torch.int64),), outputs (tensor(..., device='meta', size=(1, 64, 256), dtype=torch.bfloat16),)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144929
Approved by: https://github.com/H-Huang
2025-01-16 07:30:11 +00:00
0b17c09893 restore rng generation for fbcode (#144819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819
Approved by: https://github.com/malfet, https://github.com/kit1980
2025-01-16 06:46:26 +00:00
49bdc418be Add strict kwarg to nn.Module.set_submodule and fix bug for non dot delineated strings (#143455)
Before fixing set_submodule, it used to create leaf modules when the target was not a dot-delimited string. After the fix it will not create a new attribute if target is a non-dot-delimited string. If you want to create leaf nodes of `nn.Module` parent nodes, you can use `replace_or_create_new_leaf_module`.

Fixes https://github.com/pytorch/pytorch/issues/143441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143455
Approved by: https://github.com/mikaylagawarecki
2025-01-16 05:06:33 +00:00
e3c4d1b7d6 [c10d][fr] Fix the bug when we still mark mismatch when there are match case (#144916)
When we introduce partial match, we accidentally introduce the mark of mismatch for the full match case. This is wrong and this PR fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144916
Approved by: https://github.com/c-p-i-o
2025-01-16 04:36:30 +00:00
9e568cbaa2 Add tests for different dtypes with max autotune (#144721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721
Approved by: https://github.com/cpuhrsch, https://github.com/etaf
2025-01-16 04:29:44 +00:00
52a620845b OpenReg: Use device agnostic API (#144840)
Use `torch.accelerator.device_count()` to get the number of devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144840
Approved by: https://github.com/albanD
2025-01-16 03:31:52 +00:00
1230de4c1b [Quant][Inductor][X86] Separate binary post op fusion and lowering for qconv (#144318)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves binary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise`
2. Fuse `onednn.qconv2d_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144318
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #144224, #144312
2025-01-16 03:30:36 +00:00
cyy
843627b7b1 Remove unnecessary once flag usage (#143255)
Static variables in C++11 is guaranteed to be initialised exactly once, as mentioned [here](https://en.cppreference.com/w/cpp/language/storage_duration)
```
If multiple threads attempt to initialize the same static local variable concurrently,
the initialization occurs exactly once
(similar behavior can be obtained for arbitrary functions with std::call_once.
Usual implementations of this feature use variants
of the double-checked locking pattern,
which reduces runtime overhead for already-initialized local statics
 to a single non-atomic boolean comparison.
```
Given that static c10::once_flag is used before, why not just use the associated function to initialised the related static variables? That is the motivation behind this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143255
Approved by: https://github.com/albanD
2025-01-16 02:36:11 +00:00
41ec2e8d3e [MPSInductor] Fix codegen regression (#144924)
Caused by https://github.com/pytorch/pytorch/pull/144649

Do not try to insert anything into the header if wrapper is not ready yet

Fixes `test_sort_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144924
Approved by: https://github.com/dcci
ghstack dependencies: #144827, #144917
2025-01-16 02:12:42 +00:00
05505771a0 [MPSInductor] Properly convert index (#144917)
By calling `self.index_to_str` from `load`,`store` and `check_bounds` in order to properly handle sizevars variables renames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144917
Approved by: https://github.com/dcci
ghstack dependencies: #144827
2025-01-16 02:12:41 +00:00
d595b96059 Revert "restore rng generation for fbcode (#144819)"
This reverts commit 2bc18a905544f4e25cfbd354351418b36a0f5fc1.

Reverted https://github.com/pytorch/pytorch/pull/144819 on behalf of https://github.com/ngimel due to internal failure ([comment](https://github.com/pytorch/pytorch/pull/144819#issuecomment-2594298941))
2025-01-16 01:52:29 +00:00
6492851125 symbolic_convert: Don't fail when we hit a undefined name (#144784)
We're using a python builtin NameError here,
instead of throwing a Unsupported exception. This causes the
NameError to get wrapped in a InternalTorchDynamoError
instead of just causing a graph break, and letting the user code fail
directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144784
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-16 01:47:48 +00:00
c8bcb22e5f Default Copies are not vectorized in v3.6.0 of cutlass (#144837)
Summary:
FlashAttentionV2 perf was tanked in v3.6.0, See: https://github.com/pytorch/pytorch/issues/144729 for more details.

This PR makes it possible to land v3.6.0 update and fixes perf regression. See: https://github.com/pytorch/pytorch/issues/144729#issuecomment-2591644076 for anlaysis, as well we have various internal tests to verify

Differential Revision: D68194635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144837
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-01-16 01:12:46 +00:00
926f9056a9 speculation_log: Raise a unique error for divergence issues (#144785)
This is primarily sent for discussion and to see what tests fail due to
this. The idea is that rather than capturing this as a regex on the
fail_reason, just give it a unique failure type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144785
Approved by: https://github.com/ezyang
2025-01-16 00:49:43 +00:00
b90231a189 [inductor][BE] don't try/except ImportError for AttrsDescriptor versions (#144807)
motivation: Ed's advice to avoid `except ImportError` (i.e. based on the fact that your target module/class might in fact exist, but you might run into some different ImportError whose stacktrace you now ignore).

additional motivation: I'm going to add some more cases to this list, and would like to avoid this pattern:
```
try:
   ...
except ImportError:
    try:
        ...
    except ImportError:
        try:
            ...
```

suggestions on better ways to do this would be appreciated!

test: ran with triton commit e5be006a (last working commit) and 34a6a2ff8 (in june, when AttrsDescriptor was still in triton.compiler.compiler)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144807
Approved by: https://github.com/ezyang
2025-01-16 00:32:29 +00:00
cyy
ee97d80be2 Apply Ruff fixes and pyupgrade to torch/jit (#144208)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144208
Approved by: https://github.com/davidberard98
2025-01-16 00:28:50 +00:00
774f21a370 [export] handle buffer/input mutations for joint-graph (#144806)
Summary: previous construction of GraphSignature output specs didn't consider buffer/user input mutations

Test Plan: test_experimental

Differential Revision: D68177409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144806
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2025-01-16 00:22:16 +00:00
d7f45fc575 dynamic shape support for interpolate(antialias=True) backward (#141198)
Fixes https://github.com/pytorch/pytorch/issues/141187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141198
Approved by: https://github.com/ezyang, https://github.com/Chillee
ghstack dependencies: #141161
2025-01-16 00:08:25 +00:00
4831f89790 support numbers as tensors for aten.copy(Tensor, Tensor) (#141161)
Fixes https://github.com/pytorch/pytorch/issues/141149. `aten.copy_` supports numbers as tensors in the python arg parser. So we need to give the same treatment to `aten.copy`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141161
Approved by: https://github.com/ezyang
2025-01-16 00:08:25 +00:00
2645fc45b1 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-15 23:43:41 +00:00
fb4b5a9299 [ONNX] Use python_dispatcher in type promotion (#144801)
Fix #143118

Use python_dispatcher in the type promotion pass to preserve symbolic shapes according to @angelayi 's suggestions. (Thanks!)

Tested locally. I wasn't able to create a minimal repro except for using the full model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144801
Approved by: https://github.com/titaiwangms
2025-01-15 23:25:19 +00:00
7265dc0622 Enable s8s8s8 for qlinear with mkl-dnn (#139887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887
Approved by: https://github.com/huydhn
2025-01-15 23:20:10 +00:00
4e1834f5f3 use cooperative schedule in scaled_mm for fast_accum=false (#144809)
This improves perf for large matrices by more than 2x, more detailed benchmark coming.
On master
![image](https://github.com/user-attachments/assets/fc6a0987-5b82-475d-a2ff-b46641bb17dc)
On this branch
<img width="601" alt="image" src="https://github.com/user-attachments/assets/7f55152b-1110-45e4-b2ea-6f274d543869" />
A plot similar to https://github.com/pytorch/ao/pull/1325#discussion_r1868193786
<details>
  <summary>Benchmarking code:</summary>

```python
import torch
from triton.testing import do_bench
import itertools

def fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False):
    return torch._scaled_mm(a, b.t(), scale_a.view(-1, 1), scale_b.view(1, -1), use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16)

def fn_aten(a, b, scale, use_fast_accum=False):
    return torch._scaled_mm(a, b.t(), scale, scale, use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16)

for i,j,k in itertools.product(range(9, 15), range(9, 15), range(9, 15)):
    m = 2**i
    n = 2**j
    k = 2**k

    a=torch.randn(m, k, device="cuda").to(dtype=torch.float8_e4m3fn)
    b=torch.randn(n, k, device="cuda").to(dtype=torch.float8_e4m3fn)
    scale_a = torch.randint(1, 11, (a.shape[0],), device="cuda", dtype=torch.float32)
    scale_b = torch.randint(1, 11, (b.shape[0],), device="cuda", dtype=torch.float32)
    scale_0 = torch.randn((), device="cuda", dtype=torch.float32)

    ms_rowwise_fast = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=True), warmup=25, rep=50)
    ms_rowwise_slow = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False), warmup=25, rep=50)

    ms_tensor_fast = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=True), warmup=25, rep=50)
    ms_tensor_slow = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=False), warmup=25, rep=50)

    print(f"m={m}, n={n}, k={k}, fast={ms_rowwise_fast}, slow={ms_rowwise_slow}, ratio_tw={ms_tensor_slow /ms_tensor_fast}, ratio_rw={ms_rowwise_slow / ms_rowwise_fast}")

```
</details>

Higher N/K values still have about 40% penalty, perhaps some additional heuristics tweaks would be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144809
Approved by: https://github.com/drisspg
2025-01-15 23:04:14 +00:00
0f051eaf66 Revert "Fix global namespace pollution in ATen/Dispatch.h (#138626)"
This reverts commit 326c7cae28783f29c577b5a5d3ac38a3b61188bd.

Reverted https://github.com/pytorch/pytorch/pull/138626 on behalf of https://github.com/malfet due to This broke inductor tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_torchbench%2C%202%2C%202 ([comment](https://github.com/pytorch/pytorch/pull/138626#issuecomment-2594021436))
2025-01-15 21:59:04 +00:00
Sam
c7b2f7dd14 Add generator parameter to rand*_like functions (#136780)
Fixes #128786
Fixes #101974
Fixes #27072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136780
Approved by: https://github.com/Chillee, https://github.com/ezyang
2025-01-15 21:16:52 +00:00
d62b3979da cpp_wrapper: Move #includes to per-device header files (#143909)
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Differential Revision: [D67938955](https://our.internmc.facebook.com/intern/diff/D67938955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143909
Approved by: https://github.com/desertfire
2025-01-15 21:14:02 +00:00
05095a45f2 Fix the wrong artifact in remaining workflows (#144812)
I missed them in https://github.com/pytorch/pytorch/pull/144694 as they weren't run often.  But they are still failing nonetheless, i.e. https://github.com/pytorch/pytorch/actions/runs/12762640334/job/35578870178

The issue was from https://github.com/pytorch/pytorch/pull/125401 where it added `use-gha: ${{ inputs.use-gha }}` to linux_test workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144812
Approved by: https://github.com/clee2000
2025-01-15 20:36:40 +00:00
b88dcb4835 dynamo: Don't crash when tracing a missing attr on a constant. (#144593)
dynamo: Don't crash when tracing a missing attr on a constant.

This throws a InternalTorchDynamoError: AttributeError: 'NoneType' object has no attribute 'max'
instead of just skipping the bad call when tracing, and throwing a
normal AttributeError instead.

There are two questions that I would love reviewer comment on.
1) Is throwing unimplemented the right thing here? or should I throw
   something like ObservedAttributeError
2) Do we need to worry about performance with this code? In particular,
   should we just catch the exception? Or maybe cache the lookup result?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144593
Approved by: https://github.com/jansel
2025-01-15 20:23:43 +00:00
d812fdd490 fix as_bool serde (#144791)
Differential Revision: [D68167701](https://our.internmc.facebook.com/intern/diff/D68167701/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144791
Approved by: https://github.com/pianpwk
2025-01-15 20:22:26 +00:00
904641769e [MPSInductor] Implement pow() (#144827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144827
Approved by: https://github.com/dcci, https://github.com/jansel
2025-01-15 20:11:34 +00:00
b410378d93 Register nonzero for meta device for FBLSim (#144727)
Summary:
Fix `nonzero is not registered to meta` issue:
```
"NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered".
```

Reviewed By: ezyang

Differential Revision: D66525640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144727
Approved by: https://github.com/ezyang
2025-01-15 19:40:42 +00:00
834086c023 [export] Load side info about pos/kw argument kind for serialization. (#144686)
Summary:
Fixing issue of nodes like
```
torch.ops.aten.linear.default(x, w, b)
```
being deserialized as
```
torch.ops.aten.linear.default(x, w, bias=b)
```
which breaks roundtripping.

Test Plan:
buck test mode/opt caffe2/test:test_export -- -r TestDeserialize
buck test mode/opt caffe2/test:test_export -- -r TestSerialize

Differential Revision: D67991410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144686
Approved by: https://github.com/angelayi
2025-01-15 19:08:38 +00:00
898a90c6bb [dynamo][hop] Introduce FlexAttentionBackwardHighOrderVariable (#144533)
FIXES https://github.com/pytorch/pytorch/issues/143180

This PR adds a new variable mapping to SourcelessBuilder to represent the flex attention intermediates. The variable proxies a call to HOP, and carryovers the graph state (subgraphs represented as UnspecializedNNModuleVariable) to the dynamo output graph. This is safe to do because the nn modules used in flex attention have either been speculated on before, or are outputs of make_fx of the forward.

tlparse of `TestCompiledAutograd.test_flex_attention`: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpiWendk/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

```python
class GraphModule(torch.nn.Module):
    def forward(self, L_inputs_ : list):
         ...
         # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1)
        ...
        fw_graph0_0 = self.fw_graph0_0
        joint_graph0_0 = self.joint_graph0_0
        mask_graph0_0 = self.mask_graph0_0
        flex_attention_backward = torch.ops.higher_order.flex_attention_backward(aot0_primals_1, aot0_primals_1, aot0_primals_1, aot0_detach_3, aot0_detach_5, aot0_expand_5, aot0_zeros_1, fw_graph0_0, joint_graph0_0, (1, 1, aot0_ones, aot0_zeros, None, None, aot0__to_copy_1, aot0__to_copy_2, None, None, 1073741824, 1073741824, mask_graph0_0), 0.125, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True}, (), ());  aot0_primals_1 = aot0_detach_3 = aot0_detach_5 = aot0_expand_5 = aot0_zeros_1 = fw_graph0_0 = joint_graph0_0 = aot0_ones = aot0_zeros = aot0__to_copy_1 = aot0__to_copy_2 = mask_graph0_0 = None
        aot0_getitem_4: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[0]
        aot0_getitem_5: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[1]
        aot0_getitem_6: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[2];  flex_attention_backward = None
        ...

    class fw_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0"):
            return arg0_1

    class joint_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0", arg5_1: "bf16[][]cuda:0"):
            return [arg5_1, None, None, None, None]

    class mask_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "i32[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0"):
             # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1)
            new_ones: "b8[][]cuda:0" = torch.ops.aten.new_ones.default(arg0_1, [], dtype = torch.bool, device = device(type='cuda', index=0), pin_memory = False);  arg0_1 = None
            return new_ones

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144533
Approved by: https://github.com/zou3519
2025-01-15 18:40:57 +00:00
eqy
a6763b7b81 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-15 18:37:55 +00:00
6ac0616504 [ROCm] hipblaslt rowwise f8 gemm (#144432)
hipblaslt added rowwise f8 gemm support.  Integrate with scaled_mm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144432
Approved by: https://github.com/drisspg
2025-01-15 18:23:44 +00:00
069419569d [PagedAttention] Support different input position for each batch index (#144693)
In LLM inference, each request usually has different prefill length, leading to different input position for each batch index. This PR adds such support for paged attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144693
Approved by: https://github.com/drisspg
2025-01-15 18:03:52 +00:00
7e80758efc [CUDAGraph][Docs] add cuda to torch.randn (#144793)
Previous doc example created `torch.randn` tensor on cpu so CUDAGraph was skipped.

Fixes #144386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144793
Approved by: https://github.com/eellison
2025-01-15 18:02:10 +00:00
ee8f833d13 Undo leading underscore on ctx for breakpoint (#144864)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144864
Approved by: https://github.com/Skylion007
2025-01-15 18:00:58 +00:00
443de667b1 Revert "Enable s8s8s8 for qlinear with mkl-dnn (#139887)"
This reverts commit dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d.

Reverted https://github.com/pytorch/pytorch/pull/139887 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to have broken trunk. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/12788709683/job/35651699934) [HUD commit link](dc8692b0eb) ([comment](https://github.com/pytorch/pytorch/pull/139887#issuecomment-2593597977))
2025-01-15 17:58:33 +00:00
d065e8a9de [ez] add lint commits to .git-blame-ignore-revs (#144576)
Test Plan: Ran git blame on .lintrunner.toml and github's linter (+ manual testing) shows all commits exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576
Approved by: https://github.com/janeyx99
2025-01-15 17:39:29 +00:00
c07dc64017 Update pin memory related APIs to not pass 'device' argument (#131858)
Based on https://github.com/pytorch/pytorch/pull/126376, this PR tries to update all PT callers (e.g., `Tensor.is_pinned()`, `Tensor.pin_memory()`) to not pass `device` argument.
As for `storage/untyped_storage.is_pinned()/pin_memory()`, we keep the `device` argument but passing `device` is discouraged. And if not given, the default `device` is still 'cuda' for BC.
Additionally, based on device-agnostic pin_memory, `pin_memory_device` argument of `torch.utils.data.DataLoader` is discouraged  now. For BC, explictly passing this argument is still effective. If not given, the default `device` will be the current accelerator.

Fixes #124908
Relates https://github.com/pytorch/pytorch/pull/126376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131858
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-15 17:23:35 +00:00
0dca756832 Revert "Upload METADATA file with whl binaries (#143677)" (#144706)
This reverts commit 3eb3f4ed5580010a7961d996ccc6ee19c7ccbb5e.

Also reverts https://github.com/pytorch/pytorch/pull/144164

Manual revert because the above causes merge conflicts

Reverting in favor of https://github.com/pytorch/test-infra/pull/6159
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144706
Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet
2025-01-15 17:20:21 +00:00
d782e46a36 [BE] typing for decorators - library (#138969)
Test Plan: unit tests

Differential Revision: D62302678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138969
Approved by: https://github.com/zou3519
2025-01-15 17:08:55 +00:00
c7a9599100 Handle meta tensors in FX quantization (#144726)
Summary:
D66895899 got reverted in D67565250 because of pytorch OSS linter failure.
Adding back with the format the linter suggested
https://github.com/pytorch/pytorch/actions/runs/12443655335/job/34743090791

Test Plan: buck run fbcode//mode/dev-nosan fbcode//torchrec/fb/quant/tests:test_embedding_modules

Reviewed By: emlin

Differential Revision: D68132568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144726
Approved by: https://github.com/iamzainhuda, https://github.com/janeyx99
2025-01-15 16:49:43 +00:00
2bc18a9055 restore rng generation for fbcode (#144819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819
Approved by: https://github.com/malfet, https://github.com/kit1980
2025-01-15 16:34:25 +00:00
154185dcd0 Revert "Removed unused _RequiredParameter (#144771)"
This reverts commit 6a5f895e549665a6895c84881a35736677071048.

Reverted https://github.com/pytorch/pytorch/pull/144771 on behalf of https://github.com/malfet due to It broke number of cpuinductor tests ([comment](https://github.com/pytorch/pytorch/pull/144771#issuecomment-2593293542))
2025-01-15 15:51:33 +00:00
7c52c97a65 Expose several APIs to public (torch python APIs) (#144525)
Fixes #144302
Try to expose several APIs to public for privateuse1 scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144525
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-01-15 14:34:45 +00:00
dc8692b0eb Enable s8s8s8 for qlinear with mkl-dnn (#139887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/ng-05, https://github.com/digantdesai
2025-01-15 12:51:21 +00:00
7e1c1e65eb Graph freezing preparation for non-Inductor backends (#139902)
Enable preparing module named parameters and buffers in tracing context for non-Inductor backends to implement graph freezing.

Fixes #139272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139902
Approved by: https://github.com/eellison, https://github.com/masnesral, https://github.com/gujinghui
2025-01-15 11:25:04 +00:00
62ce3e6e84 refresh benchmarks results after recent recent regressions (#143075)
refresh data after !5 regression by https://github.com/pytorch/pytorch/pull/144319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143075
Approved by: https://github.com/bobrenjc93, https://github.com/huydhn
2025-01-15 09:11:57 +00:00
e263f0af23 [BE] Make a SymbolInfo NamedTuple (#144745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144745
Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007
2025-01-15 08:59:27 +00:00
d9d7cca009 make eval_frame safe (#141357)
Fixes #108942

this PR converts eval_frame.c's static extension types to heap types, making it thread and sub-interpreter safe.

the current modification only showcases one state variable being lifted, but there are opportunities for other variables that can be addressed in this PR

todo / suggestions:

1. uplift `eval_frame_callback_key` to module state
2. define `.m_slots` to module definition so initialization is within python's module lifecycle rather than an explicit `torch_c_dynamo_eval_frame_init`
3. define configurations for module allowing sub-interpreters or not

```c
static int module_exec(PyObject *m) {}

static PyModuleDef_Slot module_slots[] = {
    {Py_mod_exec, module_exec},
    {0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT,
     ....
    .m_slots = module_slots
};
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141357
Approved by: https://github.com/jansel

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2025-01-15 07:37:50 +00:00
6ba53a5f1c [AMD] De-noise tf32 warnings (#144797)
Summary: This is way too noisy especially during unit tests. So just log once.

Test Plan: OSS CI. Tested on a unit test and now I only see one line (hard to notice :) ).

Differential Revision: D68167633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144797
Approved by: https://github.com/jianyuh, https://github.com/leitian, https://github.com/yoyoyocmu
2025-01-15 07:10:38 +00:00
69b883d7ac Remove C10_EMBEDDED (#144808)
I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are *multiple* implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808
Approved by: https://github.com/janeyx99, https://github.com/malfet
2025-01-15 06:08:53 +00:00
b801210035 Restore support for other types of async_compile pools (spawn, fork) (#144491)
Summary: https://github.com/pytorch/pytorch/pull/142001 removed support for process pools other than "subprocess", but some OSS users still find it useful; put it back.

Test Plan: New unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144491
Approved by: https://github.com/jansel, https://github.com/haifeng-jin
2025-01-15 06:04:49 +00:00
326c7cae28 Fix global namespace pollution in ATen/Dispatch.h (#138626)
Summary:

Was it a typo? Since we already have `at::detail::record_kernel_function_dtype()` in `ATen/Dispatch.h`

Test Plan: just build

Differential Revision: D64642080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138626
Approved by: https://github.com/malfet
2025-01-15 05:43:54 +00:00
7d71ddbe5d Add non_c_binding torch functions to allowlist for AOTAutogradCache, confirm no special handlers for them (#144802)
Differential Revision: [D68173093](https://our.internmc.facebook.com/intern/diff/D68173093/)

This diff allows any function in torch_non_c_binding_in_graph_functions to be safe to cache. These functions should be safe to cache because they are part of the torch API, and do not save global state (or if they do, dynamo creates unique guards around the constants they return).
A function that's allowed in a dynamo graph is safe to cache for AOTAutograd purposes as long as:
- It's functional (i.e. does not access global state);
- or its value is constant folded away (and guarded against by dynamo)

The tricky cases are functions that dynamo uses special handlers to track. These special handlers can sometimes close over stuff that's safe for dynamo locally, but isn't encoded anywhere when cached across processes. An example of this is `DTensor.from_local`, where various DeviceMesh information doesn't change in the same dynamo process, but can change across multiple processes. The handler for `DTensor.from_local` closes over these and dynamo creates a proxy for the function call. This is not safe to cache.

That said, most special handlers are in fact functional and safe. So I add a unit test to test_trace_rules.py that confirms that any function with special handlers in dynamo added to this list needs to be audited to be safe to cache.

The list of safe handlers there either:
- Don't access global state;
- Guard on global state; or
- Always returns a constant that never changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144802
Approved by: https://github.com/bdhirsh
2025-01-15 05:41:36 +00:00
79312ddb73 [PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702)
There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble.

This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702
Approved by: https://github.com/kwen2501
2025-01-15 05:35:29 +00:00
ae7df51232 [c10d] Fix CudaEventCache for dangling references (#144496)
Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it.
1. We add a unit test to repro the issue mentioned in the issue.
2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496
Approved by: https://github.com/kwen2501
2025-01-15 05:11:48 +00:00
9cd6f46130 [ca] raise error message on AOT Autograd caching (#144595)
FIXES https://github.com/pytorch/pytorch/issues/144175, bandaid

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144595
Approved by: https://github.com/bdhirsh
2025-01-15 05:09:42 +00:00
e0bbff6019 [c10d][ez] Add comments to the end of Macro for better readability (#144789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144789
Approved by: https://github.com/c-p-i-o
2025-01-15 05:06:41 +00:00
d2ca8163c0 [MPSInductor] Support abs in MetalPrintExpr (#144826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144826
Approved by: https://github.com/dcci
ghstack dependencies: #144509, #144798, #144795, #144796
2025-01-15 05:01:25 +00:00
9610a22e94 Fix FakeTensor device creation for MPS (#144796)
By promoting torch.device("mps") to `torch.device("mps:0")`, but skipping `is_initialized` check, as MPS does not really support multi-GPU right now

This fixes `GPUTests.test_remove_no_ops_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144796
Approved by: https://github.com/ezyang
ghstack dependencies: #144509, #144798, #144795
2025-01-15 05:01:25 +00:00
18786c65e5 [BE] Extend test_remove_no_ops (#144795)
----

- Use `is_dtype_supported` to skip dtype promotions portion of the test on unsupported device
- Extend it to use `torch.float16` so promotions could be checked there
- Implement `CpuInterface.is_bfloat16_supported` that returns true (which looks like the case, even if it's supported via emulation)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144795
Approved by: https://github.com/Skylion007
ghstack dependencies: #144509, #144798
2025-01-15 05:00:26 +00:00
48f7e7c378 [torch][ao][EASY] Change print to log in numeric debugger to avoid large output (#144790)
Summary:
This print statement was spewing a bunch of data in logs by default, but it should
be silenceable.
Use `log.debug` instead.

Differential Revision: D68166823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144790
Approved by: https://github.com/tarun292
2025-01-15 04:58:56 +00:00
6a5f895e54 Removed unused _RequiredParameter (#144771)
As per this [discussion](https://discuss.pytorch.org/t/a-question-about-requiredparameter/137977), I figured that `_RequiredParameter` is no longer used.

The `required` object was initially introduced in this [PR](4db6667923) as the `SGD` optimizer did not offer a default value for the learning rate. However there isn't a single place in the code base using `_RequiredParameter`, nor `required`. I am therefore removing unused `_RequiredParameter` and `required`.

Everything not included in this PR is Not a Contribution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144771
Approved by: https://github.com/janeyx99
2025-01-15 04:11:17 +00:00
cyy
d87aad6877 [5/N] Apply Ruff fixes and pyupgrade to Python 3.9 (#144205)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144205
Approved by: https://github.com/albanD
2025-01-15 04:00:47 +00:00
db787181b5 Back out "[Submodule] Upgrade to Cutlass 3.6" (#144738)
Summary: Revert due to perf regressions see: https://github.com/pytorch/pytorch/issues/144729

Test Plan: sand castle

Differential Revision: D68137326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144738
Approved by: https://github.com/huydhn
2025-01-15 02:57:14 +00:00
e2251fffbb [MPSInductor] Add min/max to MetalExprPrinter (#144798)
After that `GPUTests::test_avg_pool2d8_mps` and `GPUTests::test_avg_pool2d5_mps` passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144798
Approved by: https://github.com/dcci
ghstack dependencies: #144509
2025-01-15 01:43:42 +00:00
9199c79a9c [Quant][Inductor][X86] Separate unary post op fusion and lowering for qconv (#144312)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves unary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise`
2. Fuse `onednn.qconv2d_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144312
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #144224
2025-01-15 00:50:54 +00:00
825fe15024 EZ fix to make sure local pytest run succeeds in export (#144764)
Previously run_tests() was protected under IS_FBCODE flag so that following works:
```
python test/export/test_export_legacy.py
```

But it fails on:
```
pytest test/export/test_export_legacy.py
```

This is because pytest doesn't seem to get triggered through run_tests().

Differential Revision: [D68152737](https://our.internmc.facebook.com/intern/diff/D68152737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144764
Approved by: https://github.com/avikchaudhuri
2025-01-15 00:43:40 +00:00
8c2aa0c533 [cutlass backend] cexpr the arg before writing to cpp file (#144714)
Summary: The problem is for certain shapes, see unit test, one of the dimensions is like `s0 // 2`. If we use cutlass backend, this means writing that to C++ file, which would lead to C++ compilation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144714
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78, https://github.com/desertfire
2025-01-14 23:09:44 +00:00
8ad37ed710 Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483
Approved by: https://github.com/Skylion007
2025-01-14 22:32:51 +00:00
ea3395e4f2 [ROCm] Improvements for vectorized elementwise kernels (#143269)
*  Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
   * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143269
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2025-01-14 22:09:21 +00:00
c000214826 Allow GradientEdge as torch.autograd.backward outputs (#144744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144744
Approved by: https://github.com/albanD
2025-01-14 21:31:44 +00:00
64829b356a [PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325)
PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend.

However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling  `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw:
```
could not parse dispatch key 'my_backend'
```

So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325
Approved by: https://github.com/albanD
2025-01-14 21:21:29 +00:00
130452dad6 [Pipelining] fix test_schedule.py (missing destroy_process_group (#144734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144734
Approved by: https://github.com/H-Huang
ghstack dependencies: #144352, #144596
2025-01-14 21:16:09 +00:00
aa57f0c663 [Pipelining] Refactor common utils from test_pp_dp (#144596)
Split test_pp_dp into pp_ddp and pp_fsdp so its a bit more
concise and easier to add CP to the FSDP one.

Realize that 'use_new_runtime' parametrization was not even being used,
removing it saves a bunch of test time. We should migrate schedules to
the new runtime and have them be covered that way.  (And
test_schedule*.py are testing new runtime too).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144596
Approved by: https://github.com/H-Huang
ghstack dependencies: #144352
2025-01-14 20:13:17 +00:00
6f5dce3035 [Pipelining] Fix PP grad scaling (#144352)
Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches.

Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352
Approved by: https://github.com/H-Huang
2025-01-14 20:13:17 +00:00
9157a748a6 [MPSInductor] Add dummy properties (#144509)
For compute capabilitiy (which is an empty string, same as CPU)
And for multicore count return 8, as this is smallest number of GPU cores on Apple silicon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144509
Approved by: https://github.com/jansel
2025-01-14 20:12:38 +00:00
bdd942efd7 Revert "Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138)"
This reverts commit 6cfc08167595e27ee9a5701c6426a7a8a7e387ef.

Reverted https://github.com/pytorch/pytorch/pull/144138 on behalf of https://github.com/albanD due to This seems to impact the caffe2 code ([comment](https://github.com/pytorch/pytorch/pull/144138#issuecomment-2590891200))
2025-01-14 19:04:12 +00:00
b4b4e57469 [CD] Enable profiling for XPU Windows nightly wheels (#144316)
PR https://github.com/pytorch/pytorch/pull/144034 added profiling support for torch XPU Windows binary, enable it in PyTorch XPU Windows CD
Works for https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144316
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-01-14 19:01:27 +00:00
2683691237 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-14 18:47:42 +00:00
e2891d43a8 [codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +1 (#144783)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144783
Approved by: https://github.com/albanD, https://github.com/malfet
2025-01-14 18:34:54 +00:00
ec1c3ab3b2 [inductor][triton] skip test_data_type_propagation if triton (#142054)
None cpp inductor backends don't have a `DataTypePropagation` pass on the scheduler nodes so skip the test. CUDA only passes because the device is currently not changed to "cuda" in the test body.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142054
Approved by: https://github.com/eellison
2025-01-14 18:03:00 +00:00
e666807653 [Fix]: Enable support for Arm Neon & SVE support for FP32 Gemm Wrapper (#144327)
**Performance Improvements**:
Linear Layer [ 1x512 * 512x512 ] ->  2x - 4x
Linear Layer [ 3x512 * 512x512 ] -> 2x - 4x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144327
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/cfRod, https://github.com/malfet

Co-authored-by: Crefeda Rodrigues <crefeda.Rodrigues@arm.com>
2025-01-14 17:52:00 +00:00
eee7a47e94 Support FunctionalTensor subclass in is_fake and maybe_get_fake_mode (#144719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144719
Approved by: https://github.com/bdhirsh
2025-01-14 17:49:11 +00:00
d21738f24a Revert "Fix torch.normal ignores default_device (#144070)"
This reverts commit 184549b2d7e59acfc6e47d121e9ebb50648945b3.

Reverted https://github.com/pytorch/pytorch/pull/144070 on behalf of https://github.com/ezyang due to broken a specific use case ([comment](https://github.com/pytorch/pytorch/pull/144070#issuecomment-2590681953))
2025-01-14 17:41:58 +00:00
7977a3638e [executorch hash update] update the pinned executorch hash (#140769)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140769
Approved by: https://github.com/pytorchbot
2025-01-14 17:38:07 +00:00
f2975717f3 [CD] Fix slim-wheel nvjit-link import problem (#141063)
When other toolkit (say CUDA-12.3)  is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with
```
ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
```
It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following
```python
        if version.cuda in ["12.4", "12.6"]:
            with open("/proc/self/maps") as f:
                _maps = f.read()
            # libtorch_global_deps.so always depends in cudart, check if its installed via wheel
            if "nvidia/cuda_runtime/lib/libcudart.so" in _maps:
                # If all abovementioned conditions are met, preload nvjitlink
                _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]")
```

Fixes https://github.com/pytorch/pytorch/issues/140797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141063
Approved by: https://github.com/kit1980

Co-authored-by: Sergii Dymchenko <sdym@meta.com>
2025-01-14 17:33:07 +00:00
5c727d5679 [minifier] Fix config generator for callables (#144518)
Summary:
When config contains callables, the current configs generated cannot be run:

```
torch._dynamo.config.reorderable_logging_functions = {<built-in function print>, <function warning at 0x7f774c595630>, <function log at 0x7f774c595870>, <function error at 0x7f774c595510>, <function info at 0x7f774c595750>, <built-in function warn>, <function exception at 0x7f774c5955a0>, <function debug at 0x7f774c5957e0>, <function critical at 0x7f774c5953f0>}
```

We fix the config to generate the right string, so the config is runnable, like below

```
import logging
import warnings
torch._dynamo.config.reorderable_logging_functions = { warnings.warn, logging.warn, print }
```

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config
```

Differential Revision: D67998703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144518
Approved by: https://github.com/desertfire
2025-01-14 17:18:13 +00:00
cbb1ed2966 [1/N] OpenReg: Replace open_registration_extension.cpp with openreg (#141815)
As described in OpenReg [next-steps](https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/README.md#next-steps), here we replace the current `open_registration_extension.cpp` test in PyTorch CI with openreg.

The current `open_registration_extension.cpp` contains two parts:
1. Implentations to support `PrivateUse1` backend.
2. Helper functions used for UTs in `test_cpp_extensions_open_device_registration.py` and `test_transformers.py`.

For the first part, we'll replace it with openreg. For the second part, we'll migrate them to ut files step by step.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141815
Approved by: https://github.com/albanD
2025-01-14 15:59:00 +00:00
347a74b8f5 Mark CUDA-12.6 as experimental for 2.6 release (#144769)
Because that's the first time we are trying to release it, and it also is the first release to use manylinux2_28
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144769
Approved by: https://github.com/atalman
2025-01-14 15:30:00 +00:00
60d2e32fa4 [BE] Remove lambda from str (#144743)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144743
Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007
ghstack dependencies: #144471
2025-01-14 15:10:57 +00:00
ffb3f32693 Add max kwarg to torch._check with alternate size oblivious semantics (#144471)
Fixes https://github.com/pytorch/pytorch/issues/120288 for the static bound case

I had been tying myself in knots in the original issue about the fact that we can't really do symbolic bounds like u0 < s0. But then I realized, "Wait, but the static bounds are easy!" So this makes it so you can also exclude a specific upper bound when doing size oblivious tests, which is enough to solve https://github.com/pytorch/pytorch/issues/123592#issuecomment-2574556708

It's written very dirtily, maybe there's some cleanup. Bikeshed on the public API name also welcome.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144471
Approved by: https://github.com/avikchaudhuri
2025-01-14 15:10:57 +00:00
95b41d2aa4 Tests Generelization for multiple accelerator devices (#139749)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Chnages: There are general changes in common_dtesnor module for device type generalization so that tests can be executed on non cuda devices too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139749
Approved by: https://github.com/kwen2501
2025-01-14 08:52:46 +00:00
1800f5f461 Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735)
**Motivation:**

- Enable coalescing path on XPU for `batch_isend_irecv`.
- If XCCL backend is specified, then construct a XPU tensor to ensure `barrier` dispatch to XCCL backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143735
Approved by: https://github.com/kwen2501
2025-01-14 08:37:48 +00:00
21cbee5d9b Drop unused num_elements variable (#144723)
Summary:
With the recent enforcement of unused variable as an error in D67329035, certain tests like
https://www.internalfb.com/intern/test/562950135258426?ref_report_id=0
can't build citing:
```
Action failed: fbcode//caffe2:libtorch_cuda (cfg:linux-x86_64-fbcode-platform010-clang17-no-san#2a7259832b2f5c67) (cxx_compile torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (pic))
Remote command returned non-zero exit code 1
Remote action, reproduce with: `frecli cas download-action a95a6625d2b071a782a7a8ea2882f4adccf103b023df5ccb596f48c506101754:145`
Stdout: <empty>
Stderr:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3757:16: error: unused variable 'num_elements' [-Werror,-Wunused-variable]
 3757 |         size_t num_elements = output.numel();
      |                ^~~~~~~~~~~~
1 error generated.
```
This causes Sandcastle to turn off these tests, decreasing protection from other bad diffs. Clean up the unused variable to unblock.

Test Plan:
```
buck2 build --config hpc_comms.use_ncclx=dev --flagfile fbcode//mode/opt fbcode//ftar:ftar_py_e2e_test
```

https://www.internalfb.com/buck2/888dfc68-07eb-4ba1-add5-b38c12d52b33

Reviewed By: c-p-i-o

Differential Revision: D68126236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144723
Approved by: https://github.com/fduwjj, https://github.com/Skylion007

Co-authored-by: Daulet Askarov <dauleta@meta.com>
2025-01-14 08:29:01 +00:00
80eff6e720 [MPS] fix triangular for >3D tensors (#144545)
Old implementation leads to incorrect output due to not handling the other batch sizes other than 3D tensors(B, M, N)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144545
Approved by: https://github.com/malfet
2025-01-14 08:25:01 +00:00
8436a5c2cb [Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves binary post op fusion of qlinear out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise`
2. Fuse `onednn.qlinear_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-01-14 06:46:38 +00:00
c031defe0b [RELAND] Generalize at::manual_seed for all accelerators (#144370)
# Additional Context
This is a reland PR originated from eeb57394f93d720bca498c3fa9d167fc7b9cca46

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2025-01-14 06:09:36 +00:00
9d98b66e7b [Inductor][CPP] Enable Epilogue Fusion for Grouped GEMM Template (#143897)
**Summary**
In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it.

**Fusion**

- The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM.
- During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design.
- In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles.

**Code Gen**
We maintain a list of epilogues and codegen it one by one.

- If any of the GEMM has bias, we create  a extra `bias_add` epilogue and prepend it at first of the epilogue list.
- If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #143796
2025-01-14 06:07:50 +00:00
25de671ea8 [Inductor][CPP] Enable Grouped GEMM Template (#143796)
**Summary**
Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012

- Support flexible number of GEMMs
- Share activation across GEMMs
  - The Grouped GEMM Template supports independent activations
  - However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs
- Each GEMM can have a unique weight but same sizes
- Each GEMM can have a unique bias or None
  - Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR
- Each GEMM have its own epilogues
  - Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid
python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear
```

**Example**
Here is the example and generated code
```
batch_size = 4
in_features = 512
out_features = 1024
dtype = torch.bfloat16

class M(torch.nn.Module):
    def __init__(self, bias):
        super().__init__()
        self.linear0 = torch.nn.Linear(in_features, out_features, bias=False)
        self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)

    def forward(self, x):
        return self.linear0(x), self.linear1(x)

if __name__ == "__main__":
    with torch.no_grad():
        input = torch.randn(batch_size, in_features, dtype=dtype)
        m = M(bias=bias).to(dtype=dtype).eval()
        cm = torch.compile(m)
        act_res = cm(input)
```

Generated Code:  https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py

**Next Step**

- Support Epilogue fusion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796
Approved by: https://github.com/jgong5, https://github.com/jansel
2025-01-14 05:59:07 +00:00
35b46a75f1 [mps/inductor] Add support for round() (#144731)
With this change, inductor/test_view_on_aliased passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144731
Approved by: https://github.com/malfet
2025-01-14 05:56:13 +00:00
17e05cde0c ROCm: Skip tests in elastic/utils/distributed_test (#144692)
The tests are failing on ROCm machines due to the below error. The client socket has timed out after 1000ms while trying to connect to (gpu4f67.jax.cs.cpe.ice.amd.com, 0)
Disabling the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144692
Approved by: https://github.com/jeffdaily
2025-01-14 03:49:06 +00:00
e58c823ab8 Implement increment and add_to_set for CompileEventLogger (#143427)
This diff implements `increment` and `add_to_set`, which are features of MetricsContext, but not ChromiumEventLogger. This allows us to add a bunch of other metricscontext callsites to use CompileEventLogger instead.

Differential Revision: [D67354867](https://our.internmc.facebook.com/intern/diff/D67354867/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143427
Approved by: https://github.com/masnesral
2025-01-14 02:42:49 +00:00
6053242890 [CD] Enable python3.13t builds for aarch64 (#144698)
But make sure that right numpy version is picked (2.0.2 does not support 3.13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144698
Approved by: https://github.com/atalman
ghstack dependencies: #144696, #144697, #144716
2025-01-14 02:29:01 +00:00
b221f88fc1 Leave SCCACHE_S3_KEY_PREFIX empty to share the cache among all build jobs (#144704)
This is a follow-up of https://github.com/pytorch/pytorch/pull/144112#pullrequestreview-2528451214.  After leaving https://github.com/pytorch/pytorch/pull/144112 running for more than a week, all build jobs were fine, but I failed to see any improvement in build time.

So, let's try @malfet suggestion by removing the prefix altogether to keep it simple.  After this land, I will circle back on this to see if there is any improvements.  Otherwise, it's still a simple BE change I guess.

Here is the query I'm using to gather build time data for reference:

```
with jobs as (
    select
        id,
        name,
        DATE_DIFF('minute', created_at, completed_at) as duration,
        DATE_TRUNC('week', created_at) as bucket
    from
        workflow_job
    where
        name like '%/ build'
        and html_url like concat('%', {repo: String }, '%')
        and conclusion = 'success'
        and created_at >= (CURRENT_TIMESTAMP() - INTERVAL 6 MONTHS)
),
aggregated_jobs_in_bucket as (
    select
        --groupArray(duration) as durations,
        --quantiles(0.9)(duration),
        avg(duration),
        bucket
    from
        jobs
    group by
        bucket
)
select
    *
from
    aggregated_jobs_in_bucket
order by
    bucket desc
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144704
Approved by: https://github.com/clee2000
2025-01-14 02:19:38 +00:00
6d56277682 [export] Fix torchbind constant folding (#144684)
Summary: `CallTorchBind` should not be folded during constant folding

Test Plan:
```
buck2 run mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding_torchbind
```

Reviewed By: henryoier

Differential Revision: D67721272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144684
Approved by: https://github.com/zhxchen17
2025-01-14 01:58:44 +00:00
eaa8a97b39 [RelEng] Add --ami option to build_aarch64 (#144685)
Which should be mutually-exclusive with OS

For example, one can use the following to alloc one-off instance
```
./build_aarch64_wheel.py --alloc-instance  --instance-type g5.4xlarge --key-name nshulga-key --ami ami-0f51103893c02957c --ebs-size 200
```

TODO:
 - Figure out EBS volume name depending on the AMI (for `ami-05576a079321f21f8`(al2023) it's `/dev/xvda`, but for `ami-0f51103893c02957c`(deep learning container) it's `/dev/sda1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144685
Approved by: https://github.com/atalman
2025-01-14 01:48:27 +00:00
de9d6a25d7 [mps/inductor] Add support for ceil (#144715)
inductor/test_index_dynamic_shapes passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144715
Approved by: https://github.com/malfet
2025-01-14 01:16:47 +00:00
64bcf39180 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit 388b75edec09182131be0dfe1abeafc5c3b91adf.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))
2025-01-14 00:48:28 +00:00
dfe06e555d Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)"
This reverts commit dcc04e9237292de10e9cedd8213253e253b1e91c.

Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/kit1980 due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/144441 ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2588515018))
2025-01-14 00:46:48 +00:00
58302c4eaa [BE] [CD] Remove pygit2 dep for aarch64_wheel build (#144716)
As it's incompatible with 3.13t and only used to fetch the branch name, which could be done by running
```
git rev-parse --abbrev-ref HEAD
```

Also, remove yet another reference to long gone `master` branch.

Test plan:
  Download `manywheel-py3_11-cpu-aarch64.zip` produced by this PR, install it inside docker container and check it's version
```
# pip install torch-2.7.0.dev20250113+cpu-cp311-cp311-manylinux_2_28_aarch64.whl
...
Installing collected packages: mpmath, typing-extensions, sympy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch
Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.12.0 jinja2-3.1.5 mpmath-1.3.0 networkx-3.4.2 sympy-1.13.1 torch-2.7.0.dev20250113+cpu typing-extensions-4.12.2
root@434f2540345e:/# python
Python 3.11.9 (main, Aug  1 2024, 23:33:10) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.7.0.dev20250113+cpu'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144716
Approved by: https://github.com/atalman
ghstack dependencies: #144696, #144697
2025-01-14 00:43:46 +00:00
dcc04e9237 Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483
Approved by: https://github.com/Skylion007
2025-01-13 23:19:44 +00:00
c15d6508bd Binary builds Docker images - remove cuda 12.1 (#144575)
Remove cuda 12.1 from manylinux, libtoch and almalinux builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144575
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/malfet, https://github.com/Skylion007
2025-01-13 22:44:59 +00:00
4f74864c94 Revert "[AOTI] Add a boxed_run API (#142213)"
This reverts commit 868984c3e324dedeac04cf10e2bbfbf912dac3b1.

Reverted https://github.com/pytorch/pytorch/pull/142213 on behalf of https://github.com/kit1980 due to breaking lots of internal builds, see D68036023 ([comment](https://github.com/pytorch/pytorch/pull/142213#issuecomment-2588378262))
2025-01-13 22:43:47 +00:00
a54a784b82 [dynamo][dicts] Consolidate dict(..) construction (#144342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342
Approved by: https://github.com/StrongerXi
2025-01-13 22:24:56 +00:00
0373cd9950 remove allow-untyped-defs from torch/distributed/checkpoint/api.py (#144653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144653
Approved by: https://github.com/Skylion007
2025-01-13 21:57:19 +00:00
1dab79470d c10::string_view -> std::string_view in pytorch (#143591)
Test Plan: Sandcastle

Differential Revision: D67312322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143591
Approved by: https://github.com/malfet
2025-01-13 21:44:05 +00:00
5129d6ef51 Fix inductor periodic smoke test wrong artifact (#144694)
I'm not entirely sure why this failure starts to show up in periodic since Friday https://github.com/pytorch/pytorch/actions/runs/12716967189/job/35463656803.  The artifact was uploaded to S3, but `use-gha: anything-non-empty-to-use-gh` was set and it was working.  Maybe this is related to https://github.com/pytorch/pytorch/issues/144479

I also clean up the GCP/AWS A100 selection logic as the GCP cluster doesn't exist anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144694
Approved by: https://github.com/clee2000
2025-01-13 21:42:39 +00:00
e15f91337b [inductor] Add unbacked symints binding in ShapeProp (#144605)
Summary: ShapeProp  doesn't know how to propagate unbacked. Patch it up to propagate unbacked symints like PropagateUnbackedSymInts.

Test Plan:
```
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r test_shape_prop_unbacked_sym
```

Differential Revision: D68050073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144605
Approved by: https://github.com/guowentian, https://github.com/pianpwk
2025-01-13 21:30:20 +00:00
3c55669b88 Enable grep_linter to use -a (#144589)
Lintrunner can only apply changes (-a) if only one suggestion is made per file.  The grep_linter makes a suggestion for every line it finds incorrect, so it creates multiple suggestions per file if there are multiple lines that it wants to change

This sets the `line` parameter of the LintMessage to None for all of grep_linter, but I'm not sure if that entry did anything

I'm not sure if enabling -a is the best idea, since its currently used for tabs and tab width might differ each time?  I had one instance where running with -a cause the spacing to change.  On the other hand, -a would have already worked if only one line was bad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144589
Approved by: https://github.com/huydhn
2025-01-13 21:18:24 +00:00
91dbd7b75c [BE]: Improve typing inference with TypeIs (#144682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144682
Approved by: https://github.com/albanD

Co-authored-by: Aaron Orenstein <aorenste@meta.com>
2025-01-13 21:14:31 +00:00
4ceca4d60f [dynamo] Avoid graph break on updates to obj.__dict__ (#144419)
`obj.__dict__` is handled specially in Dynamo, and prior to this patch
we only support read and membership check on that dictionary object.

This patch adds support for writes and some documentation.

Fixes #143756.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-01-13 21:04:10 +00:00
684d015c2f [AOTI] Support _int_mm (#144571)
Summary: Add _int_mm to the C shim, to resolve a torchao issue, https://github.com/pytorch/ao/pull/1531#issue-2776827015

Differential Revision: [D68030385](https://our.internmc.facebook.com/intern/diff/D68030385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144571
Approved by: https://github.com/yushangdi
2025-01-13 20:32:29 +00:00
b7f95df65b [Feat]: Add Multithreading support for kleidiai groupwise GEMM kernels (#144074)
KleidiAI Groupwise GEMM Kernel was not 2D Blocked. This change adds supports for 2D blocking of GEMM kernel to efficiently split workload & speedup GEMM kernel over multiple threads.

Performance improvements:
7B model Pre-fill  speedup from 145 t/s to 175 t/s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144074
Approved by: https://github.com/digantdesai
2025-01-13 20:32:23 +00:00
5a2e8fce9d Fix block pointer test module for triton CPU and add to CI (#144474)
- Fix for BlockPointerTestBase._discontiguous_tensor. It defaults to constructing CUDA tensors, causing a failure if CUDA is not available.
- Add test module to CI to prevent errors like the above from occurring.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144474
Approved by: https://github.com/jansel
2025-01-13 20:25:05 +00:00
80c286cbec remove allow-untyped-defs from torch/_C/_dynamo/eval_frame.pyi (#144655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144655
Approved by: https://github.com/StrongerXi
2025-01-13 20:03:25 +00:00
18deff0262 remove allow-untyped-defs from torch/ao/nn/intrinsic/__init__.py (#144652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144652
Approved by: https://github.com/Skylion007
2025-01-13 19:36:08 +00:00
d44c3906b8 [EZ] [CD] Add 3.13 to FULL_PYTHON_VERSIONS (#144697)
Separation was necessary for Conda codegen, but now it's gone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144697
Approved by: https://github.com/atalman, https://github.com/izaitsevfb
ghstack dependencies: #144696
2025-01-13 19:12:12 +00:00
d2f905760d [EZ] [CD] Eliminate stale TODO (#144696)
As 3.13 has been enabled across the board, which one can verify by running `./github/regenerate.sh` and observe that non of the configs have changed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144696
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2025-01-13 19:12:12 +00:00
cd477cdd1d remove allow-untyped-defs from torch/ao/nn/quantized/reference/modules/linear.py (#144656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144656
Approved by: https://github.com/Skylion007
2025-01-13 19:03:05 +00:00
f93d786f73 remove allow-untyped-defs from torch/nn/parameter.pyi (#144654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144654
Approved by: https://github.com/Skylion007
2025-01-13 19:02:31 +00:00
983bf604e5 ReshapeTransform: added missing argument in docstring (#144401)
See https://github.com/pytorch/pytorch/pull/144197#discussion_r1907336339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144401
Approved by: https://github.com/janeyx99, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-13 17:59:59 +00:00
fe8c5c7a2d Update the Triton DeviceInterface in test/inductor/extension_backends/triton/device_interface.py (#144399)
Following the changes to how `DeviceInterface` is used in this [PR](https://github.com/pytorch/pytorch/pull/142033), the `DeviceInterface` in `extension_backend/triton/device_interface.py` should by updated to return the `DeviceProperties` instead of raising a NotImplementedError.

This PR mirrors the [changes](https://github.com/pytorch/pytorch/pull/142033/files#diff-06553e25e48e1d60f3030458bc46d52067d3d0c3eef2d5fcea29f7e8126bd7c9L112-R114) made in Dynamo when the PR landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144399
Approved by: https://github.com/jansel
2025-01-13 17:19:58 +00:00
bee84e88f8 [BE][Easy] improve submodule discovery for torch.ao type annotations (#144680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144680
Approved by: https://github.com/Skylion007
2025-01-13 17:16:19 +00:00
c40d917182 [MPSInductor] Fix maximum/minimum for int types (#144665)
`metal::isnan` is only defined for floats, so provide a generic wrapper
that is false for integral types

TODO: Figure out why type propagantion is not working (or should it?)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144665
Approved by: https://github.com/dcci
2025-01-13 15:14:01 +00:00
8633845090 Support nanj in inductor (#144064)
Fixes https://github.com/pytorch/pytorch/issues/144029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144064
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-13 14:29:38 +00:00
417354d953 [mps/inductor] Add support for truncdiv(). (#144666)
Two other inductor tests pass after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144666
Approved by: https://github.com/malfet
2025-01-13 13:39:38 +00:00
7e2239f1f0 [MPSInductor] Better error when kernel fails to compile (#144649)
Now error message looks as follows:
```
% python ../test/inductor/test_torchinductor.py -v -k test_cat_unbacked_2d_mps
test_cat_unbacked_2d_mps (__main__.GPUTests) ... inline_call []
stats [('calls_captured', 6)]
inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('not_ok', 1)]
ERROR

======================================================================
ERROR: test_cat_unbacked_2d_mps (__main__.GPUTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3126, in wrapper
    method(*args, **kwargs)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 12254, in new_test
    return value(self)
  File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 5885, in test_cat_unbacked_2d
    self.common(
  File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 620, in check_model_gpu
    check_model(
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 461, in check_model
    actual = run(*example_inputs, **kwargs)
  File "/Users/malfet/git/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1149, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1064, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 1977, in compile_to_module
    return self._compile_to_module()
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 2018, in _compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/codecache.py", line 2768, in load_by_key_path
    mod = _reload_python_module(key, path)
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 51, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 40, in <module>
  File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 32, in _compile_mps_shader
torch._inductor.exc.InductorError: SyntaxError: failed to compile
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        long x1 = (xindex) / (3);
        auto tmp0 = x1;
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 0;
        auto tmp3 = tmp1 >= tmp2;
        auto tmp4 = 2;
        auto tmp5 = tmp1 < tmp4;
        long x0 = (xindex) % (3);
        auto tmp6 = in_ptr0[x0 + 3*(x1)];
        auto tmp7 = tmp5 ? tmp6 : 0.0;
        auto tmp8 = tmp1 >= tmp4;
        auto tmp9 = 2 + ks0;
        auto tmp10 = static_cast<long>(tmp9);
        auto tmp11 = tmp1 < tmp10;
        auto tmp12 = 1.0;
        auto tmp13 = tmp8 ? tmp12 : 0.0;
        auto tmp14 = tmp5 ? tmp7 : tmp13;
        long x2 = xindex;
        out_ptr0[x2] = static_cast<float>(tmp14);
    }
 with program_source:18:25: error: use of undeclared identifier 'ks0'
        auto tmp9 = 2 + ks0;
                        ^

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor.py GPUTests.test_cat_unbacked_2d_mps

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.472s

FAILED (errors=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144649
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #144647, #144648
2025-01-13 13:38:03 +00:00
a85d1ee106 Update slow tests (#144670)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144670
Approved by: https://github.com/pytorchbot
2025-01-13 12:06:22 +00:00
6e77d7cac5 Add AOTAutogradCache support for cache hot loading APIs (#144499)
This diff adds AOTAutogradCache support to the mega cache.

Differential Revision: [D67991059](https://our.internmc.facebook.com/intern/diff/D67991059/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67991059/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144499
Approved by: https://github.com/oulgen
2025-01-13 07:07:18 +00:00
a08bd8154e [MPSInductor] Add support for sizevars (#144662)
Just pass them as kernel arguments

After this change  `pytest test/inductor/test_torchinduct.py -v -k _mps` reports 330 failed, 429 passed  after and 335 failed, 424 passed before

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144662
Approved by: https://github.com/jansel
2025-01-13 06:22:38 +00:00
87843ee9ab [export] Unify single and multiple return for hops (#143227)
Summary: Introduce `is_hop_single_tensor_return` field to the `Node` class in serialization so that during deserialization when there is a single return, we know whether it is a tuple of a single element or a single element.

Test Plan:
```
buck2 run @mode/dev-nosan sigmoid/inference/test:e2e_test_cpu -- -r E2ETestCPUCond
buck2 run @mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding2
```

Differential Revision: D66991624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143227
Approved by: https://github.com/zhxchen17
2025-01-13 03:31:14 +00:00
0aa34e9591 Revert "Collect packages with importlib in collect_env (#144616)"
This reverts commit 3541d2a2aaacc4f15ea865c815ce8882577a439c.

Reverted https://github.com/pytorch/pytorch/pull/144616 on behalf of https://github.com/malfet due to Somehow this change causes test_bottleneck_cuda to fail ([comment](https://github.com/pytorch/pytorch/pull/144616#issuecomment-2586095595))
2025-01-13 03:11:04 +00:00
46eeef9130 [MPS][BE] Surface syntax errors shader compilation (#144648)
Before this change
```python
>>> import torch
>>> torch.mps._compile_shader('What')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/malfet/miniconda3/envs/py311/lib/python3.11/site-packages/torch/mps/__init__.py", line 157, in _compile_shader
    return torch._C._mps_compileShader(source)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Failed to create metal library, error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
" UserInfo={NSLocalizedDescription=program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
}
```
After this change
```python
>>> import torch
>>> torch.mps._compile_shader('What')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/malfet/git/pytorch/pytorch/torch/mps/__init__.py", line 157, in _compile_shader
    return torch._C._mps_compileShader(source)
SyntaxError: program_source:1:1: error: unknown type name 'What'
What
^
program_source:1:5: error: expected unqualified-id
What
    ^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144648
Approved by: https://github.com/Skylion007
ghstack dependencies: #144647
2025-01-13 02:03:19 +00:00
9ae35b8bb1 [BE] Introduce c10::SyntaxError (#144647)
Which will be translated into Python's SyntaxError
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144647
Approved by: https://github.com/Skylion007
2025-01-12 23:23:54 +00:00
3541d2a2aa Collect packages with importlib in collect_env (#144616)
If pytorch is installed systemwide (via os package manager) or by alternative package manager like `uv`, pip is not available, causing error in `collect_env`.
However it is still possible to collect exactly the same list using `importlib` API, which is always available.

Fixes #144615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144616
Approved by: https://github.com/malfet
2025-01-12 23:21:08 +00:00
1376116ab1 Config fuzzer (#139736)
This tool makes it easy to search through config state-space with a minimal reproduction or test. It presents a similar interface to the config bisector by taking a test_function that should either raise on Exception or return False upon failure.

It has two entry points: `fuzz_n_tuple`, which tries every combination of n configs, and `bisect`, which randomly flips configs and tries to find the minimal reproduction upon failure. `bisect` is a much more efficient way to search the space, but `fuzz_n_tuple` can give you peace of mind that a new config will compose with every other config.

It's been used to find three bugs so far in the inductor config:
https://github.com/pytorch/pytorch/issues/140220 https://github.com/pytorch/pytorch/issues/140219
https://github.com/pytorch/pytorch/issues/143524

This PR also adds a bunch of missing types to the inductor config to get them to play nice with the fuzzer, so it can be a good forcing function for adding types to config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139736
Approved by: https://github.com/eellison
2025-01-12 22:59:02 +00:00
334ee8ba40 Fix a bug for conj_physical (#144391)
Fixes #141426

fix a bug in previous [PR](https://github.com/pytorch/pytorch/pull/141427), which shouldn't convert the data type for conj.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144391
Approved by: https://github.com/jansel
2025-01-12 21:18:17 +00:00
cb66146f2b [BE]: Update literal typing for torch/fx/graph nodelist (#144650)
Mentioned in discussion for #144631

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144650
Approved by: https://github.com/jansel
2025-01-12 21:02:13 +00:00
91a65cbd31 [MPSInductor] Implement check_bounds (#144635)
Although at the moment it returns rather than rasises assert due to https://github.com/pytorch/pytorch/pull/144632

`pytest test/inductor/test_torchinductor.py -v -k _mps` score is `368
failed, 391 passed, 32 skipped`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144635
Approved by: https://github.com/jansel
2025-01-12 21:01:20 +00:00
fd382f1269 Micro-optimization in Graph.nodes.__iter__ (#144631)
This generates slightly better code (removing a generator frame) and
drops a redundant assert.

```py
>>> import timeit
>>> def a():
...   yield from range(3)
...
>>> def b():
...   return range(3)
...
>>> timeit.timeit(lambda: [*a()])
0.2714634328149259
>>> timeit.timeit(lambda: [*b()])
0.12076826114207506
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144631
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2025-01-12 17:46:46 +00:00
de04acaca9 Disable scuba logging for autotuning (#144568)
Summary: the compile IDs are currently null, which is confusing. Turn it off until we have a solution.

Test Plan: https://fburl.com/scuba/dynamo_compile/sandbox/g2d2g5xs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144568
Approved by: https://github.com/jamesjwu
2025-01-12 15:47:14 +00:00
1664033e13 [Functorch] Refactor vmapify autograd function: remove cell mutation (#143811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143811
Approved by: https://github.com/zou3519
2025-01-12 10:31:23 +00:00
cec245806e [MPSInductor] Implement bitcasts (#144638)
That will be used to compile something like `torch.rand(32, device='mps').view(dtype=torch.int32)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144638
Approved by: https://github.com/dcci
2025-01-12 06:11:28 +00:00
32a91dedc5 [MPSInductor] Properly generate index expressions (#144632)
Now test_slice_scatter4_mps passes

Before this change test_torchinductor.py reported 422 failed and 337 passed, after this change 412 failed 347 passed.

Fixes https://github.com/pytorch/pytorch/issues/144630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144632
Approved by: https://github.com/dcci
2025-01-12 06:10:05 +00:00
3355103233 [Dynamo] Supports autograd.Function forward returns constant (#144597)
Fixes #144142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144597
Approved by: https://github.com/jansel
2025-01-12 03:53:10 +00:00
e0f67405a1 [mps/inductor] Add support for exp(). (#144606)
inductor/test_silu now passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144606
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-12 00:38:11 +00:00
10887fc139 [BE] Enable test_public_bindings on MacOS (#144591)
I've tried it locally and it works.. (One more reason to xfail rather than skip)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144591
Approved by: https://github.com/Skylion007
2025-01-12 00:34:47 +00:00
5e858254d2 [mps/inductor] Add support for trunc(). (#144629)
inductor/test_div1 passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144629
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-12 00:11:03 +00:00
f6688ac81d remove allow-untyped-defs from torch/distributed/_shard/sharded_tensor/shard.py (#144623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144623
Approved by: https://github.com/Skylion007
2025-01-12 00:10:42 +00:00
b8aae2773f remove allow-untyped-defs from torch/distributed/_checkpointable.py (#144627)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144627
Approved by: https://github.com/Skylion007
2025-01-12 00:07:26 +00:00
b5485c9f41 remove allow-untyped-defs from torch/_functorch/utils.py (#144626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144626
Approved by: https://github.com/Skylion007
2025-01-12 00:07:16 +00:00
ad221269b0 remove allow-untyped-defs from torch/distributions/pareto.py (#144624)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144624
Approved by: https://github.com/Skylion007
2025-01-12 00:06:56 +00:00
80b756ed91 remove allow-untyped-defs from torch/jit/_pickle.py (#144625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144625
Approved by: https://github.com/Skylion007
2025-01-12 00:06:25 +00:00
4f406d22a2 Revert "[mps/inductor] Add support for exp(). (#144606)"
This reverts commit 2ccbacfa24cae724ec1ea3bc7de189e5bf948d46.

Reverted https://github.com/pytorch/pytorch/pull/144606 on behalf of https://github.com/malfet due to It now passes MPS-not-supported test ([comment](https://github.com/pytorch/pytorch/pull/144606#issuecomment-2585482477))
2025-01-11 23:51:35 +00:00
eaa24821f2 Revert "[ez] add lint commits to .git-blame-ignore-revs (#144576)"
This reverts commit 49c1f81be84466d015705b1882320919eecffa82.

Reverted https://github.com/pytorch/pytorch/pull/144576 on behalf of https://github.com/janeyx99 due to need to redo with better testing ([comment](https://github.com/pytorch/pytorch/pull/144576#issuecomment-2585456893))
2025-01-11 21:53:00 +00:00
2ccbacfa24 [mps/inductor] Add support for exp(). (#144606)
inductor/test_silu now passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144606
Approved by: https://github.com/malfet
2025-01-11 18:09:33 +00:00
eqy
63569d9745 [CUDA][TF32] Add some missing TF32 decorators to test_nn.py (#144592)
Original authored by @bilal2vec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144592
Approved by: https://github.com/Skylion007
2025-01-11 16:20:59 +00:00
eqy
388b75edec [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-11 15:30:38 +00:00
2e3b051154 [XPU] Fix TRITON_XPU_BUILD_FROM_SOURCE (#142850)
Fixes #142849

The idea is to remove the redundant 'git' in TRITON_XPU_BUILD_FROM_SOURCE=1 case (L29) while keep it in pre-build whl installation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142850
Approved by: https://github.com/chuanqi129, https://github.com/benjaminglass1, https://github.com/EikanWang, https://github.com/atalman
2025-01-11 13:11:55 +00:00
b7bef1ca84 [aarch64] fix TORCH_CUDA_ARCH_LIST for cuda arm build (#144436)
Fixes #144037

Root cause is CUDA ARM build did not call `.ci/manywheel/build_cuda.sh`, but calls `.ci/aarch64_linux/aarch64_ci_build.sh `instead. Therefore, https://github.com/pytorch/pytorch/blob/main/.ci/manywheel/build_cuda.sh#L56 was not called for CUDA ARM build.

Adding the equivalent of the code to `.ci/aarch64_linux/aarch64_ci_build.sh` as a WAR.

In the future, we should target to integrate the files in  .ci/aarch64_linux/aarch64_ci_build.sh back to .ci/manywheel/build_cuda.sh.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144436
Approved by: https://github.com/atalman
2025-01-11 09:00:46 +00:00
e1d0a2ff30 [Inductor] Restrict ND tiling analysis to MemoryDeps (#144497)
# Issue
https://github.com/pytorch/pytorch/pull/137243 introduced a feature where the ND tiling algorithm analyzes memory dependencies. It iterates over all `Dep`'s of the kernel.  However, the analysis is only applicable to `MemoryDep` instances, which are a subclass of `Dep`. In particular, it doesn't work for `StarDep`'s, for the reasons described here: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/simd.py#L1653

# Fix
This PR changes the algorithm to only iterate over `MemoryDep` instances.

# Testing
Parameterized an existing test for `torch.bucketize` to also run with ND tiling. This test emits a node with `StarDep`'s. Without this PR, the compiler would crash on this test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144497
Approved by: https://github.com/eellison
2025-01-11 05:16:47 +00:00
e4b2e90e54 Fix broken YAML template after #144574 (#144604)
The YAML syntax is wrong and GitHub complains about it https://github.com/pytorch/pytorch/blob/main/.github/ISSUE_TEMPLATE/pt2-bug-report.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144604
Approved by: https://github.com/wdvr
2025-01-11 05:09:06 +00:00
11082aead3 [Pipelining] Fix FSDP+PP stream sync bug (#144535)
This bug could cause gradient corruption as a race condition exists
between FSDP's reduce-scatter and any operations reading .grad on the
main stream.  The root cause is that pipelining stage .backward implementation
got modified to support zero-bubble and in doing so, invoked .grad()
instead of .backward(), and performed manual gradient accumulation and
manually called into hooks for FSDP.  But one key hook was missed for
FSDP, the '_root_post_backward_final_callback' hook, which is
responsible for syncing the grad reduction ops after the last layer's
backward completes.

Note: this fix applies to both zero-bubble and non-zero-bubble schedules.  This caused some confusion initially, as non-zero-bubble schedules do use torch.autograd.backward() which would have called into fsdp's hooks and synced, unlike zero-bubble which uses .grad() which does not invoke hooks.  However, this difference was already taken into consideration as FSDP's hooks are manually disabled before invoking either type of backward, and then the hooks are manually triggered.

A better fix as a follow up PR would be to invoke .backward() for the
weight grad, so that we never have to disable or manually invoke hooks.

Modified test_pp_dp to intentionally race against FSDP's reduce by
modifying the parameters inplace in a mathematically identical way, and
confirmed it fails intermittently when the FSDP sync is not applied and
passes with the FSDP sync added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144535
Approved by: https://github.com/awgu
ghstack dependencies: #144534
2025-01-11 03:42:15 +00:00
1d3cd7bd09 [Pipelining] Improve test_pp_dp (#144534)
Some refactoring, but important changes include
- initializing the weights properly so there are more nonzero gradients
  flowing, which helped catch the DDP+PP+ZB bug
- make the DDP+ZB+PP bug skip for now and file an issue
- tighten the tolerances to defaults
- use separate targets instead of same inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144534
Approved by: https://github.com/H-Huang
2025-01-11 03:27:16 +00:00
8fa47c9455 [dynamo] log compiler collective duration to tlparse chromium trace (#144372)
To show wall time in tlparse for the synchronous compiler collective. Can eliminate the leading hypothesis from https://fb.workplace.com/groups/1075192433118967/permalink/1578670289437843.

<img width="1296" alt="image" src="https://github.com/user-attachments/assets/b17d4efb-8573-43e5-af58-c51af05acb54" />

sample: https://gist.github.com/xmfan/19eeaa80d55a4e7c168e150355ec7392
rank 0: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpr5WNMt/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10
rank 1: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpr5WNMt/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144372
Approved by: https://github.com/ezyang
2025-01-11 03:10:39 +00:00
0cd9320c7f easy: dynamo_config: sort keys and set values (#143317)
This will create consistent ordering of keys when writing, as well as
sorting sets before serializing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143317
Approved by: https://github.com/masnesral
ghstack dependencies: #143307
2025-01-11 03:08:04 +00:00
074aca3ed2 [user triton] add support for @triton.heuristics after @triton.autotune (#142208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142208
Approved by: https://github.com/zou3519
2025-01-11 02:18:26 +00:00
3753d30273 Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)"
This reverts commit 9f09b719d33c61224ebb85baa369a8364063aa6f.

Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it somehow breaks memory leak checks ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2585004792))
2025-01-11 02:10:16 +00:00
49c1f81be8 [ez] add lint commits to .git-blame-ignore-revs (#144576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576
Approved by: https://github.com/janeyx99
2025-01-11 02:09:46 +00:00
92ddb3d3d3 [MPS] Expose MPSProfiler::start/stopCapture to Python (#144561)
I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager.

For example, code below:
```python
import torch
import os

def foo(x):
   return x[:,::2].sin() + x[:, 1::2].cos()

if __name__ == "__main__":
    os.environ["MTL_CAPTURE_ENABLED"] = "1"
    x = torch.rand(32, 1024, device="mps")

    with torch.mps.profiler.metal_capture("compiled_shader"):
        torch.compile(foo)(x)
```
should capture the execution of a `torch.compile` generated shader
<img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561
Approved by: https://github.com/manuelcandales
ghstack dependencies: #144559, #144560
2025-01-11 02:05:36 +00:00
c7dbee5106 [reland][export] don't decompose custom triton op when exporting (#144284)
Summary:
A reland of https://github.com/pytorch/pytorch/pull/142426.

Copying the description over here:

For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable.

The alternative:
If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because:

it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes.
changes to triton or the serialization logic for triton arguments can be BC breaking
exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction.

Future plans:
After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC.

In the long term, we may export multiple cubins for the triton op directly.

Test Plan: see new tests.

Differential Revision: D67879685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144284
Approved by: https://github.com/zou3519
2025-01-11 01:34:35 +00:00
95d333f52e [distributed] Fix _ReaderView.read() and readinto() to stop reading at the end of the slice (#143357)
_ReaderView doesn't work correctly if the slice ends past the view.

read(-1) would call read(-1) on the base_stream, which would consume the entire underlying stream, even if the view ended before that.
read(n) would read n bytes, even if the view ended before that.

The new implementation clamps the size read to the size of the view.

readinto(b) would read len(b) bytes, even if the view ended before that.

Since the interface depends on the size of b, we use a (potentially) shortened view into b to avoid a copy.  If the view doesn't contain enough data to fill the view, then this will appear as end of stream to the caller, which is the desired behavior.

This fix should not be user facing, since the bug is in an internal helper, and is only visible with new code down the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143357
Approved by: https://github.com/saumishr
2025-01-11 00:22:10 +00:00
c9afa00a85 update sleef for disable libm on Windows [submodule Sleef] (#142245)
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `Sleef` to contains it's PRS: https://github.com/shibatch/sleef/pull/603
2. Set `SLEEF_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `Sleef`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142245
Approved by: https://github.com/EikanWang, https://github.com/atalman

Co-authored-by: Eikan Wang <eikan.wang@intel.com>
2025-01-11 00:11:55 +00:00
cyy
6cfc081675 Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138)
To facilitate further possible changes of DeviceIndex to int16_t.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144138
Approved by: https://github.com/albanD
2025-01-10 23:53:19 +00:00
b80ecc4457 Revert "Fix poision child process issue when call getAccelerator() (#144368)"
This reverts commit 2583d831d40d6fa64f0b637d5bc7598e484a3283.

Reverted https://github.com/pytorch/pytorch/pull/144368 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))
2025-01-10 23:36:43 +00:00
db2a30932a Revert "Generalize at::manual_seed for all accelerators (#144370)"
This reverts commit eeb57394f93d720bca498c3fa9d167fc7b9cca46.

Reverted https://github.com/pytorch/pytorch/pull/144370 on behalf of https://github.com/clee2000 due to broke internal tests D68023262, probably the same problem as noted in the issue this PR is mentioned above ([comment](https://github.com/pytorch/pytorch/pull/144368#issuecomment-2584848568))
2025-01-10 23:36:43 +00:00
9ec8ecea71 Update documentation.yml 2025-01-10 15:27:28 -08:00
1ff8a1c4eb Update documentation.yml to request english 2025-01-10 15:26:43 -08:00
c7f12a4a7b [MPSInductor] Speedup maximum/minumum ops (#144581)
By relying on the fact that if either `a` or `b` is NaN (or both), than `a + b` would also be NaN.

I.e. it replaces
```metal
auto tmp2 = metal::any(metal::isnan(static_cast<decltype(tmp0+tmp1)>(tmp0))) | metal::any(metal::isnan(static_cast<decltype(tmp0+tmp1)>(tmp1))) ? static_cast<decltype(tmp0+tmp1)>(NAN) : metal::max(static_cast<decltype(tmp0+tmp1)>(tmp0), static_cast<decltype(tmp0+tmp1)>(tmp1));
```
with
```metal
auto tmp2 = metal::isnan(tmp0 + tmp1) ? tmp0 + tmp1 : metal::max(static_cast<decltype(tmp0+tmp1)>(tmp0), static_cast<decltype(tmp0+tmp1)>(tmp1));
```

which according to MetalProfiler takes fewer instructions:
<img width="520" alt="image" src="https://github.com/user-attachments/assets/54659392-012b-453e-9c02-c3c5f332074a" />
vs
<img width="1031" alt="image" src="https://github.com/user-attachments/assets/55fcfa78-1ea5-4b0a-8154-d79b3e3cc400" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144581
Approved by: https://github.com/dcci, https://github.com/jhavukainen
2025-01-10 22:58:00 +00:00
a94ec0a9a5 [aoti] Remove example inputs from aoti_compile_and_package (#144520)
Summary: The args were removed in https://github.com/pytorch/pytorch/pull/140991

Test Plan: CI

Differential Revision: D67998954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144520
Approved by: https://github.com/yushangdi
2025-01-10 21:56:23 +00:00
6b902e6e1a Update bug-report.yml to make it not look weird
Seems like https://github.com/pytorch/pytorch/pull/144574 did not format as expected.
2025-01-10 13:53:27 -08:00
4daf007b64 Request English for Issues (#144574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144574
Approved by: https://github.com/albanD
2025-01-10 21:51:15 +00:00
68dad26b95 torch/nn/modules/linear.py: docs: improvements (#138484)
torch/nn/modules/linear.py: docs: improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138484
Approved by: https://github.com/mikaylagawarecki
2025-01-10 20:03:43 +00:00
7a81ba18b9 [export] Add support for serializing symint inputs (#142284)
Fixes https://github.com/pytorch/pytorch/issues/142167
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142284
Approved by: https://github.com/avikchaudhuri
2025-01-10 20:03:26 +00:00
18c1dcb8f3 docs: get rid of copyright year (#144562)
Fixes https://github.com/pytorch/pytorch/pull/144153#pullrequestreview-2540418083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144562
Approved by: https://github.com/albanD
2025-01-10 19:57:25 +00:00
be5afe16a6 Fix deepcopy hooks (#144531)
Summary: As title, fix bug when a GraphModule doesn't have _deepcopy_hooks attribute

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//torchmultimodal/tests:tests -- --exact 'torchmultimodal/tests:tests - test_albef.py::test_dequeue_and_enqueue'
```

Differential Revision: D68002767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144531
Approved by: https://github.com/BoyuanFeng
2025-01-10 19:55:22 +00:00
10ff6b8894 [export] Add pickle protocol (#142253)
Fixes https://github.com/pytorch/pytorch/issues/142004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142253
Approved by: https://github.com/avikchaudhuri
2025-01-10 19:49:07 +00:00
396630ed78 Update the accuracy results for moco and llama (#144523)
This has been failing in trunk for sometimes, let's just update the accuracy results first.  The command I run `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py 127f836881e75e0c688619b54a35b018a69d7ee7`.  I also fix the update script a bit to make it working after https://github.com/pytorch/pytorch/pull/139337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144523
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2025-01-10 19:40:49 +00:00
99600789c3 [ROCm][Inductor][CK] hackfix for segfault in addmm op (#144519)
This snippet used to cause segfault on GPU due to incorrect input order when invoking the kernel

```
import os
import torch
import torch.nn as nn

from torch._inductor import config as inductor_config
from torch._inductor.utils import fresh_inductor_cache

M, N, K = 128, 128, 4096
dtype = torch.float16

X = torch.randn(M, N, dtype=dtype).cuda()
A = torch.randn(M, K, dtype=dtype).cuda()
B = torch.randn(K, N, dtype=dtype).cuda()

class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, b, x, y):
        return torch.addmm(b, x, y)

import ck4inductor
ck_dir = os.path.dirname(ck4inductor.__file__)

with fresh_inductor_cache():
    with inductor_config.patch(
        {
            "max_autotune_gemm_backends": "CK",
            "autotune_fallback_to_aten": False,
            "compile_threads": 144,
            "rocm.ck_dir": ck_dir,
        }
    ):
        compiled_model = torch.compile(SimpleModel(), mode="max-autotune")
        res = compiled_model(X, A, B)
        res_eager = torch.addmm(X, A, B)
        torch.testing.assert_close(res, res_eager)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144519
Approved by: https://github.com/chenyang78
2025-01-10 19:29:14 +00:00
a37db5ae39 operator benchmark change parsing from regex based to manual (#144297)
The regex-based parser would erroneously split on commas in nested brackets, for example, it would do the following parse which is wrong:
'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16)', ' (64, 32)]', 'ZPB: 2']

The new manual parser handles this situation the right way:
'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16), (64, 32)]', 'ZPB: 2']

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144297
Approved by: https://github.com/XuehaiPan, https://github.com/jeffdaily
2025-01-10 19:15:36 +00:00
4f04078aec [CI] Ensure ACL is obtained from GitHub (#141804)
- The GitHub tagged releases is the preferred method to obtain ACL.

Please merge this before https://github.com/pytorch/pytorch/pull/138889 so that PyTorch can take GitHub releases going forward instead of mlplatform.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141804
Approved by: https://github.com/snadampal, https://github.com/ng-05, https://github.com/digantdesai
2025-01-10 19:05:02 +00:00
cyy
4abf554882 Use structure binding (#144524)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144524
Approved by: https://github.com/Skylion007
2025-01-10 18:47:35 +00:00
1ce3524277 use collective_comm activity for hccl traces (#144490)
Summary: Use existing collective_comm (currently used for nccl traces) for hccl traces as well. Only init the nccl profiler when KINETO_HAS_NCCL_PROFILER is defined so as to not init it when the build is for MTIA/HCCL

Test Plan: CIs

Differential Revision: D67285333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144490
Approved by: https://github.com/sraikund16
2025-01-10 18:39:35 +00:00
868984c3e3 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-10 18:27:00 +00:00
b46d00c1b7 Shard RegisterDispatchKey (#144364)
Should fix https://github.com/pytorch/pytorch/issues/143952 .

Testing: built PyTorch on Raspberry Pi 5; this seemed to alleviate high peak memory requirement. (I did increase shard counts for other generated files along the way, but I need to go back and figure out how much of that was strictly necessary vs. needing to use -j1 or -j2.)

Differential Revision: [D67925496](https://our.internmc.facebook.com/intern/diff/D67925496/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144364
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #144363
2025-01-10 18:21:19 +00:00
4143312e67 S390x ci periodic tests (#125401)
Periodically run testsuite for s390x

**Dependencies update**
Package z3-solver is updated from version 4.12.2.0 to version 4.12.6.0. This is a minor version update, so no functional change is expected.
The reason for update is build on s390x. pypi doesn't provide binary build for z3-solver for versions 4.12.2.0 or 4.12.6.0 for s390x. Unfortunately, version 4.12.2.0 fails to build with newer gcc used on s390x builders, but those errors are fixed in version 4.12.6.0. Due to this minor version bump fixes build on s390x.

```
# pip3 install z3-solver==4.12.2.0
...
      In file included from /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:53:
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp: In member function ‘void* region::allocate(size_t)’:
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/tptr.h:29:62: error: ‘uintptr_t’ does not name a type
         29 | #define ALIGN(T, PTR) reinterpret_cast<T>(((reinterpret_cast<uintptr_t>(PTR) >> PTR_ALIGNMENT) + \
            |                                                              ^~~~~~~~~
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:82:22: note: in expansion of macro ‘ALIGN’
         82 |         m_curr_ptr = ALIGN(char *, new_curr_ptr);
            |                      ^~~~~
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:57:1: note: ‘uintptr_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
         56 | #include "util/page.h"
        +++ |+#include <cstdint>
         57 |
```

**Python paths update**
On AlmaLinux 8 s390x, old paths:
```
python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'
/usr/lib/python3.12/site-packages
```

Total result is `/usr/lib/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages`

New paths:
```
python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))'
/usr/local/lib64/python3.12/site-packages;/usr/local/lib/python3.12/site-packages;/usr/lib64/python3.12/site-packages;/usr/lib/python3.12/site-packages;/usr/local/lib64/python3.12/site-packages/torch;/usr/local/lib/python3.12/site-packages/torch;/usr/lib64/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages/torch
```

```
# python -c 'import torch ; print(torch)'
<module 'torch' from '/usr/local/lib64/python3.12/site-packages/torch/__init__.py'>
```

`pip3 install dist/*.whl` installs torch into `/usr/local/lib64/python3.12/site-packages`, and later it's not found by cmake with old paths:

```
CMake Error at CMakeLists.txt:9 (find_package):
  By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "Torch", but
  CMake did not find one.
```

https://github.com/pytorch/pytorch/actions/runs/10994060107/job/30521868178?pr=125401

**Builders availability**
Build took 60 minutes
Tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards)

60 + 150 + 110 + 65 + 55 + 115 + 85 + 50 + 70 + 105 + 110 = 975 minutes used. Let's double it. It would be 1950 minutes.

We have 20 machines * 24 hours = 20 * 24 * 60 = 20 * 1440 = 28800 minutes

We currently run 5 nightly binaries builds, each on average 90 minutes build, 15 minutes test, 5 minutes upload, 110 minutes total for each, 550 minutes total. Doubling would be 1100 minutes.

That leaves 28800 - 1100 = 27700 minutes total. Periodic tests would use will leave 25750 minutes.

Nightly binaries build + nightly tests = 3050 minutes.

25750 / 3050 = 8.44. So we could do both 8 more times for additional CI runs for any reason. And that is with pretty good safety margin.

**Skip test_tensorexpr**
On s390x, pytorch is built without llvm.
Even if it would be built with llvm, llvm currently doesn't support used features on s390x and test fails with errors like:
```
JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
unknown file: Failure
C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
```
**Disable cpp/static_runtime_test on s390x**

Quantization is not fully supported on s390x in pytorch yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125401
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-10 18:21:07 +00:00
603e1c0b02 torchgen: move dispatch_helpers out of RegisterDispatchDefinitions.ini (#144363)
The dispatch_helpers should be generated once, not once per kernel namespace.

Differential Revision: [D67925497](https://our.internmc.facebook.com/intern/diff/D67925497/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144363
Approved by: https://github.com/bdhirsh
2025-01-10 18:13:06 +00:00
7a93a58b3c fix typo: "assumbed" (#144543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144543
Approved by: https://github.com/Skylion007
2025-01-10 17:16:01 +00:00
fdc4f9dde2 Avoid running helper functions as test (#144544)
Pytest considers all symbols starting with `test_` as a test case/function and runs them.
The `test_compiled_fsdp` is a decorator but due to the import discovered by pytest.
Rename it to avoid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144544
Approved by: https://github.com/Skylion007
2025-01-10 17:15:50 +00:00
8dba1ce73b [MPS] Make MPSProfiler usable from C++ (#144560)
By moving `buildTensorString` implementation away from the header
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144560
Approved by: https://github.com/Skylion007
ghstack dependencies: #144559
2025-01-10 17:13:34 +00:00
f604338e31 [MPS] Make sure that MPSStream is usable from C++ (#144559)
It's intended to be, but this was never tested.
This change introduces no new functionality, just properly isolates ObjC implementation details from the potential C++ caller
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144559
Approved by: https://github.com/Skylion007
2025-01-10 17:13:34 +00:00
473b745cb9 Revert "[dynamo] Avoid graph break on updates to obj.__dict__ (#144419)"
This reverts commit c8595ba7d02fea9a5642ebbb60a810d18dc60666.

Reverted https://github.com/pytorch/pytorch/pull/144419 on behalf of https://github.com/clee2000 due to newly added test fails internally D68004708 ([comment](https://github.com/pytorch/pytorch/pull/144419#issuecomment-2583265412))
2025-01-10 16:59:38 +00:00
e6b9e67465 [BE][Opinfo] Delete redundant dtypesIfCUDA (#144512)
If they are the same as CPU, no need to have that extra line

Discovered while reviewing https://github.com/pytorch/pytorch/pull/143833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144512
Approved by: https://github.com/Skylion007
2025-01-10 15:15:38 +00:00
a222029f4e retracing in strict doesn't like dataclass registration (#144487)
Retracing in strict doesn't seem to like dataclass registration. Just refactoring some tests to make this explicit (whereas other export testing variants work fine).

Differential Revision: [D67985149](https://our.internmc.facebook.com/intern/diff/D67985149/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144487
Approved by: https://github.com/angelayi
2025-01-10 12:31:53 +00:00
b2fde28283 [Profiler] Fix device setting error of other backends in torch.profiler (#144237)
In earlier implementation, if `self.use_device != "cuda"` and `device is None`, we would get a `device = "cpu"` from line401, which is not as expected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144237
Approved by: https://github.com/sraikund16
2025-01-10 10:41:11 +00:00
eeb57394f9 Generalize at::manual_seed for all accelerators (#144370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
ghstack dependencies: #144368
2025-01-10 09:28:28 +00:00
2583d831d4 Fix poision child process issue when call getAccelerator() (#144368)
# Motivation
fix https://github.com/pytorch/pytorch/issues/144152

# Solution

- Align `at::globalContext()::hasXXX` to determine if accelerator XXX is built with PyTorch or an extension already registered to PyTorch.
- Define `at::hasXXX` to determine if accelerator XXX is available at runtime.
- Use `at::globalContext()::hasXXX` in `getAccelerator` rather than `at::hasXXX` to avoid initializing the XXX runtime (which can poison child processes) while detecting the current accelerator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144368
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/gujinghui
2025-01-10 09:28:27 +00:00
08be9ec312 Migrate from Tuple -> tuple in torch/distributed (#144258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258
Approved by: https://github.com/aorenste
2025-01-10 08:34:54 +00:00
184549b2d7 Fix torch.normal ignores default_device (#144070)
Fixes #122886

1. Enable `torch.normal` working with `DeviceContext` to get default device which set via `set_default_device`.
2. Add hint in `set_default_device` doc, suggest use `torch.Tensor.to` method move to desired device explicitly.

**Test Result**
1. **Doc Preview**
![image](https://github.com/user-attachments/assets/eb69c334-be2b-4dc5-bdce-567da21e1635)

2. **Local Test**
```python
>>> import torch
>>> torch.normal(0.,1., (10,10)).device
device(type='cpu')
>>> torch.set_default_device('cuda')
>>> torch.normal(0.,1., (10,10)).device
device(type='cuda', index=0)
```

```bash
pytest test/test_tensor_creation_ops.py
```

![image](https://github.com/user-attachments/assets/8b466b55-f162-4b83-8b20-71de2c1d0914)

```bash
lintrunner
```
![image](https://github.com/user-attachments/assets/5b269c50-da57-47ed-8500-4edf2c2295e4)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144070
Approved by: https://github.com/ezyang
2025-01-10 08:19:55 +00:00
1fe3af2c68 Migrate from Tuple -> tuple in torch/_dynamo (#144261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144261
Approved by: https://github.com/aorenste, https://github.com/zou3519
2025-01-10 07:45:57 +00:00
f295eff512 [Profiler] Hide Kineto Step Tracker Behind Env Var (#144494)
Summary:
To support iteration-based on-demand we have step tracker hooks for both the scheduler and for the optimizer to control Kineto's backend FSM. We already hide the optimizer step tracker behind and ENV_VAR to prevent any extra overhead from the frontend profiler down to the kineto backend, but we don't do any such thing for the profiler step tracker. It also seems to cause errors occasionally in the FSM having both auto-trace and on-demand occurring at the same time.

To remedy this issue, lets put in a patch to guard the step incrementer for the frontend step function. This will bypass all of the on-demand logic which shouldn't occur in auto-trace

Test Plan:
Ran
`buck run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test -- --enable_profiling --trace_handler=auto_trace --with_stack` and added prints in on-demand functions (performLoopStep and collectTrace) and saw that neither were called even though they were called on main.

Also got following healthy traces:

Auto-Trace (schedule-based):
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jan_09_12_43_37.1122140.pt.trace.json.gz&bucket=gpu_traces

Timing Based On-demand:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1736456722/localhost/libkineto_activities_1286261.json.gz&bucket=gpu_traces

Iteration Based On-demand:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1736456889/localhost/libkineto_activities_1304781.json.gz&bucket=gpu_traces

Differential Revision: D67990080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144494
Approved by: https://github.com/ngimel
2025-01-10 07:00:56 +00:00
8cc8989b26 [Inductor UT] Generalize newly introduced device-bias hard code in (#144456)
Re-land #143975. Fix "cuda" hard code in test_pattern_matcher.py introduced by #139321
Fix #143974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144456
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel
ghstack dependencies: #144457
2025-01-10 06:55:44 +00:00
e5111d0430 [Inductor UT] Add expected failure for newly added case on XPU, align CUDA. (#144457)
The newly added case `test_randint_distribution` from #143787 was set expected failure for CUDA but not for XPU.
 We add the expected failure here because if fails with the same reason as CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144457
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/jansel, https://github.com/liangan1
2025-01-10 06:55:44 +00:00
eddf83559e [Intel GPU][Inductor] Convert Conv1D to 2D in inductor (#144140)
Layout optimization in inductor does not apply to Conv1D. We convert Conv1D to channel last Conv2D for better performance on Intel GPU. For example, demucs fp16 inference in torchbench can improve from 149ms to 91ms on Max 1100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144140
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2025-01-10 06:50:46 +00:00
fbad833538 Migrate from Tuple -> tuple in test/distributed/_composable (#144254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144254
Approved by: https://github.com/aorenste
2025-01-10 06:38:05 +00:00
3b6b306b71 Migrate from Tuple -> tuple in torch/testing (#144256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256
Approved by: https://github.com/aorenste
2025-01-10 06:37:55 +00:00
493a52cb72 Refine torch.xpu.get_device_properties API error message (#144379)
# Motivation
Remove the redundant error message.

Without this PR:
```python
>>> import torch
>>> torch.xpu.get_device_name(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 215, in get_device_name
    return get_device_properties(device).name
  File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 258, in get_device_properties
    raise AssertionError("Invalid device index")
AssertionError: Invalid device index
```

With this PR:
```python
>>> import torch
>>> torch.xpu.get_device_name(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
  File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 215, in get_device_name
    return get_device_properties(device).name
  File "/home/guangyey/repos/stock-pytorch/torch/xpu/__init__.py", line 257, in get_device_properties
    return _get_device_properties(device)  # type: ignore[name-defined]  # noqa: F821
RuntimeError: The device index is out of range. It must be in [0, 1), but got 1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144379
Approved by: https://github.com/EikanWang
2025-01-10 06:27:51 +00:00
4375c2c534 Cleanup gpt_fast benchmark (#144517)
This is an exact copy of https://github.com/pytorch/pytorch/pull/144484, I bricked the last PR running ghstack land :(

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144517
Approved by: https://github.com/davidberard98, https://github.com/huydhn
2025-01-10 05:22:13 +00:00
c8595ba7d0 [dynamo] Avoid graph break on updates to obj.__dict__ (#144419)
`obj.__dict__` is handled specially in Dynamo, and prior to this patch
we only support read and membership check on that dictionary object.

This patch adds support for writes and some documentation.

Fixes #143756.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-01-10 05:22:04 +00:00
d100a92d33 [CPU][Brgemm] add support for int8 brgemm (#143384)
For INT8 SDPA kernel usage, we add support for INT8 Brgemm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143384
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/ezyang
2025-01-10 04:20:26 +00:00
0529908f13 Remove is_reduced_floating_point from namespace std (#144502)
Partial fix for #144495. Avoiding BC-break using existing practice of removing only if FBCODE_CAFFE2 and C10_NODEPRECATED are not defined.

Differential Revision: [D67992342](https://our.internmc.facebook.com/intern/diff/D67992342/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144502
Approved by: https://github.com/malfet
2025-01-10 03:24:10 +00:00
cyy
9a841f9321 Enable bugprone-unchecked-optional-access (#144226)
We can actually enable bugprone-unchecked-optional-access without the risk of hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144226
Approved by: https://github.com/albanD
2025-01-10 03:16:56 +00:00
9f09b719d3 Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483
Approved by: https://github.com/Skylion007
2025-01-10 02:31:43 +00:00
898fcb4590 Simplify vec128 bfloat16/half fmadds (#144486)
I was being silly when I wrote these; it doesn't make sense to do four conversions and two FMAs when we could do a multiply and an add.

Differential Revision: [D67985074](https://our.internmc.facebook.com/intern/diff/D67985074/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144486
Approved by: https://github.com/malfet
2025-01-10 02:25:57 +00:00
d1b64ec326 [export] Fix sym_bool serialization (#144295)
Summary:
When there is a `torch._check()` that checks if a sym_int is equal to some constant, it will generate 3 nodes in the graph with target `operation.ge`, `operator.le` and `operator.eq`. These operators belong to `_SYM_BOOL_OPS` but the `meta_val` of these nodes are are `bool` instead of `torch.SymBool`.

Similar things can happen to `torch.SymInt`, where a `node.target` belongs to `_SYM_INT_OPS` but `node.meta["val"]` is an `int` instead of `torch.SymInt`.

Therefore, we need to check both `meta_val` type and `node.target` type during serialization.

Test Plan:
```
buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_sym_bool_torch_check_equal
buck2 run @mode/dev-nosan caffe2/test:test_export -- -r test_sym_int_torch_check_equal
```

Differential Revision: D67883754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144295
Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi
2025-01-10 02:07:54 +00:00
6de110b862 Support with statement on torch.Stream (#140138)
# Motivation
We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`.

With this PR, we can do like this
```python
s1= torch.Stream()
# Set s1 to the current stream
torch.accelerator.set_stream(s1)
with torch.Stream() as s2:
    # Inside with statement, we set s2 to the current stream
    assert torch.accelerator.current_stream() == s2
# Here the current stream should be s1
assert torch.accelerator.current_stream() == s1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138
Approved by: https://github.com/albanD
2025-01-10 02:05:19 +00:00
04cb19d225 Add instantiation level to CutlassArgs (#144506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144506
Approved by: https://github.com/huydhn
2025-01-10 02:01:40 +00:00
87c1f76e63 Revert "Migrate from Tuple -> tuple in torch/_decomp (#144260)"
This reverts commit 8db67e03193dd1dbf7ca80cf0eb2f904e18e25ec.

Reverted https://github.com/pytorch/pytorch/pull/144260 on behalf of https://github.com/kit1980 due to Lots of inductor failures ([comment](https://github.com/pytorch/pytorch/pull/144260#issuecomment-2581572235))
2025-01-10 01:47:29 +00:00
bf6dd955cd Fix max(map(...)) (#142443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142443
Approved by: https://github.com/zou3519
2025-01-10 01:44:37 +00:00
1dd1d532ba [BE] Fix extra-semi warnings in int4mm_kernel.cpp (#144510)
Fixes
```
In file included from /Users/nshulga/git/pytorch/pytorch/build/aten/src/ATen/native/cpu/int4mm_kernel.cpp.DEFAULT.cpp:1:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cpu/int4mm_kernel.cpp:998:2: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
};
 ^

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144510
Approved by: https://github.com/kit1980
2025-01-10 01:17:31 +00:00
bd1f5d1c32 update xnnpack for disable libm on Windows [submodule XNNPACK] (#141943)
This PR is implement of RFC: https://github.com/pytorch/pytorch/issues/141946
Changes:
1. Update `XNNPACK` to contains it's PRS: https://github.com/google/XNNPACK/pull/7456, https://github.com/google/XNNPACK/pull/7535 and other build fixing PRs.
2. Set `XNNPACK_BUILD_WITH_LIBM` to `OFF`, it is turn off CMake find_library(libm) of `XNNPACK`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141943
Approved by: https://github.com/atalman
2025-01-10 00:47:41 +00:00
8db67e0319 Migrate from Tuple -> tuple in torch/_decomp (#144260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144260
Approved by: https://github.com/aorenste
2025-01-10 00:13:15 +00:00
3607ff2c1d Migrate from Tuple -> tuple in benchmarks/instruction_counts/core (#144253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144253
Approved by: https://github.com/aorenste
2025-01-10 00:12:23 +00:00
a55977f763 Migrate from Tuple -> tuple in torch/ao (#144265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144265
Approved by: https://github.com/aorenste
2025-01-10 00:12:06 +00:00
08eaaa61ea Inductor dashboard benchmarks: swap unused freeze_autotune_cudagraphs workflow for cppwrapper workflow (#144427)
GitHub limits us to 10 inputs per workflow_dispatch job, so this PR swaps out an input that is no longer used for the cppwrapper input. See [the HUD](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2002%20Jan%202025%2016%3A30%3A07%20GMT&stopTime=Thu%2C%2009%20Jan%202025%2016%3A30%3A07%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/53/orig&lCommit=4c3d3ad3c7886cbda9705b41c6db5fa7da0d6fe9&rBranch=main&rCommit=00df63f09f07546bacec734f37132edc58ccf574) for an example showing that it works and displays sane output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144427
Approved by: https://github.com/desertfire, https://github.com/huydhn
2025-01-09 23:56:00 +00:00
66ce13b497 Revert D67299312: Multisect successfully blamed "D67299312: [AoTI Minifier] UX Improvement" for one test failure (#144475)
Summary:
This diff partially reverts D67299312
D67299312: [AoTI Minifier] UX Improvement by yushangdi causes the following test failure:

Differential Revision: D67963019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144475
Approved by: https://github.com/zhxchen17, https://github.com/angelayi
2025-01-09 23:27:55 +00:00
91cbeb7db9 [MPSInductor] Fix masked/where for inf values (#144500)
Move constant to value logic to `value_to_metal` function (similar to `value_to_cpp`)

Call it from `constant` as well as `where` ops (which is in turn being called from `masked` op

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144500
Approved by: https://github.com/dcci
2025-01-09 23:11:06 +00:00
b1c2c3967a [dtensor] deprecate _shard_tensor to use src_data_rank=None (#144171)
as titled, we can achieve no comm sharding for the inference case with
src_data_rank=None, so deprecate the private APi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144171
Approved by: https://github.com/awgu
2025-01-09 22:26:45 +00:00
379b54603a [Inductor] [bc-breaking] Node Level provenance tracking (#144277)
Summary:

- use GraphTransformObserver + replace_node hooks to track node sources when they are replaced
- add pre_grad_graph tracking to tlparse
- add the node provenance information to post_grad_graph tlparse. This is for the frontend to create a mapping between pre_grad and post_grad graph. See an example frontend (this is just a prototype) here:  https://drive.google.com/file/d/1cMHH_0y4FJUSS9tATwGQvA72O0Lth8eh/view?usp=sharing
- change "action" of NodeSource from a single action to a list of actions.

- It's BC-Breaking because we removed `GraphTransformObserver`'s class methods `on_node_erase` and `on_node_erase` .

https://docs.google.com/document/d/1dGh9myqNhywmbfP0Quzx_f04bghDFlj8cawj8MopiO8/edit?tab=t.0

The front-end code that takes in the tlparse result is in https://github.com/yushangdi/compiler_explorer.
ghstack-source-id: 260390519

Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r node_source
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance
```

Front-end example screenshots on a real model, 93% coverage rate between pre_grad_graph and post_grad_graph

 {F1973584210}{F1973584209}

```
buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark

MODEL_ENTITY_ID=644688112
SNAPSHOT_ID=32
MODULE=merge

TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=7 TORCH_LOGS="+inductor,+schedule,output_code,graph_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/ec86b05dd59e84db/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --local-model /home/bahuang/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune':
True}"

buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:auto_functionalize
```

Differential Revision: D65006709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144277
Approved by: https://github.com/desertfire
2025-01-09 22:06:51 +00:00
28b1960d49 [CUDA] parse arch-conditional compute-capability when building extensions (#144446)
don't choke on arch-conditional compute capabilities e.g., `sm_90a`: #144037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144446
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-01-09 22:05:18 +00:00
206a932f23 [Submodule] Upgrade to Cutlass 3.6 (#144180)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-09 21:56:53 +00:00
3e7e435bb1 [codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +2 (#144371)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144371
Approved by: https://github.com/Skylion007
2025-01-09 21:49:17 +00:00
f71688f30d Revert "[Submodule] Upgrade to Cutlass 3.6 (#144180)"
This reverts commit f2c103317814eecf2b622e322e4d0877c16af943.

Reverted https://github.com/pytorch/pytorch/pull/144180 on behalf of https://github.com/huydhn due to Ops, this fails some slow tests.  Please help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/144180#issuecomment-2581302233))
2025-01-09 21:45:39 +00:00
127f836881 S390x cancelled jobs cleanup (#144149)
Sometimes job is cancelled during nested docker container creation.
This leads to nested docker container not being stopped and worker hanging forever in the job.
Improve nested docker containers cleanup for these cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144149
Approved by: https://github.com/seemethere
2025-01-09 20:45:19 +00:00
40305dd37e [onnx] Fix bug for exporting torch.cdist into onnx and support 'compute_mode' (#144213)
### Fix bug for exporting torch.cdist and support 'compute_mode'
In [cdist,](https://github.com/pytorch/pytorch/blob/main/torch/onnx/symbolic_opset9.py#L6181) the 'compute_mode' was ignored, which leads to a big difference of the computation flow between original torch.cdist and the exported onnx file when computing Euclidean distance (p=2). For computing Euclidean distance, the running of exported onnx model will be 10x slower than running torch.cdist directly, and also very likely to cause CUDA OOM for larger matrixes unnecessarily.

This code is going for exporting the same onnx computation flow with the forward of  torch.cdist defined at [forward implementation](9225f149eb/aten/src/ATen/native/Distance.cpp (L66-L149).) under every compute_mode.

Fixes #144212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144213
Approved by: https://github.com/justinchuby
2025-01-09 20:07:20 +00:00
2b241a8206 Amazon Linux 2023: Preload cusparseLt.so (#144477)
Fixes https://github.com/pytorch/pytorch/issues/144433

Test with some debug statements added:

```
>>> import torch
trying to load libcublas.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12']
trying to load libcublas.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cublas/lib/libcublas.so.12
trying to load libcudnn.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9']
trying to load libcudnn.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cudnn/lib/libcudnn.so.9
trying to load libnvrtc.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12']
trying to load libnvrtc.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_nvrtc/lib/libnvrtc.so.12
trying to load libcudart.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12']
trying to load libcudart.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_runtime/lib/libcudart.so.12
trying to load libcupti.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12']
trying to load libcupti.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cuda_cupti/lib/libcupti.so.12
trying to load libcufft.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11']
trying to load libcufft.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cufft/lib/libcufft.so.11
trying to load libcurand.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10']
trying to load libcurand.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10
trying to load libnvJitLink.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12']
trying to load libnvJitLink.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nvjitlink/lib/libnvJitLink.so.12
trying to load libcusparse.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12']
trying to load libcusparse.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusparse/lib/libcusparse.so.12
trying to load libcusparseLt.so.*[0-9] from []
trying to load libcusparseLt.so.*[0-9] from /usr/local/lib/python3.9/site-packages/cusparselt/lib/libcusparseLt.so.0
trying to load libcusolver.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11']
trying to load libcusolver.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/cusolver/lib/libcusolver.so.11
trying to load libnccl.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2']
trying to load libnccl.so.*[0-9] from /usr/local/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2
trying to load libnvToolsExt.so.*[0-9] from ['/usr/local/lib/python3.9/site-packages/nvidia/nvtx/lib/libnvToolsExt.so.1']
trying to load libnvToolsExt.so.*[0-9] from /usr/local/lib/python3.9/site-
packages/nvidia/nvtx/lib/libnvToolsExt.so.1
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:275: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> exit()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144477
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia
2025-01-09 20:04:11 +00:00
6bc17b0725 Update #graph breaks for moco benchmark (#144266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144266
Approved by: https://github.com/zou3519
2025-01-09 18:51:13 +00:00
0e02e6f95f [BE]: Remove redundant contiguous copy in torch/_decomp/decompositions (#144472)
Removes a redundant extra copy by calling contiguous. Instead, just add a memory_format flag to the dtype cast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144472
Approved by: https://github.com/awgu, https://github.com/cyyever, https://github.com/malfet
2025-01-09 18:50:00 +00:00
307ca094c9 [BE]: Remove redundant contiguous copy in flex attention (#144467)
Removes a redundant potential copy, instead use memory_format kwarg to fuse both operations into a single copy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144467
Approved by: https://github.com/awgu
2025-01-09 18:30:09 +00:00
bbec35f028 [BE]: Replace clone detach with detach clone to be more efficient (#144469)
Follow up to #144270 and fix some vulkan code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144469
Approved by: https://github.com/awgu
2025-01-09 18:28:39 +00:00
73278e6a5d easy: sort dictionary keys for inductor config when publishing (#143307)
This means we should get consistent logging strings for the same
config on different ranks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143307
Approved by: https://github.com/xmfan
2025-01-09 18:01:20 +00:00
84443bd61a feature_use: Remove JK from naming for feature use. (#143529)
See discussion in https://github.com/pytorch/pytorch/pull/142819 but
TL;DR, since we're loging use but not direct JK reads, it's less
confusing to use the logging

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143529
Approved by: https://github.com/ezyang
2025-01-09 17:58:22 +00:00
b8f383107e Link to transformer tutorial in transformer docs (#144425)
<img width="1045" alt="Screenshot 2025-01-08 at 4 50 20 PM" src="https://github.com/user-attachments/assets/05adfecb-8a23-4c48-9a2c-50c5b3f886b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144425
Approved by: https://github.com/albanD
2025-01-09 17:42:09 +00:00
f2c1033178 [Submodule] Upgrade to Cutlass 3.6 (#144180)
Differential Revision: [D67866269](https://our.internmc.facebook.com/intern/diff/D67866269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144180
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-09 17:29:58 +00:00
1365ae859c [ROCm][CI] upgrade CI to ROCm 6.3 (#142152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142152
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-09 17:14:16 +00:00
cyy
b0be30dd79 [19/N] Fix extra warnings brought by clang-tidy-17 (#144448)
Apply more clang-tidy fixes. There was a bug introduced by #144014 due to incorrect namespace concatenation which is reverted here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144448
Approved by: https://github.com/albanD
2025-01-09 15:58:05 +00:00
1353f3beb4 [mps/inductor] Add support for fmod(). (#144449)
397 -> 395 tests failing. `static_cast<>` is because there are several overloads of `fmod()` that's otherwise ambiguous. I wonder if we should take in account NaN propagation (maybe it's not tested).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144449
Approved by: https://github.com/malfet
2025-01-09 15:47:41 +00:00
9631d1a021 [pipelining] throw error with ZB and compile (#143599)
Zero bubble wil SIGSEGV when operating on a `torch.compile`'d model so raising this error while I am still investigating the cause / design for a fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143599
Approved by: https://github.com/wconstab
2025-01-09 06:53:25 +00:00
3797143e06 Revert "[Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224)"
This reverts commit fabf2ea12e18bad3297e2810b77417d71c2a360b.

Reverted https://github.com/pytorch/pytorch/pull/144224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems that some ARM tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/144224#issuecomment-2579260377))
2025-01-09 06:20:31 +00:00
6f28e466f3 [mps/inductor] Add support for tanh(). (#144443)
Fixes test_tanh() in the inductor testsuite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144443
Approved by: https://github.com/malfet
2025-01-09 06:14:03 +00:00
7f1946aa9b [aot] don't dce aten rng nodes (#144319)
FIXES https://github.com/pytorch/pytorch/issues/143431

For aot_eager backend, we dce twice in aot. The first dce errs on the side of caution and provides a restrictive dce function: 2e1ea8598f/torch/fx/experimental/proxy_tensor.py (L1173)

The second one is more aggressive: 2e1ea8598f/torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py (L185)
But this deviates from eager accuracy when rand ops are dce'd

The repro doesn't work for inductor, but that's a separate issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144319
Approved by: https://github.com/jansel
2025-01-09 05:27:49 +00:00
d4871750d9 [ROCm] Enable post-merge trunk workflow on MI300 runners; skip and fix MI300 related failed tests (#143673)
This PR
* makes changes to the workflow files and scripts so we can run CI workflows on the MI300 runners
* skips and fixes several tests, failed on MI300, observed in https://github.com/pytorch/pytorch/pull/140989

Skipped due to unsupported Float8_e4m3fn data type on MI300 (need to update test code to use datatypes supported by MI300):
- distributed.tensor.parallel.test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_\*_gather_dim_\* (24 tests across inductor/distributed configs)
- distributed.tensor.parallel.test_micro_pipeline_tp.py::test_fuse_scaled_matmul_reduce_scatter_A_dims_\*_scatter_dim_\* (12 tests across inductor/distributed configs))
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_cast_and_t
- inductor.test_loop_ordering::LoopOrderingTest::test_fp8_pattern_2

Skipped due to AssertionError on MI300:
- inductor.test_mkldnn_pattern_matcher.py::test_qconv2d_int8_mixed_bf16
- distributed._tools.test_sac_ilp::TestSACILP::test_sac_ilp_case1

Skipped:
- test_cuda.py::TestCudaMallocAsync::test_clock_speed
- test_cuda.py::TestCudaMallocAsync::test_power_draw
- test_torch.py::TestTorchDeviceTypeCUDA::test_deterministic_cumsum_cuda

Skipped flaky tests on MI300:
- distributed.test_c10d_gloo.py::ProcessGroupGlooTest::test_gather_stress_cuda
- inductor.test_cpu_repro::CPUReproTests::test_lstm_packed_unbatched_False* (256 tests)

Fixed:
- test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_basics_cuda

Features:
- inductor/test_fp8.py - declare a new function to convert FP8 datatypes to ROCm supported FP8 datatypes. It keeps test names for CUDA and ROCm and allows to enable Inductor FP8 tests on CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143673
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/pruthvistony

Co-authored-by: saienduri <saimanas.enduri@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-09 05:18:57 +00:00
0d08084f1a [Inductor] Add convolution output size checking to the meta function (#144225)
Fixes #144013

Adding a size check to the meta function, similar to which in the CUDA/CPU aten op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144225
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-01-09 04:20:06 +00:00
fabf2ea12e [Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

This PR is one of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves binary post op fusion of qlinear out of the lowering pass.
This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise`
2. Fuse `onednn.qlinear_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #143903
2025-01-09 03:27:09 +00:00
bc576355a2 Let aotriton.cmake detect the best binary package to use, and deprecate aotriton_version.txt (#137443)
We do not need `install_aotriton.sh` and `aotriton_version.txt` any more since `aotriton.cmake` now installs the best binary release package as the default option when building pytorch.

This should resolve the issue of needing a pre-installed aotriton package when building PyTorch for ROCm from source, which is not feasible if building PyTorch *outside* a CI docker image. With this change, a user can have a pre-installed AOTriton in their environment, if desired, and have the build pick it up by specifying the `AOTRITON_INSTALLED_PREFIX` env var, or have the build automatically detect and install the compatible version. As a third option, the user can also force AOTriton to build from source instead, using the `AOTRITON_INSTALL_FROM_SOURCE` env var.

Also, with the changes in this PR, the cmake build process handles the tasks of copying aotriton .so and images directory from `torch/lib` to the installation path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137443
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-09 00:00:02 +00:00
8ac005ddb8 [DTensor] Add aten.view.dtype op support (#144404)
Fixes https://github.com/pytorch/pytorch/issues/144286

Viewing a tensor to a different dtype does not require any redistribution and can use the default strategy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144404
Approved by: https://github.com/wanchaol
2025-01-08 23:11:22 +00:00
dcc3cf7066 [BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415)
The fixes are generated by:

```bash
ruff check --fix --preview --unsafe-fixes --select=E226 .
lintrunner -a --take "RUFF,PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-08 21:55:00 +00:00
a742859fc2 [ONNX] Update images and APIs to onnx_dynamo.rst (#144358)
Update the result image of exporting, and delete the functions/class that belongs to `torch.onnx.dynamo_export`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144358
Approved by: https://github.com/justinchuby, https://github.com/malfet
2025-01-08 21:44:43 +00:00
a5164a2b18 [BE] Clean up ExecuTorch Export Docstring (#141490)
Summary: I noticed when looking at the docs for [`torch.export.load`](https://pytorch.org/docs/stable/_modules/torch/export.html#load) that it looked like there was a copy and paste error from the save command docstring since ep is not an actual parameter for load and it says "The exported program to save." This diff removes it from the docstring.

Test Plan: Automated Testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141490
Approved by: https://github.com/JacobSzwejbka
2025-01-08 21:28:58 +00:00
8c5d992772 [Pipelining] Refactor pp composability test to use faster MPCT (#144345)
* Using MultiProcessContinuousTest base class is faster (60s vs 279s for
  the full run of `test_manual_with_data_parallel` and all its
  parametrizations
* Have to move to a new file to use MPTC since it requires a different
  launcher style in `__main__`
* Propose to reorganize the composability tests anyway, since
  `test/_composable/test_composability/test_pp_composability` is an
  annoyingly long path
* rename `test_manual_with_data_parallel` to `test_pp_dp` for
  simplicity/consistency with newer test names.  (manual refers to not
  using tracer frontend, but that's not so important to call out in the
  test name)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144345
Approved by: https://github.com/H-Huang, https://github.com/mori360
2025-01-08 20:50:12 +00:00
c194e5c986 Remove extra copy torch/_prims (#144407)
updated _reshape_aten

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144407
Approved by: https://github.com/awgu
2025-01-08 20:14:48 +00:00
628acc4ace Dirichlet.mode: use dim= instead of axis= (#144402)
`axis=` is undocumented and will raise typing errors when #144197 is merged.

See: https://github.com/pytorch/pytorch/pull/144197#pullrequestreview-2537398866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144402
Approved by: https://github.com/Skylion007
2025-01-08 20:14:01 +00:00
ab1f627aa4 fix randint distribution for large max (#143787)
Fixes #ISSUE_NUMBER
Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`)
This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it.
`torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better.
`__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787
Approved by: https://github.com/eqy
2025-01-08 18:51:48 +00:00
0e1675a89b Relax aten.to restriction (#142420)
Summary: if we have a.to(b), and b has a different dtype with a, then it must be a copy. In this case, we do not need to freeze the tensor. Instead, we use torch.ops.aten._assert_tensor_metadata.default to ensure that a must not have the same dtype as b.

Fixes https://github.com/pytorch/pytorch/issues/139718

Update executorch pin to include https://github.com/pytorch/executorch/pull/7277.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_float_conversion
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_device_to_mutation_float
```

Differential Revision: D66988295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142420
Approved by: https://github.com/bdhirsh
2025-01-08 18:11:31 +00:00
768d73f692 use torch.special.xlogy to implement x_log_x (#144220)
Fixes #144279

Using `x* x.log()` does not produce the correct value when `x=0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144220
Approved by: https://github.com/Skylion007
2025-01-08 17:41:55 +00:00
cyy
d0070ca07e [18/N] Fix extra warnings brought by clang-tidy-17 (#144014)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144014
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-08 17:21:55 +00:00
373541fbf4 [BE]: Remove unnecessary copy of gradients in util (#144329)
No need to copy gradients to CPU too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144329
Approved by: https://github.com/awgu, https://github.com/cyyever
2025-01-08 16:52:15 +00:00
e14c36d3f4 Set maximum supported version of Python as 3.13 (#144396)
Same as https://github.com/pytorch/pytorch/pull/119743 Required for Release 2.6.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144396
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet
2025-01-08 16:16:10 +00:00
3068ce0337 ROCm SDPA: Ensure attn_mask has the same dtype with q (#143242)
This is required by current AOTriton's backend.

Fixes NaN when calling SDPA ME backend with `q.dtype() != attn_mask.dtype()` when training llama2 using transformers+deepspeed+pytorch

Corresponding CUDA check seems to be here:
708ce3c008/aten/src/ATen/native/transformers/cuda/attention.cu (L1331-L1336)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143242
Approved by: https://github.com/jeffdaily
2025-01-08 15:20:26 +00:00
708ce3c008 Add is_dtype_supported predicate to DeviceInterface (#144355)
Which will return true, unless dtype is bf16 by default

For MPS device it will return false if dtype is double

Check that it works by refactoring `test_inf` that should expect TypeError raised if invoked with unsupported dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144355
Approved by: https://github.com/jansel, https://github.com/dcci
2025-01-08 13:59:46 +00:00
8fc0ffe54b [mps/inductor] Add support for rsqrt(). (#144374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144374
Approved by: https://github.com/malfet
2025-01-08 13:58:05 +00:00
f700035090 [3.13t] use sysconfig to check for Python nogil builds (#144361)
`sys._is_gil_enabled()` wasn't working in certain cases, according to @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144361
Approved by: https://github.com/atalman
2025-01-08 13:00:32 +00:00
a5051a9521 Update torch.masked.mean to upcast dtype for bool tensors (#139999)
When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated.

The below example shows how the incorrect result occurs:
```
a = torch.tensor([True, True])
count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2
total = torch.sum(a, dtype=torch.bool) # True (1)
mean = total / count # 0.5
```

This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999
Approved by: https://github.com/cpuhrsch
2025-01-08 10:35:19 +00:00
60a505022f [AMD] SDPA internal changes (#144320)
Summary: All the internal changes needed to enable flash attention w/ SDPA in fbcode.

Test Plan:
```
TORCH_ROCM_FA_PREFER_CK=1  buck run -m rocm621  mode/opt-amd-gpu scripts/xdwang/example:sdpa

+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Math Time (µs) |   xformers Time (µs) |   Flash TFlops |   Math TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+
|            1 |              4096 |      32 |         64 |           455.552 |          7748.76 |              513.449 |        301.698 |       17.7369 |           267.678 |                17.0096 |                   15.0916 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      16 |        128 |           329.971 |          4741.11 |              386.049 |        416.519 |       28.9888 |           356.014 |                14.3683 |                   12.2811 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |         64 |          1455.76  |         31869.6  |             1665.49  |        377.642 |       17.2501 |           330.087 |                21.8921 |                   19.1353 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |        128 |          1265.77  |         18972.8  |             1479.48  |        434.325 |       28.976  |           371.588 |                14.9891 |                   12.824  |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |         64 |          5732.99  |        121861    |             6816.77  |        383.573 |       18.0453 |           322.59  |                21.2562 |                   17.8767 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |        128 |          4749.69  |         73776.4  |             5404.03  |        462.982 |       29.8066 |           406.923 |                15.5329 |                   13.6521 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|   Batch Size |   Sequence Length |   Heads |   Head Dim |   Flash Time (µs) |   Math Time (µs) |   xformers Time (µs) |   Flash TFlops |   Math TFlops |   xformers TFlops |   Speedup (Flash/Math) |   Speedup (xformers/Math) | xformers trace_url   | Flash trace_url   |
+==============+===================+=========+============+===================+==================+======================+================+===============+===================+========================+===========================+======================+===================+
|            1 |              4096 |      32 |         64 |           1615.41 |          8342.67 |              1822.72 |        212.7   |       41.1855 |           188.508 |                5.16443 |                   4.57705 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              4096 |      16 |        128 |           1357.97 |          5943.53 |              1432.34 |        253.022 |       57.8104 |           239.886 |                4.37676 |                   4.14953 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      32 |         64 |           5556.5  |         31726.7  |              6502.17 |        247.348 |       43.3197 |           211.374 |                5.70984 |                   4.8794  |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |              8192 |      16 |        128 |           5186    |         22529.4  |              5590.36 |        265.019 |       61.0044 |           245.85  |                4.34427 |                   4.03004 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      32 |         64 |          22527.7  |        130413    |             26527.6  |        244.035 |       42.155  |           207.239 |                5.789   |                   4.91613 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+
|            1 |             16384 |      16 |        128 |          18347.9  |         87553.2  |             20358    |        299.628 |       62.791  |           270.044 |                4.77184 |                   4.30068 |                      |                   |
+--------------+-------------------+---------+------------+-------------------+------------------+----------------------+----------------+---------------+-------------------+------------------------+---------------------------+----------------------+-------------------+

```

Reviewed By: leitian, feikou, yoyoyocmu, sijiac

Differential Revision: D67262726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144320
Approved by: https://github.com/jianyuh, https://github.com/eqy, https://github.com/leitian
2025-01-08 09:29:28 +00:00
7d9f26de05 Revert "Unskipped multiple inductor tests for ROCm (#143581)"
This reverts commit e05d67790ee4a53c310322829631c000f0ac2985.

Reverted https://github.com/pytorch/pytorch/pull/143581 on behalf of https://github.com/huydhn due to There is some tests failing on ROCm jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143581#issuecomment-2577163274))
2025-01-08 09:15:14 +00:00
aaf56152ea [cpu/sorting] Throw an error when trying to sort complex numbers. (#144113)
It doesn't really make sense to sort complex numbers as they are not comparable.

Fixes #129296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144113
Approved by: https://github.com/malfet
2025-01-08 05:15:36 +00:00
78eded8e00 [ONNX] Use torch.export.Dim.AUTO in dynamo_export (#144356)
Align to the changes in https://github.com/pytorch/pytorch/pull/143158
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144356
Approved by: https://github.com/justinchuby
2025-01-08 05:00:16 +00:00
90e81a157a Migrate from Tuple -> tuple in torch/utils/data (#144255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144255
Approved by: https://github.com/andrewkho
2025-01-08 04:09:45 +00:00
8ccf3f6f3f [dynamo][easy] Move dict tests to test_dicts.py (#144165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144165
Approved by: https://github.com/jansel
ghstack dependencies: #143997
2025-01-08 03:56:33 +00:00
2ac41404a8 [dynamo][dicts] Guarding lazily on dict keys (#143997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997
Approved by: https://github.com/jansel
2025-01-08 03:56:33 +00:00
e05d67790e Unskipped multiple inductor tests for ROCm (#143581)
All of them should be fine to run now after the triton fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581
Approved by: https://github.com/jataylo, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-08 03:55:33 +00:00
28b4992e7a Set prop_kind to forward_inference when grad is not needed for mkldnn_convolution_pointwise (#142855)
`prop_kind` of MKLDNN convolution is always `dnnl_forward`, i.e., `dnnl_forward_training` , regardless of whether grad is needed. Setting `prop_kind` to `dnnl_forward_inference` for mkldnn_convolution_pointwise could have better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142855
Approved by: https://github.com/jgong5
2025-01-08 02:22:06 +00:00
f8fcb9e7d3 [Quant][Inductor][X86] Separate unary post op fusion and lowering for qlinear (#143903)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

This PR is the first of a series of PRs which separate post op fusion and lowering for quantized linear and convolution. It moves unary post op fusion of qlinear out of the lowering pass.
This PR moves the fusion pass from the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise`
2. Fuse `onednn.qlinear_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143903
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2025-01-08 01:55:53 +00:00
094ca3154d Fix torch._refs.tensor error with empty list (#143461)
Fixes #143216

**Test Result**

**Before**

```python
>>> import torch
>>> torch._refs.tensor([])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6614, in tensor
    new_tensor = _internal_new_from_data(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6596, in _internal_new_from_data
    tensor = _recursive_build(inferred_scalar_type, data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_refs/__init__.py", line 6545, in _recursive_build
    return torch.stack([_recursive_build(scalarType, item) for item in seq])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects a non-empty TensorList

```

**After**

```python
>>> torch._refs.tensor([])
tensor([])
>>> torch._refs.tensor([], device='cuda')
tensor([], device='cuda:0')
```

```bash
$ pytest test/test_tensor_creation_ops.py -k test_refs_tensor
```

![image](https://github.com/user-attachments/assets/5be4c17a-bea6-4b7b-bec1-b4fcb417a8cd)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/e8f88f41-78ac-4337-b53f-2e524de2bec0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143461
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-01-08 01:29:00 +00:00
9e6a6389ce [functorch] clean up asserts in test_dims.py (#144276)
For better debuggability of issues encountered in e.g., #141730 when trying to migrate to python 3.12/3.13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144276
Approved by: https://github.com/Skylion007
2025-01-08 01:21:40 +00:00
013c796b1e Eliminate c10::optional usage in PyTorch (#144346)
Differential Revision: D67907427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144346
Approved by: https://github.com/hl475
2025-01-08 01:14:04 +00:00
f002825e1e added __add__ and __mul__ hints to torch.Size (#144322)
Fixes #144218

`Size` returns `Size`, whereas `tuple` returns `tuple`: 9f28171658/stdlib/builtins.pyi (L985-L988)

- Use `SupportIndex` instead of `int` in `__getitem__` (supported at runtime)
- `Size.__add__` overrides  `tuple.__add__`, the latter supports adding tuples on non-integral type.
- Added typing unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144322
Approved by: https://github.com/Skylion007
2025-01-08 01:02:11 +00:00
06ea81336f [Inductor UT] Remove excepted failure for aoti test_fft_c2c (#144238)
Since #143223 enabled runtime dispatch for fft_c2c in AOTI mod, for XPU, we can fallback fft_c2c which has no XPU implementation to CPU and pass the case now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144238
Approved by: https://github.com/jansel
2025-01-08 00:49:32 +00:00
96f4abba17 [dtensor] move all tests to distribute/tensor folder (#144166)
as titled, mainly moving files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166
Approved by: https://github.com/Skylion007
2025-01-08 00:32:33 +00:00
7c9cf287c2 [ONNX] Handle list values as 0d inputs (#144343)
Handle list values as 0d inputs instead of 1d, as the `SymInt`s are expected to be 0d tensors in ONNX.

This PR reshapes int64 values into 1D tensors in a list, assuming they are 0D tensors initially.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144343
Approved by: https://github.com/gramalingam, https://github.com/titaiwangms
2025-01-08 00:15:50 +00:00
9ee242213b [RFC] Introduce cache hot loading APIs (a.k.a. "Mega-cache") (#143341)
This PR essentially introduces two new APIs
* torch.compiler.save_cache_artifacts
* torch.compiler.load_cache_artifacts

which aim to create a mega cache experience where the user can start collecting cache artifacts, and later call the save API to fetch them. In the next attempt, the user can "hot load" the cache artifacts via the load function.

This bundling approach reduces the need to rely on porting individual files one by one, or relying on many network requests.

Note that these APIs CANNOT log to structured logging as these functions will be called before and after compilation, as opposed to during compilation. Due to this limitation, the API returns a struct that the user can log with.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143341
Approved by: https://github.com/jansel
2025-01-07 23:13:24 +00:00
c2c50d5f00 Fixed doc where more than one device specified since only one device is used (#17553) (#144043)
Fixes #17553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144043
Approved by: https://github.com/soulitzer
2025-01-07 23:06:52 +00:00
430d54ee20 [Dynamo] Add functorch C++ bindings as in graph functions (#144309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144309
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306, #144307, #144308
2025-01-07 22:25:01 +00:00
d146763f6f [Dynamo] Inline functions in torch._ops (#144308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144308
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306, #144307
2025-01-07 22:25:01 +00:00
242a4a3f83 [Dynamo] Inline functions in torch._functorch.pyfunctorch (#144307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144307
Approved by: https://github.com/williamwen42
ghstack dependencies: #144306
2025-01-07 22:24:53 +00:00
4417be65e5 [Dynamo] Inline functions in torch._functorch.autograd_function (#144306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144306
Approved by: https://github.com/williamwen42
2025-01-07 22:24:46 +00:00
3beb7006dd c10::optional -> std::optional in a few places (#144340)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144340
Approved by: https://github.com/malfet
2025-01-07 21:09:39 +00:00
f4969c8235 fix torch.compile + ddp + non-reentrant AC pack hook firing count (#144271)
FIXES https://github.com/pytorch/pytorch/issues/144035

In order to preserve hook firing semantics, we disabled pack/unpack hooks for torch.compile: https://github.com/pytorch/pytorch/pull/123196. In DDP under torch.compile, there's this other callsite that we need to disable hooks for

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144271
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2025-01-07 21:08:52 +00:00
861b65fe74 [Easy] Fix linalg.norm hint message typo (#144323)
Fixes #136454

**Test Result**

**Before**

```python
>>> import torch
>>> from torch import linalg
>>>
>>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]])
>>>                            # ↓ ↓ ↓ ↓ ↓
>>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2]
>>>                            # ↓ ↓ ↓ ↓ ↓
>>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linalg.norm: If dim is specified, it mut be of length 1 or 2. Got [0, 1, 2]

```

**After**

```python
>>> import torch
>>> from torch import linalg
>>>
>>> my_tensor = torch.tensor([[[8., -3., 0., 1.]]])
>>>                            # ↓ ↓ ↓ ↓ ↓
>>> linalg.norm(input=my_tensor, ord='fro', dim=(0, 1, 2)) # Error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2]
>>>                            # ↓ ↓ ↓ ↓ ↓
>>> linalg.norm(input=my_tensor, ord='nuc', dim=(0, 1, 2)) # Error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linalg.norm: If dim is specified, it must be of length 1 or 2. Got [0, 1, 2]

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144323
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2025-01-07 20:34:16 +00:00
d38af6e8bc [ca] dedup node names when AOT bwd graph is reused multiple times (#144202)
This error started popping up in HUD CA benchmarks:
```python
 File "/data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py", line 371, in dce
    self.fx_tracer.graph.eliminate_dead_code(is_impure)
  File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1862, in eliminate_dead_code
    self.lint()
  File "/data/users/xmfan/core/b/pytorch/torch/fx/graph.py", line 1753, in lint
    raise RuntimeError(f"Node redefined name {node.name}!")
RuntimeError: Node redefined name aot0_expand!
```

We added CA initial capture's renaming (https://github.com/pytorch/pytorch/pull/133148) to help debug issues with AOT backward, but it errors out when we have multiple instances of the same AOT backward. This likely only showed up now because of increased hierarchical graph reuse. I fix it by adding a postfix counter to the node name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144202
Approved by: https://github.com/bdhirsh, https://github.com/jansel
2025-01-07 20:23:09 +00:00
72e8f34715 [AoTI Minifier] UX Improvement (#143330)
Summary:
- When a user specify `TORCHINDUCTOR_MAX_AUTOTUNE=1` env variable, we add `config.max_autotune=True` to the generated minifier_launcher
- We should do this to other inductor configs as well in a followup Diff

Currently in dynamo and aoti minifier, if a config is overwritten by an env variable, the config will not show up in the config list in the minifier_launcher.py file. As a result, when running the minifier_launcher, they need to re-apply the same env variable.
 This is:
1) not convenient for the users
2) if they copy-paste the minifier_launcher.py to us without including the env variable, we could be confused and not able to reproduce the error.

Underlying implementation change:

- Add `env_default` parameter to `codegen_config()`. If set, configs overriden by the env are not considered default.

Test Plan:
```
 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config
```

Differential Revision: D67299312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143330
Approved by: https://github.com/jansel, https://github.com/eellison
2025-01-07 20:04:19 +00:00
096cb874d3 remove allow-untyped-defs from torch/_prims/executor.py (#144233)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144233
Approved by: https://github.com/Skylion007
2025-01-07 19:40:40 +00:00
0aa74d0ab9 Skip L1 cache for single-use buffers (#143115)
### 1. Synopsis

Adds `cache_modifier='.cg'` optional argument into `tl.load` instructions in the inductor-generated triton code for selected buffers.

It makes the `tl.load` instruction to skip  the L1 cache for short-lived / non-reused data.

### 2. Using the feature

This feature is experimental and disabled by default.  It can be enabled by setting the environmental variable `TORCHINDUCTOR_SKIP_L1` equal to `1`.

### 3. Results

For a simple pointwise addition kernel:
```python
@torch.compile
def add_dummy(x: torch.Tensor, y: torch.Tensor):
    return x+y
```
we get (bandwith performance is in GB/s):

(a) feature DISABLED:
![image](https://github.com/user-attachments/assets/6caaf775-f083-4943-a61f-8a1bcb154387)

(b) feature ENABLED:
![image](https://github.com/user-attachments/assets/9286be7d-c6ff-4a33-a023-77cb5cc87ff6)

### 4. Caveats

The feature boost is only available when using
```python
torch._dynamo.config.cache_size_limit = 64 # or any other sufficiently big number..
torch._dynamo.config.automatic_dynamic_shapes = False   # use static shapes
```
When using (the default) dynamic shapes, only 1-2 triton kernels are generated with non-optimal block-sizes for
*all* the cases (vector sizes), hiding any perf benefit from skipping the L1 cache.

In the static case, as an optimal block size is generated for each vector size, the perf benefit of skipping the L1 cache becomes visible.

This block-size optimization issue is a larger problem in pytorch inductor and is outside the scope of this feature.

### 5. References

- [tl.load](https://triton-lang.org/main/python-api/generated/triton.language.load.html#triton.language.load)
- [cache operators](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143115
Approved by: https://github.com/jansel
2025-01-07 19:35:40 +00:00
355b0bc7e3 [typing] Add type hints to @property and @lazy_property in torch.distributions. (#144110)
Fixes #76772, #144196
Extends #144106

- added type annotations to `lazy_property`.
- added type annotation to all `@property` and `@lazy_property` inside `torch.distributions` module.
- added simply type-check unit test to ensure type inference is working.
- replaced deprecated annotations like `typing.List` with the corresponding counterpart.
- simplified `torch.Tensor` hints with plain `Tensor`, otherwise signatures can become very verbose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144110
Approved by: https://github.com/Skylion007
2025-01-07 19:27:36 +00:00
aa69d73e6b [ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor (#144007)
Fixes #136291

This PR is to fix the `invalid configuration argument` problem happened on ROCm when input is a large tensor when calling `torch.layer_norm`.

```
 File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2573, in layer_norm
    return torch.layer_norm
RuntimeError: HIP error: invalid configuration argument
```

After investigation, I found that the reason why this error happened is: The amd compute language runtime checks whether  `gridDim.x * blockDim.x` is greater than `std::numeric_limits<uint32_t>::max()` or not. If yes, it will error out with the "invalid configuration argument" message.

The fix is to split the whole task to several chunks so that each chunk will not trigger the failure condition. This will ensure the correctness and completeness given the current kernel implementation logic of `vectorized_layer_norm_kernel`.

Also added a largeTensor layer_norm unit test `test_layer_norm_large_tensor` with the same shape `[16, 3000, 3000, 16]` as the one used by the pytorch issue #136291 so that the unit test can check the expected output value to ensure correctness.

The future work may include performance optimization of layer_norm and CK layer_norm integration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144007
Approved by: https://github.com/eqy
2025-01-07 19:17:02 +00:00
6c54963f75 Revert "[dtensor] move all tests to distribute/tensor folder (#144166)"
This reverts commit 2e1ea8598f477322965c28fb52e6e5f53876d8dd.

Reverted https://github.com/pytorch/pytorch/pull/144166 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but inductor/test_compiled_autograd needs to be updated ([comment](https://github.com/pytorch/pytorch/pull/144166#issuecomment-2575969871))
2025-01-07 18:31:36 +00:00
e4a05dec0f [BE][Ez]: Fix docs recommending inefficient tensor op order (#144270)
`detach().clone()` is faster than `.clone().detatch()` since the gradients are not cloned. Let's update all the documentation and tests so that users do not use the inefficient op ordering.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144270
Approved by: https://github.com/awgu, https://github.com/XuehaiPan
2025-01-07 17:31:32 +00:00
8d35333498 [CD] Aarch64 builds should not override OVERRIDE_PACKAGE_VERSION envvar (#144285)
Currently our nightly aarch64 binaries have correct suffixes +cpu or +cu126. But release binaries are missing these suffixes. Hence to correct this, make sure are nightly and release binaries are consistent, I propose this change.

I see that override is already set correctly in release workflow:
https://github.com/pytorch/pytorch/actions/runs/12383179841/job/34565381200

For CPU:
```
OVERRIDE_PACKAGE_VERSION="2.6.0+cpu"
```

For CUDA:
```
OVERRIDE_PACKAGE_VERSION="2.6.0+cu126"
```

The removed code will set : OVERRIDE_PACKAGE_VERSION="2.6.0" for both cuda and cpu builds for release binaries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144285
Approved by: https://github.com/malfet, https://github.com/tinglvv
2025-01-07 12:50:54 +00:00
12fdb93ebd fix non-strict placeholder naming with kwargs (#144278)
Fixes https://github.com/pytorch/pytorch/issues/143732

Differential Revision: [D67872055](https://our.internmc.facebook.com/intern/diff/D67872055/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144278
Approved by: https://github.com/yushangdi, https://github.com/pianpwk
2025-01-07 11:22:09 +00:00
c3b28491c8 [caffe2] Add AVX512 support for box_cox operator (#143627)
Summary:
Reuse templetized implementation of box_cox caffe2 operator.
* Duplicate .cc file of AVX2
* change intrinsics functions to use AVX512 instructions
* override templates
* extend the caller to use new methods
* guard AVX512 with a gflag to allow smooth transition

Differential Revision: D67433457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143627
Approved by: https://github.com/hl475
2025-01-07 09:54:39 +00:00
bf7747e935 Tests Generelization for multiple accelerator devices (#139184)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209  Merged now.
There was a #135242  for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501  @zeshengzong Please review the changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2025-01-07 09:04:38 +00:00
2e1ea8598f [dtensor] move all tests to distribute/tensor folder (#144166)
as titled, mainly moving files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144166
Approved by: https://github.com/Skylion007
2025-01-07 06:45:14 +00:00
d0f5df83a5 [ca] add test_dtensor_compile.py to compiled autograd tests (#144107)
more than half the tests use autograd, pass rate 19/26

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107
Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel
2025-01-07 05:16:14 +00:00
fcf9dc3b11 Migrate from Tuple -> tuple in benchmarks (#144259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144259
Approved by: https://github.com/yanboliang
2025-01-07 04:09:52 +00:00
2e42be0595 Use random64 in Fischer-Yates algorithm for large N (#143682)
Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/malfet
2025-01-07 03:48:56 +00:00
551f104153 [mps/inductor] Add support for sign(). (#144298)
Drive-by fix of a test name while I was at it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144298
Approved by: https://github.com/malfet
2025-01-07 03:33:26 +00:00
a3ab27b8e0 Migrate from Tuple -> tuple in torch/_inductor (#144264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144264
Approved by: https://github.com/eellison
2025-01-07 03:27:27 +00:00
778d953951 Revert "[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011)"
This reverts commit 24ac87392bc4e0060a90483643f7df5611988ae5.

Reverted https://github.com/pytorch/pytorch/pull/144011 on behalf of https://github.com/malfet due to Not sure what is going on, but lots of builds are failing ([comment](https://github.com/pytorch/pytorch/pull/144011#issuecomment-2574317669))
2025-01-07 03:24:01 +00:00
f4e9aebbcc Revert "Update torch.masked.mean to upcast dtype for bool tensors (#139999)"
This reverts commit 0742b2366e7ba65e0437a17b09a3bb0804ae51ea.

Reverted https://github.com/pytorch/pytorch/pull/139999 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a landrace and fails a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/139999#issuecomment-2574283986))
2025-01-07 02:42:55 +00:00
168c2cb3f3 remove allow-untyped-defs from torch/nn/utils/_deprecation_utils.py (#144231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144231
Approved by: https://github.com/albanD
2025-01-07 02:22:22 +00:00
24ac87392b [AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-07 02:15:42 +00:00
73a6a40346 [Inductor][CPP] Fix outer loop fusion buffer removed (#144243)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/144186. For the test case reported in the issue, we have saw some nodes with `LoopNest`

-  `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=1, parallel=0, simd_omp=False, simd_vec=False, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc724426680>)`

- `LoopNest(loops=[LoopLevel(var=x0, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=False), LoopLevel(var=x1, size=8, offset=0, tiled_size=0, steps=16, parallel=0, simd_omp=False, simd_vec=True, collapsed=False, is_reduction=True)], kernel=<torch._inductor.codegen.cpp.CppKernelProxy object at 0x7fc75c2cae60>)`

Although, these 2 `LoopNest` have same `range` and `var`, but different `steps` 1 and 16. So, they will fail to be merged with outer loops. And since when we localize the buffer, we have removed the global buffers. We need to restore the status of `V.graph.removed_buffers` before fallback to codegen without outer loop fusion.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_outer_loop_fusion_buffer_remove
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144243
Approved by: https://github.com/jgong5
2025-01-07 01:17:46 +00:00
2f6f13562f [BE] Actually suppress vmap warning from gradcheck (#144287)
This is the much safer change compared to https://github.com/pytorch/pytorch/pull/144283

Before:
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd
/data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap.
  result = vmap(vjp)(torch.stack(grad_outputs))
/data/users/janeyx/pytorch/torch/autograd/gradcheck.py:1156: FutureWarning: Please use torch.vmap instead of torch._vmap_internals.vmap.
  result = vmap(vjp)(torch.stack(grad_outputs))
.
----------------------------------------------------------------------
Ran 1 test in 0.028s
```

(the env vars aren't necessary)

After:
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/test_optim.py -k TestDifferentiableOptimizer.test_sgd
.
----------------------------------------------------------------------
Ran 1 test in 0.028s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144287
Approved by: https://github.com/cyyever, https://github.com/soulitzer
2025-01-07 01:11:41 +00:00
61c0a3d1cb Fix lint in test_provenance_tracing.py (#144296)
Regression introduced by https://github.com/pytorch/pytorch/pull/143684/ that somehow did not surface on PR CI

IMO this also makes two branches of the test(compile vs aoti) more readable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144296
Approved by: https://github.com/xw285cornell, https://github.com/huydhn
2025-01-07 01:11:38 +00:00
48153c72b2 [Intel XPU] enable kineto for XPU Windows. (#144034)
This PR will turn on `kineto` on Windowx XPU wheel build.

For `kineto` on Windows XPU, the build time dependencies list:
1. Intel PTI, it contained by oneAPI 2025+.
2. Level zero SDK: https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip

**Note:**
We need to manual setup level zero SDK on build time, so we will turn off kineto build on Windows XPU by default. It is in order to avoid developer occurred build issue.
After add level zero SDK include path to `INCLUDE` env_var path. We can add an env_var `XPU_ENABLE_KINETO` to turn on it.

For runtime dependency:
1. Intel-pti pipy package. @chuanqi129 will follow up on further PR.

Local tested the nightly binary:

<img width="1909" alt="image" src="https://github.com/user-attachments/assets/7dfaa7bc-e8ed-40b8-bc71-f91a3df3b95f" />

TODO: @chuanqi129 will submit a following PR to add `intel-pti` as dependency and turn on env_var `XPU_ENABLE_KINETO` for nightly build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144034
Approved by: https://github.com/chuanqi129, https://github.com/zejun-chen, https://github.com/EikanWang, https://github.com/sraikund16
2025-01-07 01:11:25 +00:00
0742b2366e Update torch.masked.mean to upcast dtype for bool tensors (#139999)
When calling `torch.masked.mean(...)` with a boolean tensor, the dtype is inferred to be bool. When the mean is being computed, the sum operator is used. When the sum operator is used with dtype=torch.bool, the result is clamped to True (1) leading to an incorrect mean being calculated.

The below example shows how the incorrect result occurs:
```
a = torch.tensor([True, True])
count = torch.sum(torch.ones(a.shape, dtype=torch.int64)) # 2
total = torch.sum(a, dtype=torch.bool) # True (1)
mean = total / count # 0.5
```

This PR upcasts the dtype used for the sumation to int32 in the case of bool tensors allowing for the correct result to be computed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139999
Approved by: https://github.com/cpuhrsch
2025-01-07 00:26:59 +00:00
f013cfee38 [TreeSpec] Support enum in defaultdict (#144235)
Summary: Followup from D66269157, add support for enum in defaultdict.

Test Plan: Added unit test

Differential Revision: D67832100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144235
Approved by: https://github.com/henrylhtsang, https://github.com/houseroad
2025-01-07 00:10:46 +00:00
c68c38c673 Support getattr for tensor subclasses in pre-dispatch export via patching tensor.getattr (#143946)
Previous discussion: https://github.com/pytorch/pytorch/pull/143671#issuecomment-2560112499 and https://github.com/pytorch/pytorch/pull/143671

Differential Revision: [D67693609](https://our.internmc.facebook.com/intern/diff/D67693609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143946
Approved by: https://github.com/bdhirsh
2025-01-06 23:55:50 +00:00
66059f80d2 Migrate from Tuple -> tuple in torch/profiler (#144257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144257
Approved by: https://github.com/sraikund16
2025-01-06 23:34:14 +00:00
5ccbfffd11 update expected results (#144274)
this PR f6488d85a0 made it +1.3% < 1.5%.
once we have the API from dev infra and change the test this wont be happening.

<img width="364" alt="Screenshot 2025-01-06 at 11 01 15 AM" src="https://github.com/user-attachments/assets/401b2d11-e400-49d6-b6f9-8e10ca141cb0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144274
Approved by: https://github.com/oulgen, https://github.com/anijain2305
2025-01-06 23:18:21 +00:00
f879a6982d Enhance provenance tracing unit test to cover torch.compile() (#143684)
Summary: Follow up as title.

Test Plan:
```
buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_to_post_grad_tracing
```

Differential Revision: D67543556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143684
Approved by: https://github.com/yushangdi
2025-01-06 22:58:04 +00:00
301b9c8a90 Fix PythonMod printing (#144078)
Fixes #144075
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144078
Approved by: https://github.com/anijain2305
2025-01-06 22:52:34 +00:00
edbda2fad8 remove allow-untyped-defs from torch/export/_remove_auto_functionalized_pass.py (#144230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144230
Approved by: https://github.com/Skylion007
2025-01-06 22:23:19 +00:00
d75ffccd0a Migrate from Tuple -> tuple in torch/_export (#144262)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144262
Approved by: https://github.com/avikchaudhuri
2025-01-06 22:20:26 +00:00
00c18c8882 Make all-reduce input contiguous in distributed.nn.all_reduce (#144267)
Fixes https://github.com/pytorch/pytorch/issues/144060

I confirmed that the unit test fails without the `.contiguous()` fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144267
Approved by: https://github.com/wz337, https://github.com/Skylion007, https://github.com/fduwjj
2025-01-06 22:20:04 +00:00
16c1b1048b [MPSInductor] Add nan constant generation (#144281)
If val is not equal to self, it's a nan (which is spelled as `NAN` in Metal)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144281
Approved by: https://github.com/atalman, https://github.com/dcci
2025-01-06 22:13:23 +00:00
7d5249dbc2 [EZ][BE] Fix E226 flake8 violation (#144282)
Not sure why CI did not complain about it, but it my local runs it clearly says
```
Advice (FLAKE8) E226
    missing whitespace around arithmetic operator
    See https://www.flake8rules.com/rules/E226.html

        268  |            with code.indent():
        269  |                if len(idx_var_names) > 1:
        270  |                    for idx, name in enumerate(idx_var_names):
    >>> 271  |                        code.writeline(f"auto {name} = thread_pos.{chr(120+idx)};")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144282
Approved by: https://github.com/Skylion007
2025-01-06 22:12:21 +00:00
5d88002af6 [inductor] Avoid specializing over symbolic value during constant folding (#144176)
Fixes #143667. See more context in the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144176
Approved by: https://github.com/jansel, https://github.com/eellison
2025-01-06 21:50:17 +00:00
729b7c0a84 [TGIF][Easy] Slightly improve the logging for tgif split pass (#143771)
Summary:
1. Added more details for some of the assert statements.
2. Moved assert statements to use tgif_assert

Test Plan: all unit tests should pass

Reviewed By: jingsh

Differential Revision: D67608251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143771
Approved by: https://github.com/jingsh
2025-01-06 21:00:15 +00:00
b5cf8e2460 [BE]: Remove redundant copy in torch chunk shard (#144269)
Fixes an issue noticed in recent all_gather PR. Some parts of the codebase have a double copy with `clone().contiguous()` which could be fused into a single copy op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144269
Approved by: https://github.com/awgu
2025-01-06 20:52:49 +00:00
1b8a943011 remove allow-untyped-defs from ao/nn/sparse/quantized/utils.py (#144232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144232
Approved by: https://github.com/Skylion007
2025-01-06 19:54:27 +00:00
6d445bef0c [ROCm][NFC] Fix condition for small tensor tuning (#144087)
Fix condition for small tensor tuning to not impact non-ROCm compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144087
Approved by: https://github.com/jeffdaily
2025-01-06 19:40:22 +00:00
c62873a09a Fix incorrect python expression (#143675)
Summary:
This expression would return True always, causing the input to be deleted
on error, even for non-write modes:

```
>>> bool("w" or "+" or "a" in "rb")
True
```
Test Plan: new test in test_fsspec.py

Differential Revision: D67537234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143675
Approved by: https://github.com/mayankgarg1990, https://github.com/huydhn
2025-01-06 19:04:26 +00:00
e3aac7f8a0 detect fake mode in proxy_tensor creation in make_fx (#144168)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/143742

A FakeTensorMode may already exist when we are setting the "val" meta of a proxy tensor. We should detect existing FakeTensorMode before creating a new one.

Otherwise, we could cause an error when using `detect_fake_mode` later, because there are now multiple FakeTensorModes existing.

Test Plan: The error in https://github.com/pytorch/pytorch/issues/143742

Differential Revision: D67813111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144168
Approved by: https://github.com/BoyuanFeng, https://github.com/tugsbayasgalan
2025-01-06 19:02:08 +00:00
e56768f030 [MPS] Fix bitwise shifts for uint8 (#144251)
Previosly all bitwise operations were aliased to the same type, but this is wrong for shift ops

Rather than building an overly complex logic, let's just instantiate using shared `scalarToMetalTypeString` helper function

Fixes https://github.com/pytorch/pytorch/issues/144190
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144251
Approved by: https://github.com/Skylion007
ghstack dependencies: #144249, #144250
2025-01-06 18:27:16 +00:00
aa14fcd96c Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit e141cb9c34e5e96ca47ea69b565bc4fd9c8f34c1.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))
2025-01-06 18:15:52 +00:00
ebeb433e73 [BE] Fix + parametrize test_min_max_nan_propagation (#144250)
- `dtype` was not passed as argument to `torch.rand` before
- Condition bfloat16 testing on MacOS14+
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144250
Approved by: https://github.com/Skylion007
ghstack dependencies: #144249
2025-01-06 17:49:41 +00:00
11a0663eeb [BE] Parametrize test_min_max (#144249)
It's better to have one unit test per dtype rather a combined one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144249
Approved by: https://github.com/Skylion007
2025-01-06 17:49:41 +00:00
d65a50ef34 Fix subclass unwrapping bug (#143945)
I noticed a small bug in tensor subclass unwrapping logic. cc @IvanKobzarev
It seems easier if we just implement it recursively so that it is easier to track the inner attrs to corresponding plain tensors and both aot_autograd and fake_tensor implement subclass unwrapping recursively.

Differential Revision: [D67693610](https://our.internmc.facebook.com/intern/diff/D67693610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143945
Approved by: https://github.com/IvanKobzarev
2025-01-06 17:38:19 +00:00
5c783bf410 [BE][Ez]: Update CUDNN Frontend submodule to 1.9.0 (#144200)
* Update CUDNN Frontend to 1.9.0, which include some API improvements, new features, and bugfixes. This is a header only lib fix so should be pretty straight forward.
* Nicest feature is that it now logs / print warnings when the CUDNN compiled version does not match the dynamically loaded one
* Fixes corrupted / truncated log lines from being printed by CUDNN Frontend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144200
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-01-06 17:33:38 +00:00
c8713e659a fix memleak, detach instead of clone to not drag around graph (#144154)
Thanks @clee2000 for bringing the memleak to my attention: https://github.com/pytorch/pytorch/actions/runs/12549765082/job/34996244798.

This memleak in the test was caused by the differentiable flavors. Because we had param.clone() and param persisted outside the for loop, the autograd graph would continue growing for each optimizer.step instead of being deleted after the optim input was used up.

To clarify, I had still expected (and still do expect) the test to fully clean everything up once the test is over, but I didn't get the chance to look into why that's not the case. This change would preliminarily unblock this particular test from failing the memleak CI.

Use detach instead of clone, which is...cheaper anyway :D since a detach I've learned from @soulitzer is a view with requires_grad=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144154
Approved by: https://github.com/clee2000, https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD
2025-01-06 17:09:00 +00:00
e222dd5d25 Rewrite _reparametrize_module to use contextmanager (#138203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203
Approved by: https://github.com/zou3519
ghstack dependencies: #136033, #140604
2025-01-06 16:56:22 +00:00
4c8d661348 Set enable_trace_contextlib_contextmanager flag to True (#140604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604
Approved by: https://github.com/zou3519
ghstack dependencies: #136033
2025-01-06 16:56:22 +00:00
defbf0d339 [DTensor] Add strategy for _scaled_mm (#143760)
This is done by copying the one for a regular mm, and enforcing that the scales have the same sharding scheme as their respective operands. This works because scales are 2-d tensors that must "broadcast" to the operands. This broadcasting is trivial when scales have dimensions of 1 or N, which is the only options we currently support.

Note, however, that after this PR scales will be allowed to have the mesh's world size as a dimension (in certain cases). This works because, when mapped to the local shard, it becomes a dimension of 1, which can be handled by the operator. Note that when using row-wise _scaled_mm for tensor (sequence) parallelism, this situation arises naturally!

Because of these specificities, the test is rather complex, as it specifically tests all these behaviors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143760
Approved by: https://github.com/tianyu-l
2025-01-06 16:35:47 +00:00
d4609af1ca Propagate callable parameter types using ParamSpec (#142306) (#144047)
Fixes #142306

This PR includes typing improvements and refactoring for the following files:
- __init__.py
- decorators.py
- _ops.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144047
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2025-01-06 16:16:18 +00:00
cyy
9225f149eb Enable clang-analyzer checks of Clang-tidy (#144222)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144222
Approved by: https://github.com/Skylion007
2025-01-06 15:44:45 +00:00
bba672e117 [docs/export] update dynamic_shapes docs (#142510)
https://pytorch.org/docs/stable/export.html dynamic_shapes section formatting is messed up, fix & update documentation to be more user-friendly.

Happy accepting nits :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142510
Approved by: https://github.com/yushangdi
2025-01-06 14:12:34 +00:00
d85ae4be73 Update slow tests (#144236)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144236
Approved by: https://github.com/pytorchbot
2025-01-06 11:19:09 +00:00
a8e97d9d4d fix torch.acos and torch.asin for torch.complex datatypes on CPU (#134838)
Fix https://github.com/pytorch/pytorch/issues/134487, https://github.com/pytorch/pytorch/issues/138327.

These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `asin`. For correctness, I temporarily fallback the implementation of `asin `to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134838
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-06 06:17:39 +00:00
e1622dca7a Fix duplicate pattern error (#139321)
vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline:

For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321
Approved by: https://github.com/shunting314
2025-01-06 05:04:59 +00:00
cb5fa17e44 Revert "[ca] add test_dtensor_compile.py to compiled autograd tests (#144107)"
This reverts commit 67f85ccdcf56894d653b4d37cd7651eefa0ddf8c.

Reverted https://github.com/pytorch/pytorch/pull/144107 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/144107#issuecomment-2572209717))
2025-01-06 03:30:22 +00:00
c9ef98478a [mps/BE] Enable a test that now passes. (#144198)
After the implementation of floordiv in 464b50dbd7 landed, this now passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144198
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-06 03:14:21 +00:00
23e2953cd3 [mps/inductor] Add support for floor(). (#144195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144195
Approved by: https://github.com/jansel
2025-01-06 02:07:17 +00:00
d71f111109 [Inductor][CPP] Fix Inductor integer avg pool (#144059)
Fixes #143738. Currently the scaler for averaging is rounded to 0 if dtype is an integer, resulting to all-zero output. This fix uses `truediv` instead for integer cases.

## Test
```bash
pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool1d_cpu_int64
pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool2d_cpu_int64
pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_avg_pool3d_cpu_int64
pytest -vs ./test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_nn_functional_local_response_norm_cpu_int64
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144059
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-01-06 01:26:53 +00:00
3d3a07963f [reland][attempt2][AMD] Turn on TF32 for aten::mm (#144145)
Summary:
https://github.com/pytorch/pytorch/pull/143549 was reverted due to some
internal/oss tooling issue. Relanding.

hipblaslt supports TF32, so adding the support.
Original PR https://github.com/pytorch/pytorch/pull/139869

Test Plan: CI

Differential Revision: D67785496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144145
Approved by: https://github.com/jianyuh
2025-01-06 00:37:01 +00:00
9f94710e48 Update core.py to fix typo (#144201)
dype -> dtype

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144201
Approved by: https://github.com/Skylion007
2025-01-05 18:20:52 +00:00
51a37a42e0 [inductor][cpu] Fix bmm b_index for dynamic expressions in inductor autotuner (#143141)
Fixes #143102

Addresses 2 problems relating to dynamic batch size in BMM autotuner:
1. With dynamic batch size, when the input is a sympy Mult expression, such as `s0*8` which occurs in many dynamo benchmark models. We address this by using `size_hints` to solve for any expressions. This is safe since this section of the code is only called to generate inputs for benchmarking.
2. Some epilogue nodes may use the dynamic batch size as part of the codegen, for example when an input to the epilogue node is transposed and has dynamic batch size in the stride of other dimensions. When these epilogue nodes exist, if the sizevar is not already present in the `kernel.args`, it will create a new sizevar with a name. It is possible that subsequent calls to `def_kernel` could overwrite this variable name, so to avoid this we pass all the sizevars as `extra_sizevars` to the calls to `def_kernel` for the GEMM functions, so no variable renaming happens later in the BMM definition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143141
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5
2025-01-05 18:02:37 +00:00
f6488d85a0 [dynamo][user-defined] Remove __getattribute__ checks and add getsetdescriptor (#144173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144173
Approved by: https://github.com/jansel
2025-01-05 13:48:15 +00:00
b01556bd8a Revert "[dynamo][dicts] Guarding lazily on dict keys (#143997)"
This reverts commit f5df082fabfe81639e25b8e01dae107548389c5e.

Reverted https://github.com/pytorch/pytorch/pull/143997 on behalf of https://github.com/jeanschmidt due to Seems to have introduced internal ci redness in some tests, D67828366 ([comment](https://github.com/pytorch/pytorch/pull/143997#issuecomment-2571587599))
2025-01-05 11:09:45 +00:00
1e881ceecf Update torch-xpu-ops commit pin (#143984)
Update the torch-xpu-ops commit to [28cfac20ec662abdb0ac98faf122450013e8f520](28cfac20ec), includes:

- Disable batch_norm vectorization path to fix accuracy issues.
- Fix the LSRM/RNN implementation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143984
Approved by: https://github.com/EikanWang, https://github.com/ruidazeng, https://github.com/desertfire, https://github.com/jansel
2025-01-05 09:01:36 +00:00
157c185afe [inductor] Add types to compile_tasks.py and runtime_utils.py (#144004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144004
Approved by: https://github.com/yanboliang
2025-01-05 08:47:49 +00:00
67f85ccdcf [ca] add test_dtensor_compile.py to compiled autograd tests (#144107)
more than half the tests use autograd, pass rate 19/26

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144107
Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/jansel
2025-01-05 02:11:48 +00:00
f2d6cfa677 Introduce CompileEventLogger, replace usages of metrics_context and chromium_event with it (#143420)
**Problem statement**: I want to be able to centralize and simplify the process by which people add columns/data to existing spans. We have MetricsContext and ChromiumEventLogger, and there's various choices you can make to decide where and when to log different levels of observability for your events. To resolve this, I want a central API for "adding to events under dynamo_timed".

**CompileEventLogger** is intended as a frontend for MetricsContext and ChromiumEventLogger so we can use the same class for handling everything.

CompileEventLogger is intended be used within a `dynamo_timed()` context. Its purpose is to 1. log to existing events that are in progress (i.e. within dynamo_timed), and 2. log instant events to chromium that are independent of any specific span.

CompileEventLogger has three log levels:

- CHROMIUM: Log only to chromium events, visible via tlparse.
- PT2_COMPILE: Log to chromium_events + pt2_compile_events
- COMPILATION_METRIC: Log to compilation metrics in addition to the toplevel chromium and pt2_compile_event.

In addition, we have a function CompileEventLogger.add() that automagically chooses the correct log level. For now, it is conservative, and will never automagically choose to log CompilationMetrics (though I could imagine it figuring out the metadata are all keys in CompilationMetric and therefore loggable there).

The goal here is to make one single interface to log stuff for observability reasons, and make it as easy as possible.

Not included in this diff:
- V1 of this diff will not have implementations of `increment` and `add_to_set` which MetricsContext has, so those usages are not replaced yet. But I'll add those in a followup.

- We don't handle `RuntimeMetricsContext`. It's unclear if I want that to be part of this, because under RuntimeMetricsContext there might not be a toplevel event to log to, so chromium events doesn't make sense in that context. So I might leave that separate for now.

Differential Revision: [D67346203](https://our.internmc.facebook.com/intern/diff/D67346203/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143420
Approved by: https://github.com/aorenste
2025-01-04 22:40:34 +00:00
68d30c6a25 Add check for unsupported sprase layout to resolve false INTERNAL ASSERT FAILED (#139198)
Fixes #131319. Implemented the check on layout as described in the original issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139198
Approved by: https://github.com/pearu, https://github.com/amjames, https://github.com/cpuhrsch

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Pearu Peterson <pearu.peterson@gmail.com>
2025-01-04 21:40:36 +00:00
b1bc880f26 [EZ][BE] Cleanup test_mps_basic (#144194)
- Sort imported tests alphabetically
- Run `add` tests with `check_lowp=False` as it is tested explicitly by parametrization
- Do not hardcode device, but rather use `self.device` property

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144194
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-01-04 21:36:40 +00:00
0dc1e6be19 [mps/BE] Fix linter warning/advice. (#144199)
Two spaces before an inline comment according to E261.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144199
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-04 20:15:41 +00:00
e458b39fc4 c10::string_view -> std::string_view in Device.cpp (#144178)
Test Plan: Sandcastle

Differential Revision: D67817163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144178
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-04 18:51:33 +00:00
811c714911 Fix nan propagation for minimum() and maximum() in MPS (#144086)
Fixes #143976

- Moves minimum and maximum operations to use the NaN propagating call into MPSGraph instead of the default one.
 - Adds test for the NaN propagating case to `test_mps.py`.
- Adjusts the inductor metal backend implementation for minimum and maximum to also respect the nan propagation.

Additions by @malfet:
 - Introduce MPSGraph+PyTorchFixups interface following [Customizing existing classes](https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/ProgrammingWithObjectiveC/CustomizingExistingClasses/CustomizingExistingClasses.html) tutorial and implement `minimumWithNaNPropagationAndIntFallbackWithPrimaryTensor:` as `minimumWithNaNPropagationWithPrimaryTensor:` segfaults when called for integral types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144086
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-01-04 18:48:24 +00:00
60de73c3c7 Update nightly PyTorch version to 2.7.0
Same as https://github.com/pytorch/pytorch/pull/135916
2025-01-04 13:24:48 -05:00
f5df082fab [dynamo][dicts] Guarding lazily on dict keys (#143997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143997
Approved by: https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141, #144158, #144163, #144160
2025-01-04 18:13:00 +00:00
005a4b9537 [Submodule] Bump Cutlass to 3.5.1 OSS PR (#144000)
## Summary
Follow up PR to https://github.com/pytorch/pytorch/pull/143515. That PR added a bunch of macro switches to ensure both 3.4 and 3.5.1 built succesfully. This PR actual bumps the cutlass pin to 3.5.1.

I am going to do a stack on top to add an conditional gates for 3.6 hijacking the 3.4 switches. We will leap frog our way to the top :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144000
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet
2025-01-04 18:04:03 +00:00
93633d0e80 [ROCm][Windows] Fix export macros (#144098)
For correct import and export of functions when the dynamic linkage is used for HIP libraries on windows, the appropriate export/import macros need to be put in place. This Pull Request utilizes existing CUDA import/export macros by converting them to corresponding HIP macros during the hipification process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144098
Approved by: https://github.com/jeffdaily
2025-01-04 17:12:46 +00:00
45ef3309e3 [BE] typing for decorators (#144161)
Summary:
Untyped decorators strip annotations from the decorated items.

- _compile
- _inductor/fx_passes/post_grad
- _inductor/lowering
- _library/custom_ops
- _meta_registrations
- _ops
- _refs/nn/functional
- ao/quantization/quantizer/xnnpack_quantizer_utils
- distributed/_composable/contract
- fx/experimental/graph_gradual_typechecker
- fx/experimental/migrate_gradual_types/constraint_generator
- optim/optimizer
- signal/windows/windows
- testing/_internal/common_device_type
- torch/_inductor/decomposition
- utils/flop_counter

Test Plan: unit tests

Differential Revision: D62302684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144161
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-04 16:40:09 +00:00
79cbda3ab0 [ROCm] Get rid of extra rpath-link that was needed for libtinfo. (#143348)
Fixes #137858

Due to the extra rpath-link line inserted by these CMake lines, it is possible to unintentionally pick up other libraries that are incompatible with the version of ROCm in ${ROCM_PATH}.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143348
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/pruthvistony
2025-01-04 15:42:30 +00:00
6f2451c2e9 [MPS] Add aten::angle (#143449)
This adds an MPS backend implementation for `aten::angle` and `aten::angle_out` (mentioned in issue #77764), following the example #78408.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143449
Approved by: https://github.com/malfet
2025-01-04 15:38:40 +00:00
301c457032 [MPS] Fix nllnd_loss_backward crash with different dtypes (#144170)
Otherwise, invoking with torch.half inputs, but float weights will result in
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.divide' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %16 = "mps.divide"(%15, %arg2) : (tensor<5x5xf16>, tensor<1xf32>) -> tensor<*xf32>
2025-01-03 14:13:18.747151-0800 python[87772:4027380] /AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm, line 975: error 'original module failed verification'
/AppleInternal/Library/BuildRoots/b11baf73-9ee0-11ef-b7b4-7aebe1f78c73/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:975: failed assertion `original module failed verification'
```

Test plan: `python -mpytest test/inductor/test_torchinductor.py -k test_nll_loss_backward_mps` should not crash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144170
Approved by: https://github.com/kit1980, https://github.com/Skylion007
ghstack dependencies: #144167, #144162, #144083, #144084
2025-01-04 15:24:55 +00:00
99f2491af9 Revert "Use absolute path path.resolve() -> path.absolute() (#129409)"
This reverts commit 45411d1fc9a2b6d2f891b6ab0ae16409719e09fc.

Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/jeanschmidt due to Breaking internal CI, @albanD please help get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2571316444))
2025-01-04 14:17:20 +00:00
cyy
df458be4e5 [4/N] Apply py39 ruff and pyupgrade fixes (#143257)
```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257
Approved by: https://github.com/justinchuby, https://github.com/albanD
2025-01-04 10:47:51 +00:00
a881954b0c [PTD] Dump rcclexp proxy trace in pytorch (#143678)
Summary:
Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts.

Added
`#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console.

This is the changes of the PTD.

Test Plan:
Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3
 {F1971449692}

Reviewed By: c-p-i-o

Differential Revision: D67036093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678
Approved by: https://github.com/c-p-i-o
2025-01-04 10:20:47 +00:00
aa7d01ea22 Use sccache 0.9.0 on ROCm build job (#144125)
TSIA, sccache 0.9.0 seems to work fine with ROCm build job

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144125
Approved by: https://github.com/jithunnair-amd, https://github.com/wdvr, https://github.com/jeffdaily
2025-01-04 08:56:48 +00:00
636a2c7e0f [Inductor][lowering] support out_dtype for dequant lowering (#143845)
In lowering, support the parameter `out_dtype` for `dequant_per_tensor` and `dequant_per_channel`.

Fix the following runtime error issue found in https://github.com/pytorch/ao/pull/1372:

```
File "/home/liaoxuan/pytorch_ao/torch/_inductor/lowering.py", line 452, in wrapped
    out = decomp_fn(*args, **kwargs)
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
LoweringException: TypeError: quantized_decomposed_dequantize_per_tensor_default() got an unexpected keyword argument 'out_dtype'
  target: quantized_decomposed.dequantize_per_tensor.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[1, 7, 7, 9], stride=[441, 63, 9, 1]))
  ))
  args[1]: 0.01
  args[2]: 100
  args[3]: 0
  args[4]: 255
  args[5]: torch.uint8
  kwargs: {'out_dtype': torch.bfloat16}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143845
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-01-04 08:48:41 +00:00
417d9c3522 [Inductor/Triton] Upcast FP16/BF16 math reductions to FP32 (#141052)
Summary:
Triton compiler does not automatically promote fp16/bf16 reductions to fp32  accumulation. This will result in significant accuracy issue.

This diff will upcast the input to FP32 for all math reductions `["welford_reduce", "welford_combine", "prod", "sum", "xor_sum"]`

Test Plan:
CI
```
python test/inductor/test_torchinductor.py TritonCodeGenTests.test_low_precision_reduction
```

Differential Revision: D65965032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141052
Approved by: https://github.com/blaine-rister
2025-01-04 07:57:10 +00:00
816328fa51 [dynamo][lazy] LazyVT utils to get original value/source and is_hashable (#144160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144160
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141, #144158, #144163
2025-01-04 06:23:05 +00:00
b5b1e9456a [MPSInductor] Add masked implementation (#144084)
More or less borrowed from
22580f160e/torch/_inductor/codegen/halide.py (L549-L563)

`pytest test/inductor/test_torchinductor.py -k _mps` score is 408 failed, 347 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144084
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #144167, #144162, #144083
2025-01-04 04:30:07 +00:00
f15af077fb Fix get_source_partitions when weights are tied (#142446)
Summary:
Fix https://github.com/pytorch/pytorch/issues/142035 and  https://github.com/pytorch/pytorch/issues/143621

When Linear module params are tied to another parameter, like this:

```
class SimpleLinearModel(nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleLinearModel, self).__init__()
        # Define a linear layer
        self.linear = nn.Linear(input_size, output_size)
        self.tied_weight = self.linear.weight

    def forward(self, x):
        # Forward pass through the linear layer
        b = self.tied_weight + 1
        return self.linear(x), b
```

We get a graph like below:

```
graph():
    %p_tied_weight : [num_users=0] = placeholder[target=p_tied_weight]
    %p_linear_weight : [num_users=2] = placeholder[target=p_linear_weight]
    %p_linear_bias : [num_users=1] = placeholder[target=p_linear_bias]
    %x : [num_users=1] = placeholder[target=x]
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%p_linear_weight, 1), kwargs = {})
    %linear : [num_users=1] = call_function[target=torch.ops.aten.linear.default](args = (%x, %p_linear_weight, %p_linear_bias), kwargs = {})
    return (linear, add)
```

Notice that ` %p_linear_weight : [num_users=2]`.

When we get source partitions, we should exclude attributes nodes like `p_linear_weight` from outputs.

A real world example where people do something like this is in https://github.com/pytorch/pytorch/issues/142035.

Test Plan:
```
 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r test_module_partitioner_weight_tied
```

Differential Revision: D66998592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142446
Approved by: https://github.com/angelayi
2025-01-04 04:28:20 +00:00
cyy
f9bf9057ef Fix ruff warnings in caffe2 and functorch (#144182)
In preparation for upgrading ruff config to py3.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144182
Approved by: https://github.com/malfet
2025-01-04 04:15:01 +00:00
ec1f56fdcf [user triton] add support for prune_configs_by in @triton.autotune (#142207)
This PR adds support for prune_configs_by in the @triton.autotune decorator [docs](https://triton-lang.org/main/python-api/generated/triton.autotune.html#triton.autotune). Supporting this lets users reduce autotuning time by running user-supplied code (early_config_prune, perf_model) to prune the provided list of configs.

We implement this by realizing args/kwargs in call_triton_kernel(...), and then calling kernel.prune_configs(...).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142207
Approved by: https://github.com/zou3519, https://github.com/aakhundov
2025-01-04 03:50:28 +00:00
479d6f2199 [mps/inductor] Add support for log(). (#144169)
Tested via:

```
 % pytest test/inductor/test_mps_basic.py
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144169
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-04 03:07:56 +00:00
087c625261 [dynamo] Trace torch.typename (#144163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144163
Approved by: https://github.com/yanboliang, https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141, #144158
2025-01-04 02:52:58 +00:00
3292220c43 [dynamo][easy] Move symnode helpers to utils (#144158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144158
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #144129, #144130, #144141
2025-01-04 02:52:58 +00:00
98949df7a4 Fix torch.distributed._functional_collectives.AsyncCollectiveTensor for aten.to. (#134661)
Fixes #133421

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134661
Approved by: https://github.com/bdhirsh
2025-01-04 02:33:38 +00:00
eqy
7e3cd0e488 [CUDA] Check size calculation in ilpReduce for softmax (#144009)
For #143644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144009
Approved by: https://github.com/Skylion007
2025-01-04 02:31:15 +00:00
eqy
dbdda654af [64-bit][CUDA] Upsample2D 64-bit indexing fix attempt 2 (#141923)
#141831
Block/thread math requires a cast...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141923
Approved by: https://github.com/ngimel
2025-01-04 02:30:38 +00:00
1d091e47d6 [Inductor UT] Generalize device-bias code in test_torchinductor.py introduced by #143884. (#144057)
Fix #144056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144057
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-01-04 02:24:33 +00:00
22580f160e Multinomial sampling fix on mps for non contiguous tensors (#141515)
Fixes #141457

As for the tests. I looked in `test/test_mps.py` but I saw that `test_multinomial` function is disabled. Glad to add test where needed if there is some place where multinomial function is tested on metal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141515
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-04 01:21:37 +00:00
464b50dbd7 [MPSInductor] Add floor_div and index_expr implementation (#144083)
Simply copy-n-pasted from CPPInductor

`pytest test/inductor/test_torchinductor.py -k _mps` score is 418 failed, 337 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144083
Approved by: https://github.com/jansel
ghstack dependencies: #144167, #144162
2025-01-04 01:10:01 +00:00
6d25938540 [MPSInductor] Add remainder op (#144162)
For it to return correct result for half precision type it must be
upcast to float

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144162
Approved by: https://github.com/jansel
ghstack dependencies: #144167
2025-01-04 00:47:40 +00:00
f8e1eacf2f [MPSInductor] Extend constant to bool type (#144167)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144167
Approved by: https://github.com/jansel
2025-01-04 00:47:40 +00:00
d41134f7e5 [Inductor] Fix torch.polygamma() when n == 0 (#144058)
Fixes #143648

aten:

dec1a6d0f0/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L436-L447)

compiled kernel code:

```
cpp_fused_polygamma_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_devuser/tmpi1d9ksww/db/cdb7hyptwxpzukwd42x4ajfjlgrpum4a4htdd6lhb65apclsmno4.h"
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        {
            {
                auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
                auto tmp1 = static_cast<float>(0.0);
                auto tmp2 = tmp1 == 0 ? calc_digamma(tmp0) : calc_polygamma(tmp0, tmp1);
                out_ptr0[static_cast<int64_t>(0L)] = tmp2;
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144058
Approved by: https://github.com/jansel
2025-01-04 00:22:10 +00:00
52742b07c5 remove allow-untyped-defs from nn/utils/_deprecation_utils.py (#144136)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144136
Approved by: https://github.com/aorenste
2025-01-03 23:44:14 +00:00
0a94bb432e [ROCm] CK Flash Attention Backend (#143695)
Replace https://github.com/pytorch/pytorch/pull/138947 for re-import.

Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695
Approved by: https://github.com/malfet

Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-03 22:01:36 +00:00
3251171ae8 Make whl metadata public readable (#144164)
After https://github.com/pytorch/pytorch/pull/143677/files#r1902138480 lands, the new nightly wheel metadata is not readable publicly causing pip install to fail, for example https://github.com/pytorch/pytorch/actions/runs/12603415308/job/35128414909.

FBGEMM folks are also noticed this failure on their end (cc @q10)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144164
Approved by: https://github.com/clee2000
2025-01-03 21:08:49 +00:00
9bf2a9a616 [ScaledMM] Fix NaNs in test for garbage input data (#144042)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144042
Approved by: https://github.com/janeyx99
2025-01-03 21:02:25 +00:00
b75f32b848 Update TorchDynamo-based ONNX Exporter memory usage example code. (#144139)
Address related comments earlier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144139
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-01-03 20:41:36 +00:00
64bffb3124 remove allow-untyped-defs onnx/_internal/exporter/_fx_passes.py (#144134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144134
Approved by: https://github.com/Skylion007
2025-01-03 20:18:40 +00:00
64b197b603 remove allow-untyped-defs from export/_remove_auto_functionalized_pass.py (#144135)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144135
Approved by: https://github.com/Skylion007
2025-01-03 20:08:11 +00:00
9b8a4e7141 remove allow-untyped-defs from torch/onnx/operators.py (#144133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144133
Approved by: https://github.com/Skylion007
2025-01-03 20:07:56 +00:00
6e09d32c00 remove allow-untyped-defs from torch/jit/_passes/_property_propagation.py (#144132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144132
Approved by: https://github.com/Skylion007
2025-01-03 20:07:37 +00:00
eb7a303d21 [dtensor] expose the __create_chunk_list__ in the doc (#144100)
as titled, this PR expose this dunder method as a public API in the doc,
so that different checkpoint implementations can leverage this protocol,
instead of exposing a separate API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144100
Approved by: https://github.com/awgu
ghstack dependencies: #144099
2025-01-03 20:06:23 +00:00
45411d1fc9 Use absolute path path.resolve() -> path.absolute() (#129409)
Changes:

1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()`
2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409
Approved by: https://github.com/albanD
2025-01-03 20:03:40 +00:00
e9e18a9617 remove allow-untyped-defs from _export/db/logging.py (#144093)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144093
Approved by: https://github.com/Skylion007
2025-01-03 19:36:14 +00:00
ad09395674 [MPSInductor] Fix multi rangevar kernel invocation (#144050)
By changing `thread_position_in_grid` type to uint{n} and passing
dimentions during the kernel call

`pytest test/inductor/test_torchinductor.py -k _mps` score is 445 failed, 309 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144050
Approved by: https://github.com/jansel
ghstack dependencies: #144055, #144051, #144122, #144105, #144156
2025-01-03 19:32:43 +00:00
52e107a7ca [MPSInductor] Add constant, isinf and isnan ops (#144156)
Per Table 6.5 of [Metal Language Specification](https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf) infinity is `HUGE_VALF`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144156
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #144055, #144051, #144122, #144105
2025-01-03 19:32:43 +00:00
383ff4011c [ez] Use strip for arg sanitization in upload_metadata_file to improve readability (#144155)
Minor thing that improves readability.  I didn't realize you could specify characters for strip when I wrote this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144155
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-03 19:25:30 +00:00
8b3479e361 remove allow-untyped-defs from torch/distributed/fsdp/_dynamo_utils.py (#144131)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144131
Approved by: https://github.com/Skylion007
2025-01-03 19:07:21 +00:00
7b69f7b449 Clarify what we mean by decoupled weight decay in the *AdamWs (#144101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144101
Approved by: https://github.com/albanD
2025-01-03 19:06:00 +00:00
c36f94b373 [while_loop][dynamo] auto-unspecialize int input and output to unbacked symints (#143106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143106
Approved by: https://github.com/zou3519
ghstack dependencies: #143105, #143545
2025-01-03 19:01:07 +00:00
5660709856 [hop][BE] unify meta checking with check_meta_consistency (#143545)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143545
Approved by: https://github.com/zou3519
ghstack dependencies: #143105
2025-01-03 19:01:07 +00:00
6e8dca9ff3 [while_loop][aot] auto-unspecialize int input and output to unbacked symints (#143105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143105
Approved by: https://github.com/zou3519
2025-01-03 19:01:07 +00:00
56f6289f6a [mps/inductor] Add support for atanh(). (#144121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144121
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-03 18:55:05 +00:00
a7b61c5b49 [MPSInductor] Add signbit op support (#144105)
By mapping it to `metal::signbit`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144105
Approved by: https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #144055, #144051, #144122
2025-01-03 18:34:46 +00:00
8d63a4a409 Revert "Set enable_trace_contextlib_contextmanager flag to True (#140604)"
This reverts commit 1c817fe6714cec510ccc6022b2c3e66146c3ad59.

Reverted https://github.com/pytorch/pytorch/pull/140604 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/140604#issuecomment-2569640837))
2025-01-03 18:23:53 +00:00
c5c897c3a1 [dynamo][easy] Miscellaneous fixes (#144141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144141
Approved by: https://github.com/williamwen42
ghstack dependencies: #144129, #144130
2025-01-03 18:22:56 +00:00
732359c633 [dynamo][easy] Minor fixes in guards.cpp (#144130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144130
Approved by: https://github.com/williamwen42
ghstack dependencies: #144129
2025-01-03 18:22:56 +00:00
a450e177fd [dynamo] remove inline inbuilt tests as flag is enabled by default (#144129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144129
Approved by: https://github.com/williamwen42
2025-01-03 18:22:56 +00:00
2409b49a33 Revert "Rewrite _reparametrize_module to use contextmanager (#138203)"
This reverts commit 7bf3b7cdc5631f9991eebcdd8ec09095339a9973.

Reverted https://github.com/pytorch/pytorch/pull/138203 on behalf of https://github.com/guilhermeleobas due to breaking one of the benchmarks (moco) ([comment](https://github.com/pytorch/pytorch/pull/138203#issuecomment-2569634001))
2025-01-03 18:17:32 +00:00
60fe8a65af [Inductor] Generalize tiling algorithm to handle fused reductions (#144041)
# Issue

This PR cleans up an edge case that wasn't handled by https://github.com/pytorch/pytorch/pull/137243. The existing tiling code assumes that `node.get_ranges()` is a reliable source of pointwise and reduction numels. This is true for pointwise kernels, but the situation is more complicated with reductions. Since reductions change the number of elements in a tensor, not all ops within a reduction kernel will have the same number of iterations. For example, `var_mean` fuses pointwise division with the output of reduction sum, and the division lacks the corresponding reduction ranges.

# Fix

Instead of getting numels from `node.get_ranges()`, explicitly pass the global pointwise and reduction numels to the relevant tiling functions. In `SIMDKernel.complete_partial_tiling`, we solve for the missing numel by diving the global numel by the partial tiling's numel. This ensures all tilings have the correct global numel.

Also, in `SIMDKernel.is_compatible`, add the global reduction numel to node ranges that are missing it. For example, `{"x": 8, "r0_": 8}` is compatible with  a node of ranges `([8], [])` when we have `reduction_numel=8`.

Finally, this PR generalizes some of the existing codegen to handle multiple reduction dims. We already had code to ignore reduction splits for pointwise kernels, but it only worked for 1D reductions. Now it can handle ND.

# Test plan

This PR parametrizes the existing CI test for `var_mean` to also run with tiled reductions. It also adds a new test checking that `var_mean` generates 2D tilings (with tiled reduction enabled). These new tests would fail on the current main branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144041
Approved by: https://github.com/jansel
2025-01-03 18:16:27 +00:00
e93f625d00 [AOTI] don't codegen autotune_at_compile_time for non-Triton kernels (#143990)
`autotune_at_compile_time` is a separate codegen file specifically for autotuning Triton kernels. We can skip it for non-Triton kernels (like CUTLASS).

This test (test_aoti_workspace_ptr) checks that `workspace_0.data_ptr()` is codegen-ed correctly in AOTI.

```
// in AOTI codegen
kernels.cuda_fused_0(
  (const half*)arg0_1.data_ptr(), (const half*)arg1_1.data_ptr(), (half*)buf0.data_ptr(),
  (int)200, (int)5216, (int)10432, (int)10432, (int)5216, (int)0, (int)5216,
  (size_t*)nullptr, (uint8_t*)workspace_0.data_ptr(), stream);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143990
Approved by: https://github.com/henrylhtsang, https://github.com/chenyang78, https://github.com/desertfire
2025-01-03 18:01:12 +00:00
f3968373c1 Migrate the rest of CUDA 12.1 jobs to 12.4 (#144118)
CUDA 12.4 is the default now and we don't build nightly 12.1 anymore, so it's time to move the rest of CI jobs to 12.4.  I also clean up some redundant CI jobs on periodic and inductor-periodic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144118
Approved by: https://github.com/atalman
2025-01-03 17:45:41 +00:00
cbdc70ae07 Use the build environment as sccache prefix instead of workflow name (#144112)
This is an attempt to improve cache usage for jobs in non-pull workflows like periodic, slow, or inductor as we are seeing build timeout there from time to time, for example https://github.com/pytorch/pytorch/actions/runs/12553928804.  The build timeout never happens in pull or trunk AFAICT because they are more up to date with the cache content coming from the PR itself.

Logically, the same build should use the same cache regardless of the workflows.  We have many examples where the same build, for example [linux-focal-cuda12.4-py3.10-gcc9-sm86](https://github.com/search?q=repo%3Apytorch%2Fpytorch+linux-focal-cuda12.4-py3.10-gcc9-sm86&type=code), is split between different workflows and, thus, uses different caches.

I could gather some sccache stats from CH in the meantime to try to prove the improvement before and after this lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144112
Approved by: https://github.com/malfet
2025-01-03 17:33:03 +00:00
b9fbd65dfd AOTI fallback ops: remove ops that were never codegen'ed (#143421)
Removes 4 fallback ops that are currently not possible to codegen, which does not break ABI-compatibility.

1. `_cudnn_rnn_backward` and `_histogramdd_bin_edges` both return `Tensor[]`, which we cannot codegen with the current design.
2. `_sparse_coo_tensor_with_dims_and_tensors` only supplies a Sparse operator, which we don't support.
3. `zeros.names` requires a `Dimname` input, which we can't currently codegen.

Removing these ops from the list will improve test performance, since the fallback op generation will use the Python proxy executor instead of calling non-existent C functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143421
Approved by: https://github.com/desertfire
ghstack dependencies: #141371, #143223
2025-01-03 16:05:38 +00:00
b5b419d627 cpp_wrapper: Use runtime dispatched fallbacks for complex ops (#143223)
When calling a fallback op in cpp_wrapper mode, where any of the inputs are complex numbers, utilize the runtime dispatched fallback mode. This properly handles the Conjugate and Negative dispatch keys, if present, in exchange for a performance pessimization in complex arithmetic.

This PR additionally fixes some cascading failure modes exposed in our `aot_inductor` tests by this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143223
Approved by: https://github.com/desertfire
ghstack dependencies: #141371
2025-01-03 16:05:38 +00:00
e88d06f54e ir.ExternKernel: correctly handle kwarg default arguments (#141371)
Additionally, enable torchinductor opinfo tests exercising all
previously fixed bugs in this stack.

Note: I've manually sharded the cpp_wrapper CI checks into 2 shards.
Once all OpInfo tests are enabled we should switch back to automatic
sharding, but until then the pipeline doesn't have appropriate timing
stats.  More shards would be helpful given the compilation slowdown
associated with cpp_wrapper, but 2 will do for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141371
Approved by: https://github.com/desertfire
2025-01-03 16:05:31 +00:00
f7644efa79 [MPSInductor][EZ] Fix logical_[or|end] ops (#144122)
For boolean operands it does not really matter whether `&` or `&&` is
used, but if one ever to rely on operator precedence, then bitwise ops
should have higher precendence than logical ones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144122
Approved by: https://github.com/huydhn
ghstack dependencies: #144055, #144051
2025-01-03 15:28:07 +00:00
b336d72dae [MPSInductor] Preserve dtype during load (#144051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144051
Approved by: https://github.com/Skylion007
ghstack dependencies: #144055
2025-01-03 15:17:33 +00:00
a1ae8fadc7 [cpu][vec] support reduce ops for add and max (#144065)
### Description

During the support of INT8 SDPA https://github.com/pytorch/ao/pull/1372, we find that `at::vec::vec_reduce_all<int32_t>` would go  into slow scalar path when doing sum and max. So here, we support the two reduce-related ops `reduce_add` and `reduce_max` for `vec512` and `vec256`, using the Sequence instructions.

### Details
- Support vectorized `reduce_add` and `reduce_max` for dtypes `int32` and `float32`, using the Sequence instructions;
- Implement the scalar version for fallback path in vec base;
- Add the operator `reduce` in vec base, in order to simplify the codes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144065
Approved by: https://github.com/mingfeima
2025-01-03 13:01:52 +00:00
55dc61dd52 Dataloader distribute tasks to workers when in_order is False (#142324)
Fixes #105203 and is a follow up PR to #141833

When `in_order` is True (the default), tasks are given out to workers in a round robin fashion. When `in_order` is False this is no longer needed, as we give up guarantees of reproducibility, and instead tasks should be given to workers that are able to perform work.
In this PR I've added tracking of the number of outstanding tasks for each worker (updated when tasks are added to their queue, and when data is returned to the main thread). When finding the next queue to add a task to, if `in_order` is False it will only add the task to the workers queue if it has fewer than `_prefetch_factor` tasks outstanding.
The current default behaviour is left as is.

Tests are also updated to assert on the worker IDs for each sample of data returned.
I've run the following to confirm they aren't flaky
```bash
for i in {1..20}; do python test/test_dataloader.py TestOutOfOrderDataLoader; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142324
Approved by: https://github.com/andrewkho
2025-01-03 12:57:04 +00:00
c09bf71bd6 [Inductor][CPU] Fix C++ compile error of torch.max on bool type (#143848)
Fix https://github.com/pytorch/pytorch/issues/143568
Before:
![image](https://github.com/user-attachments/assets/3e1e869e-7ae7-45c0-a334-8a663028e003)
After:
![image](https://github.com/user-attachments/assets/91f72920-64bd-449a-a6c6-6048409c1450)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143848
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-01-03 09:00:43 +00:00
d9507548d8 [dynamo][BE] move zip_longest polyfill to submodule polyfills.itertools (#144067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144067
Approved by: https://github.com/yanboliang
ghstack dependencies: #144066
2025-01-03 08:08:31 +00:00
fb1beb31d2 [dynamo][BE] move dropwhile polyfill to submodule polyfills.itertools (#144066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144066
Approved by: https://github.com/jansel
2025-01-03 08:08:31 +00:00
00df63f09f [ROCm] Fix for ld failed to convert GOTPCREL relocation in PyTorch build (#143986)
I experienced an error while doing a DEBUG build of pytorch on rocm:
```
additional relocation overflows omitted from the output
/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
```
Based on discussions on similar issue #138427, I fixed it after adding the `--offload-compress` to the HIP_HIPCC_FLAGS to successfully build DEBUG mode on my node.

Further updated the PR to enable the flag for non-DEBUG builds as well due to the size reduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143986
Approved by: https://github.com/jeffdaily
2025-01-03 06:53:08 +00:00
e141cb9c34 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2025-01-03 05:41:06 +00:00
48a05ee773 [dtensor] improve doc of the DTensor class (#144099)
as titled: explicitly list all public members to make sure the public
API stays consistent, also use groupwise as the member order to make doc
look better

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144099
Approved by: https://github.com/awgu
2025-01-03 05:35:44 +00:00
41b5c600df [ReduceOps] Add dimension checking for cummin()/cummax(). (#143920)
Summary: cum{min,max} didn't guard against 0-d vector and allowed an arbitrary dimension to be passed.

Test Plan: torch_test.py

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #71477

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143920
Approved by: https://github.com/malfet
2025-01-03 04:14:33 +00:00
c5b75f8db1 [AOTI] Remove more AOTI_TORCH_EXPORT (#144081)
Summary: Similar to https://github.com/pytorch/pytorch/pull/142500, remove redundant AOTI_TORCH_EXPORT from several cpp files, to solve a windows build issue.

Differential Revision: D67762069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144081
Approved by: https://github.com/yushangdi
2025-01-03 02:17:38 +00:00
c31912666e [ROCm] Print amdgpu info on bare metal for CI runners (#144038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144038
Approved by: https://github.com/jeffdaily
2025-01-03 02:00:40 +00:00
37e9da0687 [ROCm][Windows] Disable roctracer-related code (#143329)
Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329
Approved by: https://github.com/sraikund16
2025-01-03 01:51:01 +00:00
891a86d1ad remove allow-untyped-defs from ao/quantization/experimental/fake_quantize.py (#144091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144091
Approved by: https://github.com/aorenste
2025-01-03 01:26:36 +00:00
377e29745f remove allow-untyped-defs from distributed/elastic/utils/data/cycling_iterator.py (#144090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144090
Approved by: https://github.com/aorenste
2025-01-03 01:22:50 +00:00
0d6db839a7 remove allow-untyped-defs from utils/data/datapipes/iter/streamreader.py (#144088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144088
Approved by: https://github.com/aorenste
2025-01-03 01:21:44 +00:00
bdfb40ed29 remove allow-untyped-defs from utils/_import_utils.py (#144089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144089
Approved by: https://github.com/aorenste
2025-01-03 01:21:13 +00:00
28a74fe3aa remove allow-untyped-defs from torch/mps/event.py (#144092)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144092
Approved by: https://github.com/aorenste
2025-01-03 01:20:17 +00:00
496fc90965 [CI] Multigpu 1 -> 2 shards (#143992)
Fixes #ISSUE_NUMBER
It's been timing out https://github.com/pytorch/pytorch/actions/runs/12544191739/job/34977636276

They're still somewhat uneven but they're both under the limit now.  It would probably be better to use run_test.py's sharding to do this, maybe in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143992
Approved by: https://github.com/huydhn
2025-01-03 00:33:16 +00:00
3eb3f4ed55 Upload METADATA file with whl binaries (#143677)
Upload the metadata file for wheels for pep658 https://peps.python.org/pep-0658/
Using a python script but using bash might be easier...

--

Testing

Example run https://github.com/pytorch/pytorch/actions/runs/12550595201/job/34994883276 without actual upload, just dry run

Lightly tested the script to make sure it uploads to s3, but integration with the bash script + workflow is untested

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143677
Approved by: https://github.com/seemethere
2025-01-03 00:32:05 +00:00
bb5e439f2d Add networkx as bazel dep to fix CI failure (#143995)
Add networkx as a dependency for test_bazel

Example failure: https://github.com/pytorch/pytorch/actions/runs/12551752021/job/34996706301

```

INFO: From Testing //:test_bazel:
==================== Test output for //:test_bazel:
Traceback (most recent call last):
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 33, in <module>
    test_simple_compile_eager()
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/test/_test_bazel.py", line 27, in test_simple_compile_eager
    opt_foo1 = torch.compile(foo, backend="eager")
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2533, in compile
    backend = _TorchCompileWrapper(backend, mode, options, dynamic)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/__init__.py", line 2342, in __init__
    self.compiler_fn = lookup_backend(backend)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 66, in lookup_backend
    _lazy_import()
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/registry.py", line 102, in _lazy_import
    import_submodule(backends)
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/utils.py", line 2797, in import_submodule
    importlib.import_module(f"{mod.__name__}.{filename[:-3]}")
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/execroot/pytorch/external/python3_10_x86_64-unknown-linux-gnu/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_dynamo/backends/common.py", line 12, in <module>
    from torch._functorch.aot_autograd import (
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/aot_autograd.py", line 147, in <module>
    from .partitioners import default_partition
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/partitioners.py", line 31, in <module>
    from ._activation_checkpointing.graph_info_provider import GraphInfoProvider
  File "/var/lib/jenkins/.cache/bazel/_bazel_jenkins/fdf6d09bf4b4f04a71e2a7dfceb40620/sandbox/processwrapper-sandbox/6504/execroot/pytorch/bazel-out/k8-fastbuild/bin/test_bazel.runfiles/pytorch/torch/_functorch/_activation_checkpointing/graph_info_provider.py", line 3, in <module>
    import networkx as nx
ModuleNotFoundError: No module named 'networkx'
```

No periodic runs on this PR or its main branch commit, but I'm pretty sure its started on https://togithub.com/pytorch/pytorch/pull/143539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143995
Approved by: https://github.com/huydhn
2025-01-02 19:42:18 +00:00
a8c98ce175 [cutlass-3] Update third-party/cutlass-3 from 3.4 to 3.5.1 (#143515)
# Summary:

This also makes updates to different repositories throughout FB code to roll any updates needed for this new release.

I was not able to get AsyncMM.cu to build (still trying) Yfiu suggested that I just skip it for now

Test Plan:
Have run various build commands to try and expose errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143515
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-02 18:45:11 +00:00
8506a2af9a remove allow-untyped-defs from _export/pass_infra/proxy_value.py (#143944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143944
Approved by: https://github.com/aorenste
ghstack dependencies: #143943
2025-01-02 18:17:03 +00:00
8f3eb84373 ROCm: Enable 4 gpu tests for distributed config (#140319)
Change the label to make sure the jobs land on a
node which has >= 4 GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140319
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/kwen2501
2025-01-02 17:22:11 +00:00
916b510ff5 Enable mkldnn pattern matcher tests for BF16 on AArch64 (#144030)
Fixes #143146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144030
Approved by: https://github.com/malfet
2025-01-02 17:13:38 +00:00
a93e75d1e2 [MPS] Handle implicit cpu-scalar-to-gpu transfer (#144055)
Followup after https://github.com/pytorch/pytorch/pull/143934, this check is no longer necessary and fixes a subset of inductor tests

Before `pytest test/inductor/test_torchinductor.py -k _mps` reports 463
failed, 291 passed, 32 skipped after 456 failed, 298 passed, 32 skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144055
Approved by: https://github.com/Skylion007
2025-01-02 17:12:39 +00:00
0431d47eaa [tp] propagate src_data_rank kwarg in TP API (#144005)
as titled, this PR propagates the src_data_rank in the TP API, so that
module level APIs could leverage the flexibility to choose
src_data_rank, and avoid the communication if it does not need to

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144005
Approved by: https://github.com/tianyu-l
ghstack dependencies: #143883
2025-01-02 05:35:52 +00:00
f242dbb76f [dtensor] add src_data_rank to distribute_tensor API (#143883)
As titled, this PR add a kwarg src_data_rank to the distribute_tensor
API, to allow user specify a specific rank as the full tensor source
data. Previously we by default specify group_rank=0 as the source of
truth for single device semantic, this new option:

* gives advanced user flexiblity to choose the source data rank
* allow user to specify None explicity, which means we will skip the
  communications needed (scatter/broadcast) for the cases that does not
care about single device semantic (i.e. loading from a checkpoint)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143883
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
2025-01-02 05:35:52 +00:00
dec1a6d0f0 [dynamo] Separate out GetItemSource and DictGetItemSource (#143926)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143926
Approved by: https://github.com/jansel
2025-01-01 02:39:41 +00:00
8d9ff9c8a4 Fix a bug for wrong stride in fake tensor (#141427)
Fixes #141426

Please see details in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141427
Approved by: https://github.com/jansel
2024-12-31 23:45:32 +00:00
e7ed660233 [inductor] Add missing py312 xfail (#144006)
See #144006

```py
__________________________________________ CudaReproTests.test_repeated_masked_load __________________________________________
RuntimeError: First class dim doesn't work with python 3.12

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/jansel/conda/envs/pytorch/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/jansel/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
  File "/home/jansel/pytorch/test/inductor/test_cuda_repro.py", line 1678, in test_repeated_masked_load
    from functorch.einops import rearrange
  File "/home/jansel/pytorch/functorch/einops/__init__.py", line 1, in <module>
    from .rearrange import rearrange
  File "/home/jansel/pytorch/functorch/einops/rearrange.py", line 7, in <module>
    from functorch._C import dim as _C
ImportError: initialization failed
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144006
Approved by: https://github.com/Skylion007
2024-12-31 23:37:05 +00:00
a174ee2255 Revert "Fix duplicate pattern error (#139321)"
This reverts commit 9e8d84f8631317ce61de4f0f9731fc1b1ccc3d2b.

Reverted https://github.com/pytorch/pytorch/pull/139321 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/139321#issuecomment-2566620095))
2024-12-31 17:44:02 +00:00
d8a2796fb6 Revert "[Inductor UT] Generalize newly introduced device-bias hard code in (#143975)"
This reverts commit 7c1c0730beed9bb05a16ba678a8f32b29fdd0a29.

Reverted https://github.com/pytorch/pytorch/pull/143975 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/139321 feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/143975#issuecomment-2566619312))
2024-12-31 17:41:06 +00:00
eec30916e7 Revert "Update low prec codegen for div/mod (#142350)"
This reverts commit 135a2d44830b2de1ed6714f52cc6a612406adb6d.

Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/jeanschmidt due to breaking internal signals ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2566615835))
2024-12-31 17:35:32 +00:00
5ef0de7615 [MPSInductor] Fix multiple kernel generation (#143998)
At the moment by generating multiple MetalLibraries

`pytest test/inductor/test_torchinductor.py -k _mps` score is 434 failed, 317 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143998
Approved by: https://github.com/jansel, https://github.com/ruidazeng
ghstack dependencies: #143948, #143949, #143973, #143977
2024-12-31 13:51:50 +00:00
f0f09bb3c2 [MPSInductor] Implement minimum and maximum ops (#143977)
By calling `metal::min` and `metal::max` respectively with argument
typecast to a common type to avoid ambiguous calls errors

TODO: Implement NaN propagation for both eager and compile, see https://github.com/pytorch/pytorch/issues/143976

`pytest test/inductor/test_torchinductor.py -k _mps` score is 460 failed, 291 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143977
Approved by: https://github.com/jansel
ghstack dependencies: #143948, #143949, #143973
2024-12-31 13:51:50 +00:00
09e47ab7ab Refine CUDA Stream priority (#143849)
# Motivation
As mentioned in https://github.com/pytorch/pytorch/pull/141119#discussion_r1897480515, we properly handle the priority value if it is outside of the priority range.

# Additional Context
If the value falls outside of the allowed priority range, it will automatically be mapped to the nearest valid priority(either lowest or highest).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143849
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119, #141123, #143799
2024-12-31 11:15:59 +00:00
3848de55ed Add get_stream_from_external API for CUDA backend (#143799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119, #141123
2024-12-31 11:15:59 +00:00
8f6c4d1732 Add get_stream_from_external API for XPU backend (#141123)
# Motivation
This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch.

# Additional Context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119
2024-12-31 11:15:52 +00:00
a68c0ca497 Add low priority XPU Stream (#141119)
# Motivation
Due to the potential for the external SYCL queue to have a low priority, we need to support the low-priority SYCL queue for native XPU Streams to maintain consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141119
Approved by: https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #142347
2024-12-31 11:15:45 +00:00
39450ae655 Refine XPU external Stream (#142347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142347
Approved by: https://github.com/gujinghui, https://github.com/albanD
2024-12-31 11:15:38 +00:00
16a57e232c removed dead code for dynamo flag dead_code_elimination (#140938)
Fixes #136862

1.  removed dead code from torch/_dynamo/convert_frame.py
2.  ran `lintrunner -a` and all the tests passed.
3. ran the unit tests and everything seems to be in order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140938
Approved by: https://github.com/zou3519
2024-12-31 09:27:43 +00:00
01034e963c [AOTI] Not use AOTI_TORCH_CHECK in non AOTI mode. (#143970)
Fix #143967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143970
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-31 06:28:32 +00:00
a2753e376b [Inductor] Support tiling reduction dimensions (#137243)
Fixes #134277 and https://github.com/pytorch/pytorch/issues/142317.

Sub-PRs containing refactors from this one:
 - https://github.com/pytorch/pytorch/pull/141733
 - https://github.com/pytorch/pytorch/pull/141738
 - https://github.com/pytorch/pytorch/pull/141751 (based off the former)
 - https://github.com/pytorch/pytorch/pull/142249
 - https://github.com/pytorch/pytorch/pull/142020
 - https://github.com/pytorch/pytorch/pull/143135

 These refactor PRs should land before the main one.

# Feature

*Note: to minimize risk, multi-dimensional reductions are gated by the flag `config.triton.tile_reductions`, which defaults to False.*

Instead of having a single reduction dimension called `"r"`, we can now support 2D reductions with `"r0_"` and `"r1_"` dimensions. 2D reductions generate two nested loops, with different block pointer advancements in each loop body. Most of the implementation is generic to ND reductions, but for now the tiling algorithm sets a hard limit at 2D.

Here's an example of a 2D persistent reduction kernel:
```
@triton.jit
def triton_per_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr):
    xnumel = 1
    r0_numel = 15
    R0_BLOCK: tl.constexpr = 16
    r1_numel = 15
    R1_BLOCK: tl.constexpr = 16
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], True, tl.int1)
    r0_index = tl.arange(0, R0_BLOCK)[None, :, None]
    r0_offset = 0
    r0_mask = r0_index < r0_numel
    r1_index = tl.arange(0, R1_BLOCK)[None, None, :]
    r1_offset = 0
    r1_mask = r1_index < r1_numel
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    roffset = r1_offset + (r0_offset*r1_numel)
    rindex = r1_index + (r0_index*r1_numel)
    r0_0 = r0_index
    r1_1 = r1_index
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[15, 15], strides=[30, 1], block_shape=[R0_BLOCK, R1_BLOCK], order=[1, 0], offsets=[r0_offset, r1_offset]), boundary_check=[0, 1], padding_option='zero')[None, :, :]
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
    tmp3 = tl.where(r0_mask & r1_mask, tmp1, 0)
    tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])
    tmp5 = tl.sum(tmp4, 1)[:, None, None]
    tl.store(out_ptr0 + (tl.full([XBLOCK, 1, 1], 0, tl.int32)), tmp5, None)
''', device_str='cuda')
```

There are a few main differences between this kernel and what Inductor would generate without this PR.
 - Instead of an `r`/`RBLOCK` dimension, we have two reduction dimensions: `r0_`/`R0_BLOCK` and `r1_`/`R1_BLOCK`.
 - There are special size and indexing variables for reductions, which don't directly correspond to any kernel dimension. (`rindex`, `rnumel`, `RBLOCK`, and `roffset`.) These collapse N-D reduction sizes and indices indices into 1D. This simplifies the codegen for reductions, which sometimes want to access linear indices instead of N-dimensional ones. Doing things this way allows us to generate N-D loads and stores, but access this data as if it were 1D, minimizing the blast radius of this PR. Although this makes the code more verbose, it shouldn't have a perf impact because the triton compiler eliminates dead code.
 - We generate the line `tmp4 = tl.reshape(tmp3, [XBLOCK, RBLOCK])` before performing the actual reduction. This reshapes N reduction dimensions into 1D. This allows us to reduce over all N dimensions at once, simplifying the codegen and allowing the Triton complier to decide the order of processing under the hood.

Here's an example of a looped reduction:
```
@triton.jit
def triton_red_fused_sum_0(in_ptr0, out_ptr0, xnumel, r0_numel, r1_numel, XBLOCK : tl.constexpr, R0_BLOCK : tl.constexpr, R1_BLOCK : tl.constexpr):
    xnumel = 3
    r0_numel = 43
    r1_numel = 129
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = xindex < xnumel
    r0_base = tl.arange(0, R0_BLOCK)[None, :, None]
    r1_base = tl.arange(0, R1_BLOCK)[None, None, :]
    rnumel = r0_numel * r1_numel
    RBLOCK: tl.constexpr = R0_BLOCK*R1_BLOCK
    rbase = r1_base + (r0_base*r1_numel)
    x0 = xindex
    block_ptr0 = tl.make_block_ptr(in_ptr0, shape=[3, 43, 129], strides=[11094, 258, 1], block_shape=[XBLOCK, R0_BLOCK, R1_BLOCK], order=[2, 1, 0], offsets=[xoffset, 0, 0])
    _tmp2 = tl.full([XBLOCK, R0_BLOCK, R1_BLOCK], 0, tl.float32)
    for r0_offset in range(0, r0_numel, R0_BLOCK):
        r0_index = r0_offset + r0_base
        r0_mask = r0_index < r0_numel
        for r1_offset in range(0, r1_numel, R1_BLOCK):
            r1_index = r1_offset + r1_base
            r1_mask = r1_index < r1_numel
            roffset = r1_offset + (r0_offset*r1_numel)
            rindex = r1_index + (r0_index*r1_numel)
            r0_1 = r0_index
            r1_2 = r1_index
            tmp0 = tl.load(block_ptr0, boundary_check=[0, 1, 2], padding_option='zero', eviction_policy='evict_first')
            tmp1 = tl.broadcast_to(tmp0, [XBLOCK, R0_BLOCK, R1_BLOCK])
            tmp3 = _tmp2 + tmp1
            _tmp2 = tl.where(r0_mask & r1_mask & xmask, tmp3, _tmp2)
            block_ptr0 = tl.advance(block_ptr0, [0, 0, R1_BLOCK])
        block_ptr0 = tl.advance(block_ptr0, [0, R0_BLOCK, (-1)*R1_BLOCK*((128 + R1_BLOCK) // R1_BLOCK)])
    tmp4 = tl.reshape(_tmp2, [XBLOCK, RBLOCK])
    tmp2 = tl.sum(tmp4, 1)[:, None, None]
    tl.store(tl.make_block_ptr(out_ptr0, shape=[3], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.reshape(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

In addition to the aforementioned changes to the persistent reduction, multidimensional looped reductions have a few more lines of code:
 - They calculate indices inside the loop using `r0_base` and `r1_base`. For compatibility with existing codegen, these are collapsed to the 1D variant `rbase`.
 - Block pointer advancements are more nuanced for multidimensional loops. At the end of each loop body, we emit a `tl.advance` line which not only increments the pointer in its own dimension, but also undoes the cumulative increments of the previous loop level. This is equivalent to the usual practice in nested loops of starting with a fresh iteration variable at each level. Implementing this required refactoring the way we generate pointer advancements into a new `self.pointer_advancements` field of the kernel, which categorizes advancements by dimension.

The biggest difficulty in implementing this feature was that we represented tiling with a tuple like `(5,2)`. In the existing codebase, the compiler can infer that the reduction dimension of `(5,2)` is `2`, since reductions are always the last dimension. This became cumbersome now that we have to support multiple reduction dimensions, so I refactored tiling into a dict like `{"x": 5, "r0_": 2, "r1_": 4}`. This required quite a few code changes, but I don't think it makes the underlying logic much more complex. This will also make it easier to eventually support simultaneous pointwise and reduction tiling, like `{"x": 5, "y": 5, "r0_": 2, "r1_": 4}`. (This is not supported today, but we might want to do it eventually.)

The existing tiling algorithm generalized naturally to support reductions. For pointwise kernels, we tile the pointwise dimensions (`"x"`, `"y"`) as is. For reduction kernels, we never tile the `"x"` dimension, and only tile the reduction dimensions (`"r0_"`, `"r1_"`). Thus we only ever tile pointwise OR reduction dimensions, but not both. In principle it seems possible to support both, but it would likely require changes to the kernel fusion and autotuning logic. I thought it best to keep this PR as minimal as possible since it already touched a lot of different files.

Unfortunately, these changes weren't enough to get block pointers in some seemingly simple test cases. In some tests for `argmax` and `var_mean`, we already collapse reduction dimensions into 1D and generate modular indexing expressions, prior to tiling. So it's not trivial to figure out how to expand the collapsed reduction dimension back to a shape that would simplify the indexing.

To address these cases, this PR adds a new feature to the `config.prefer_nd_tiling` option, which analyzes reads and writes in the kernel, using the same mod-div pattern matching logic that generates block pointers later on. By matching this pattern, we can solve for the tiling splits which *would* simplify the indexing expression, and use then use that tiling to eliminate the modular indexing and emit a block pointer. This tiling mode is still off by default, but it's important for certain applications where we need to get as many block pointers as possible.

# Test plan

This touches pretty much anything that uses the Triton and Halide backends, so the existing CI provides good coverage. However, 2D reductions are gated behind a few feature flags like `config.prefer_nd_tiling` and `config.tile_reductions`, so this really only checks that the PR doesn't break 1D reductions.

In addition to existing CI tests, this PR also adds some new tests that specifically stress 2D reductions:

- `test_2d_reduction_odd_shapes`: test 2D reductions with a variety of ops and sizes. This covers the typical persistent and looped reductions.
-  `test_2d_reduce_no_x_dim`: test 2D reductions with no x dimension.
-  `test_2d_welford_reduction`: test 2D welford reductions with block pointers.
- `test_welford_non_block_pointer`: test a 2D welford reduction when block pointer analysis fails.
- `test_reduction_multiple_discontiguous_dims`: test reducing over more than one discontiguous dimension. We won't get a block pointer for this case, since that would require 3D tiling, but we're currently limited to 2D.
- `test_2d_reduction_multi_kernel`: test multi kernel autotuning on a 2D softmax kernel.
- `test_enable_tiled_reductions`: test that `config.triton.tile_reductions` enables/disables this feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137243
Approved by: https://github.com/jansel

Co-authored-by: Yueming Hao <yhao@meta.com>
Co-authored-by: Jason Ansel <jansel@meta.com>
2024-12-31 05:06:46 +00:00
f3e5078c27 [Inductor] Relax size constraints for re-inplacing (#143884)
Current reinplacing requires input buffer and output buffer has exactly the same storage size. However, matmul padding may increase the tensor size slightly for better performance, which prevents reinplacing.

This PR changes the size constraints to be:
- input and output buffer have exactly the same symbolic expression for storage size (i.e., sympy str).
- it's statically known that 0.99 * input_size <= output_size <= input_size

### Apply on llm.c
See the reuse of `buf1`.
Before relaxing size requirements on re-inplacing: ([P1703512078](https://www.internalfb.com/phabricator/paste/view/P1703512078))
![1](https://github.com/user-attachments/assets/1472f550-6eb8-4d5c-9965-49bbb20d81a9)

After relaxing size requirements on re-inplacing: ([P1703513053](https://www.internalfb.com/phabricator/paste/view/P1703513053))
![2](https://github.com/user-attachments/assets/416294dd-30eb-4e12-a36c-1aebf9af530b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143884
Approved by: https://github.com/eellison
2024-12-31 03:52:47 +00:00
cyy
8df99b6a6e Remove unneeded std::make_optional (#143575)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143575
Approved by: https://github.com/Skylion007
2024-12-31 03:08:47 +00:00
11bb94b7ea [MPSInductor] Fix index generation for transpose (#143973)
Alas, PythonPrinter would not work here, not would CppPrinter, so start building MetalPrinter.

`pytest test/inductor/test_torchinductor.py -k _mps` score is 474 failed, 277 passed, 32 skipped
Before this change:
`pytest test/inductor/test_torchinductor.py -k _mps` reported 506 failed, 245 passed, 32 skipped

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143973
Approved by: https://github.com/jansel
ghstack dependencies: #143948, #143949
2024-12-31 02:04:50 +00:00
cb24013b5b Fix assertion failure in pytorch profiler (#143940)
Summary:
Attempt to fix the following exception which occurred when profiling a Pytorch model ( Meta-internal LLM ) that also involved a ThreadPoolExecutor in the background:
```
Exception Found: !stack.empty() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/autograd/profiler_python.cpp":987, please report a bug to PyTorch. Python replay stack is empty.
```
The root cause of this issue seems to be that a thread call stack can be empty, which is asserted to not be empty.

I fixed this with some minimal changes to profiler_python.cpp

Approach:
 * Ensuring that the stack in question is not empty before trying to pop from it.

Test Plan:
* Tested manually on a reproducible scenario where the assertion failure was otherwise triggered ( repro too large to include here ). The assertion failure disappears.
 * CI

Differential Revision: D67691558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143940
Approved by: https://github.com/Skylion007, https://github.com/sraikund16
2024-12-31 01:43:04 +00:00
cyy
af629a8146 Enable readability-redundant-declaration (#143982)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143982
Approved by: https://github.com/Skylion007
2024-12-31 00:20:10 +00:00
934eaa503f [Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-30 23:51:17 +00:00
d9a6ffb63c [FSDP] Add workaround to fix buffer_dtype without root parameters (#143989)
Fixes https://github.com/pytorch/pytorch/issues/143900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143989
Approved by: https://github.com/H-Huang
2024-12-30 23:42:24 +00:00
2da7fb5320 [inductor] Make generated kernels deterministic (#143951)
`"compile_id"` had slipped into our generated Triton code (in the
metadata), which will defeat caching because the same kernels generated
in a different order would not cache hit with eachother.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951
Approved by: https://github.com/oulgen
2024-12-30 23:35:11 +00:00
d88a8c41d5 Fix flaky "Upload test stats" job (#143991)
Test stat uploading was intermittently failing due to certain XML strings being opportunistically converted to numbers, when string output was expected. This PR makes the conversion behavior optional, which should fix the stat uploads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143991
Approved by: https://github.com/clee2000, https://github.com/huydhn
2024-12-30 21:40:01 +00:00
d260bc4476 cpp_wrapper: minimize pybind11 dependency (#143772)
Only include the parts of `pybind11` that handle GIL management within `cpp_wrapper`. This dramatically improves compilation times by reducing the number of headers we compile. Improvements on my local system are on the order of 2x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143772
Approved by: https://github.com/Skylion007
2024-12-30 20:41:02 +00:00
baee623691 [BE][Ez]: Update fmtlib submodule to 1.11.1 (#143937)
* Exactly the same as previous fmtlib except it fixes an edgecase that could affect ABI compatibility between fmtlib versions.
* Seems safe to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143937
Approved by: https://github.com/albanD
2024-12-30 19:46:27 +00:00
9d026000de change import relative paths due to internal build failures (#143968)
Internal builds failing due to #143355, changing imports to be relative, similar to other imports

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143968
Approved by: https://github.com/albanD
2024-12-30 17:19:49 +00:00
c27c788e35 [MPS] Fix torch.add(x,y, alpha=2) crash (#143949)
TODO: as followup PR replace this weird logic with shaders

Fixes https://github.com/pytorch/pytorch/issues/143932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143949
Approved by: https://github.com/Skylion007
ghstack dependencies: #143948
2024-12-30 17:16:29 +00:00
beb6c2dea5 [MPS] Fix crash when mm is invoked with mixed dtypes (#143948)
Simply by copy-n-pasting check from
a7915c56f6/aten/src/ATen/native/cuda/Blas.cpp (L254-L257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143948
Approved by: https://github.com/Skylion007
2024-12-30 17:13:34 +00:00
7c1c0730be [Inductor UT] Generalize newly introduced device-bias hard code in (#143975)
test_pattern_matcher.py
Fix #143974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143975
Approved by: https://github.com/malfet
2024-12-30 16:47:19 +00:00
cyy
dca443835e Enable more readability-redundant checks (#143963)
They are helpful to simplifying code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963
Approved by: https://github.com/albanD
2024-12-30 14:49:33 +00:00
438698b20b [CD] Remove redundant triton dependency for xpu wheels (#143839)
Due to XPU CD wheels enabled pypi dependencies by https://github.com/pytorch/pytorch/pull/141135, so the PYTORCH_EXTRA_INSTALL_REQUIREMENTS has value for XPU CD wheel build.
Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850
Fixes #143838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143839
Approved by: https://github.com/huydhn
2024-12-30 13:39:06 +00:00
2fa09853cb Update slow tests (#143745)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143745
Approved by: https://github.com/pytorchbot
2024-12-30 11:51:49 +00:00
2ed4d65af0 Update torch-xpu-ops commit pin (#143853)
Update the torch-xpu-ops commit to [214f33](214f33b9d9), includes:

- Fix building issue for transformer related operators
- Improve XPU operator coverage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143853
Approved by: https://github.com/EikanWang
2024-12-30 02:38:16 +00:00
1b0d19a2cb Revert "[inductor] Make generated kernels deterministic (#143951)"
This reverts commit 79b354ee37b7d8a06a48ca8cc4e19a3fd006b433.

Reverted https://github.com/pytorch/pytorch/pull/143951 on behalf of https://github.com/wdvr due to failing tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/143951#issuecomment-2564952267))
2024-12-30 02:06:38 +00:00
cf89127137 [Torch.package] Add support for UntypedStorage tensors (#143930)
Summary: fp8 uses untyped storage. Add support for torch.package by using the same logic as in serialization.py

Differential Revision: D67684033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143930
Approved by: https://github.com/henrylhtsang
2024-12-30 02:03:52 +00:00
92d8965082 Adding support for differentiable lr, weight_decay, and betas in Adam/AdamW (#143726)
Third PR in a series of PRs to broaden differentiable optimizer support w/ @janeyx99 (sorry for pinging over the holidays! I just wanted to put this one out but I am definitely not asking for review or anything like that rn)

This is also going to probably be my last PR before the holidays!

Note: This is a branch of #143710 -- I've never worked on a branch of a branch before so I wasn't sure about the protocol so I thought I'd just made the PR and wait until that one gets merged.

This is adding support for differentiable lr, weight_decay, and betas to Adam and AdamW (but after refactoring AdamW into an Adam subclass, it's really just changing code in torch/optim/adam.py)

I had one main thing I was wondering about, which is that adam already has a differentiable flag built in, so I have code like this
```py
if differentiable and isinstance(beta2, Tensor):
    if beta2.requires_grad:
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
    else:
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
else:
    exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```
That I could definitely simplify to just
```py
if differentiable and isinstance(beta2, Tensor):
    exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
else:
    exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)
```

It would definitely be a little slower in the case that it's differentiable but doesn't need a grad for beta2, but the code would also be a lot more clear and I'm debating speed vs future code usability.

Also the line in the above example:
```py
exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj().mul(1 - beta2))
```
was concerning to me because it is considerably more expensive than `value=1 - beta2`, but I couldn't think of a better way to do it.

Further work on #141832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143726
Approved by: https://github.com/janeyx99
2024-12-30 01:11:57 +00:00
a7915c56f6 Propagate callable parameter types using ParamSpec (#142306) (#143797)
The codebase has a few locations where callable parameter type information is lost when the unpackings *args and **kwargs are typed as Any. Refactor these instances to retain type information using typing_extensions.ParamSpec.

Also, in these functions, enforce return type with TypeVar.

Addresses #142306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143797
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2024-12-29 23:03:14 +00:00
79b354ee37 [inductor] Make generated kernels deterministic (#143951)
`"compile_id"` had slipped into our generated Triton code (in the
metadata), which will defeat caching because the same kernels generated
in a different order would not cache hit with eachother.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143951
Approved by: https://github.com/oulgen
2024-12-29 19:53:33 +00:00
b6bdb67f82 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-12-29 17:23:13 +00:00
7101b8ca35 remove allow-untyped-defs from onnx/_internal/_lazy_import.py (#143943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143943
Approved by: https://github.com/justinchuby
2024-12-29 10:29:43 +00:00
cf0b72c4ab remove allow-untyped-defs from _inductor/compile_worker/watchdog.py (#143941)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143941
Approved by: https://github.com/Skylion007
2024-12-29 01:05:09 +00:00
3ba6fcd3ff remove allow-untyped-defs from torch/_size_docs.py (#143942)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143942
Approved by: https://github.com/Skylion007
2024-12-29 01:00:46 +00:00
85f348578b [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#143929)
Differential Revision: D67682313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143929
Approved by: https://github.com/hl475
2024-12-28 23:39:21 +00:00
e1abbe155e remove allow-untyped-defs from ao/nn/qat/dynamic/modules/linear.py (#143919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143919
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-28 20:50:48 +00:00
3054aae493 [MPS] Fix fmin/fmax for scalar argument (#143934)
CPU scalar promotion to GPU is allowed for CUDA and shoudl be allowed for MPS as well (at the very least it should not crash)

Fixes https://github.com/pytorch/pytorch/issues/143933 https://github.com/pytorch/pytorch/issues/142203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143934
Approved by: https://github.com/Skylion007
2024-12-28 17:07:19 +00:00
45a709d9ec Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit cbc4cf3043a7316c1f6e86b1e22d96042a59489c.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see d8c3900d80/1 ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))
2024-12-28 16:49:38 +00:00
8cccc46e33 Revert "Add AOT inductor support for _scaled_mm for CPU (#141961)"
This reverts commit 3fabd10c40c632104e420ae8e3721f33176e8640.

Reverted https://github.com/pytorch/pytorch/pull/141961 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see d8c3900d80/1 ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))
2024-12-28 16:49:38 +00:00
d8c3900d80 [Inductor] Implement primitive Metal compiler (#143893)
Still work in progress, only works for element wise operations. Current implementation could be used to turn something like
```python
def f(x):
  return x[:,::2].sin() + x[:, 1::2].cos()
```
into the following shader
```python
# Topologically Sorted Source Nodes: [sin, cos, add], Original ATen: [aten.sin, aten.cos, aten.add]
# Source node to ATen node mapping:
#   add => add
#   cos => cos
#   sin => sin
# Graph fragment:
#   %sin : [num_users=1] = call_function[target=torch.ops.aten.sin.default](args = (%slice_2,), kwargs = {})
#   %cos : [num_users=1] = call_function[target=torch.ops.aten.cos.default](args = (%slice_4,), kwargs = {})
#   %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%sin, %cos), kwargs = {})
mps_lib = torch.mps._compile_shader("""
    kernel void kernel_0(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        int x0 = xindex;
        auto tmp0 = in_ptr0[2*x0];
        auto tmp1 = metal::precise::sin(tmp0);
        auto tmp2 = in_ptr0[2*x0 + 1];
        auto tmp3 = metal::precise::cos(tmp2);
        auto tmp4 = tmp1 + tmp3;
        out_ptr0[x0] = static_cast<float>(tmp4);
    }
""")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143893
Approved by: https://github.com/jansel
ghstack dependencies: #143891, #143892
2024-12-28 06:58:32 +00:00
74028cfd0c [Inductor][CPP] Fix Data Type issue of frexp (#143746)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/143729. `frexp` has 1 input but 2 output tensor with different data type, current `deduce_dtype_for_cpp_cse_variable` can't deduce the data type for each output correctly due to missing of output index. In this PR, we set the data type of cse var in the codegen of `frexp` and avoid it being overridden in the following flow.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_frexp
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143746
Approved by: https://github.com/jgong5
2024-12-28 06:00:13 +00:00
01980cac38 [dynamo] Make ConstDictKeySource a subclass of ChainedSource (#143924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143924
Approved by: https://github.com/jansel
2024-12-28 05:59:45 +00:00
3fabd10c40 Add AOT inductor support for _scaled_mm for CPU (#141961)
This PR is to add AOT inductor support for _scaled_mm for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141961
Approved by: https://github.com/malfet
ghstack dependencies: #139975
2024-12-28 05:57:35 +00:00
cbc4cf3043 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2024-12-28 05:49:06 +00:00
d3e9133ab2 Fix separate in process bisector cache, cleanup on exit (#143661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143661
Approved by: https://github.com/ezyang
ghstack dependencies: #143657
2024-12-28 03:20:37 +00:00
1e246ef05b [CUDA][CUDA graphs][RNG] Skip replay prologue if wholegraph_increment is 0 (#143777)
#143572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143777
Approved by: https://github.com/ngimel, https://github.com/eellison
2024-12-28 02:31:26 +00:00
4a7cf0dbff [Inductor] Add MPS device op overrides (#143892)
Mostly dummy interface as MPS backend currently limited to a single device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143892
Approved by: https://github.com/jansel
ghstack dependencies: #143891
2024-12-28 02:11:45 +00:00
ad78edee8e Add support for list, tuple and dict in numeric debugger (#143882)
Summary:
Previously numeric debugger only supports torch.Tensor, this PR adds support for list, tuple and dict as well

Test Plan:
python test/test_quantization.py -k test_extract_results_from_loggers_list_output

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D67660049](https://our.internmc.facebook.com/intern/diff/D67660049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143882
Approved by: https://github.com/dulinriley
2024-12-28 02:10:31 +00:00
c3c27aef34 [dynamo] Remove HFPretrained config hack (#143698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143698
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #143888
2024-12-28 02:03:13 +00:00
7c343a9d68 Fix emulate low precision bool inp (#143657)
Fix for https://github.com/pytorch/pytorch/issues/143502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143657
Approved by: https://github.com/BoyuanFeng
2024-12-28 01:51:28 +00:00
88ccf2fa5e remove allow-untyped-defs from distributed/elastic/multiprocessing/subprocess_handler/handlers.py (#143917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143917
Approved by: https://github.com/Skylion007
2024-12-28 00:13:05 +00:00
e3fefdfbf0 [CUTLASS] fix addmm (#143537)
We would get a CUDA IMA before because we pass Bias in for X. So, we need to re-order the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143537
Approved by: https://github.com/chenyang78
ghstack dependencies: #143528
2024-12-27 23:47:55 +00:00
b54620f40f [CUTLASS] fix bugs: extra data_ptr() call, wrong size symbol name, bias symbol not added (#143528)
A few small things in this PR:
- fixed a bug where `workspace.data_ptr().data_ptr()` showed up
- for SM80 CUTLASS kernels, the symbol size for W.size(1) was never created
- for addmm kernels, the ldc bias symbol never showed up

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143528
Approved by: https://github.com/henrylhtsang
2024-12-27 23:38:18 +00:00
c17d767686 remove allow-untyped-defs from _inductor/codegen/rocm/rocm_template_buffer.py (#143870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143870
Approved by: https://github.com/aorenste, https://github.com/Skylion007
2024-12-27 23:28:51 +00:00
63d6e1f743 remove allow-untyped-defs from _inductor/codegen/aoti_hipify_utils.py (#143916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143916
Approved by: https://github.com/Skylion007
2024-12-27 23:25:37 +00:00
928e01545c restore 'unused' variable to fix test_cuda_device_memory_allocated (#143885)
This PR fix `test_cuda_multigpu.py::TestCudaMultiGPU::test_cuda_device_memory_allocated`
by restoring a deleted 'unused' variable from commit d8c8ba2440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143885
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-27 23:18:13 +00:00
0de661dc27 Add support for differentiable weight decay (#143679)
(Actual) second PR in a larger project to broaden support for differentiable optimizers with @janeyx99!

In this PR, I did a lot of pattern matching from the previous PR to add support for differentiable weight_decay.

And also added a single new line on line 359 (previously line 352) to make the code from the last PR a little easier to read

Continuation of progress on #141832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143679
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-12-27 23:14:43 +00:00
c0c7f881da remove allow-untyped-defs from distributed/pipelining/_unflatten.py (#143915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143915
Approved by: https://github.com/aorenste, https://github.com/Skylion007, https://github.com/malfet
2024-12-27 22:21:28 +00:00
af823bd526 remove allow-untyped-defs from utils/tensorboard/_convert_np.py (#143918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143918
Approved by: https://github.com/Skylion007
2024-12-27 22:19:33 +00:00
fe398de769 [EZ] Update sympy to 1.13.3 (#143908)
And remove python>=3.9 check as it currently covers all supported python versions

Fixes https://github.com/pytorch/pytorch/issues/143907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143908
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2024-12-27 21:32:55 +00:00
b5042cfa58 Revert "remove allow-untyped-defs from torch/ao/__init__.py (#143604)"
This reverts commit 1598d458797e69376a9a148bd37fb6e8167d22e3.

Reverted https://github.com/pytorch/pytorch/pull/143604 on behalf of https://github.com/wdvr due to failing typing checks in torchao ([comment](https://github.com/pytorch/pytorch/pull/143604#issuecomment-2564043233))
2024-12-27 21:30:02 +00:00
7a13bfa1ad [EZ] Update jinja2 to 3.1.5 (#143923)
To make Dependabot happy about https://cwe.mitre.org/data/definitions/150.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143923
Approved by: https://github.com/Skylion007
2024-12-27 21:10:21 +00:00
228b228449 Fix batch-specific attention mod for NJT + Flex (#143866)
Fixes #143788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143866
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-12-27 20:51:41 +00:00
1e65dec2b9 [Dynamo] Add MPSDevice interface (#143891)
That simply checks if device is available and whether or not it supports bf16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143891
Approved by: https://github.com/jansel
2024-12-27 20:31:44 +00:00
d2f769476f [Easy] add quotes to shell activation commands (#143902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143902
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-27 19:17:46 +00:00
a87cd5283b [dynamo] Trace through overridden __getattribute__ method (#143888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143888
Approved by: https://github.com/jansel
2024-12-27 18:10:00 +00:00
fda9048ca8 remove allow-untyped-defs from distributed/elastic/multiprocessing/errors/handlers.py (#143869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143869
Approved by: https://github.com/Skylion007
2024-12-27 15:49:19 +00:00
a20765a9c1 subgraph rewriter supports matched pattern with no users (#143842)
Fixes #143841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143842
Approved by: https://github.com/yushangdi
2024-12-27 12:45:39 +00:00
9e8d84f863 Fix duplicate pattern error (#139321)
vllm had an error when we were incorrectly stating two patterns are duplicates. See, comment inline:

For a particular generated pattern repr, store all the equivalent graphs that used to generate them. Because we ignore certain patterns in searching, but not in matching, use the graph to distinguish if two equivalent searches are actually different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139321
Approved by: https://github.com/shunting314
2024-12-27 11:10:46 +00:00
3571476739 Revert "fix randint distribution for large max (#143787)"
This reverts commit 8059d56ec364feb554f3fb90012a0fc2d2104e7f.

Reverted https://github.com/pytorch/pytorch/pull/143787 on behalf of https://github.com/wdvr due to failing internal tests, to be fixed first ([comment](https://github.com/pytorch/pytorch/pull/143787#issuecomment-2563493323))
2024-12-27 09:16:36 +00:00
f6801ba4b3 Revert "Use random64 in Fischer-Yates algorithm for large N (#143682)"
This reverts commit 7013be0094e8d3ded2ba2f948082f98d63e622bb.

Reverted https://github.com/pytorch/pytorch/pull/143682 on behalf of https://github.com/wdvr due to failing Meta internal tests that need to be updated ([comment](https://github.com/pytorch/pytorch/pull/143682#issuecomment-2563487675))
2024-12-27 09:09:33 +00:00
ba5cacbc17 [Codemod][AddExplicitStrictExportArg] caffe2/test (#143688)
Reviewed By: avikchaudhuri

Differential Revision: D67530154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143688
Approved by: https://github.com/tugsbayasgalan
2024-12-27 07:58:44 +00:00
969415885d [inductor][invoke_subgraph] Support None/int as input/output of invoke_subgraph (#139373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139373
Approved by: https://github.com/eellison
2024-12-27 06:46:09 +00:00
cyy
379bbef23c Enable more C++ warnings (#143355)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355
Approved by: https://github.com/albanD
2024-12-27 05:46:57 +00:00
fca457b5db Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 3f80632c802f1d9fafd0c303d45ba2376b9c1e11.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2563331259))
2024-12-27 05:25:17 +00:00
0f474a960b [dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699
Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #143722
2024-12-27 04:51:35 +00:00
e296bab614 [dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
2024-12-27 04:51:35 +00:00
d60282c177 remove allow-untyped-defs from _inductor/codegen/cpu_device_op_overrides.py (#143881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143881
Approved by: https://github.com/aorenste
2024-12-27 04:10:47 +00:00
43853691bc [Quantization] add an option keep_original_weights in _lower_to_native_backend (#141049)
Differential Revision: D66153809

This diff adds an option to keep_original_weights so we can track back the original weight and bias after performing prepare_fx and convert_fx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141049
Approved by: https://github.com/jerryzh168
2024-12-27 04:02:07 +00:00
809106a93f [fr][c10d] fix flaky test (#143878)
Summary:
Test erroneously assumed that input/output sizes are same and that all
states are matchable.

Fixes issue #143798

Test Plan:
Test passes

Reviewers
Test passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143878
Approved by: https://github.com/fduwjj
ghstack dependencies: #143865
2024-12-27 03:13:15 +00:00
1cd70e7e23 [fr][c10d] log trace capture enabled or not in flight recorder (#143865)
Summary:
Refactor logging for flight recorder so we can log if the capture was
with or without stack trace capture enabled.
We introduce a new column ('trace_enabled') in the logger.

Test Plan:
Tested on local job and noted that correct output was produced.
Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865
Approved by: https://github.com/fduwjj
2024-12-27 03:07:55 +00:00
6bdf2addc5 [inductor] Simplify get_launch_args_* handling (#143835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143835
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #143813, #143814, #143815, #143817, #143818
2024-12-27 02:02:11 +00:00
138efb3002 [inductor] Move GPUTarget backwards compat to triton_compat.py (#143818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143818
Approved by: https://github.com/eellison
ghstack dependencies: #143813, #143814, #143815, #143817
2024-12-27 02:02:11 +00:00
be1936804b [inductor] Drop support for pre-ASTSource Triton (#143817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143817
Approved by: https://github.com/eellison
ghstack dependencies: #143813, #143814, #143815
2024-12-27 02:02:11 +00:00
f3d0f67039 [inductor] Minor refactor of hip compile_meta (#143815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143815
Approved by: https://github.com/eellison
ghstack dependencies: #143813, #143814
2024-12-27 02:02:11 +00:00
29841b9414 remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143871)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143871
Approved by: https://github.com/Skylion007
2024-12-27 01:20:26 +00:00
373dba35f9 remove allow-untyped-defs from fx/experimental/refinement_types.py (#143868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143868
Approved by: https://github.com/Skylion007
2024-12-27 01:00:45 +00:00
c4bff71854 [Easy] Add ROCm support to nightly pull tool (#141282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141282
Approved by: https://github.com/malfet
ghstack dependencies: #143263
2024-12-27 00:07:38 +00:00
8059d56ec3 fix randint distribution for large max (#143787)
Fixes #ISSUE_NUMBER
Similar to #143682, for large maximum values we were sampling integers via % and it doesn't provide uniform distribution. Here we limit the max skew to approx 1% (random32 is used for max values `<= 2**32 / 128`)
This comes with significant perf penalty, especially for cuda, but it's a pretty bad bug, so we'll have to figure out what can be done to improve it.
`torch.compile` has always been producing correct results for this, and it's performance is also significantly better than current eager (eager is ~660 GB/s on H100, torch.compile 1200 GB/s), so we have to figure out why torch.compile is better.
`__launch_bounds__` slightly regress perf, so perhaps we can figure out how to specify them better, but it's only 20-30 GB/s, so the big difference is still unexplained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143787
Approved by: https://github.com/eqy
2024-12-26 23:54:03 +00:00
1598d45879 remove allow-untyped-defs from torch/ao/__init__.py (#143604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143604
Approved by: https://github.com/aorenste
2024-12-26 23:27:16 +00:00
3f80632c80 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #139974
2024-12-26 22:22:42 +00:00
26364428f5 Revert "[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)"
This reverts commit fe95cbe018218d159ba0a0269045b31ff72f1a20.

Reverted https://github.com/pytorch/pytorch/pull/143722 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))
2024-12-26 22:04:36 +00:00
ee25daef5a Revert "[dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699)"
This reverts commit 7d1c6661397f9bff93c1ea389506c8a163d7a2ab.

Reverted https://github.com/pytorch/pytorch/pull/143699 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))
2024-12-26 22:04:35 +00:00
2966fb3708 [pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143775)
The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET.

This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc.

We also added a few ways to enable ET and ET Resources through the OS environment variables.

Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch.

Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels.

Differential Revision: [D67610542](https://our.internmc.facebook.com/intern/diff/D67610542/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D67610542/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143775
Approved by: https://github.com/shengfukevin, https://github.com/wdvr
2024-12-26 21:15:39 +00:00
96e9a5aeec [CI] Disable sccache for xpu test (#143851)
WA for https://github.com/pytorch/pytorch/issues/143585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143851
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-26 19:45:04 +00:00
3df12d38cf dynamo tracing perf: cache cleaned_instructions: 33.7 -> 30.0 (#143070)
See #143056 for overall docs.

This PR: Cache the interesting/expensive bits of `cleaned_instructions()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143070
Approved by: https://github.com/jansel
2024-12-26 19:02:08 +00:00
51a7ecde80 [Easy] Bump CUDA nightly version to 11.8 / 12.4 / 12.6 in nightly pull tool (#143263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143263
Approved by: https://github.com/malfet
2024-12-26 19:01:38 +00:00
78502a58ba Enable FSDP2 on XPU device (#143737)
**Motivation:**  Enabling FSDP2 on XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143737
Approved by: https://github.com/awgu
2024-12-26 18:34:11 +00:00
475656fd9c Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 2293fe1024812d6349f6e2b3b7de82c6b73f11e4.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/malfet due to failing internal ROCM builds with error: ModuleNotFoundError: No module named hipify ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2562973920))
2024-12-26 17:32:23 +00:00
cc4e70b7c3 Revert "Use absolute path path.resolve() -> path.absolute() (#129409)"
This reverts commit 135c7db99d646b8bd9603bf969d47d3dec5987b1.

Reverted https://github.com/pytorch/pytorch/pull/129409 on behalf of https://github.com/malfet due to need to revert to as dependency of https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/129409#issuecomment-2562969825))
2024-12-26 17:26:06 +00:00
9255ffc841 Revert "Enable more C++ warnings (#143355)"
This reverts commit daa3ffe0ebff58577b8db964447b6abc6de53a25.

Reverted https://github.com/pytorch/pytorch/pull/143355 on behalf of https://github.com/malfet due to It fails internal build system as it kind of breaks separation between native and native/cpu ([comment](https://github.com/pytorch/pytorch/pull/143355#issuecomment-2562961546))
2024-12-26 17:13:10 +00:00
cf76c05b4d [inductor] Refactor conditional triton imports into triton_compat.py (#143814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143814
Approved by: https://github.com/Skylion007
ghstack dependencies: #143813
2024-12-26 09:14:06 +00:00
efac5ed81b [inductor] Reorder imports in codecache.py (#143813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143813
Approved by: https://github.com/Skylion007
2024-12-26 09:14:06 +00:00
bf8da4c145 Bump jinja2 from 3.1.4 to 3.1.5 in /.ci/docker (#143844)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.4 to 3.1.5.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/pallets/jinja/releases">jinja2's releases</a>.</em></p>
<blockquote>
<h2>3.1.5</h2>
<p>This is the Jinja 3.1.5 security fix release, which fixes security issues and bugs but does not otherwise change behavior and should not result in breaking changes compared to the latest feature release.</p>
<p>PyPI: <a href="https://pypi.org/project/Jinja2/3.1.5/">https://pypi.org/project/Jinja2/3.1.5/</a>
Changes: <a href="https://jinja.palletsprojects.com/changes/#version-3-1-5">https://jinja.palletsprojects.com/changes/#version-3-1-5</a>
Milestone: <a href="https://github.com/pallets/jinja/milestone/16?closed=1">https://github.com/pallets/jinja/milestone/16?closed=1</a></p>
<ul>
<li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as by passing a stored reference to a filter that calls its argument. <a href="https://github.com/pallets/jinja/security/advisories/GHSA-q2x7-8rv6-6q7h">GHSA-q2x7-8rv6-6q7h</a></li>
<li>Escape template name before formatting it into error messages, to avoid issues with names that contain f-string syntax. <a href="https://redirect.github.com/pallets/jinja/issues/1792">#1792</a>, <a href="https://github.com/pallets/jinja/security/advisories/GHSA-gmj6-6f8f-6699">GHSA-gmj6-6f8f-6699</a></li>
<li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence types. <a href="https://redirect.github.com/pallets/jinja/issues/2032">#2032</a></li>
<li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1952">#1952</a></li>
<li>Avoid unclosed <code>auto_aiter</code> warnings. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li>
<li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li>
<li>Avoid leaving <code>root_render_func()</code> unclosed in <code>Template.generate_async</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li>
<li>Avoid leaving async generators unclosed in blocks, includes and extends. <a href="https://redirect.github.com/pallets/jinja/issues/1960">#1960</a></li>
<li>The runtime uses the correct <code>concat</code> function for the current environment when calling block references. <a href="https://redirect.github.com/pallets/jinja/issues/1701">#1701</a></li>
<li>Make <code>|unique</code> async-aware, allowing it to be used after another async-aware filter. <a href="https://redirect.github.com/pallets/jinja/issues/1781">#1781</a></li>
<li><code>|int</code> filter handles <code>OverflowError</code> from scientific notation. <a href="https://redirect.github.com/pallets/jinja/issues/1921">#1921</a></li>
<li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code> call. <a href="https://redirect.github.com/pallets/jinja/issues/2021">#2021</a></li>
<li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code> objects. <a href="https://redirect.github.com/pallets/jinja/issues/2025">#2025</a></li>
<li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object. <a href="https://redirect.github.com/pallets/jinja/issues/2027">#2027</a></li>
<li><code>Environment.overlay(enable_async)</code> is applied correctly. <a href="https://redirect.github.com/pallets/jinja/issues/2061">#2061</a></li>
<li>The error message from <code>FileSystemLoader</code> includes the paths that were searched. <a href="https://redirect.github.com/pallets/jinja/issues/1661">#1661</a></li>
<li><code>PackageLoader</code> shows a clearer error message when the package does not contain the templates directory. <a href="https://redirect.github.com/pallets/jinja/issues/1705">#1705</a></li>
<li>Improve annotations for methods returning copies. <a href="https://redirect.github.com/pallets/jinja/issues/1880">#1880</a></li>
<li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. <a href="https://redirect.github.com/pallets/jinja/issues/1870">#1870</a></li>
<li>Tests decorated with <code>@pass_context</code> can be used with the <code>|select</code> filter. <a href="https://redirect.github.com/pallets/jinja/issues/1624">#1624</a></li>
<li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the target is a namespace attribute. <a href="https://redirect.github.com/pallets/jinja/issues/1413">#1413</a></li>
<li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks does not cause the variable to be considered initially undefined. <a href="https://redirect.github.com/pallets/jinja/issues/1253">#1253</a></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pallets/jinja/blob/main/CHANGES.rst">jinja2's changelog</a>.</em></p>
<blockquote>
<h2>Version 3.1.5</h2>
<p>Released 2024-12-21</p>
<ul>
<li>The sandboxed environment handles indirect calls to <code>str.format</code>, such as
by passing a stored reference to a filter that calls its argument.
:ghsa:<code>q2x7-8rv6-6q7h</code></li>
<li>Escape template name before formatting it into error messages, to avoid
issues with names that contain f-string syntax.
:issue:<code>1792</code>, :ghsa:<code>gmj6-6f8f-6699</code></li>
<li>Sandbox does not allow <code>clear</code> and <code>pop</code> on known mutable sequence
types. :issue:<code>2032</code></li>
<li>Calling sync <code>render</code> for an async template uses <code>asyncio.run</code>.
:pr:<code>1952</code></li>
<li>Avoid unclosed <code>auto_aiter</code> warnings. :pr:<code>1960</code></li>
<li>Return an <code>aclose</code>-able <code>AsyncGenerator</code> from
<code>Template.generate_async</code>. :pr:<code>1960</code></li>
<li>Avoid leaving <code>root_render_func()</code> unclosed in
<code>Template.generate_async</code>. :pr:<code>1960</code></li>
<li>Avoid leaving async generators unclosed in blocks, includes and extends.
:pr:<code>1960</code></li>
<li>The runtime uses the correct <code>concat</code> function for the current environment
when calling block references. :issue:<code>1701</code></li>
<li>Make <code>|unique</code> async-aware, allowing it to be used after another
async-aware filter. :issue:<code>1781</code></li>
<li><code>|int</code> filter handles <code>OverflowError</code> from scientific notation.
:issue:<code>1921</code></li>
<li>Make compiling deterministic for tuple unpacking in a <code>{% set ... %}</code>
call. :issue:<code>2021</code></li>
<li>Fix dunder protocol (<code>copy</code>/<code>pickle</code>/etc) interaction with <code>Undefined</code>
objects. :issue:<code>2025</code></li>
<li>Fix <code>copy</code>/<code>pickle</code> support for the internal <code>missing</code> object.
:issue:<code>2027</code></li>
<li><code>Environment.overlay(enable_async)</code> is applied correctly. :pr:<code>2061</code></li>
<li>The error message from <code>FileSystemLoader</code> includes the paths that were
searched. :issue:<code>1661</code></li>
<li><code>PackageLoader</code> shows a clearer error message when the package does not
contain the templates directory. :issue:<code>1705</code></li>
<li>Improve annotations for methods returning copies. :pr:<code>1880</code></li>
<li><code>urlize</code> does not add <code>mailto:</code> to values like <code>@a@b</code>. :pr:<code>1870</code></li>
<li>Tests decorated with <code>@pass_context`` can be used with the ``|select`` filter. :issue:</code>1624`</li>
<li>Using <code>set</code> for multiple assignment (<code>a, b = 1, 2</code>) does not fail when the
target is a namespace attribute. :issue:<code>1413</code></li>
<li>Using <code>set</code> in all branches of <code>{% if %}{% elif %}{% else %}</code> blocks
does not cause the variable to be considered initially undefined.
:issue:<code>1253</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="877f6e51be"><code>877f6e5</code></a> release version 3.1.5</li>
<li><a href="8d58859265"><code>8d58859</code></a> remove test pypi</li>
<li><a href="eda8fe86fd"><code>eda8fe8</code></a> update dev dependencies</li>
<li><a href="c8fdce1e03"><code>c8fdce1</code></a> Fix bug involving calling set on a template parameter within all branches of ...</li>
<li><a href="66587ce989"><code>66587ce</code></a> Fix bug where set would sometimes fail within if</li>
<li><a href="fbc3a696c7"><code>fbc3a69</code></a> Add support for namespaces in tuple parsing (<a href="https://redirect.github.com/pallets/jinja/issues/1664">#1664</a>)</li>
<li><a href="b8f4831d41"><code>b8f4831</code></a> more comments about nsref assignment</li>
<li><a href="ee832194cd"><code>ee83219</code></a> Add support for namespaces in tuple assignment</li>
<li><a href="1d55cddbb2"><code>1d55cdd</code></a> Triple quotes in docs (<a href="https://redirect.github.com/pallets/jinja/issues/2064">#2064</a>)</li>
<li><a href="8a8eafc6b9"><code>8a8eafc</code></a> edit block assignment section</li>
<li>Additional commits viewable in <a href="https://github.com/pallets/jinja/compare/3.1.4...3.1.5">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=jinja2&package-manager=pip&previous-version=3.1.4&new-version=3.1.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143844
Approved by: https://github.com/Skylion007

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-26 05:20:06 +00:00
cyy
e05bfb8ee3 [Submodule] Bump libfmt to 11.1.0 (#143843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143843
Approved by: https://github.com/Skylion007
2024-12-26 04:49:11 +00:00
4bacfd6e11 Sort requirements.txt (#143778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143778
Approved by: https://github.com/albanD
2024-12-26 00:51:52 +00:00
cyy
f42cff4e29 [17/N] Fix extra warnings brought by clang-tidy-17 (#143804)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143804
Approved by: https://github.com/Skylion007
2024-12-25 19:54:42 +00:00
a8ac3a6b20 [inductor] fix the adaptive_avg_pool on processing int64 (#143802)
Fixes #143801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143802
Approved by: https://github.com/jansel
2024-12-25 09:08:43 +00:00
c0d710634f Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292)
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-25 02:37:11 +00:00
7013be0094 Use random64 in Fischer-Yates algorithm for large N (#143682)
Fixes bug in randperm https://nbsanity.com/static/a4774194938414dedcec7d6e99727d31/Shuffling_20in_20torch_20vs_20numpy-public.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143682
Approved by: https://github.com/eqy, https://github.com/albanD
2024-12-25 01:19:19 +00:00
27b0d41f0a [ROCm] Add miopen_batch_norm to meta_registrations to fix AOTI issue (#143569)
Currently the upstream example for AOTI usage breaks on ROCm (https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html)

```
File "/root/upstream/torch/_dynamo/exc.py", line 317, in unimplemented
    raise Unsupported(msg, case_name=case_name)
torch._dynamo.exc.Unsupported: unsupported operator: aten.miopen_batch_norm.default (see https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit#heading=h.64r4npvq0w0 for how to fix)

from user code:
   File "/root/vision/torchvision/models/resnet.py", line 285, in forward
    return self._forward_impl(x)
  File "/root/vision/torchvision/models/resnet.py", line 269, in _forward_impl
    x = self.bn1(x)
```

This PR adds a meta_registration for miopen_batch_norm to resolve this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143569
Approved by: https://github.com/jeffdaily
2024-12-24 23:43:11 +00:00
9035fb5a7b [dynamo] Add types to exc.py (#143626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143626
Approved by: https://github.com/yanboliang
ghstack dependencies: #143552, #143610
2024-12-24 21:48:32 +00:00
3e7f9e2cc4 [inductor] Shorten tracebacks for errors inside inductor (by skipping AOTAutograd frames) (#143610)
Before #143552
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1381, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1165, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 547, in __call__
    return _compile(
           ^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 987, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 715, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 750, in _compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object
    transformations(instructions, code_options)
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 231, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 662, in transform
    tracer.run()
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run
    super().run()
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run
    while self.step():
          ^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE
    self._return(inst)
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return
    self.output.compile_subgraph(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1101, in compile_subgraph
    self.compile_and_call_fx_graph(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1382, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1432, in call_user_compiler
    return self._call_user_compiler(gm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1483, in _call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1462, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__
    return self.compiler_fn(gm, example_inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen
    self.scheduler = Scheduler(self.operations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__
    self._init(nodes)
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init
    self.nodes = self.fuse_nodes(self.nodes)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes
    nodes = self.fuse_nodes_once(nodes)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once
    assert False, "a fake error during fusion"
           ^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: a fake error during fusion

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

Before this PR
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1484, in _call_user_compiler
    raise BackendCompilerFailed(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__
    return self.compiler_fn(gm, example_inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen
    self.scheduler = Scheduler(self.operations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__
    self._init(nodes)
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init
    self.nodes = self.fuse_nodes(self.nodes)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes
    nodes = self.fuse_nodes_once(nodes)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once
    assert False, "a fake error during fusion"
           ^^^^^
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: a fake error during fusion

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

After this PR
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1138, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1053, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1912, in codegen
    self.scheduler = Scheduler(self.operations)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1880, in __init__
    self._init(nodes)
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 1955, in _init
    self.nodes = self.fuse_nodes(self.nodes)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2461, in fuse_nodes
    nodes = self.fuse_nodes_once(nodes)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 2773, in fuse_nodes_once
    assert False, "a fake error during fusion"
           ^^^^^
torch._inductor.exc.InductorError: AssertionError: a fake error during fusion

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
```

A large numer of frames are removed between:
```py
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143610
Approved by: https://github.com/eellison
ghstack dependencies: #143552
2024-12-24 21:48:32 +00:00
9e5f3fdfc7 [dynamo] Shorten tracebacks for backend compiler errors (#143552)
Fixes #143406

After this PR the error for missing Triton is:
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend
    raise TritonMissing(inspect.currentframe())
torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

Setting `TORCHDYNAMO_VERBOSE=1` yields something like the old error:
```py
Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 51, in <module>
    fp32_compiled = optimized_model(low_input)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/eval_frame.py", line 576, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1383, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 1167, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 548, in __call__
    return _compile(
           ^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 988, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 716, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_utils_internal.py", line 95, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 751, in _compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/bytecode_transformation.py", line 1361, in transform_code_object
    transformations(instructions, code_options)
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 232, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/convert_frame.py", line 663, in transform
    tracer.run()
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 2870, in run
    super().run()
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 1053, in run
    while self.step():
          ^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 963, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3050, in RETURN_VALUE
    self._return(inst)
  File "/home/jansel/pytorch/torch/_dynamo/symbolic_convert.py", line 3035, in _return
    self.output.compile_subgraph(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1102, in compile_subgraph
    self.compile_and_call_fx_graph(
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1383, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1433, in call_user_compiler
    return self._call_user_compiler(gm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/output_graph.py", line 1463, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/__init__.py", line 2314, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1880, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/backends/common.py", line 83, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1145, in aot_module_simplified
    compiled_fn = AOTAutogradCache.load(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/autograd_cache.py", line 754, in load
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 1131, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 580, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 830, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 676, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_functorch/aot_autograd.py", line 489, in __call__
    return self.compiler_fn(gm, example_inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1758, in fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 572, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_dynamo/repro/after_aot.py", line 102, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 686, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1129, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/compile_fx.py", line 1044, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1975, in compile_to_module
    return self._compile_to_module()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1981, in _compile_to_module
    self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
                                                             ^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/graph.py", line 1916, in codegen
    self.scheduler.codegen()
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3667, in codegen
    return self._codegen()
           ^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3761, in _codegen
    if device is not None and self.get_backend(device).ready_to_flush():
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3631, in get_backend
    self.backends[device] = self.create_backend(device)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_inductor/scheduler.py", line 3624, in create_backend
    raise TritonMissing(inspect.currentframe())
torch._dynamo.exc.TritonMissing: Cannot find a working triton installation. Either the package is not installed or it is too old. More information on installing Triton can be found at: https://github.com/triton-lang/triton

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

This PR also strips dynamo stack frames from other types of backend compile errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143552
Approved by: https://github.com/yanboliang
2024-12-24 21:48:23 +00:00
844e6108f6 Revert "[Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)"
This reverts commit ad750ae32079020f51f9b7d01237f3ecfa83b6ff.

Reverted https://github.com/pytorch/pytorch/pull/143266 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/143266#issuecomment-2561303786))
2024-12-24 17:22:57 +00:00
6c32ef4c5b Remove builder repo from workflows and scripts (#143776)
Part of https://github.com/pytorch/builder/issues/2054
Builder is repo is no longer used. Hence remove any references to builder repo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143776
Approved by: https://github.com/huydhn
2024-12-24 14:11:51 +00:00
aec3b46274 [DTensor] Add aten.amin/amax to linear_reduction_strategy (#143747)
In the same vein as https://github.com/pytorch/pytorch/pull/134206, these two ops still seemed missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143747
Approved by: https://github.com/kwen2501
2024-12-24 13:36:40 +00:00
b77406a9ec [BE][CI] bump ruff to 0.8.4 (#143753)
Changes:

1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
2024-12-24 12:24:10 +00:00
dbbc81cb34 Enabled force_shape_pad for test_pad_mm and test_slice_mm_bandwidth_computation (#141768)
Some tests fail for ROCm build on navi arch because of this check: f83361b274/torch/_inductor/fx_passes/pad_mm.py (L211)

There is no need to determine if mm is compute bound for most of the padding tests since they don't specifically test compute bound behavior. We don't have enough empirical data to fine tune this check for AMD gpus yet. I propose to force the shape padding for the tests that we had trouble with to avoid this unnecessary logic path.

Please correct me if I didn't add other tests that can potentially fail with this issue or if I added a test that is dependent on logic below the `force_shape_pad` check here: f83361b274/torch/_inductor/fx_passes/pad_mm.py (L444)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141768
Approved by: https://github.com/jeffdaily
2024-12-24 11:03:39 +00:00
783065637e Add FP8 support for eye (#139974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-24 10:00:23 +00:00
060ee14753 [inductor] Make adaptive_max_pool2d error on int64 (#143762)
Fixes #143752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143762
Approved by: https://github.com/yanboliang
2024-12-24 08:33:59 +00:00
135c7db99d Use absolute path path.resolve() -> path.absolute() (#129409)
Changes:

1. Always explicit `.absolute()`: `Path(__file__)` -> `Path(__file__).absolute()`
2. Replace `path.resolve()` with `path.absolute()` if the code is resolving the PyTorch repo root directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129409
Approved by: https://github.com/albanD
2024-12-24 08:33:08 +00:00
362ecad9bb [ROCm] Use linux.rocm.gpu.2 for 2-GPU and linux.rocm.gpu.4 for 4-GPU runners (#143769)
* Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4`
* Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point)
* Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769
Approved by: https://github.com/jeffdaily
2024-12-24 08:04:00 +00:00
1963fc83a1 [micro_pipeline_tp] don't pass return_A to fused_all_gather_scaled_matmul (#143782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143782
Approved by: https://github.com/tianyu-l
2024-12-24 07:25:38 +00:00
ad750ae320 [Inductor XPU] Support max-autotune on XPU and reuse the corresponding Inductor UT. (#143266)
This PR aims to add the functionality support of max-autotune for XPU. The current triton templates and configurations are not well optimized for XPU, so the performance is not ready yet. Also the `mm_plus_mm` template have accuracy issues in some cases. We will address these issues in the next PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143266
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-24 05:42:36 +00:00
b0c3f48a40 [inductor] Improve error message for assert_size_stride (#143765)
```
>>> torch._C._dynamo.guards.assert_size_stride(torch.randn(10), (10,), (2,))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError: expected size 10==10, stride 1==2 at dim=0
This error most often comes from an incorrect meta function for a custom op.
See https://pytorch.org/docs/stable/library.html#torch.library.opcheck
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143765
Approved by: https://github.com/zou3519
2024-12-24 05:26:05 +00:00
ace645a017 Add support for prototype affine quantization in pt2e flow (#141421)
Summary:
duplicated affine quantization functionality including
observer (https://github.com/pytorch/ao/blob/main/torchao/quantization/observer.py)
and some quant_primitive ops (7c3c51fd0d/torchao/quantization/quant_primitives.py (L26-L30))
to allow for per group quantization min max observer in pt2e flow

Next: We can follow up to add moving average min max observer

Test Plan:
python test/test_quantization.py -k test_channel_group_quantization

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141421
Approved by: https://github.com/cccclai
2024-12-24 04:22:18 +00:00
60a0d53c13 [dynamo] Add test for #143697 (#143764)
The issue from #143697 seems to already be fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143764
Approved by: https://github.com/Skylion007
2024-12-24 03:50:15 +00:00
01d60bcf32 [Easy] Fix todo by enable tests for cuda (#143637)
Fix TODO in `test_tensor_creation_ops.py` file:

```python
# TODO: update to work on CUDA, too
```

**Test Result**

```bash
$ pytest test/test_tensor_creation_ops.py
```

![image](https://github.com/user-attachments/assets/ef829541-668e-446d-a9ab-b26b9d73085f)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/d6a46eee-1f60-48e6-898a-a8d9620eb54a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143637
Approved by: https://github.com/albanD
2024-12-24 03:47:43 +00:00
b90a3b7281 [cumsum][CUDA][64-bit indexing] Add 64-bit indexing path for cumsum (#143696)
For #143486

Interestingly enough changing the indexing type seems to degrade performance when a larger width is not needed, even on small sizes, so making this a template param rather than forcing all cases to 64-bit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143696
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-24 03:45:28 +00:00
dec4286b2d [inductor] Fix for extract_target with dots (#143766)
Fixes #143650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143766
Approved by: https://github.com/yanboliang
2024-12-24 03:42:15 +00:00
cyy
1feae27ed6 [16/N] Fix extra warnings brought by clang-tidy-17 (#143714)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143714
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-12-24 03:29:38 +00:00
49fdc52fd2 Revert "Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261)"
This reverts commit bc78b6ea4f88d673426d6de17671b82facf50beb.

Reverted https://github.com/pytorch/pytorch/pull/143261 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint, plz help fix and reland this ([comment](https://github.com/pytorch/pytorch/pull/143261#issuecomment-2560583332))
2024-12-24 03:15:38 +00:00
cyy
d6a066ead6 Simplify host_softmax (#143251)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143251
Approved by: https://github.com/albanD
2024-12-24 02:27:51 +00:00
da21fabf34 [BE] Only print MKL version on x86 platforms (#143763)
As it will obviously be missing on ARM/S390, etc

Test plan: run `python3 -c "import torch;print(torch.__config__.parallel_info())"` on both x86 and non-x86 system
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143763
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-12-24 02:04:26 +00:00
7d1c666139 [dynamo] Remove dead code after introducing UserDefinedDictVariable (#143699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143699
Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #143722
2024-12-24 02:00:18 +00:00
fe95cbe018 [dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
2024-12-24 02:00:18 +00:00
67355a1289 [Easy] Add torch.range, torch.arange params optional description (#143731)
Fixes #129333

**Test Result**

**Before**

![image](https://github.com/user-attachments/assets/c5873690-7de7-4a14-9423-a150d17d137e)

![image](https://github.com/user-attachments/assets/ff4ee545-f27a-403b-bf92-51f9571022a3)

**After**

![image](https://github.com/user-attachments/assets/34e2c41f-8b54-417d-bb10-7ca6f679206a)

![image](https://github.com/user-attachments/assets/b54bcebd-70e9-4a1a-8a22-1ab815e17827)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143731
Approved by: https://github.com/janeyx99
2024-12-24 01:29:24 +00:00
0ca6a47872 Update tag_regex in filter_test_configs.py for workflows such as inductor-rocm (#143768)
This helps to make `continue-through-error`/`keep-going` work as expected on `inductor-rocm` workflow jobs.

Without this, the code here doesn't enter the `if` condition: 6ccb8ed186/.github/scripts/filter_test_configs.py (L577)

Tested via [this PR](https://github.com/pytorch/pytorch/pull/140989):
Without this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=8232e18957f987d99c946efc0cf6da9be9b52067: https://github.com/pytorch/pytorch/actions/runs/12164558045/job/34192442187#step:13:144

With this change: https://hud.pytorch.org/pytorch/pytorch/pull/140989?sha=763179c5e421791ee05c8e2a600379b29a1c8c33: https://github.com/pytorch/pytorch/actions/runs/12261943684/job/34213300153#step:13:145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143768
Approved by: https://github.com/huydhn
2024-12-24 00:50:14 +00:00
bc78b6ea4f Add a warning when a tensor with requires_grad=True is converted to a scalar (#143261)
Fixes #143071

Operations performed on tensors with `requires_grad=True` such as
```python
import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
```
and
```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.pow(x,3)
```
are valid operations.

While an operation using `numpy` like
```python
import numpy as np

x = torch.tensor(2.0, requires_grad=True)
y = np.pow(x,3)
# > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
```
leads to an error.

However, an operation that uses `math` like
```python
import math

x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
```
does not cause an error, and `y` is no longer a tensor with a gradient!

This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models.

To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with:
```python
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
# > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
# Consider using tensor.detach() first.
```

Please let me know if you have any questions 👍
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261
Approved by: https://github.com/albanD
2024-12-24 00:22:18 +00:00
6ccb8ed186 Refactor AdamW into Adam (heavily inspired by tfsingh) (#143710)
Fixes #104899

Refactors AdamW into Adam by making AdamW a subclass of Adam. Additionally adds a test to assert that the added parameter `decoupled_weight_decay` is True in AdamW and also updates test_defaults_changed_to_foreach to account for the differences in module location for AdamW.

Heavily heavily inspired by #118857 by @tfsingh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143710
Approved by: https://github.com/janeyx99
2024-12-23 23:27:28 +00:00
4271a95590 [logging] A few fixes/updates to record_compilation_metrics (#143332)
Summary: Mostly cosmetic, but one bug fix:
* Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]"
* Sort collections in `collection_to_str`
* Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings)
* Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics

Test Plan:
```
python test/dynamo/test_structured_trace.py
python test/dynamo/test_utils.py
```
Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332
Approved by: https://github.com/ppanchalia
2024-12-23 23:10:11 +00:00
2ab698e708 allow profiling on all threads via experimentalConfig (#143659)
In some situations we want to profile calls coming from all threads (similar to on-demand), not just the thread that started profiling and the spawned threads that would inherit KinetoThreadLocal state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143659
Approved by: https://github.com/sraikund16
2024-12-23 20:41:27 +00:00
00831f9b22 [BE]: Properly forward raise pickle exception with from (#143761)
Properly raises the pickle exception with from. Provides a more informative stack trace and forwards information about the exception that led to the current exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143761
Approved by: https://github.com/XuehaiPan, https://github.com/albanD
2024-12-23 20:21:30 +00:00
75e1f8a227 [ROCm] upgrade nightly wheels to rocm6.3 - 2 of 2 (binaries) (#143613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143613
Approved by: https://github.com/jeffdaily
2024-12-23 19:47:30 +00:00
0ebc6388cf Revert "Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218)"
This reverts commit 3bfdf6f0633e6feb067e032009256c740a2a2665.

Reverted https://github.com/pytorch/pytorch/pull/143218 on behalf of https://github.com/atalman due to this constrain is ignored see https://github.com/pytorch/pytorch/issues/143654 ([comment](https://github.com/pytorch/pytorch/pull/143218#issuecomment-2560208992))
2024-12-23 19:37:35 +00:00
727ee853b4 Apply TorchFix TOR203 fixes (#143691)
Codemodded via `torchfix . --select=TOR203 --fix`.
This is a step to unblock https://github.com/pytorch/pytorch/pull/141076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143691
Approved by: https://github.com/malfet
2024-12-23 18:21:03 +00:00
c042c8a475 Use default_collate from public API (#143616)
Codemodded via `torchfix . --select=TOR104 --fix`.
This is a step to unblock https://github.com/pytorch/pytorch/pull/141076
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143616
Approved by: https://github.com/malfet
2024-12-23 17:38:43 +00:00
a70191da41 Add torch.topk indices vary description (#143736)
Fixes #133542

**Test Result**

**Before**

![image](https://github.com/user-attachments/assets/65227efb-02af-45e7-804c-35588dff360d)

**After**

![image](https://github.com/user-attachments/assets/91f1f53f-008c-4784-82fe-013404e273cb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143736
Approved by: https://github.com/zou3519
2024-12-23 17:16:31 +00:00
1519a9e30b Revert "Add FP8 support for eye (#139974)"
This reverts commit 01890526b9068ae20b38b2a33e8f11a6331d7d4b.

Reverted https://github.com/pytorch/pytorch/pull/139974 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some slow tests ([comment](https://github.com/pytorch/pytorch/pull/139974#issuecomment-2560046399))
2024-12-23 17:12:39 +00:00
12662901aa [BE] Move Mac BB test to its own step (#143513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143513
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere
ghstack dependencies: #143395, #143511, #143512
2024-12-23 14:05:10 +00:00
5c4545f857 [BE][Easy] enable PYFMT for torch/[a-s]*/ (#138447)
Reproduce command:

```bash
ghstack checkout https://github.com/pytorch/pytorch/pull/138447
git checkout HEAD~1 torch/
lintrunner -a --take "PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138447
Approved by: https://github.com/ezyang
2024-12-23 14:04:00 +00:00
7314cf44ae torch/accelerator: fix device type comparison (#143541)
This was failing without the fix:
```
python -c 'import torch; d=torch.device("xpu:0"); torch.accelerator.current_stream(d)'
```
with:
```
ValueError: xpu doesn't match the current accelerator xpu.
```

CC: @guangyey, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143541
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-23 10:54:53 +00:00
434e0c2104 Inductor Cutlass backend: Eliminate unused code. (#143723)
Summary: Eliminates an unused file and some smaller unused code fragments from the inductor cutlass codebase.

Test Plan: CI

Differential Revision: D67579837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143723
Approved by: https://github.com/ColinPeppler
2024-12-23 09:35:03 +00:00
01890526b9 Add FP8 support for eye (#139974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-23 06:47:49 +00:00
448c16ac87 Revert "[reland][AMD] Turn on TF32 for aten::mm (#143549)"
This reverts commit 41cdc7f73552cc8a0dbf2d3cb55440c0d6b548ea.

Reverted https://github.com/pytorch/pytorch/pull/143549 on behalf of https://github.com/malfet due to It breaks ROCM testing, see 06b4b96b34/1 ([comment](https://github.com/pytorch/pytorch/pull/143549#issuecomment-2559016960))
2024-12-23 06:47:36 +00:00
06b4b96b34 dynamo tracing perf: no re in arg_ref: 33.9 -> 33.7 (#143069)
See #143056 for overall docs.

This PR: Avoid use of python re and move valid varname check in
`GuardBuilder.arg_ref()` into C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143069
Approved by: https://github.com/jansel
2024-12-23 05:32:09 +00:00
07fa6e2c8b Fix torch.accelerator api abort when passing invaild device (#143550)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/143543

# Solution
We should raise python exception instead of aborting...

# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
  what():  device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
    return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
2024-12-23 03:44:22 +00:00
eebc93d41e Better fix for f-strings in set_linter for py3.12 (#143725)
#143628 didn't handle a few cases right for example:
```py
$ python3 tools/linter/adapters/set_linter.py torch/_inductor/scheduler.py
torch/_inductor/scheduler.py:261:24: Builtin `set` is deprecated
  259 |                 multiline=False,
  260 |             )
  261 |         return f"{self}{data_str}"
                               ^
  262 |
  263 |     def log_details(self) -> None:

torch/_inductor/scheduler.py:261:33: Builtin `set` is deprecated
  259 |                 multiline=False,
  260 |             )
  261 |         return f"{self}{data_str}"
                                        ^
  262 |
  263 |     def log_details(self) -> None:
```
also multi-line fstrings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143725
Approved by: https://github.com/yanboliang
2024-12-22 22:51:27 +00:00
41cdc7f735 [reland][AMD] Turn on TF32 for aten::mm (#143549)
Summary:
hipblaslt supports TF32, so adding the support.

Original PR https://github.com/pytorch/pytorch/pull/139869

Test Plan: CI

Differential Revision: D67431681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143549
Approved by: https://github.com/eqy
2024-12-22 21:05:05 +00:00
6425f0779d [BE] Update triton repo link (#143429)
It should be https://github.com/triton-lang/triton rather than https://github.com/openai/triton shouldn't it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143429
Approved by: https://github.com/jansel
2024-12-22 18:38:35 +00:00
a316a4581d Add mps to GPU_TYPES (#143634)
Because it is a GPU, but don't require a triton, as it does not need one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143634
Approved by: https://github.com/jansel
2024-12-22 18:37:35 +00:00
cyy
09c950cc87 Remove unused <ATen/core/Array.h> inclusion (#143701)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143701
Approved by: https://github.com/albanD
2024-12-22 14:30:11 +00:00
dc55704b48 Rename cache limit to recompile limit in configs (#143709)
This PR renames every cache_limit to recompile_limit via sed.

Old config options are maintained via Config(alias='xyz')

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143709
Approved by: https://github.com/jansel
2024-12-22 10:03:57 +00:00
9bf4b1c2e9 dynamo tracing perf: c++ strip_function_call: 49.12 -> 47.77 (#143063)
See #143056 for overall docs.

This PR: Convert `strip_function_call()` into C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143063
Approved by: https://github.com/jansel
ghstack dependencies: #143057, #143062
2024-12-22 06:38:46 +00:00
3ec04d30d5 dynamo tracing perf: kill import: 50.36 -> 49.12 (#143062)
See #143056 for overall docs.

This PR: Stop importing in the body of `BuiltinVariable.call_getattr()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143062
Approved by: https://github.com/jansel
ghstack dependencies: #143057
2024-12-22 06:38:46 +00:00
f2b744b9ca dynamo tracing perf: import_module: 59.92 -> 52.9 (#143057)
See #143056 for overall docs.

This PR: Using `importlib.import_module()` within the hot path of
symbolic_convert is slow. Memoize it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143057
Approved by: https://github.com/jansel
2024-12-22 06:38:38 +00:00
f1cbf4b1b5 Enable ruff's unused variable checking everywhere in pytorch (#136965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-12-22 02:33:11 +00:00
2293fe1024 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-12-21 22:08:01 +00:00
197954e14b Revert "Handle meta tensors in FX quantization (#142262)"
This reverts commit e97b97af56204230f1030bd297dda9bc6b053a4c.

Reverted https://github.com/pytorch/pytorch/pull/142262 on behalf of https://github.com/janeyx99 due to this PR broke lint  ([comment](https://github.com/pytorch/pytorch/pull/142262#issuecomment-2558233022))
2024-12-21 20:34:09 +00:00
0666347fc4 [Codemod][AddExplicitStrictExportArg] caffe2/benchmarks/dynamo (#143686)
Reviewed By: avikchaudhuri

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143686
Approved by: https://github.com/tugsbayasgalan
2024-12-21 19:56:56 +00:00
e97b97af56 Handle meta tensors in FX quantization (#142262)
Summary:
If module being quantized contains a some meta tensors and some tensors with actual device, we should not fail quantization.

Quantization should also not fail if new quantized module is created on a meta device.

Differential Revision: D66895899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142262
Approved by: https://github.com/iamzainhuda
2024-12-21 13:19:30 +00:00
cyy
daa3ffe0eb Enable more C++ warnings (#143355)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355
Approved by: https://github.com/albanD
2024-12-21 09:19:02 +00:00
e15442a9b2 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 6733045a4aaef7a8d9fb1f9f8b80f4f5f4ef1f4f.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but my first attempt to fix internal build does not fix all the cases, so let us try again ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2558043056))
2024-12-21 08:06:19 +00:00
51eacea8c4 graph module retracing without preserving MCS (#143676)
Retracing while preserving module call signatures used to be a problem because graph modules don't have submodules at given paths. This led to a number of failing retracebility tests. By not trying to wrap modules with export tracepoints we can pass most of these tests; the only exception is where you do module swapping on retraced programs, which is still not possible.

Differential Revision: [D67539304](https://our.internmc.facebook.com/intern/diff/D67539304/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143676
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
ghstack dependencies: #143664
2024-12-21 07:57:43 +00:00
cyy
d7e59c2f85 Fix cppcoreguidelines-pro-type-member-init (#141787)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141787
Approved by: https://github.com/albanD
2024-12-21 07:51:30 +00:00
7b2af25f80 [1/n] Support Dynamic Memory Budget in Auto AC (#143539)
# Summary:
Full Context: https://docs.google.com/document/d/1-j5KSbfGFJQcH4sYh7BIeJXso3zYzl5G5yFQqXdKx_o/edit?usp=sharing

tl;dr

This change introduces classes which help determine a dynamic memory budget. This will mostly be helpful for models with many implicit graph breaks.

---

New Classes:

*GraphInfoProvider*
* Takes the joint_graph as well as the input memories and runtimes and parses the graph + values into usable forms for the SolverEvaluator.

*KnapsackEvaluator*
* Provides a function: Given all of the four inputs (solver function as a callable, max_dynamic_memory_budget, min_dynamic_memory_budget, dynamic_memory_budget_pareto_granularity) it returns an approximation of the knee point of the pareto distribution.

# Test Plan:

### LintRunner

LintRunner Output: P1700445547

### Unit Tests

```
$ buck test @mode/opt //caffe2/test/functorch:test_ac_knapsack
`@mode/opt` was specified, but not found. Using file at `//mode/opt`.
This behavior is being deprecated. Please use `"@//mode/opt"` instead
File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpB6PmDS
File changed: fbsource//xplat/caffe2/test/functorch/test_ac_knapsack.py
File changed: fbcode//caffe2/.ruff_cache/0.7.4/.tmpyjCiPn
20 additional file change events
Buck UI: https://www.internalfb.com/buck2/414ead46-9ede-4192-8e1a-5d3c52bdb9cc
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6473924710342830
Network: Up: 0B  Down: 0B  (reSessionID-159794b9-9d61-477e-8e63-9bdeaa537dca)
Analyzing targets. Remaining     0/214
Executing actions. Remaining     0/6933                                                                                                                                                                                  0.1s exec time total
Command: test.     Finished 1 local
Time elapsed: 18.5s
Tests finished: Pass 15. Fail 0. Fatal 0. Skip 0. Build failure 0
```

### Test Run

Updated the config:

```
      activation_memory_budget_solver: DYNAMIC_MEMORY_BUDGET_DP
```

Confirming proper execution via: [aps-fb_fm_v4_768_01_dynamic-2a792ba8af](https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-fb_fm_v4_768_01_dynamic-2a792ba8af?job_attempt=0&version=0&env=PRODUCTION)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143539
Approved by: https://github.com/jansel
2024-12-21 07:38:52 +00:00
bee47b0663 Revert "[pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430)"
This reverts commit 33dd4f187dd3b54d65182d56998feae235ee48c7.

Reverted https://github.com/pytorch/pytorch/pull/143430 on behalf of https://github.com/huydhn due to The internal diff D58707846 has been backed out ([comment](https://github.com/pytorch/pytorch/pull/143430#issuecomment-2558033930))
2024-12-21 07:26:34 +00:00
47c4e01e71 [audio hash update] update the pinned audio hash (#143694)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143694
Approved by: https://github.com/pytorchbot
2024-12-21 05:42:34 +00:00
9f3c291bc3 Fix issue with setAttribute and int8_t vs int32_t variables (#143693)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693
Approved by: https://github.com/huydhn
2024-12-21 05:31:56 +00:00
518b5050c0 Fix unused-variable issues in caffe2 (#143639)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/cyyever
2024-12-21 05:27:38 +00:00
f44310097c Reuse partial reductions (#143600)
Reuse partial reductions for complete reductions. We could expand this to more cover more types of reductions, although we'd have to be a bit more careful about keeping the intermediary, partial reduction in higher precision.

Just doing the ops which do not depend on a higher compute_dtype_precision for now to cover the relevant use case initially.

Fix for https://github.com/pytorch/pytorch/issues/136267. Longer term, we should make sure cooperative reductions fuse partial and complete reductions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143600
Approved by: https://github.com/vkuzo
2024-12-21 04:44:07 +00:00
97990f476d Revert "Fix unused-variable issues in caffe2 (#143639)"
This reverts commit 23ca7c2515dd1f601926c4fd0e65513308c135a9.

Reverted https://github.com/pytorch/pytorch/pull/143639 on behalf of https://github.com/huydhn due to This is failing OSS tests ([comment](https://github.com/pytorch/pytorch/pull/143639#issuecomment-2557991297))
2024-12-21 04:30:48 +00:00
b89bfe0bac Revert "Fix issue with setAttribute and int8_t vs int32_t variables (#143693)"
This reverts commit ae3d385fcba0f91f35b2848b852d4c75f88cbd62.

Reverted https://github.com/pytorch/pytorch/pull/143693 on behalf of https://github.com/huydhn due to Sorry for reverting this change but it has a conflict with https://github.com/pytorch/pytorch/pull/143639 that is breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/143693#issuecomment-2557990508))
2024-12-21 04:27:18 +00:00
a8953c36f5 [compiled autograd] log compilation time to perfetto (#140964)
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmprli4iy/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
```
[
  {
    "args": {
      "compile_id": "0/-/-",
      "graph_id": 0
    },
    "cat": "dynamo_timed",
    "name": "compiled_autograd",
    "ph": "B",
    "pid": 0,
    "tid": 0,
    "ts": 1733886868992655.8
  },
  {
    "args": {
      "compile_id": "0/-/-",
      "graph_id": 0
    },
    "cat": "dynamo_timed",
    "name": "compiled_autograd",
    "ph": "E",
    "pid": 0,
    "tid": 0,
    "ts": 1733886869130681.0
  },
  {
    "args": {
      "compile_id": "0/0/0"
    },
    "cat": "dynamo_timed",
    "name": "dynamo",
    "ph": "B",
    "pid": 0,
    "tid": 0,
    "ts": 1733886869134350.5
  },
  {
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140964
Approved by: https://github.com/masnesral
ghstack dependencies: #141907, #143175
2024-12-21 04:23:25 +00:00
c7d7eff798 Revert "[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347)"
This reverts commit efe21ee59dfdd6642cc693e69e07aa9d8be13eb9.

Reverted https://github.com/pytorch/pytorch/pull/143347 on behalf of https://github.com/huydhn due to D67118173 has been backed out internally ([comment](https://github.com/pytorch/pytorch/pull/143347#issuecomment-2557983266))
2024-12-21 04:04:16 +00:00
dabc9566c4 Revert "(MTIA) Move "empty_cache" API (#143402)"
This reverts commit c7d9f298072a3f59b39517e367c7d3d2ea30e6d9.

Reverted https://github.com/pytorch/pytorch/pull/143402 on behalf of https://github.com/huydhn due to The internal diff D67148738 has been reverted ([comment](https://github.com/pytorch/pytorch/pull/143402#issuecomment-2557982597))
2024-12-21 04:01:23 +00:00
fecf03fa3f [AOTI][reland] Emit a CMakeLists.txt when package_cpp_only (#143680)
Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143680
Approved by: https://github.com/huydhn
2024-12-21 03:48:40 +00:00
b5e159270a [AOTI XPU] Replace intel compiler with g++ to build inductor CPP wrapper in runtime. (#142322)
This PR aims to removes the de pendency on Intel Compiler at Inductor runtime. Now we only need a SYCL_HOME in runtime to find the sycl headers and libs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142322
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/albanD
ghstack dependencies: #143491
2024-12-21 02:27:04 +00:00
af0e159740 [Inductor XPU] Add XPU check for is_big_gpu(). (#143491)
Fix #143472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143491
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
2024-12-21 02:27:04 +00:00
0da004f3dd [dynamo] Remove transformers ModelOutput hack (#143567)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143567
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #143548
2024-12-21 01:46:14 +00:00
4627cfd1f9 [dynamo] Support user defined dicts (#143548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143548
Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/williamwen42
2024-12-21 01:46:14 +00:00
9cb743d1f9 [easy] Set feature use for aot autograd remote cache (#143674)
Use set_feature_use for logging aot autograd cache so that dynamo_compile has this data as well as PT2 Compile Events.

Differential Revision: [D67536293](https://our.internmc.facebook.com/intern/diff/D67536293/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143674
Approved by: https://github.com/bobrenjc93
2024-12-21 01:40:18 +00:00
ffd1b53f26 [aot] refactor dynamo source and cudagraphs static idx logic (#141748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141748
Approved by: https://github.com/ezyang
2024-12-21 01:20:53 +00:00
ae3d385fcb Fix issue with setAttribute and int8_t vs int32_t variables (#143693)
Test Plan: Sandcastle

Differential Revision: D67549758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143693
Approved by: https://github.com/huydhn
2024-12-21 01:19:29 +00:00
bdeee82822 unflatten isinstance (#143664)
When we unflatten, the submodules we generate (`InterpreterModule` or `InterpreterModuleDispatcher`) are not related by type to the original submodules `N`. This makes `isinstance(mod, N)` checks fail. Since we do not have the original types after export, the best we can do is expose a `type_name()` method that carries the original type name, which we do carry in `nn_module_stack` entries.

Differential Revision: [D67526542](https://our.internmc.facebook.com/intern/diff/D67526542/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143664
Approved by: https://github.com/tugsbayasgalan
2024-12-21 01:07:10 +00:00
d88ebbf822 cleanup chromium event log on dynamo exit rather than on entry (#143175)
clearing at dynamo start is an issue because it throws away events from compiled autograd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143175
Approved by: https://github.com/Skylion007, https://github.com/jamesjwu
ghstack dependencies: #141907
2024-12-21 00:41:24 +00:00
4ee166b82f [ca] add compiled autograd to CompileId (#141907)
tlparse PR: https://github.com/ezyang/tlparse/pull/83

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141907
Approved by: https://github.com/ezyang
2024-12-21 00:41:24 +00:00
0ce233b8ca Support tensor subclass unwrapping (#141941)
This PR adds support for export to unwrap/wrap subclasses AOT so that we can trace through subclass parameters. This will resolve the UX issue in torchao where users had to manually unwrap their subclasses before calling export.

Differential Revision: [D67531057](https://our.internmc.facebook.com/intern/diff/D67531057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141941
Approved by: https://github.com/bdhirsh
2024-12-21 00:29:31 +00:00
553031fb9a [BE] Remove gcc-5 workaround for unused args (#143685)
ditto

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143685
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/atalman
2024-12-21 00:18:15 +00:00
ad7ab5ef84 Revert "[logging] A few fixes/updates to record_compilation_metrics (#143332)"
This reverts commit a9c753bbc88bfdc0e77f66956b3a11e405235d0f.

Reverted https://github.com/pytorch/pytorch/pull/143332 on behalf of https://github.com/malfet due to Surprisingly failure is caused by this PR ([comment](https://github.com/pytorch/pytorch/pull/143332#issuecomment-2557899120))
2024-12-21 00:06:44 +00:00
bf7009d839 [rpc] Fix unit test after c10::nullopt removal (#143690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143690
Approved by: https://github.com/yifuwang, https://github.com/c-p-i-o, https://github.com/XilunWu
2024-12-20 23:36:07 +00:00
eqy
912d6a2867 [CUDA] Bump tolerances in test_svd_lowrank_cuda_float64 (#143049)
pre-emptive bump for apparent noisy failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143049
Approved by: https://github.com/Skylion007, https://github.com/lezcano, https://github.com/nikitaved
2024-12-20 23:25:21 +00:00
8960cb5809 Add support for bfloat16 atomic adds in fbcode (#143629)
Reland https://github.com/pytorch/pytorch/pull/141857 and fallback on A100 which doesn't have bfloat16 atomic add instrs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143629
Approved by: https://github.com/eellison
2024-12-20 23:05:13 +00:00
a3b04d473e [ROCm] Update setup-rocm for almalinux-based images (#143590)
Needed for https://github.com/pytorch/test-infra/pull/6003 and https://github.com/pytorch/ao/pull/999

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143590
Approved by: https://github.com/atalman

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-12-20 22:48:54 +00:00
23ca7c2515 Fix unused-variable issues in caffe2 (#143639)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143639
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-12-20 22:30:58 +00:00
6e58c37542 c10d: no call_guard in init (#143598)
`py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe.

For more details see #143593

https://github.com/pybind/pybind11/issues/5473

Test plan:

```
python setup.py develop
```

CI

```py
import time
from concurrent.futures import ThreadPoolExecutor
from torch import distributed as dist

def run():
    store = dist.TCPStore(
        host_name="localhost",
        port=0,
        is_master=True,
        wait_for_workers=False,
    )

    # this sleep is required to trigger the crash
    time.sleep(0.1)
    del store

futures = []
with ThreadPoolExecutor(
    max_workers=100,
) as executor:
    for i in range(100000):
        print(i)
        futures.append(executor.submit(run))
        if len(futures) > 100:
            futures.pop(0).result()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598
Approved by: https://github.com/c-p-i-o
2024-12-20 22:23:36 +00:00
a9c753bbc8 [logging] A few fixes/updates to record_compilation_metrics (#143332)
Summary: Mostly cosmetic, but one bug fix:
* Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]"
* Sort collections in `collection_to_str`
* Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings)
* Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics

Test Plan:
```
python test/dynamo/test_structured_trace.py
python test/dynamo/test_utils.py
```
Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332
Approved by: https://github.com/ppanchalia
2024-12-20 21:42:32 +00:00
372b023eb1 Fix test_serialization_zipfile_actually_jit when weights_only is not default (#143668)
Fails in fbcode where weights_only isn't default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143668
Approved by: https://github.com/awgu
ghstack dependencies: #143326, #143403
2024-12-20 21:25:10 +00:00
33dd4f187d [pytorch/et] Allow ET to save additional resources for completing a trace like generated kernels and index tensor data (#143430)
The resources directory lets ET observer dump any additional data like Triton kernels while capturing the ET.

This allows us to use the ET trace to replay PT2 workloads and get visibility into data like generated kernels and their usage in a model, index tensor data etc.

We also added a few ways to enable ET and ET Resources through the OS environment variables.

Setting `ENABLE_PYTORCH_EXECUTION_TRACE` will enable default Execution Tracing in Pytorch.

Additionally setting `ENABLE_PYTORCH_EXECUTION_TRACE_EXTRAS` will enable ET to collect extra resources from the ET run like Triton Kernels.

Differential Revision: [D58707846](https://our.internmc.facebook.com/intern/diff/D58707846/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143430
Approved by: https://github.com/shengfukevin, https://github.com/sraikund16
2024-12-20 21:20:32 +00:00
cee06e74ee Apply clang-format for ATen/core/dispatch headers (#143620)
Code change via add path config in `.lintrunner.toml` file and running

```bash
 $ lintrunner -a --take CLANGFORMAT --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143620
Approved by: https://github.com/malfet
2024-12-20 21:16:23 +00:00
8e483654cb Add config.save.use_pinned_memory_for_d2h to serialization config (#143342)
This was benchmarked with two separate scripts on my A100
(A) Save state_dict of llama3-style model on CUDA to disk with ``torch.save``
(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save`
Timings are an average of 5 runs and benchmark scripts + results are attached

Under both scenarios, we see **~2x speedup in ``torch.save`` time with (``compute_crc32=False`` and ``use_pinned_memory_for_d2h=True``)** compared to the baseline of the current defaults (``compute_crc32=True`` and ``use_pinned_memory_for_d2h=False``

(A)  Save state_dict of llama3-style model on CUDA to disk with ``torch.save`` [[script](https://gist.github.com/mikaylagawarecki/d3a86ea1bb08045d1a839976808d7432)][[results](https://gist.github.com/mikaylagawarecki/f61a4714e5cff703146a1fcb7e0c755c)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 28.54s | 20.76s |
| `compute_crc_32 = False` | 22.57s |  **14.51s** |

(B) Save `ModuleList` of 10 `nn.Linear(10,000, 10,000)` on CUDA to disk with `torch.save` [[script](https://gist.github.com/mikaylagawarecki/ecbc505436bdd4b5190ef1b3430c12b6)][[results](https://gist.github.com/mikaylagawarecki/4e686bcf030b57de8c3ca74d8f5a88f7)]

|                                                                                 |  use_pinned_memory_for_d2h=False (Default) |  use_pinned_memory_for_d2h=True |
|-|-|-|
| `compute_crc_32= True`  (Default)| 8.38s | 5.53s |
| `compute_crc_32 = False` | 6.94s |  **3.99s** |

Trace of (A) with `use_pinned_memory_for_d2h=True`, `compute_crc32=False`
<img width="1745" alt="Screenshot 2024-12-16 at 7 32 33 PM" src="https://github.com/user-attachments/assets/80b87a8c-5a70-4eb9-ad66-7abc4aa7cc25" />

Baseline trace of (A) with `use_pinned_memory_for_d2h=False`, `compute_crc32=True`
<img width="1799" alt="Screenshot 2024-12-16 at 7 38 20 PM" src="https://github.com/user-attachments/assets/13fa12d1-8f5f-424c-adc4-275b67012927" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143342
Approved by: https://github.com/albanD
ghstack dependencies: #143324
2024-12-20 21:01:18 +00:00
3f63b742e6 Refactor serialization getter/setters into torch.utils.serialization.config (#143324)
Consolidate
- get/set_default_load_endianness
- get/set_default_mmap_options
- get/set_crc32_options

into one global dynamo-style config + allow global setting of mmap. The existing APIs are not removed and will get/set from the config (as they can't be removed for BC)

In #143459 I add the local (argument style) config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143324
Approved by: https://github.com/albanD
2024-12-20 21:01:17 +00:00
629de988df Fix old-compiler-unfriendly zero init of bfloat16_t array (#143504)
clang versions before 17 don't like to assign 0 to a bfloat16_t. gcc versions before 13 also won't assign 0.0 to a bfloat16_t. (Citation: https://godbolt.org/z/Gzs5ebdej)

Differential Revision: [D67396740](https://our.internmc.facebook.com/intern/diff/D67396740/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143504
Approved by: https://github.com/malfet
2024-12-20 20:49:51 +00:00
485497e727 [c10d][fr] flight recorder improvements (#143446)
Summary:
1. Flight recorder dumps are now automatically dumped by default upon
   timeout or exception. Users don't need to opt-in.
2. Change default dump location to running user's home directory
   `.cache` folder.

Test Plan:
1. Tested locally by running the crash program from flight recorder
   tutorial page.
   https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html#an-end-to-end-example
2. Noted that flight recorder files were correctly created.
❯ pwd
/home/cpio/.cache/fr_trace
❯ ls
nccl_trace_rank_0  nccl_trace_rank_1

Differential Revision: [D67363720](https://our.internmc.facebook.com/intern/diff/D67363720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143446
Approved by: https://github.com/d4l3k
2024-12-20 20:41:30 +00:00
a94f259a69 pgo: Log feature use (#142819)
This will cause dynamo_compile to popualte the feature column if we have
a hit for PGO.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142819
Approved by: https://github.com/ezyang
2024-12-20 20:22:20 +00:00
8ce0bc282a dynamo tracing perf: bytecode_transform improvements: 34.86 -> 33.9 (#143068)
See #143056 for overall docs.

This PR: Use slots on InstructionExnTabEntry and Instruction.  Stop doing python
version checks in the middle of `convert_instruction()` and
`inst_has_op_bits()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143068
Approved by: https://github.com/jansel
ghstack dependencies: #143065, #143067
2024-12-20 20:06:42 +00:00
5feb2d7b41 dynamo tracing perf: don't call expensive _set_guard_export_info if it's a duplicate guard: 37.66 -> 34.86 (#143067)
See #143056 for overall docs.

This PR: Move the call to `_set_guard_export_info()` after the duplicate guard
check in `GuardBuilder.DUPLICATE_INPUT()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143067
Approved by: https://github.com/jansel
ghstack dependencies: #143065
2024-12-20 20:06:42 +00:00
7d4e7fbfc1 dynamo tracing perf: no import on hot path: 47.62 -> 47.26 (#143065)
See #143056 for overall docs.

This PR: Removed another `import` in the body of the hot path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143065
Approved by: https://github.com/jansel
2024-12-20 20:06:42 +00:00
792e6184c5 [GPT-fast] Support run spcific model or micro-benchmark (#143607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143607
Approved by: https://github.com/BoyuanFeng, https://github.com/jerryzh168, https://github.com/huydhn
2024-12-20 19:58:07 +00:00
94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
b5475d334e [inductor] Fix an unused variable in cpu_vec_isa.py (#138473)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138473
Approved by: https://github.com/EikanWang, https://github.com/albanD, https://github.com/xuhancn
2024-12-20 18:50:19 +00:00
5a69c2a649 [BE][Sparse] Get rid of gcc-5 workaround (#143653)
Discovered those comments while looking at https://github.com/pytorch/pytorch/pull/143620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143653
Approved by: https://github.com/albanD
2024-12-20 18:40:45 +00:00
a5ed499f6a FlexAttention Benchmark (#139665)
1. Add alibi, sliding window, tahn softcap, prefixLM, and document_mask from attn_gym to benchmark.

2. Add comparison to different SDPA backends & FAv2, FAv3, FAKV.

Dependent on https://github.com/pytorch/pytorch/pull/139639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139665
Approved by: https://github.com/drisspg
2024-12-20 17:52:24 +00:00
c7d9f29807 (MTIA) Move "empty_cache" API (#143402)
Summary: This diff moves one of memory-related APIs to the consolidated location, which is `mtia/memory.py`.

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api
```

https://www.internalfb.com/intern/testinfra/testrun/13510798943184259

Reviewed By: nautsimon

Differential Revision: D67148738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143402
Approved by: https://github.com/nautsimon
2024-12-20 17:39:06 +00:00
d79fbf6b6d test/dynamo/test_utils: logging - Stop testing for impossible things. (#143535)
We don't support assigning to objects or numeric constants at the top level in
config modules, no need to test for them.

(This specifically breaks later sorting refactoring, since it requires <
to be implemented).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143535
Approved by: https://github.com/ppanchalia
2024-12-20 17:21:49 +00:00
f5af87c23c Make Inductor cpp backend enable_floating_point_contract_flag to take string (#143450)
Differential Revision: D66269001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143450
Approved by: https://github.com/desertfire
2024-12-20 16:28:54 +00:00
7ab880bc5e fix typo in autocast header (#143625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143625
Approved by: https://github.com/mlazos
ghstack dependencies: #143592
2024-12-20 16:17:15 +00:00
4f8b7c4272 Revert "refactor tensorify restart logic to use sources (#141517)" (#143623)
This reverts commit 30d8b30db7eaaa254d97077ac6515cdc4568fd6d.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143623
Approved by: https://github.com/mlazos
2024-12-20 15:38:34 +00:00
607884c9af [Inductor][CPP] Fix bitwise shift with corner inputs (#143635)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/143555 and https://github.com/pytorch/pytorch/issues/143566, we can align the implementation with Eager: 29b586bbad/aten/src/ATen/native/cpu/BinaryOpsKernel.cpp (L501) at these corner inputs.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_bitwise_shift_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143635
Approved by: https://github.com/jgong5
2024-12-20 13:47:40 +00:00
7bf3b7cdc5 Rewrite _reparametrize_module to use contextmanager (#138203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138203
Approved by: https://github.com/zou3519
ghstack dependencies: #136033, #140604
2024-12-20 12:02:27 +00:00
1c817fe671 Set enable_trace_contextlib_contextmanager flag to True (#140604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140604
Approved by: https://github.com/zou3519
ghstack dependencies: #136033
2024-12-20 12:02:27 +00:00
673cc88fd6 Add support for contextmanager in Dynamo (#136033)
Fixes #130559

* Intro

This PR adds support for `@contextmanager` in Dynamo. We chose to limit the
scope of this work to only `@contextmanager` and plan to handle generators fully
in #141055 (still in draft).

* Motivation

Dynamo lacks support for generator functions. When it encounters one, it traces
it as if it were a regular function. This is problematic because it can lead to
incorrect behavior. To illustrate, consider the test case below:

```python
import torch
import contextlib

@contextlib.contextmanager
def set_default_dtype(dtype):
    old_dtype = torch.get_default_dtype()
    try:
        torch.set_default_dtype(dtype)
        yield
    finally:
        torch.set_default_dtype(old_dtype)

@torch.compile(backend="eager", fullgraph=True)
def fn():
    with set_default_dtype(torch.float64):
        x = torch.tensor([3.0, 3.0 + 5.0j])
    return x
```

Before this work, Dynamo would not stop at the `yield`, and the graph produced
would contain both calls to `set_default_dtype` executed one after the other.
This is incorrect because the context manager should execute code before and
after the `yield`.

* List of changes

`YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control
flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE`
behaves differently in a generator function. Unlike regular functions, where
`RETURN_VALUE` indicates the final result, in generators it signifies that the
generator is exhausted and implicitly raises `StopIteration`.

A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable`
was introduced to handle `@contextmanager`. This variable tracker acts not just
as a wrapper for the original function but also maintains an internal `tx`
(InstructionTranslator) object to suspend and return control flow to the parent
tracer when a `yield` is encountered.

* Corner cases

Returning a context manager from a compiled function is not supported. This
would require PyTorch to synchronize the generator state between Dynamo and the
interpreter. Any attempt to return it will result in an `IncorrectUsage`
exception.

Graph breaks require special handling as well. In the event of a graph break,
the frame associated with the context manager is skipped, and the context
manager runs in eager mode.

* This PR is breaking my code

There is a configuration flag (`enable_trace_contextlib`) that can be set to
`False` to disable tracing context managers. If this still causes crashes,
please revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033
Approved by: https://github.com/zou3519
2024-12-20 12:02:20 +00:00
04b26ee1e8 Fix false positive from f-strings in set_linter (#143628)
This linter was going crazy in python 3.12, example:
```py
$ python3 tools/linter/adapters/set_linter.py torch/_inductor/runtime/triton_heuristics.py
torch/_inductor/runtime/triton_heuristics.py:192:25: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:27: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                  ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:29: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                    ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:192:31: Builtin `set` is deprecated
  190 |     args_str += ", ".join(call_args)
  191 |     for k, v in call_kwargs.items():
  192 |         args_str += f", {k}={v}"
                                      ^
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])

torch/_inductor/runtime/triton_heuristics.py:195:17: Builtin `set` is deprecated
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
                        ^
  196 |         f.write(f"{kernel_name} | {args_str}\n")
  197 |

torch/_inductor/runtime/triton_heuristics.py:195:26: Builtin `set` is deprecated
  193 |
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
                                 ^
  196 |         f.write(f"{kernel_name} | {args_str}\n")
  197 |

torch/_inductor/runtime/triton_heuristics.py:196:19: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                          ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:31: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                      ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:35: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                          ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:196:44: Builtin `set` is deprecated
  194 |     abs_path = os.path.abspath(sys.argv[0])
  195 |     with open(f"{abs_path}.launch_params", "a") as f:
  196 |         f.write(f"{kernel_name} | {args_str}\n")
                                                   ^
  197 |
  198 |

torch/_inductor/runtime/triton_heuristics.py:729:26: Builtin `set` is deprecated
  727 |         exec(
  728 |             f"""
  729 |             def launcher({', '.join(def_args)}, grid, stream):
                                 ^
  730 |                 if callable(grid):
  731 |                     grid_0, grid_1, grid_2 = grid(grid_meta)

torch/_inductor/runtime/triton_heuristics.py:729:46: Builtin `set` is deprecated
  727 |         exec(
  728 |             f"""
  729 |             def launcher({', '.join(def_args)}, grid, stream):
                                                     ^
  730 |                 if callable(grid):
  731 |                     grid_0, grid_1, grid_2 = grid(grid_meta)

torch/_inductor/runtime/triton_heuristics.py:735:24: Builtin `set` is deprecated
  733 |                     grid_0, grid_1, grid_2 = grid
  734 |
  735 |                 args = {', '.join(call_args)},
                               ^
  736 |                 launch_args = get_launch_args(
  737 |                     grid, grid_0, grid_1, grid_2, stream, function,

torch/_inductor/runtime/triton_heuristics.py:735:45: Builtin `set` is deprecated
  733 |                     grid_0, grid_1, grid_2 = grid
  734 |
  735 |                 args = {', '.join(call_args)},
                                                    ^
  736 |                 launch_args = get_launch_args(
  737 |                     grid, grid_0, grid_1, grid_2, stream, function,

torch/_inductor/runtime/triton_heuristics.py:1144:20: Builtin `set` is deprecated
 1142 |     cur_file = inspect.stack()[1].filename
 1143 |     summary_str = (
 1144 |         f"SUMMARY ({cur_file})\n"
                           ^
 1145 |         f"{overall_time:.2f}ms   \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s"
 1146 |     )

torch/_inductor/runtime/triton_heuristics.py:1144:29: Builtin `set` is deprecated
 1142 |     cur_file = inspect.stack()[1].filename
 1143 |     summary_str = (
 1144 |         f"SUMMARY ({cur_file})\n"
                                    ^
 1145 |         f"{overall_time:.2f}ms   \t {overall_gb:.2f} GB\t {overall_gb / (overall_time / 1e3):.2f}GB/s"
 1146 |     )

torch/_inductor/runtime/triton_heuristics.py:1162:61: Builtin `set` is deprecated
 1160 |                 )
 1161 |                 file.write("====================\n")
 1162 |                 file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n")
                                                                    ^
 1163 |                 for ms, num_gb, gb_per_s, kernel_name in sorted_calls:
 1164 |                     # also display the runtime percentage for each kernel

torch/_inductor/runtime/triton_heuristics.py:1162:70: Builtin `set` is deprecated
 1160 |                 )
 1161 |                 file.write("====================\n")
 1162 |                 file.write(f"TRITON KERNELS BANDWIDTH INFO ({cur_file})\n")
                                                                             ^
 1163 |                 for ms, num_gb, gb_per_s, kernel_name in sorted_calls:
 1164 |                     # also display the runtime percentage for each kernel

torch/_inductor/runtime/triton_heuristics.py:1166:36: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                           ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:47: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                      ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:52: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                           ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1166:64: Builtin `set` is deprecated
 1164 |                     # also display the runtime percentage for each kernel
 1165 |                     percentage = f"{ms / overall_time * 100:.2f}%"
 1166 |                     suffix = f" \t {percentage} \t {kernel_name}"
                                                                       ^
 1167 |                     bw_info_str = create_bandwidth_info_str(
 1168 |                         ms,

torch/_inductor/runtime/triton_heuristics.py:1175:30: Builtin `set` is deprecated
 1173 |                     )
 1174 |                     file.write(bw_info_str + "\n")
 1175 |                 file.write(f"{summary_str}\n\n")
                                     ^
 1176 |         except Exception as e:
 1177 |             log.warning(

torch/_inductor/runtime/triton_heuristics.py:1175:42: Builtin `set` is deprecated
 1173 |                     )
 1174 |                     file.write(bw_info_str + "\n")
 1175 |                 file.write(f"{summary_str}\n\n")
                                                 ^
 1176 |         except Exception as e:
 1177 |             log.warning(

torch/_inductor/runtime/triton_heuristics.py:1205:29: Builtin `set` is deprecated
 1203 |         else:
 1204 |             possible_names = _find_names(self)
 1205 |             kernel_name = f"{max(possible_names, key=len)}"
                                    ^
 1206 |             if not re.match(self.regex_filter, kernel_name):
 1207 |                 return

torch/_inductor/runtime/triton_heuristics.py:1205:58: Builtin `set` is deprecated
 1203 |         else:
 1204 |             possible_names = _find_names(self)
 1205 |             kernel_name = f"{max(possible_names, key=len)}"
                                                                 ^
 1206 |             if not re.match(self.regex_filter, kernel_name):
 1207 |                 return

torch/_inductor/runtime/triton_heuristics.py:1241:60: Builtin `set` is deprecated
 1239 |                     "%s",
 1240 |                     create_bandwidth_info_str(
 1241 |                         ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}"
                                                                   ^
 1242 |                     ),
 1243 |                 )

torch/_inductor/runtime/triton_heuristics.py:1241:72: Builtin `set` is deprecated
 1239 |                     "%s",
 1240 |                     create_bandwidth_info_str(
 1241 |                         ms, num_gb, gb_per_s, suffix=f" \t {kernel_name}"
                                                                               ^
 1242 |                     ),
 1243 |                 )

torch/_inductor/runtime/triton_heuristics.py:1256:15: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                      ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:42: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                 ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:44: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                   ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:58: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                 ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:60: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                   ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1256:75: Builtin `set` is deprecated
 1254 |     for cfg in configs:
 1255 |         hasher.update(
 1256 |             f"{sorted(cfg.kwargs.items())} {cfg.num_warps} {cfg.num_stages}\n".encode()
                                                                                  ^
 1257 |         )
 1258 |     return hasher.hexdigest()

torch/_inductor/runtime/triton_heuristics.py:1377:23: Builtin `set` is deprecated
 1375 |         if numel is None:
 1376 |             continue
 1377 |         block = cfg[f"{label}BLOCK"]
                              ^
 1378 |         if numel == 1:
 1379 |             assert block == 1, (

torch/_inductor/runtime/triton_heuristics.py:1377:29: Builtin `set` is deprecated
 1375 |         if numel is None:
 1376 |             continue
 1377 |         block = cfg[f"{label}BLOCK"]
                                    ^
 1378 |         if numel == 1:
 1379 |             assert block == 1, (

torch/_inductor/runtime/triton_heuristics.py:1381:24: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                               ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:38: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                             ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:46: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                     ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:52: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                           ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:58: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                 ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:64: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                       ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:71: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                              ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:77: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                    ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:84: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                           ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1381:88: Builtin `set` is deprecated
 1379 |             assert block == 1, (
 1380 |                 f"TritonKernel.indexing assumes numel == 1 => BLOCK == 1"
 1381 |                 f" but {label.lower()}numel=={numel} and {label}BLOCK={block} (cfg={cfg})."
                                                                                               ^
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]

torch/_inductor/runtime/triton_heuristics.py:1384:52: Builtin `set` is deprecated
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
                                                           ^
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"

torch/_inductor/runtime/triton_heuristics.py:1384:58: Builtin `set` is deprecated
 1382 |             )
 1383 |         max_block = TRITON_MAX_BLOCK[label]
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
                                                                 ^
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"

torch/_inductor/runtime/triton_heuristics.py:1386:45: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                    ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:51: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                          ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:66: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                                         ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1386:80: Builtin `set` is deprecated
 1384 |         max_block_str = f'config.triton.max_block["{label}"]'
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
                                                                                       ^
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
 1388 |         )

torch/_inductor/runtime/triton_heuristics.py:1387:20: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                           ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:26: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                 ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:33: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                        ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:39: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                              ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:45: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                    ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:59: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                  ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:61: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                    ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:71: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                              ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:78: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                                     ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1387:82: Builtin `set` is deprecated
 1385 |         assert max_block % block == 0, (
 1386 |             f"TritonKernel.indexing assumes {label}BLOCK divides {max_block_str}"
 1387 |             f" but {label}BLOCK={block} and {max_block_str}={max_block} (cfg={cfg})."
                                                                                         ^
 1388 |         )
 1389 |

torch/_inductor/runtime/triton_heuristics.py:1402:19: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                          ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:23: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                              ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:46: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                     ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:56: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                               ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:67: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                                          ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1402:71: Builtin `set` is deprecated
 1400 |             assert (
 1401 |                 val <= max_block
 1402 |             ), f"'{var}' too large. Maximum: {max_block}. Actual: {val}."
                                                                              ^
 1403 |
 1404 |

torch/_inductor/runtime/triton_heuristics.py:1551:21: Builtin `set` is deprecated
 1549 |     rnumels = {}
 1550 |     for idx in range(num_reduction_dims - 1, -1, -1):
 1551 |         prefix = f"r{idx}_"
                            ^
 1552 |         max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()])
 1553 |         dim = min(max_size, remaining)

torch/_inductor/runtime/triton_heuristics.py:1551:25: Builtin `set` is deprecated
 1549 |     rnumels = {}
 1550 |     for idx in range(num_reduction_dims - 1, -1, -1):
 1551 |         prefix = f"r{idx}_"
                                ^
 1552 |         max_size = min(size_hints[prefix], TRITON_MAX_BLOCK[prefix.upper()])
 1553 |         dim = min(max_size, remaining)

torch/_inductor/runtime/triton_heuristics.py:1556:34: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                         ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:38: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                             ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:67: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                                                          ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1556:77: Builtin `set` is deprecated
 1554 |         assert (
 1555 |             remaining % dim == 0
 1556 |         ), f"Expected dimension '{dim}' to divide remaining size '{remaining}'"
                                                                                    ^
 1557 |         rnumels[prefix] = dim
 1558 |         remaining //= dim

torch/_inductor/runtime/triton_heuristics.py:1564:38: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                             ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:46: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                     ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:57: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                                ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1564:59: Builtin `set` is deprecated
 1562 |     assert (
 1563 |         r == final_numel
 1564 |     ), f"Expected ND reduction size ({rnumels}) to have {r} elements."
                                                                  ^
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:37: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                            ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:45: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                    ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:49: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                        ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1567:60: Builtin `set` is deprecated
 1565 |     assert all(
 1566 |         rnumels[prefix] <= size_hints[prefix] for prefix in rnumels
 1567 |     ), f"rnumels exceed size_hints. {rnumels} > {size_hints}"
                                                                   ^
 1568 |
 1569 |     return rnumels

torch/_inductor/runtime/triton_heuristics.py:1746:49: Builtin `set` is deprecated
 1744 |
 1745 |     if not configs:
 1746 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                        ^
 1747 |     return cached_autotune(
 1748 |         size_hints,

torch/_inductor/runtime/triton_heuristics.py:1746:60: Builtin `set` is deprecated
 1744 |
 1745 |     if not configs:
 1746 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                                   ^
 1747 |     return cached_autotune(
 1748 |         size_hints,

torch/_inductor/runtime/triton_heuristics.py:1928:32: Builtin `set` is deprecated
 1926 |         for prefix in size_hints:
 1927 |             if prefix_is_reduction(prefix):
 1928 |                 c.kwargs.pop(f"{prefix.upper()}BLOCK")
                                       ^
 1929 |
 1930 |     if disable_pointwise_autotuning(inductor_meta):

torch/_inductor/runtime/triton_heuristics.py:1928:47: Builtin `set` is deprecated
 1926 |         for prefix in size_hints:
 1927 |             if prefix_is_reduction(prefix):
 1928 |                 c.kwargs.pop(f"{prefix.upper()}BLOCK")
                                                      ^
 1929 |
 1930 |     if disable_pointwise_autotuning(inductor_meta):

torch/_inductor/runtime/triton_heuristics.py:1975:49: Builtin `set` is deprecated
 1973 |     assert triton_meta is not None
 1974 |     if len(size_hints) != 2:
 1975 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                        ^
 1976 |
 1977 |     configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta)

torch/_inductor/runtime/triton_heuristics.py:1975:60: Builtin `set` is deprecated
 1973 |     assert triton_meta is not None
 1974 |     if len(size_hints) != 2:
 1975 |         raise NotImplementedError(f"size_hints: {size_hints}")
                                                                   ^
 1976 |
 1977 |     configs = _reduction_configs(size_hints=size_hints, inductor_meta=inductor_meta)

torch/_inductor/runtime/triton_heuristics.py:2082:56: Builtin `set` is deprecated
 2080 |         xnumel, ynumel, znumel = numels[2], numels[1], numels[0]
 2081 |     else:
 2082 |         raise AssertionError(f"invalid size for numels {len(numels)}")
                                                               ^
 2083 |
 2084 |     def get_grid_dim(numel, block):

torch/_inductor/runtime/triton_heuristics.py:2082:68: Builtin `set` is deprecated
 2080 |         xnumel, ynumel, znumel = numels[2], numels[1], numels[0]
 2081 |     else:
 2082 |         raise AssertionError(f"invalid size for numels {len(numels)}")
                                                                           ^
 2083 |
 2084 |     def get_grid_dim(numel, block):

torch/_inductor/runtime/triton_heuristics.py:2104:57: Builtin `set` is deprecated
 2102 |             torch._check(
 2103 |                 y_grid <= max_y_grid,
 2104 |                 lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue",
                                                                ^
 2105 |             )
 2106 |

torch/_inductor/runtime/triton_heuristics.py:2104:64: Builtin `set` is deprecated
 2102 |             torch._check(
 2103 |                 y_grid <= max_y_grid,
 2104 |                 lambda: f"Generated y grid beyond 2^16 ({y_grid}) not supported with z dimension present. File issue",
                                                                       ^
 2105 |             )
 2106 |

torch/_inductor/runtime/triton_heuristics.py:2113:43: Builtin `set` is deprecated
 2111 |         )
 2112 |
 2113 |     setattr(grid_fn, "grid_fn_str", f"grid{numels}")  # noqa: B010
                                                  ^
 2114 |
 2115 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2113:50: Builtin `set` is deprecated
 2111 |         )
 2112 |
 2113 |     setattr(grid_fn, "grid_fn_str", f"grid{numels}")  # noqa: B010
                                                         ^
 2114 |
 2115 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2122:48: Builtin `set` is deprecated
 2120 |         return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1)
 2121 |
 2122 |     grid_fn_str = f"cooperative_reduction_grid({xnumel})"
                                                       ^
 2123 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2124 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2122:55: Builtin `set` is deprecated
 2120 |         return (meta["RSPLIT"], ceildiv(xnumel, meta.get("XBLOCK", 1)), 1)
 2121 |
 2122 |     grid_fn_str = f"cooperative_reduction_grid({xnumel})"
                                                              ^
 2123 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2124 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2135:54: Builtin `set` is deprecated
 2133 |     coop_grid = cooperative_reduction_grid(xnumel)
 2134 |     normal_grid = grid(xnumel)
 2135 |     grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})"
                                                             ^
 2136 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2137 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2135:61: Builtin `set` is deprecated
 2133 |     coop_grid = cooperative_reduction_grid(xnumel)
 2134 |     normal_grid = grid(xnumel)
 2135 |     grid_fn_str = f"maybe_cooperative_reduction_grid({xnumel})"
                                                                    ^
 2136 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2137 |     return grid_fn

torch/_inductor/runtime/triton_heuristics.py:2145:37: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                            ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:44: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                   ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:47: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                      ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2145:54: Builtin `set` is deprecated
 2143 |         return (ceildiv(rnumel, meta.get("R0_BLOCK", 1)), xnumel, 1)
 2144 |
 2145 |     grid_fn_str = f"split_scan_grid({xnumel}, {rnumel})"
                                                             ^
 2146 |     setattr(grid_fn, "grid_fn_str", grid_fn_str)  # noqa: B010
 2147 |

torch/_inductor/runtime/triton_heuristics.py:2173:42: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                 ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:53: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                            ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:66: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                                         ^
 2174 |     else:
 2175 |         # sequential dispatch

torch/_inductor/runtime/triton_heuristics.py:2173:77: Builtin `set` is deprecated
 2171 |             assert (
 2172 |                 min_blocks_d is None or min_blocks == min_blocks_d
 2173 |             ), f"inconsistent min_blocks {min_blocks} vs  x grid {numels[-1]}"
                                                                                    ^
 2174 |     else:
 2175 |         # sequential dispatch
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143628
Approved by: https://github.com/yanboliang, https://github.com/rec
2024-12-20 11:45:26 +00:00
6733045a4a export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-20 11:42:09 +00:00
b539c61631 [Hierarchical Compile] Update NoneAsConstantBuffer to support graph d… (#143531)
Fixes issues I hit while running graph deduplication with torch tune.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143531
Approved by: https://github.com/eellison
2024-12-20 09:23:12 +00:00
f9f82ca48f [ts converter] use Dim.AUTO for ts -> export converter (#138273)
Switches TS converter to use `Dim.AUTO` by default, exporting models with max dynamism. Adds runtime input tests to `test_converter.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138273
Approved by: https://github.com/avikchaudhuri
2024-12-20 07:48:24 +00:00
270ad513c8 [Dynamo] only import einops if version is lower than 0.7.0 (#142847)
Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847
Approved by: https://github.com/zou3519
2024-12-20 07:46:49 +00:00
29b586bbad fix formatting in programming model doc (#143587)
Test Plan: Some of the formatting in https://docs-preview.pytorch.org/pytorch/pytorch/143546/export.programming_model.html is broken.

Differential Revision: D67458972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143587
Approved by: https://github.com/yushangdi
2024-12-20 07:09:19 +00:00
fe0f20615c [DynamoBench] Handle accuracy results in benchmark records (#143611)
I discovered this issue when trying to search for the accuracy results on the database and couldn't find any.  It turns out that the results is there on the JSON file, for example `"metric": {"name": "accuracy", "benchmark_values": ["pass_due_to_skip"]}`, but inserting them into the database fails because benchmark values is a list of strings here while the expectation is that it's a list of numbers.

ClickHouse doesn't support mix types atm. It has a Variant type https://clickhouse.com/docs/en/sql-reference/data-types/variant, but this isn't recommended by CH team themselves.  So, the remaining option is to store this in the `extra_info` field.  This field is a dictionary, so it can goes there.

### Testing

https://github.com/pytorch/pytorch/actions/runs/12421747715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143611
Approved by: https://github.com/kit1980
2024-12-20 06:43:38 +00:00
132fcf4e0d [user triton] Raise an exception when encountering nested @triton.autotune decorators or @triton.heuristics (#143519)
We support running a single Autotuner for each Triton kernel. Currently,
if there are multiple autotuning decorators, the subsequent ones will be
silently ignored.

Instead, we should raise an error here to avoid silent incorrectness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143519
Approved by: https://github.com/aakhundov
2024-12-20 06:38:45 +00:00
71479a9b9c Revert "[AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352)"
This reverts commit 429f4cd1408b11a7b0dd10634b46b3265dc31af1.

Reverted https://github.com/pytorch/pytorch/pull/143352 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/143352#issuecomment-2556365140))
2024-12-20 06:21:31 +00:00
4e29e4aa63 [BE] Add a test to ensure grads are never inplaced into accidentally (#143612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143612
Approved by: https://github.com/soulitzer
2024-12-20 06:15:08 +00:00
2daa666591 update kineto to XPU Windows fixed PR. [submodule kineto] (#143445)
Include XPU Windows Fixed PR: https://github.com/pytorch/kineto/pull/1012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143445
Approved by: https://github.com/sraikund16
2024-12-20 05:57:30 +00:00
217a4ddb04 Add range check embedding_bag on input index >= 0 of cuda device (#140791)
Fixes #89362

**Test Result**

**Before**

```
>>> import torch
>>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda()
>>> weight = torch.rand([2, 3], dtype=torch.float32).cuda()
>>> print(torch.nn.functional.embedding_bag(input, weight))
tensor([[0., 0., 0.]], device='cuda:0')
```

**After**

```python
>>> import torch
>>> input = torch.randint(-5, 1, [1, 2], dtype=torch.int64).cuda()
>>> weight = torch.rand([2, 3], dtype=torch.float32).cuda()
>>> print(torch.nn.functional.embedding_bag(input, weight))
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [1,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
/home/zong/code/pytorch/aten/src/ATen/native/cuda/EmbeddingBag.cu:141: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [2,0,0] Assertion `0 <= input_idx && input_idx < numRows` failed.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 357, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/_tensor_str.py", line 146, in __init__
    tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

```

```bash
$ pytest test/nn/test_embedding.py
```
![image](https://github.com/user-attachments/assets/6a5ec759-a3dc-4d51-9e5e-ec79c0aac526)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/2ce4ac24-74fb-4181-9510-18b96a2c2acb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140791
Approved by: https://github.com/eqy
2024-12-20 05:47:26 +00:00
9713a6eeca remove allow-untyped-defs from torch/fx/experimental/refinement_types.py (#143602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143602
Approved by: https://github.com/aorenste
2024-12-20 05:40:52 +00:00
78d294379a remove allow-untyped-defs from torch/_lazy/config.py (#143603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143603
Approved by: https://github.com/aorenste
2024-12-20 05:34:19 +00:00
cb4e9888df remove allow-untyped-defs from torch/ao/quantization/experimental/APoT_tensor.py (#143601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143601
Approved by: https://github.com/aorenste
2024-12-20 05:26:09 +00:00
dd346dbeab remove allow-untyped-defs from torch/distributed/elastic/multiprocessing/errors/handlers.py (#143605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143605
Approved by: https://github.com/aorenste
2024-12-20 05:25:01 +00:00
fd23cf5848 [Dynamo] check node class first for graph dedup (#143609)
as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143609
Approved by: https://github.com/williamwen42
2024-12-20 04:09:46 +00:00
1c2593f035 [dynamo] guard global autocast state (#143592)
Fixes https://github.com/pytorch/pytorch/issues/112260.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143592
Approved by: https://github.com/jansel
2024-12-20 03:30:54 +00:00
d339f1506b Add cutlass version guard in prep for upgrade (#143551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143551
Approved by: https://github.com/eqy
2024-12-20 02:40:02 +00:00
75661f2036 try root fix for FP8 tensor (#143248)
Fixes #143194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143248
Approved by: https://github.com/fegin
2024-12-20 01:57:17 +00:00
4462cc6375 Revert "[Inductor] inplace padding (#140249)"
This reverts commit 297ce776363cc4802fa74d210fced2b4128960d5.

Reverted https://github.com/pytorch/pytorch/pull/140249 on behalf of https://github.com/huydhn due to This break an internal test https://fburl.com/test/ppl2we5l ([comment](https://github.com/pytorch/pytorch/pull/140249#issuecomment-2556079406))
2024-12-20 01:30:27 +00:00
e1b4635504 remove allow-untyped-defs from torch/distributed/pipelining/_debug.py (#143606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143606
Approved by: https://github.com/aorenste
2024-12-20 01:26:51 +00:00
a0cff096bc Improve cond error messaging (#143595)
Discovered by @drisspg and I trying out a simple toy example and being way too confused :')

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143595
Approved by: https://github.com/zou3519, https://github.com/ydwu4
2024-12-20 01:19:20 +00:00
d547fae5b0 [Codemod][AddExplicitStrictExportArg] caffe2/torch/onnx/_internal/exporter (#143542)
Reviewed By: avikchaudhuri

Differential Revision: D67381244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143542
Approved by: https://github.com/ydwu4, https://github.com/titaiwangms
2024-12-20 00:54:52 +00:00
544de4008e [Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759)
Fix https://github.com/pytorch/pytorch/issues/141671.

Summary:
The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-20 00:35:58 +00:00
8136daff5a Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit 4b82251011f85f9d1395b451d61e976af844d9b1.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))
2024-12-19 23:33:17 +00:00
145fd5bad0 Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847)"
This reverts commit a96387a481633389a6b5a5ac7b8406e9216f320e.

Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/huydhn due to This has been reverted internally D67436053 ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2555942351))
2024-12-19 23:22:44 +00:00
d2b83aa122 add grad_output shape check for fractional_max_pool2d_backward (#141666)
Fix https://github.com/pytorch/pytorch/issues/141102.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141666
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-19 22:47:02 +00:00
2def1f6f74 [caffe2] Move vectorized templates into a separate file for box_cox operator (#143556)
Summary: No functional changes in this diff, the code is moved into a separate file to be reused by avx512 version in the follow up diff.

Test Plan: buck build //caffe2/caffe2/perfkernels:perfkernels

Differential Revision: D67433115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143556
Approved by: https://github.com/hl475
2024-12-19 22:02:23 +00:00
429f4cd140 [AOTI] Emit a CMakeLists.txt when package_cpp_only (#143352)
Summary: Emit a CMakeLists.txt with compile and link options when package_cpp_only is specified. After unzipping AOTI generated .pt2 package file, user can manually build the generated model code in their local environment.

Differential Revision: [D67458526](https://our.internmc.facebook.com/intern/diff/D67458526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143352
Approved by: https://github.com/malfet
2024-12-19 22:01:05 +00:00
e9bd74d763 Revert "[export] don't decompose custom triton op when exporting (#142426)"
This reverts commit 10b9c5944e8d6ff0685e1ef25277a1d3c4c9c5aa.

Reverted https://github.com/pytorch/pytorch/pull/142426 on behalf of https://github.com/huydhn due to This fails one internal MTIA test, checking with the author that we need to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/142426#issuecomment-2555793496))
2024-12-19 21:21:38 +00:00
fc03c62c56 Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (#142062)
Related: #125914 (specifically see [comment](https://github.com/pytorch/pytorch/issues/125914#issuecomment-2513044125))

This PR addresses two broken things involving the usage of unbacked SymInts for calls to `slice()` with data-dependent bounds. These issues are encountered in practice for `narrow()` operating on the batch dim with an NJT input, but apply to other subclasses as well. The test in this PR uses a purpose-built subclass.

There are two different issues here, depending on whether `torch.compile()` is called with `dynamic=True`. In practice, these only occur when the unbacked SymInts are created within the torch_dispatch implementation of a subclass, because the unbacked symbols are considered "freshly created" when the output subclass instance is handled in Dynamo.

**Error 1 (dynamic=False):**
```
LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(-Min(22, Max(0, u0)) + Min(22, Max(u0 + u1, Max(0, u0))), 0) (unhinted: Eq(-Min(s0, Max(0, u0)) + Min(s0, Max(u0 + u1, Max(0, u0))), 0)).  (Size-like symbols: u1, u0)
```

The expression comes from the use of `clamp()` logic for `SliceView` in Inductor:
41e59754b4/torch/_inductor/ir.py (L3014)

If the (start, end) bounds for the `slice()` are statically known to be in range for the given dim (e.g. provided via `torch._check()` calls), we can avoid this `clamp()` logic and the error. This PR implements this fix.

**Error 2 (dynamic=True):**
```
torch._dynamo.exc.InternalTorchDynamoError: PendingUnbackedSymbolNotFound: Pending unbacked symbols {u0} not in returned outputs NestedTensor(size=(2, s16, s1), offsets=FakeTensor(..., device='cuda:0', size=(3,), dtype=torch.int64), grad_fn=<NarrowBackwardAutogradNestedTensor0 object at 0x7f1f8603cfd0>, contiguous=True) ((s1*s16, s1, 1), s1*u0)
```

The storage offset of the values component of the returned NJT is `s1*u0` where `s1` is known to be an integer. This PR expands the special logic handling the `constant * u0` case to handle SymInts as well:
314e08eb52/torch/fx/experimental/symbolic_shapes.py (L1013-L1031)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142062
Approved by: https://github.com/ezyang
ghstack dependencies: #143526
2024-12-19 21:08:04 +00:00
0b2c47962c Add support for differentiable LR in SGD + test v2.0 (#143510)
Second PR in a larger project to broader support for differentiable optimizers with @janeyx99 ! The first one had an issue near the end so this is the second PR on that subject. See #143122 for the development up until this point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143510
Approved by: https://github.com/janeyx99
2024-12-19 21:04:44 +00:00
629de4da60 [dynamo] Add a lint rule to restrict what 3P library one can import (#143312)
As title, this patch prevents developers from importing third party
libraries to patch things in Dynamo, unless there's no other easy
workaround (in which case one would add the library to the allowlist in
`import_linter.py`, as instructed by the lint error).

For instance, if we remove `einops` from the allowlist, we'd get this
```verbatim
>>> Lint for torch/_dynamo/decorators.py:

  Error (IMPORT) Disallowed import

    importing from einops is not allowed, if you believe there's a valid
    reason, please add it to import_linter.py

        608  |# Note: this carefully avoids eagerly import einops.
        609  |# TODO: we should delete this whole _allow_in_graph_einops logic by approximately 2024 Q2
        610  |def _allow_in_graph_einops():
    >>> 611  |    import einops
        612  |
        613  |    try:
        614  |        # requires einops > 0.6.1, torch >= 2.0

  Error (IMPORT) Disallowed import

    importing from einops is not allowed, if you believe there's a valid
    reason, please add it to import_linter.py

        612  |
        613  |    try:
        614  |        # requires einops > 0.6.1, torch >= 2.0
    >>> 615  |        from einops._torch_specific import (  # type: ignore[attr-defined]  # noqa: F401
        616  |            _ops_were_registered_in_torchdynamo,
        617  |        )
        618  |
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143312
Approved by: https://github.com/zou3519
2024-12-19 20:59:16 +00:00
8e78345d69 remove allow-untyped-defs from distributed/tensor/experimental/__init__.py (#143583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143583
Approved by: https://github.com/awgu
2024-12-19 20:25:28 +00:00
0a7dba4978 [cond] Change Autograd for cond (#142518)
Instead of returning None for unused variables, a tensor with all-zeros is returned.
Fixes [141301](https://github.com/pytorch/pytorch/issues/141301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142518
Approved by: https://github.com/ydwu4
2024-12-19 20:09:42 +00:00
8850a7b62c add some logging for tensorify (#143391)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143391
Approved by: https://github.com/jamesjwu
2024-12-19 20:06:26 +00:00
25172dc075 remove allow-untyped-defs from torch/ao/quantization/experimental/fake_quantize_function.py (#143582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143582
Approved by: https://github.com/XuehaiPan, https://github.com/laithsakka
2024-12-19 20:06:22 +00:00
2d150ad29f [ROCm] Fix unit test: matmul_offline_mgpu_tunableop (#143507)
Fixes #141652

This PR contains:

- Fix for `matmul_offline_mgpu_tunableop`
- Modifications to _checking_tuning_assertions to enable TunableOp if it is disabled. Also moved it into the concurrent futures initializer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143507
Approved by: https://github.com/jeffdaily
2024-12-19 19:48:20 +00:00
66172578f9 [ROCm] Guard triton backend call around cuda.is_available (#143570)
To resolve: https://github.com/pytorch/test-infra/issues/6082

Calling into Triton's get_backend_options will initialise CUDA and break CPU-only environments that may have hip installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143570
Approved by: https://github.com/atalman, https://github.com/jeffdaily
2024-12-19 19:46:13 +00:00
c46cfc245f [Dynamo] Support dict_keys from nested dict object (#143557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143557
Approved by: https://github.com/williamwen42
ghstack dependencies: #143374, #143547
2024-12-19 19:02:55 +00:00
5fa287aa82 [Dynamo] Rename Dict{View/Keys/Values} to Dict{View/Keys/Values}Variable (#143547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143547
Approved by: https://github.com/williamwen42
ghstack dependencies: #143374
2024-12-19 19:02:55 +00:00
4b82251011 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-19 18:51:26 +00:00
c5ddf5dd90 Unbacked SymInt fixes for subclasses + data-dependent slice() bounds (non-dynamic) (#143526)
Lifted non-controversial (non-dynamic) fixes from #142062. See description there for context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143526
Approved by: https://github.com/ezyang
2024-12-19 18:46:36 +00:00
2a11472f46 update expected results (#143586)
update results based on small regression added by
17b71e5d6a

the max we was 1.25%. for sum_floor_div
<img width="842" alt="Screenshot 2024-12-19 at 9 04 30 AM" src="https://github.com/user-attachments/assets/6ce913cd-110d-4837-af59-08fb6a0dd12d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143586
Approved by: https://github.com/bobrenjc93
2024-12-19 18:43:27 +00:00
e1e83015d2 [dynamo, 3.13t] raise error if torch.compile is attempted in 3.13t (nogil) (#143404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143404
Approved by: https://github.com/colesbury, https://github.com/atalman
2024-12-19 18:10:01 +00:00
33c27be017 Workaround for gather_out in MPS backend (#135543)
Avoids an underlying issue in reshape op in MPS that gets triggered when the input has multiple dimensions but the shape can be squeezed into 1D. The underlying issue is going to get fixed eventually.

Fixes https://github.com/pytorch/pytorch/issues/135240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135543
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-19 18:01:01 +00:00
1433bad0e4 torch export programming model (#143546)
Differential Revision: [D67429743](https://our.internmc.facebook.com/intern/diff/D67429743/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143546
Approved by: https://github.com/ydwu4
2024-12-19 16:56:13 +00:00
61a835ec53 Corrected description of AMSGrad algorithm (#142351)
Fixes #142323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142351
Approved by: https://github.com/janeyx99
2024-12-19 16:24:19 +00:00
171e6a934f Don't 1 specialize if stride is contiguous (#143365)
Fixes: https://github.com/pytorch/pytorch/issues/142024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143365
Approved by: https://github.com/ezyang
2024-12-19 15:22:47 +00:00
465f282a24 [reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)
Reland - https://github.com/pytorch/pytorch/pull/139560

As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
2024-12-19 15:16:10 +00:00
288aa87383 [Inductor][CPU] disable bernoulli_p decomposition (#143460)
Fix https://github.com/pytorch/pytorch/issues/142853
`fallback_random=True` should cause RNG to match between compile/eager (by having compile fall back to eager for RNG ops), but the `bernoulli_p` decompose function is not fully consistent with the eager CPU implementation.
We remove the decomp and keep the version for` fallback_random=False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143460
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-12-19 11:21:35 +00:00
fd8b217fcd Pass allow_rhs_unbacked to the stride test in metadata test too (#143040)
Fixes https://github.com/pytorch/pytorch/issues/142410

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143040
Approved by: https://github.com/bobrenjc93
2024-12-19 09:37:50 +00:00
451c233936 leaking c++ singleton specifically (#143509)
Summary:
fix forward for S477887

leaking c++ singleton specifically

when c++ shutdown, it tries to destruct the singleton and acquire GIL, at this moment python runtime exists already, causing undefined behavior.
Leaking here specifically so that we won't try to destroy singleton at the shutdown phase

Test Plan: n/a

Differential Revision: D67400633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143509
Approved by: https://github.com/c-p-i-o
2024-12-19 09:27:07 +00:00
da06d47bdb dynamo tracing perf: slight improvement on __instancecheck__: 47.77 -> 47.62 (#143064)
See #143056 for overall docs.

This PR: Switch out an `isinstance()` for an `is` in the very hot
`VariableTrackerMeta.__instancecheck__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143064
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-12-19 09:19:35 +00:00
a97c6a78a8 Upgrade submodule ideep for bf16f32 matmul changes (#143508)
This change will enable this PR #140159  to pick proper kernels in bf16 mode for SDPA layer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143508
Approved by: https://github.com/yanbing-j, https://github.com/jgong5
2024-12-19 06:49:16 +00:00
2ffdcab04c [Dynamo] Add DictKeySetVariable to capture dict_keys passed outside of compiled region (#143374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143374
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-12-19 06:39:27 +00:00
fa1a4a91e9 add batch_size check for max_pool2d_backward (#141657)
Fix https://github.com/pytorch/pytorch/issues/140923.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141657
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-19 06:01:41 +00:00
a7ba562ec8 [state dict] Change _load_model_state_dict to enable cpu_offload, accept 2 device type and optimize memory (#142845)
For destributed state dict api [migration](https://github.com/pytorch/torchtune/pull/2138), make the changes here:
1. `load_from_full_model_state_dict` at TorchTune calls `set_model_state_dict` with the options on whether to have cpu_offload. Add cpu_offload at _load_model_state_dict to process to cpu if config is True
2. Change the device check as lora_finetune might hace 2 device types, accept that to be valid.
3. Some changes to optimize the memory performance:
3.1 use `.detach().clone()` instead of view directly
3.2 if local_state is not meta, copy `full_tensor[slices]` to `ret.to_local()`
4. add relative unit tests

Memory performance calling from TorchTune with llama2/7B_full:
1. cpu_offload = True
<img width="555" alt="Screenshot 2024-12-18 at 1 36 47 PM" src="https://github.com/user-attachments/assets/429261f5-1107-4592-b295-de3944a2614b" />

2. cpu_offload = False
<img width="555" alt="Screenshot 2024-12-18 at 1 36 52 PM" src="https://github.com/user-attachments/assets/40bf281a-236a-4218-826b-b1192a10c806" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142845
Approved by: https://github.com/fegin
2024-12-19 05:06:41 +00:00
e4301aeaa5 [ODML] Make the ML feature provider thread safe (#143418)
Summary:
This PR is generated from a meta internal Diff, aiming to resolve a crash from a race condition on the dictionary.

Test Plan:

Build and run

Print out the count/name/value of the dictionary and see if the values are get/set/removed correctly.

Observe the print statement on app start within IG

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143418
Approved by: https://github.com/shoumikhin
2024-12-19 04:47:56 +00:00
bf44d5bfb5 [Inductor] move custom pre pass (#143458)
Fixes #143363.

Move `joint_custom_pre` pass after `remove_noop_ops`/`constant_folding`, in order to get the same behavior as `pattern_matcher`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143458
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-12-19 04:41:20 +00:00
deb1da15cc [foreach_map] Add foreach_map Adam impl to compiled optimizer tests (#143454)
Adds a foreach_map backed Adam to compiled optimizer tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143454
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-12-19 03:16:47 +00:00
19d8bbafb2 Update release matrix for 2.6 (#143538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143538
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-12-19 02:02:04 +00:00
14fe1f7190 Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit d3ff2d42c28a2c187cbedfd8f60b84a4dfa2d6bf.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))
2024-12-19 01:05:11 +00:00
2c48af568a [CUDA][64-bit indexing] Fix some existing problematic int64_t _ = blockIdx.* * blockDim.* code (#142010)
`grep` didn't surface any `blockIdx.z * blockDim.z` cases
```
git grep -l "int64_t.*=.*blockIdx.x \* blockDim.x.*" | xargs sed -i 's/int64_t \(.*\) = blockIdx.x \* blockDim.x + threadIdx.x;.*/int64_t \1 = ((int64_t) blockIdx.x) * blockDim.x + threadIdx.x;/g'
git grep -l "int64_t.*=.*blockIdx.x \* blockDim.x.*" | xargs sed -i 's/int64_t \(.*\) = threadIdx.x + blockIdx.x \* blockDim.x;.*/int64_t \1 = threadIdx.x + ((int64_t) blockIdx.x) * blockDim.x;/g'
git grep -l "int64_t.*=.*blockIdx.y \* blockDim.y.*" | xargs sed -i 's/int64_t \(.*\) = blockIdx.y \* blockDim.y + threadIdx.y;.*/int64_t \1 = ((int64_t) blockIdx.y) * blockDim.y + threadIdx.y;/g'
git grep -l "int64_t.*=.*blockIdx.y \* blockDim.y.*" | xargs sed -i 's/int64_t \(.*\) = threadIdx.y + blockIdx.y \* blockDim.y;.*/int64_t \1 = threadIdx.y + ((int64_t) blockIdx.y) * blockDim.y;/g'
git grep -l "int64_t.*=.*blockDim.x \* blockIdx.x.*" | xargs sed -i 's/int64_t \(.*\) = blockDim.x \* blockIdx.x + threadIdx.x;.*/int64_t \1 = ((int64_t) blockIdx.x) * blockDim.x + threadIdx.x;/g'
```

See also https://github.com/pytorch/pytorch/pull/141922/files#r1868262823 in #141999 141922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142010
Approved by: https://github.com/ngimel
2024-12-19 00:55:11 +00:00
b4e0e3bfa3 Backout D66648013 (#143433)
Summary:
backing out https://www.internalfb.com/diff/D66648013 (see comments there for justification)

I will reland and disallow the bfloat16 atomics behavior on A100 because it causes a pretty significant performance regression.

Test Plan: This is a revert

Differential Revision: D67357485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143433
Approved by: https://github.com/davidberard98
2024-12-19 00:53:49 +00:00
5c3996cab2 [Dynamo] topologically sort duplicated graph regions (#143523)
Ensure regions are topologically sorted

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143523
Approved by: https://github.com/williamwen42
2024-12-19 00:43:48 +00:00
55092e1ec5 [BE] Delete install sccache step from MacBB (#143512)
To the best of my knowledge, this step never executed and there were no MacOS binary build running on trunk for a while
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143512
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere
ghstack dependencies: #143395, #143511
2024-12-19 00:41:28 +00:00
5e172ea004 [BE] Get rid of malfet/checkout@silent-checkout (#143516)
Instead use `actions/checkout@v4` with `show-progress: false`. It's more verbose than the quiet option, but our logs are long anyway...

Partially addresses https://github.com/pytorch/pytorch/issues/143079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143516
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/huydhn
2024-12-19 00:36:36 +00:00
f9da639950 [codemod] Fix a few unused-variable issues in pytorch (#143517)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143517
Approved by: https://github.com/mhorowitz
2024-12-19 00:18:08 +00:00
b23f11c529 [ONNX] Automatically convert dynamic_axes to dynamic_shapes with torch.export.Dim.AUTO (#143158)
With https://github.com/pytorch/pytorch/pull/133620 introducing Dim.AUTO, we can now automatically convert dynamic_axes to dynamic_shapes without specifying min and max. However, exporting still could be crashed when there are same specs shared between inputs and there is no guarantee that the axes will be dynamic (see PR description).

~~Therefore, a~~ follow-up PR should create a post-processing ONNX side pass to ~~enable the missed dynamic axes~~ rename the dynamic shapes (s0,  s1, ...) to dynamic_axes (user setting names).

This PR does:
(1) Apply torch.export.Dim.AUTO to dynamic_axes when dynamic_shapes is not provided.
(2) Convert args/kwargs to tuple inputs, which follows the generated dynamic_shapes format to avoid errors during torch.export.export.
(3) Avoid KeyError in _rename_dynamic_shapes_with_model_inputs funtion.
(4) Add real world case of a HF model with kv_cache to test on ONNX exporter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143158
Approved by: https://github.com/xadupre, https://github.com/shubhambhokare1
2024-12-18 23:49:01 +00:00
15a7a0c37e Remove deprecated branch after capture_pre_autograd_graph fully migrate to training IR (#143228)
Summary:
as title

#buildall

Test Plan: CI

Differential Revision: D67222286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143228
Approved by: https://github.com/andrewor14
2024-12-18 23:30:45 +00:00
58627fb6bf [BE] Integrate 5 line build script into template (#143511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143511
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere
ghstack dependencies: #143395
2024-12-18 23:27:09 +00:00
4eafbe5288 [Dynamo] Flatten slices during graph deduplication (#143522)
I encountered this issue while debugging torchtune - overall we need to make sure to not miss nodes that are slice arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143522
Approved by: https://github.com/williamwen42
2024-12-18 23:12:34 +00:00
5380407af5 [dynamo] Properly model root frame globals during inlining (#143447)
This patch updates `InliningInstructionTranslator.STORE_GLOBAL` to
properly check whether `self.f_globals` is the same as root frame
`f_globals`. See added comments for why this is important.

Fixes #143425.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143447
Approved by: https://github.com/zou3519
2024-12-18 23:04:02 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
d298bd840f [dynamo] add two-point iter test (#143500)
Implements the last checkbox for https://github.com/pytorch/pytorch/issues/112532.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143500
Approved by: https://github.com/StrongerXi
2024-12-18 22:55:46 +00:00
d3ff2d42c2 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-18 22:30:07 +00:00
4717cd1ce9 Skip test_conv2d_linear_add_broadcast_shapes_cpu on fbcode (#143530)
Summary: The test is added by D67376995 and it is failing on fbcode

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:mkldnn_pattern_matcher_cpu -- --exact 'caffe2/test/inductor:mkldnn_pattern_matcher_cpu - test_conv2d_linear_add_broadcast_shapes_cpu (caffe2.test.inductor.test_mkldnn_pattern_matcher.TestPatternMatcher)'`

Differential Revision: D67413687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143530
Approved by: https://github.com/jansel
2024-12-18 22:08:08 +00:00
d4ed5941db Fix floating point literals in IRPrinter (#142119)
Fixes #114035
This is a recreation of #140002 with approval from its author. Original description:
>when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-18 21:59:48 +00:00
10b9c5944e [export] don't decompose custom triton op when exporting (#142426)
For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable.

#### The alternative:
If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because:
- it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes.
- changes to triton or the serialization logic for triton arguments can be BC breaking
- exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction.

#### Future plans:
After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file **on the same machine that users call export**, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC.

In the long term, we may export multiple cubins for the triton op directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142426
Approved by: https://github.com/zou3519
ghstack dependencies: #142425
2024-12-18 21:36:28 +00:00
1e201422ed [export] add is_exporting flag (#142425)
We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack).

In increasing-scope:
- `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx.
- `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True.
- `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425
Approved by: https://github.com/avikchaudhuri
2024-12-18 21:36:28 +00:00
894d47b91b [ROCm] Fix unit test: matmul_offline_tunableop (#143322)
Fixes #137936

The PR contains:
* Fix for `matmul_offline_tunableop`
* Clean-up try-finally blocks in UTs that don't use environment variables (`test_validator_tunableop_rocm`, `test_minimum_tuning_iteration_tunableop`, `test_disable_tuning_tunableop`)
* Avoid the use of environment variables in `minimum_tuning_iteration_tunableop`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143322
Approved by: https://github.com/jeffdaily
2024-12-18 20:14:44 +00:00
cyy
255a977494 [1/N] Avoid const_cast (#143169)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143169
Approved by: https://github.com/albanD
2024-12-18 19:48:01 +00:00
f129bcb5a5 [BE] Refactor argument parsing into its own function (#143395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143395
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere
2024-12-18 19:42:49 +00:00
8d4926e30a Fix unused variables in test/torch.py (#143399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143399
Approved by: https://github.com/albanD
2024-12-18 17:57:24 +00:00
863e6e4567 Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#141670)
Fix https://github.com/pytorch/pytorch/issues/141447.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141670
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-18 17:46:26 +00:00
b588a78ca3 add grad_output shape check for adaptive_max_pool2d_backward and adaptive_max_pool3d_backward (#141663)
Fix https://github.com/pytorch/pytorch/issues/141099, https://github.com/pytorch/pytorch/issues/141100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141663
Approved by: https://github.com/mingfeima, https://github.com/malfet
2024-12-18 17:44:27 +00:00
93e8e32708 Remove iOS folder (#143398)
This folder is a tutorial that is not packaged in PyTorch that's an example of how to use the now deprecated Lite Interpreter

People should be using Executorch instead and there's already good documentation on it all over our tutorials and main homepage

Testing to see what breaks in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143398
Approved by: https://github.com/albanD
2024-12-18 17:25:52 +00:00
ed9931e6ee Add tests for non divisible inputs for flex decoding (#143214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143214
Approved by: https://github.com/drisspg
2024-12-18 16:32:45 +00:00
0e8013fc1c [AOTI] Fix a typo in cpp_builder.py (#143351)
Summary: passthough -> passthrough

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143351
Approved by: https://github.com/yushangdi, https://github.com/chenyang78
ghstack dependencies: #143350
2024-12-18 16:28:37 +00:00
a2092665a9 [AOTI] Refactor path operations in AotCodeCompiler (#143350)
Summary: Use safer pathlib operation instead of direct string manipulation; Update some path naming to make them more meaningful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143350
Approved by: https://github.com/yushangdi, https://github.com/chenyang78
2024-12-18 16:28:37 +00:00
24a18d76c8 [MPS] Use metal shaders for all view ops (#143375)
Before this PR Metal  shaders were used to scatter/gather 1-5 dimensional tensors.
This PR introduces generalized ones that could be used for any dimensionality and as results  gets rid of 700+ lines complex and untested code that might not even work as expected.
Generalized gather shader looks as follows
```metal
kernel void gather_kernel_n(uint linear_index           [[thread_position_in_grid]],
                            constant void * src_        [[buffer(0)]],
                            device void * dst_          [[buffer(1)]],
                            constant uint32_t * size    [[buffer(2)]],
                            constant uint32_t * stride  [[buffer(3)]],
                            constant uint32_t & numel   [[buffer(4)]],
                            constant int32_t & ndim     [[buffer(5)]]) {{
    if (linear_index >= numel) return;

    constant {0} * src = (constant {0} *)src_;
    device {1} * dst = (device {1} *)dst_;

    uint64_t src_offs = 0;
    auto src_idx = linear_index;
    for(int dim = ndim - 1; dim >= 0; --dim) {{
      src_offs += stride[dim] * (src_idx % size[dim]);
      src_idx /= size[dim];
    }}

    dst[linear_index] = cast<{1}>(src[src_offs]);
}}
```

Which, according to the following benchmark
```python
from timeit import default_timer

import torch
import torch.utils.cpp_extension
from torch.utils.benchmark import Measurement, Timer

t = Timer(
    stmt=f"y.copy_(x);torch.mps.synchronize()",
    setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)",
    language="python", timer=default_timer
)
print(t.blocked_autorange())
```
Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one

On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below
```swift
import Metal
import MetalPerformanceShadersGraph

func gatherComplexMPS(device: MTLDevice,
                inp_buf: MTLBuffer, idx_buf: MTLBuffer,
                out_buf: MTLBuffer,
                inp_elem: Int, upd_elem: Int) {
  let graph = MPSGraph()
  let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil)
  let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil)
  let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32)
  let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer,
                               indicesPlaceholder: mpsIndicesBuffer ],
            targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer])
}

func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer {
  guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let buf_data = buf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<values.count {
    buf_data[i] = values[i]
  }
  return buf
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0])
let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3])
guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4)
```

Fixes https://github.com/pytorch/pytorch/issues/143140
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375
Approved by: https://github.com/albanD
2024-12-18 16:15:46 +00:00
f47aac6bc2 Make Context to be Device-agnostic Step by Step (3/N) (#137578)
Detailed Descriptions:
- Using unified Device-agnostic API to create new generator for accelerator.
- Add deprecated info for GeneratorForPrivateuseone

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137578
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-12-18 15:12:19 +00:00
80a42399bb Various fix for memory leak in test autograd and dataloader (#143323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143323
Approved by: https://github.com/andrewkho, https://github.com/soulitzer
ghstack dependencies: #143225
2024-12-18 13:56:59 +00:00
84b91ce4a1 remove allow-untyped-defs for torch/_inductor/test_operators.py (#143436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143436
Approved by: https://github.com/aorenste
2024-12-18 12:54:25 +00:00
d8ea4ce631 [reland] Kill capture_pre_autograd_graph API (#143426)
Summary:
Delete the following API:

- capture_pre_autograd_graph()
- capture_pre_autograd_graph_using_training_ir()
- gm_using_training_ir()

Update XLA pin to include https://github.com/pytorch/xla/pull/8398

There's no more call sites to `capture_pre_autograd_graph`.

Except
1) two test cases in coreml, guarded by version guard, PR to remove: https://github.com/apple/coremltools/pull/2400
2) a few call sites guarded by version guard (< 2.5.0)

Test Plan: CI

Differential Revision: D67354440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143426
Approved by: https://github.com/gmagogsfm
2024-12-18 12:07:09 +00:00
eb67dd3e2d [3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149)
Design Doc: https://fburl.com/gdoc/47zpuweb
Prototyping:  D66469341

In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot.

In next diff, we will integrate the mtia backend with profiler python api

Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149
Approved by: https://github.com/nautsimon
2024-12-18 11:58:23 +00:00
993b2f0ee0 Fix unused variables in test/test_transformers.py (#143407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143407
Approved by: https://github.com/drisspg
2024-12-18 09:59:24 +00:00
8dd380803c remove allow-untyped-defs for torch/_functorch/batch_norm_replacement.py (#143438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143438
Approved by: https://github.com/oulgen
2024-12-18 09:01:06 +00:00
75fe5a3ef7 remove allow-untyped-defs for torch/fx/experimental/debug.py (#143439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143439
Approved by: https://github.com/oulgen
2024-12-18 08:55:46 +00:00
03991798ca remove allow-untyped-defs for torch/nn/parallel/__init__.py (#143437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143437
Approved by: https://github.com/oulgen
2024-12-18 08:50:37 +00:00
a99536480d [ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955)
Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN.

According to my short test
```Python
import torch
device = "cuda"
dtype = torch.float32

x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype)

for n in range(1024):
    if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]:
        print(f"hermite_polynomial_h: all outputs are nans! n = {n}")
        break

for n in range(1024):
    if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]:
        print(f"hermite_polynomial_he: all outputs are nans! n = {n}")
        break
```

The output values become NaNs at these orders:
```
hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32
hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32
hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64
hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64
```

Surely, it makes sense to increase the limit as a safety margin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955
Approved by: https://github.com/malfet, https://github.com/eqy
2024-12-18 08:30:08 +00:00
2ea4b56ec8 Record min/max of integral tensor in ET (#143088)
Summary:
In et-replay, random data is used to run the operators. However, it does not work well for the op that uses index to access tensor. For example, embedding ops, which use the indices to look up the embedding table. If random data is used for these index ops, et-replay usually runs into invalid memory access issue.

To fix it, ET provides an environment variable "ENABLE_PYTORCH_EXECUTION_TRACE_INTEGRAL_TENSOR_RANGE", if it is set, ET will capture the min/max value of the flattened integral tensor. Then in et_replay, the min/max is used to generate the random tensor within that range. It fixed invalid memory access issue.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_range_cuda

Differential Revision: D66666931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143088
Approved by: https://github.com/sanrise
2024-12-18 08:20:35 +00:00
bceedeec2b fix checking non-trivial input constraints (#143442)
A bunch of auto dynamic shape tests would fail non-strict retraceability because when checking input constraints, we'd compare non-trivial expressions, which would require / affect shape env.
```
... is not tracked with proxy for <torch.fx.experimental.proxy_tensor._ModuleStackTracer object ...
```

I've also observed this bug internally.

This PR does an early check on whether args passed have concrete shapes, and only then proceeds: as before, we
1. try to unify / solve with the arg dim when the corresponding placeholder node dim is symbolic in one symbol
2. check directly if the placeholder node dim is concrete
3. otherwise defer to run time.

Differential Revision: [D67359596](https://our.internmc.facebook.com/intern/diff/D67359596/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143442
Approved by: https://github.com/tugsbayasgalan
2024-12-18 07:29:08 +00:00
90cc43f270 Support garbage collection after pt2 compilation (#143364)
Summary:
Support garbage collection after pt2 compilation.
Add jk to control the global rollout / rollback of this functionality
Add env var to control individual job's rollout

Test Plan:
Test the model training job with / without this changes

Reviewers:
@yuxihu @ezyang , @Yuzhen11 ,

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143364
Approved by: https://github.com/ezyang
2024-12-18 07:25:11 +00:00
9275091d6e [provenance_tracking] Dump inductor_triton_kernel_to_post_grad_nodes.json info in debug_trace (#143055)
Summary:
This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code:
`{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.`

Example paste: P1695235000 verified on the test model.  See "Test Plan":

We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool:
https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab)

https://pxl.cl/66BzK

Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc.

Test Plan:
test_model_coverage.sh:
```
#!/bin/sh
MODEL_ENTITY_ID=644688112
SNAPSHOT_ID=32
MODULE=merge

# buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark

TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 | tee output.txt
```
 {F1973765026}

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)'
```

```
TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor
```

Differential Revision: D66967510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055
Approved by: https://github.com/chenyang78
2024-12-18 06:51:50 +00:00
6829897682 Remove assert from partitioner.py (#143376)
Remove erroneous assert assuming a dependent (user) node to be in the partition. This partially reverts #136616 by removing the assert.

Tested locally with a failing ExecuTorch Arm test using
```
$ python -m examples.arm.aot_arm_compiler --model_name mv2 --target ethos-u55-128 --delegate --quantize
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143376
Approved by: https://github.com/tarun292
2024-12-18 06:08:19 +00:00
6715a8858a Triton bump for 3.2 cherry-picks (device context) (#143409)
Summary:
* https://github.com/triton-lang/triton/pull/3731
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143409
Approved by: https://github.com/atalman
2024-12-18 05:17:29 +00:00
c17a07ade3 Add float8 support in serde schema (#143343)
Summary:
Fix https://github.com/pytorch/pytorch/issues/141316

Bump up schema minor version.

as title, add float8 support in serde schema

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r  test_serialize_float8
```

Differential Revision: D67307670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143343
Approved by: https://github.com/yiming0416
2024-12-18 05:07:21 +00:00
576789197a Add support for CPU scalar in addcmul (#143264)
Step required for performance in #143122

Adds support for CPU scalar for tensor_2 in addcmul. For example:
```
import torch
a = torch.rand(2, 2, device="cuda")
b = torch.tensor(1e-3)

torch.add(a, b)
torch.addcmul(a, a, b)  # used to fail, now works
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143264
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-12-18 04:43:29 +00:00
859be14c4e fix a few int64_t index computations, fix complex128 scan that had to… (#143401)
…o few threads
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143401
Approved by: https://github.com/eqy
2024-12-18 04:27:27 +00:00
c947a7d38e Fix unused Python variables in test/nn (#143396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143396
Approved by: https://github.com/mikaylagawarecki
2024-12-18 03:30:54 +00:00
17a6d4b882 remove allow-untyped-defs for torch/_export/passes/remove_runtime_assertions.py (#143435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143435
Approved by: https://github.com/oulgen
2024-12-18 03:05:20 +00:00
a9de6a68f4 [CD] Test that all PyTorch wheels support OpenMP (#143394)
Together with https://github.com/pytorch/pytorch/pull/143393 fixes https://github.com/pytorch/pytorch/issues/123225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143394
Approved by: https://github.com/atalman
ghstack dependencies: #143393
2024-12-18 02:27:55 +00:00
2400db115c Use Manylinux 2.28 for nightly build and cxx11-abi (#143423)
As per: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581

Linux Builds: CPU, CUDA 11.8, CUDA 12.4 switched to Manylinux 2.28 and D_GLIBCXX_USE_CXX11_ABI=1 on the week of Dec 16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143423
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere
2024-12-18 02:02:58 +00:00
e890d67543 Use process pool for precompilation of triton templates (#142450)
Perf results: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2003%20Dec%202024%2022%3A57%3A51%20GMT&stopTime=Tue%2C%2010%20Dec%202024%2022%3A57%3A51%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/740/head&lCommit=b925256c29ec43e1933e4ede94b16d1f404b595f&rBranch=gh/eellison/740/base&rCommit=a161d6362f7d9db773322d2ce2a3a70aabbecf4b

Training:
<img width="793" alt="image" src="https://github.com/user-attachments/assets/75f5bc0d-8005-4213-ae88-0b94fb187dfc" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142450
Approved by: https://github.com/jansel
2024-12-18 01:48:04 +00:00
c06b5048ba [Inductor] Fix _can_be_inplace function (#143279)
Summary:
Modify _can_be_inplace function: return False if `_other.data` is an instance of `ir.BaseView`.

Fix https://github.com/pytorch/pytorch/issues/143280.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143279
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2024-12-18 00:26:05 +00:00
6cd96f069b Add warning to torch.jit.load (#143403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143403
Approved by: https://github.com/albanD
ghstack dependencies: #143326
2024-12-18 00:17:41 +00:00
ac8342f881 Prevent torch.jit.load path in torch.load when weights_only=True (#143326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143326
Approved by: https://github.com/albanD
2024-12-18 00:17:41 +00:00
13a5c15ef5 Fix sample inputs leaked from subtest (#143415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143415
Approved by: https://github.com/jbschlosser
ghstack dependencies: #143333
2024-12-18 00:15:18 +00:00
3f99682fbd NJT linear_backward should not return inner tensor as-is (#143333)
Fixes debug=1 use-count checks https://github.com/pytorch/pytorch/actions/runs/12187808902/job/34002323481#step:22:2521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143333
Approved by: https://github.com/jbschlosser
2024-12-18 00:15:18 +00:00
feb4818bc9 [SJD] adding kill logic for current process when killing a worker (#141060)
Summary:
we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid

something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out

Test Plan: experiment in next diff shows this works

Differential Revision: D65837085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060
Approved by: https://github.com/gag1jain
2024-12-18 00:13:02 +00:00
efe21ee59d [MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347)
Summary: This diff implements the "max_memory_allocated" PyTorch API for MTIA devices, which returns the peak device DRAM usage

Test Plan:
Passed the local unit test
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated
```

https://www.internalfb.com/intern/testinfra/testrun/8444249544807192

Reviewed By: yuhc, egienvalue

Differential Revision: D67118173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143347
Approved by: https://github.com/nautsimon
2024-12-17 23:37:03 +00:00
a040006da7 Force symlink creation when building python on s390x (#143195)
Sometimes it exists already when building on s390x

This change should fix docker image build on s390x.
Example of error can be found here:
https://github.com/pytorch/pytorch/actions/runs/12282230596/job/34365267303
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143195
Approved by: https://github.com/ezyang
2024-12-17 23:01:47 +00:00
2642bbc6dc [CD] Run smoke tests on MacOS wheel (#143393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143393
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-12-17 22:47:07 +00:00
b247f87845 tools: Add a tool to build wheels for multiple python versions (#143361)
Adds a tool to build bdist_wheels sequentially for multiple different
python versions (if specified).

The goal of this tool is to eventually be able to utilize this in our
binary build runs to significantly reduce the amount of time we take to
build packages by utilizing a local ccache from the first build.

Tested locally using the following:
```
$ ccache -C # clear cache
# -p could actually reference any python interpreter
$ python tools/packaging/build_wheel.py \
	-p /home/eliuriegas/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/bin/python3.12 \
	-p /home/eliuriegas/.local/share/uv/python/cpython-3.13.0-linux-x86_64-gnu/bin/python3.13 \
	-d dist-multi/
...
2024-12-17 10:48:11,365 - INFO - Build time (3.12.7): 571.440689s
2024-12-17 10:48:11,365 - INFO - Build time (3.13.0): 191.147503s
```

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143361
Approved by: https://github.com/malfet, https://github.com/atalman
2024-12-17 21:56:06 +00:00
1e058a8f38 FileTimerClient: add retry logic on connect (#143318)
Fixes #143188

The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side.

Test plan:

```
pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318
Approved by: https://github.com/fegin
2024-12-17 21:48:30 +00:00
aabe285aaf Add 2 more APIs to the exposed public torch python APIs (#143380)
These two APIs are being used internally for some projects and need to be exposed as the build for this is done using OSS toolchain.

af8789c056 - this change hid most apis in torch python barring the ones explicitly specified breaking the build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143380
Approved by: https://github.com/suo
2024-12-17 21:16:51 +00:00
0bdc173ab6 [fr] recognize all_reduce_barrier as a valid op (#143354)
Summary:
D67068632 introduced a better profiling name for barrier operations to be able to distinguish various ops.

Unfortunately, this broke Flight Recorder Analysis with the following error as reported by dmwu
```
fr_trace -m torchx-param_bench_16g_mi300x-all_to_all -a 0 --mast_job_version 98 -w 16
Traceback (most recent call last):
  File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/fbcode/platform010/lib/python3.10/runpy.py", line 86, in _run_code
```

Test Plan: Test manually.

Differential Revision: D67305997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143354
Approved by: https://github.com/wconstab
2024-12-17 21:09:18 +00:00
a96387a481 [Dynamo] only import einops if version is lower than 0.7.0 (#142847)
Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847
Approved by: https://github.com/zou3519
2024-12-17 20:50:25 +00:00
9283c40ba8 [codemod] Decorate unused variables with [[maybe_unused]] (#143381)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143381
Approved by: https://github.com/malfet
2024-12-17 20:36:03 +00:00
7c25a55c65 clean up type nits on torch/jit/_ir_utils.py (#143371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143371
Approved by: https://github.com/laithsakka
2024-12-17 20:28:07 +00:00
de4a555c82 Run inductor-rocm workflow on ciflow/inductor (#143205)
The paths are almost the same as ciflow/inductor.  The only differences I could spot where that ciflow/inductor also has `test/dynamo/**` and `torch/csrc/dynamo/**`

This is to prevent failures like https://github.com/pytorch/pytorch/actions/runs/12304985383/job/34345585535 which fails due to running on a fork, which cannot set the id token.

The other option to prevent this is to stop the job from running when on a fork.

If someone adds both labels, one will be cancelled because they have the same concurrency group

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143205
Approved by: https://github.com/huydhn
2024-12-17 20:09:48 +00:00
b16f020edd Add flex attention kernel parameter tuning options (#139639)
1. Add `num_warps` and `num_stages` to kernel parameters of `flex_attention`. This allows performance tuning when the default parameters of `flex_attention` is suboptimal, for example for `document_masks`.
2. Update how flex decoding splits are assigned to threadblocks. The first split of full blocks are assigned to the first threadblock, and the first split of partial blocks are assigned to the last threadblock.
3. Update `get_split_k` to assign 2 splits per SM before we have runtime workload balancing based on BlockMask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139639
Approved by: https://github.com/drisspg
2024-12-17 19:31:40 +00:00
e3c53fb1bc Increase sharding for debug build (#143327)
It started timing out consistently and takes 3+ hours per shard

I assume its just that we slowly increase tests over time since I cannot find a dramatic jump recently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143327
Approved by: https://github.com/wdvr, https://github.com/huydhn
2024-12-17 19:27:51 +00:00
5b5d7016c8 Remove stable_partition for ARM AOTI Runtimes (#142394)
Summary: This function call will cause OOM issues on ARM machines with multi-threaded predictors (reason behind this is still being investigated), we replace it with the standard partition instead.

Differential Revision: D66904296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142394
Approved by: https://github.com/frank-wei
2024-12-17 19:19:04 +00:00
e7704f41ca Simplify _compute_symbolic_stride() (#138844)
Rewrite _compute_symbolic_stride() to make it simpler and faster.

The existing code involves several inner loops in an attempt to process the common case faster - but in reality this effort is actually slower than the simpler code.

Testing:
The initial version of this PR (which passed all tests) ran both the old algorithm and new algorithm and compared the results to make sure that results were substantially the same (they weren't the same simply because the algorithm allocates new dynamic symbols as part of it).

I also measured the timing of both methods and from the cases I checked the simpler algorithm was generally about 30% faster (which was usually the "fast path" of the old algorithm).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138844
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #138843
2024-12-17 19:16:53 +00:00
63cb5e4ade Move inner loop of _create_symbolic_sizes_strides_storage_offset into its own method (#138843)
Making the next PR easier to review:
- move the inner loop of  _create_symbolic_sizes_strides_storage_offset() into a separate function
- fix lintrunner lints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138843
Approved by: https://github.com/ezyang
2024-12-17 19:16:53 +00:00
f3ec59d44c Fix non-dense inductor effn attn bias (#141905)
Didn't have any luck making local repro, partially because https://github.com/pytorch/pytorch/issues/141888 which will be fixed when we update to triton 3.2. but verified locally it fixes https://github.com/pytorch/pytorch/issues/139424 with the triton pin update that is landing soon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141905
Approved by: https://github.com/drisspg
ghstack dependencies: #143315
2024-12-17 18:55:50 +00:00
1e9ec51431 Fix unused variables in test_serialize_sym_float (#143389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143389
Approved by: https://github.com/Skylion007
2024-12-17 18:55:14 +00:00
18261e9f39 [dynamo] implement framelocals mapping as c++ object (#140063)
Implements https://github.com/pytorch/pytorch/issues/93753 - move frame local guard accessors to C++.

Before, we used dict accessors on a Python dict representing the frame's fastlocals that we manually build. We move this accessor to C++ and additionally use the fastlocal index whenever possible.

Some implementation notes:
- `FrameLocalsMapping` is now initialized as a C++ vector of `PyObject`s. We do not just use the frame's localsplus/fastlocals buffer because we also unbox cells.
- `FrameLocalsMapping` can still be converted into a Python dict representing the frame's fastlocals, but it is done lazily.
- We update `LeafGuard`, `GuardAccessor`, and `GuardManager`'s `check_nopybind` methods to accept `FrameLocalsMapping`. By default, we convert the `FrameLocalsMapping` to a Python dict and run the original `check_nopybind` on it, but in some cases, conversion is not needed.
- We add a new guard accessor `FrameLocalsGuardAccessor`, which is similar to `DictGetItemGuardAccessor` but has special handling for `FrameLocalsMapping`. We create a separate class to emphasize different use cases, but we could probably combine these two (can do in a follow up)

dynamo_guard_eval.py microbenchmark update:
- 713.2us -> 630.0us (3.10)
- 598.8us -> 530.7us (3.12)

Other followups:
- Add `FrameLocalsMapping` version for `check_verbose_nopybind` in order to match behavior between `check_nopybind` and `check_verbose_nopybind`. This can prevent difficult debugging situations where guards fail (`check_nopybind` returns false) but no guard error message is generated (`check_verbose_nopybind` succeeds).
- Rewrite the `SHAPE_ENV` guard into C++ - it is a fairly common guard that results in `FrameLocalsMapping` needing to convert to a dict

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140063
Approved by: https://github.com/jansel
ghstack dependencies: #142117, #142430
2024-12-17 18:54:27 +00:00
c04f0bb7b9 [dynamo] add benchmark for guard eval (#142430)
Benchmarks:
- 713.2us (3.10)
- 598.8us (3.12)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142430
Approved by: https://github.com/jansel
ghstack dependencies: #142117
2024-12-17 18:54:27 +00:00
97ca09f692 [dynamo] format eval_frame.c (#142117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142117
Approved by: https://github.com/jansel
2024-12-17 18:54:27 +00:00
53e4d7b6a2 remove allow-untyped-defs for torch/_lazy/device_context.py (#143367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143367
Approved by: https://github.com/aorenste
ghstack dependencies: #143366
2024-12-17 18:54:03 +00:00
bcc93a1e8e remove nonowninglayout special case in require strides (#143315)
NonOwningLayout is always constructed to a FixedLayout. We should handle it the same way as FixedLayout. Note - this case is very rare, I added an assertion here and no test/model failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143315
Approved by: https://github.com/zou3519
2024-12-17 18:47:38 +00:00
a3688ead4b [AOTI][doc] Update tutorial (#143390)
Summary: Update the cpp inference part to call AOTIModelPackageLoader.run directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143390
Approved by: https://github.com/yushangdi
2024-12-17 18:35:40 +00:00
fa4db62968 [CI] Unify the XPU Windows CICD installtion scripts (#143185)
Follow https://github.com/pytorch/pytorch/pull/142156
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143185
Approved by: https://github.com/atalman
2024-12-17 18:26:19 +00:00
74e66a21b4 remove allow-untyped-defs for torch/_C/_distributed_autograd.pyi (#143369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143369
Approved by: https://github.com/aorenste
2024-12-17 18:09:28 +00:00
37a1b9efcc [export] Serialize all dataclass fields (#142286)
Reverts a change in #121337. All dataclass members must be serialized, even default-valued members, because downstream code often implicitly assumes their presence.

This PR fixes a segfault when running `test_custom_op_all_inputs` from `test/inductor/test_aot_inductor_custom_ops.py`. This segfault was caused by querying for an "index" field for the `Device` type (see `torch/csrc/inductor/aoti_torch/oss_proxy_executor.cpp:136`), which was previously skipped when serializing if the device index was unspecified. A number of other structs which are deserialized in this file also contain optional fields, and presumably could experience the same bug.

Fixes #138955

Fixes #134793
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142286
Approved by: https://github.com/zhxchen17
ghstack dependencies: #142175
2024-12-17 17:21:27 +00:00
bb06fc79fb cpp_builder: handle CUDA lib paths involving "stubs" in more circumstances (#142175)
conda packages for `cuda-driver-dev=12.4.127` use a "stubs" subdirectory to contain `libcuda.so`.  This was previously only handled by cpp_builder in some cases, but now needs to be potentially handled more generally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142175
Approved by: https://github.com/desertfire
2024-12-17 17:21:27 +00:00
e3d754419f Revert "[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)"
This reverts commit 1bf983077f9f9c19e20dac178aa764b4620d78e7.

Reverted https://github.com/pytorch/pytorch/pull/141085 on behalf of https://github.com/huydhn due to The diff D66211131 has been commandeered internally and is it not part of the train anymore.  If codev is needed, pls reland this accordingly ([comment](https://github.com/pytorch/pytorch/pull/141085#issuecomment-2549092225))
2024-12-17 17:21:14 +00:00
ec02ae4345 remove allow-untyped-defs for torch/utils/benchmark/examples/simple_timeit.py (#143368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143368
Approved by: https://github.com/aorenste
2024-12-17 17:19:11 +00:00
313b9964ae remove allow-untyped-defs for torch/_C/_lazy.pyi (#143370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143370
Approved by: https://github.com/aorenste, https://github.com/desertfire
ghstack dependencies: #143366
2024-12-17 17:18:10 +00:00
487343346e Prevent users from seeing hardcoded print stmt when hypothesis is not installed (#142398)
Fixes: #142357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142398
Approved by: https://github.com/zou3519
2024-12-17 16:59:05 +00:00
969b07b96f Revert "[ROCm] CK Flash Attention Backend (#138947)"
This reverts commit 500d02921bcf1619e268196866ddf099a4b94080.

Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))
2024-12-17 16:46:57 +00:00
cd7de1f4fa remove allow-untyped-defs for torch/masked/maskedtensor/creation.py (#143321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143321
Approved by: https://github.com/laithsakka
2024-12-17 16:44:50 +00:00
4d90c487d8 [AOTI] Add is_big_gpu checking to test_conv3d (#143339)
Summary: test_conv3d tests max-autotune, which is only supported for big_gpu.

Differential Revision: D67306331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143339
Approved by: https://github.com/BoyuanFeng
2024-12-17 16:18:45 +00:00
792f1c47e9 No actual change, just remove variable contain Tensors from global scope (#143225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143225
Approved by: https://github.com/ezyang
2024-12-17 16:14:25 +00:00
afa313e669 Extend bmm tiling to work up to 2^32 elem in any single output dim (#143095)
The previous tiling implementation worked for up to 2^32 total elements per single batch entry. This extends the functionality to support the dimensions encountered in ComfyUI (output shape: 1,72250,72250).

Fixes #141909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143095
Approved by: https://github.com/kulinseth
2024-12-17 16:03:46 +00:00
340f02c49b make it clearer (in docs) one can double decorate with torch.library.impl_* APIs (#137608)
Fixes #120503. Fix originally attempt by @soxand16 with PR: https://github.com/pytorch/pytorch/pull/121469. PR was almost ready to merge, but then went stale (over 6 months old). This PR implements original fix with refactoring for clarity.

CC: @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137608
Approved by: https://github.com/zou3519
2024-12-17 15:13:58 +00:00
6bbbb08458 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [10/N] (#142451)
> This is the last one

related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688
- #140922
- #140924
- #140933
- #142451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142451
Approved by: https://github.com/bdhirsh
2024-12-17 12:18:29 +00:00
34a0d8b62e [inductor] invalidate pointwise dep cache for LOAF (#141160)
Fixes https://github.com/pytorch/pytorch/issues/141134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141160
Approved by: https://github.com/vkuzo
2024-12-17 09:51:29 +00:00
5160a725c8 [FlexAttention] Fix broken eager tracing (#143344)
Fixes #143331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143344
Approved by: https://github.com/Chillee
ghstack dependencies: #143299
2024-12-17 09:42:36 +00:00
cf46eb3bf5 [inductor] Include types and size hints in MultiKernel cache key (#142349)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142349
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-12-17 09:26:38 +00:00
e2d47a133b Disable c10::optional macros (#138912)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138912
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-17 09:22:47 +00:00
c3f3a6e4d2 Back out "Fix undesired specialization on slice after split. (#142372)" (#143356)
Summary:
Original commit changeset: e54ffcc9fd48

Original Phabricator Diff: D67113058

Reviewed By: ezyang

Differential Revision: D67311579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143356
Approved by: https://github.com/oulgen
2024-12-17 09:17:18 +00:00
2531543c5f [user triton cache] Dedup user-defined Triton kernels by config in codecache (#143353)
Previously, the same kernel source with different autotuning configs would generate the same cache key which can lead to wrong cache it and silent incorrectness. Here we add the configs to the cache key in `FxGraphHashDetails`.

Test Plan:

```
python3 test/inductor/test_codecache.py -k test_triton_higher_order_op_different_configs
...
----------------------------------------------------------------------
Ran 2 tests in 3.590s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143353
Approved by: https://github.com/oulgen
2024-12-17 08:41:22 +00:00
6056efc5ff non strict sequential slicing (#143298)
Differential Revision: [D67284841](https://our.internmc.facebook.com/intern/diff/D67284841/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143298
Approved by: https://github.com/zhxchen17
2024-12-17 08:35:20 +00:00
297ce77636 [Inductor] inplace padding (#140249)
https://github.com/pytorch/pytorch/issues/139865

This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine.

Perf for `test_linear_and_cel`:
```
# TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=False ms=83.311

# TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=True ms=79.827
```

The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c.
- Without the feature: 182.151ms per batch, 180.9K tokens/s
- With the feature:  178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase.

Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) .

UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249
Approved by: https://github.com/jansel
2024-12-17 06:15:48 +00:00
a42ca5a45b remove allow-untyped-defs for _inductor/codegen/rocm/rocm_template_buffer.py (#143272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143272
Approved by: https://github.com/aorenste
2024-12-17 05:34:22 +00:00
d2ec7f0756 [FlexAttention] Allow num_warps 8 since when block size >=128 (#143299)
# Summary
Fixes #143290

We already strip bad configs here: e0e763e331/torch/_inductor/kernel/flex_attention.py (L2299)
So this shouldn't be needed. Confirming that the 64 x 128 case is valid otherwise we can just change the default config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143299
Approved by: https://github.com/yanboliang
2024-12-17 05:32:41 +00:00
e7ec92331e remove allow-untyped-defs for torch/jit/_ir_utils.py (#143366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143366
Approved by: https://github.com/aorenste
2024-12-17 05:15:15 +00:00
bcd3692132 [Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142474)
Summary:
**Re-land the pr**. The previous one was reverted because of a test failure on SM89. The fix is just removing `xfailIfSM89`.

```
_____________________ LoopOrderingTest.test_fp8_pattern_2 ______________________
Unexpected success
```
------
(Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.)

-----
Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference.

The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0

-------

The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`.

Before the change:
`shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold`
After the change:
`shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused`

----
It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again.

Test Plan:
```
buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering
```
-----
Ran a float8 dynamic scaling training script to verify it e2e

Differential Revision: D67012816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142474
Approved by: https://github.com/eellison, https://github.com/sijiac, https://github.com/shunting314
2024-12-17 04:14:28 +00:00
500d02921b [ROCm] CK Flash Attention Backend (#138947)
Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947
Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian

Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
2024-12-17 02:18:07 +00:00
c15638d803 Enable swap on all Linux jobs (#143316)
A swapfile on Linux runner has been prepared by https://github.com/pytorch/test-infra/pull/6058.  So this PR does 2 things:

* Start using the swapfile on all Linux build and test jobs
* Testing the rollout https://github.com/pytorch-labs/pytorch-gha-infra/pull/582

### Testing

Run `swapon` inside the container and the swapfile shows up correctly:

```
jenkins@259dfb0a314c:~/workspace$ swapon
NAME      TYPE SIZE USED PRIO
/swapfile file   3G 256K   -2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143316
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2024-12-17 02:12:24 +00:00
cb4c614ed6 [foreach-map] Add tests for backward (#143282)
Adds tests for unary and binary foreach_map w/ backwards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143282
Approved by: https://github.com/eellison
2024-12-17 02:08:12 +00:00
533d63f83b Revert "FileTimerClient: add retry logic on connect (#143318)"
This reverts commit b3fb8f8a3a2fe07ca61852b09271382c988629fc.

Reverted https://github.com/pytorch/pytorch/pull/143318 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint jobs in trunk ([comment](https://github.com/pytorch/pytorch/pull/143318#issuecomment-2547342910))
2024-12-17 02:06:52 +00:00
cyy
201cb8834f Enable more C++ warnings (#143099)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143099
Approved by: https://github.com/albanD
2024-12-17 02:03:39 +00:00
af190479c8 [fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160)
## Benchmark
M=2048, N=3584, K=8192

baseline (nccl + cublas): 301us
decomp-based async-tp: 354us
comm-aware async-tp: 295us
**multimem_all_gather matmul: 277us**

As M further decreases, the multimem_all_gather approach consistently outperforms the baseline and other approaches (omitted other approaches in the chart as they start to be slower than the baseline):
![image](https://github.com/user-attachments/assets/5811455a-68c9-43fe-9d82-ca488dd77bc1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143160
Approved by: https://github.com/weifengpy
ghstack dependencies: #142283, #142810, #143159
2024-12-17 01:07:27 +00:00
286921b39e [fused_all_gather_matmul] introduce an argument to specify whether the all-gather result needs to be returned (#143159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143159
Approved by: https://github.com/weifengpy
ghstack dependencies: #142283, #142810
2024-12-17 01:07:27 +00:00
6fae60a34a [SymmetricMemory] introduce multimem_all_gather (#142810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142810
Approved by: https://github.com/weifengpy
ghstack dependencies: #142283
2024-12-17 01:07:27 +00:00
519d858c31 Revert "Kill capture_pre_autograd_graph API (#143224)"
This reverts commit 4c62275325afe21052f3fd49ed4135e3db3c47eb.

Reverted https://github.com/pytorch/pytorch/pull/143224 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failure is legit ([comment](https://github.com/pytorch/pytorch/pull/143224#issuecomment-2547264675))
2024-12-17 00:47:24 +00:00
9d57a39541 [C10D] Update docs for wait() (#143305)
Clarify that currently active stream, not default stream, is the one
that will be blocked by a call to wait(), and also point out that the
CPU is not blocked by the call for CUDA/nccl collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143305
Approved by: https://github.com/LucasLLC, https://github.com/ngimel
2024-12-17 00:41:11 +00:00
b3fb8f8a3a FileTimerClient: add retry logic on connect (#143318)
Fixes #143188

The fifo server binds from a thread -- under rare cases the client connects before the server thread starts. This adds a retry when opening the fifo socket in non-blocking mode. This will wait up to 1s for the server to start which balances fast error messages while still providing some wiggle room on the server side.

Test plan:

```
pytest --minutes 10 test/distributed/elastic/timer/file_based_local_timer_test.py -k test_watchdog_call_count -x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143318
Approved by: https://github.com/fegin
2024-12-17 00:36:10 +00:00
90fb7c36ab [FSDP2] Clamp reduce_dtype in lazy init (#143297)
fixes https://github.com/pytorch/pytorch/issues/143277 by moving the clamp of `reduce_dtype` to `None` to lazy init (same place as where `param_dtype` can be clamped to `None`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143297
Approved by: https://github.com/weifengpy
2024-12-17 00:25:08 +00:00
dd2cd4279e Create build_directory if it does not exist when generating ninja build file (#143328)
Fixes: https://github.com/pytorch/vision/issues/8816
I am observing this failure on Windows, Python 3.13 vision builds:
```
Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja...
error: [Errno 2] No such file or directory: 'C:\\actions-runner\\_work\\vision\\vision\\pytorch\\vision\\build\\temp.win-amd64-cpython-313\\Release\\build.ninja'
ERROR conda.cli.main_run:execute(49): `conda run packaging/windows/internal/vc_env_helper.bat python setup.py bdist_wheel` failed. (See above for error)
```

Adding the code above fixes it, confirmed by running `` python setup.py bdist_wheel`` :
```
building 'torchvision._C' extension
Emitting ninja build file C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release\build.ninja...
Creating build directory C:\actions-runner\_work\vision\vision\pytorch\vision\build\temp.win-amd64-cpython-313\Release
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/26] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc -Dtorchvision_EXPORTS -IC:\actions-runner\_work\vision\vision\pytorch\vision\torchvision\csrc -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\TH -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Lib\site-packages\torch\include\THC -IC:\actions-runner\_work\_temp\conda_environment_12361066769\include -IC:\actions-runner\_work\_temp\conda_environment_12361066769\Include "-IC:\Pr
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143328
Approved by: https://github.com/kit1980, https://github.com/albanD
2024-12-17 00:20:43 +00:00
467970d683 [AOTI] Relax input alignment assertion (#143236)
Summary: https://github.com/pytorch/pytorch/pull/142136 added a runtime alignment assertion. But the assumption is probably too strict for more flexible use cases of AOTI, e.g. python deployment, see a recent error torchchat ran into for more details, https://github.com/pytorch/torchchat/actions/runs/12322072267/job/34394851280 . This PR relaxes the runtime check and implements copy_misaligned_inputs in cpp instead.

Differential Revision: [D67287922](https://our.internmc.facebook.com/intern/diff/D67287922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143236
Approved by: https://github.com/malfet, https://github.com/chenyang78
2024-12-17 00:17:39 +00:00
c4ab3e6ceb remove allow-untyped-defs for torch/__config__.py (#143320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143320
Approved by: https://github.com/aorenste
ghstack dependencies: #143319
2024-12-17 00:16:09 +00:00
0178e43949 remove allow-untyped-defs for torch/utils/_stats.py (#143319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143319
Approved by: https://github.com/aorenste
2024-12-17 00:16:09 +00:00
ff373171d0 [Profiler] Add Optional Flag to turn off external correlations v2 (#143314)
Summary: The original diff got reverted because its base commit was on a broken version of pytorch that was failing rocm tests. There is no indication that this diff had any effect on rocm. Had trouble rebasing the GH pr after revert and accidentally closed the PR so submitting again .

Test Plan: See original PR with same name

Differential Revision: D67293040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143314
Approved by: https://github.com/leitian, https://github.com/aaronenyeshi
2024-12-16 23:49:13 +00:00
10df370a77 Add missing IValue overloads for SymInt lists (#143167)
We should be able to convert Int lists into SymInt lists.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143167
Approved by: https://github.com/ezyang
ghstack dependencies: #143166
2024-12-16 23:18:55 +00:00
557da8014d [gen_autograd_functions] rename some variables (#143166)
This is a follow-up from https://github.com/pytorch/pytorch/pull/141278.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143166
Approved by: https://github.com/soulitzer
2024-12-16 23:18:55 +00:00
4c62275325 Kill capture_pre_autograd_graph API (#143224)
Summary:
Delete the following API:

- capture_pre_autograd_graph()
- capture_pre_autograd_graph_using_training_ir()
- gm_using_training_ir()

There's no more call sites to `capture_pre_autograd_graph`.

Except
1) two test cases in coreml, PR to remove: https://github.com/apple/coremltools/pull/2400
2) XLA: one test case in pytorch/xla, PR to remove: https://github.com/pytorch/xla/pull/8398
3) a few call sites guarded by version guard (< 2.5.0)

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D64056353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143224
Approved by: https://github.com/tugsbayasgalan
2024-12-16 23:06:22 +00:00
6356690b3d Revert "[BE] Revert "Add conda to Manylinux Docker images (#139903)" (#143300)"
This reverts commit c86383f956ee86f34d0ffb94bc229c51c6f11dd9.

Reverted https://github.com/pytorch/pytorch/pull/143300 on behalf of https://github.com/atalman due to failing nova workflows with conda: command not found ([comment](https://github.com/pytorch/pytorch/pull/143300#issuecomment-2547030664))
2024-12-16 22:50:08 +00:00
135a2d4483 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2024-12-16 21:46:08 +00:00
15aee8e090 update aten bmm CK heuristic (#143294)
Summary: updates heuristic to use new instances based on ck profiling of LLM shapes

Differential Revision: D67280269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143294
Approved by: https://github.com/mxz297, https://github.com/xw285cornell
2024-12-16 21:44:59 +00:00
c86383f956 [BE] Revert "Add conda to Manylinux Docker images (#139903)" (#143300)
This reverts commit 56a40d4ebb0bcf733f1ea5f6efde805326a7a565.

Having conda in manylinux builder images is not required. This was added to have manylinux-builder images as the only images for CD builds after conda-builder is deprecated. However we decided to start using ``almalinux-builder``.

We are using almalinux-builder for linux_job_v2 which contains conda: https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job_v2.yml#L114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143300
Approved by: https://github.com/seemethere
2024-12-16 21:40:08 +00:00
4e594f4d12 Triton bump for 3.2 cherry-picks (mmav3 segfault fix, gfx950 support) (#143302)
* https://github.com/triton-lang/triton/pull/5277
* https://github.com/triton-lang/triton/pull/5084
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143302
Approved by: https://github.com/atalman, https://github.com/pruthvistony
2024-12-16 21:22:29 +00:00
401b1498d2 [BE] typing for decorators - distributed/_tensor/ops/utils (#142139)
Test Plan: unit tests

Differential Revision: D62302679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142139
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-12-16 21:19:33 +00:00
159b7ad8aa Improve async workers to handle forking for async compile (#142072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142072
Approved by: https://github.com/masnesral
2024-12-16 21:16:42 +00:00
678f74988d Fix a misspelling [ONNX] (#143301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143301
Approved by: https://github.com/titaiwangms
2024-12-16 20:19:41 +00:00
8ad842cda4 remove allow-untyped-defs for utils/data/datapipes/dataframe/structures.py (#143273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143273
Approved by: https://github.com/aorenste
ghstack dependencies: #143271
2024-12-16 20:07:36 +00:00
54ed13cdce Revert "Update low prec codegen for div/mod (#142350)"
This reverts commit ca973069ed9a08782695d9407605e219008821e2.

Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it. breaks an internal test ([comment](https://github.com/pytorch/pytorch/pull/142350#issuecomment-2546615951))
2024-12-16 20:05:14 +00:00
e885225eda Add persistent+TMA version of Triton mm and addmm (#142101)
This PR adds persistent+TMA versions (Triton template + the corresponding infra) for the `tuned_mm` and `tuned_addmm` lowerings. The persistent+TMA choices are added to the GEMM autotuning if (checked by the `use_triton_tma_template` helper):

1. The min. hardware and Triton version requirements are met for the TMA support.

2. The GEMM inputs are compatible with the Triton TMA API (i.e., 16-byte aligned and contiguous).

3. The `config.triton.enable_persistent_tma_matmul` is set to `True`.

Additional notes:

1. As added in this PR, the TMA uses are not compatible with prolog / epilogue fusion. To this end, in the new Triton template we currently support: TMA-based loads of A/B, but no prologue fusion; epilogue fusion, but no TMA-based stores of C. TMA + fusion compatibility can be added as a follow-up.

2. The current Triton TMA API (`experimental_device_tensormap_create2d`) does not support strides. Due to this, we limit the applicability of the new Triton template to the cases where the inputs are contiguous.

3. The transposed layouts of A and / or B are supported by passing the constexpr flags to the kernel and adjusting the ordering of the block sizes accordingly in the kernel code (this should have no effect on the kernel perf, as decided at the Triton compilation time).

4. After the next Triton pin update, we can switch to the tensor descriptor API (landed recently in https://github.com/triton-lang/triton/pull/5290) in the new Triton template, which should allow lifting 2 and 3 above.

5. The configs for the new Triton template in `persistent_mm_kernel_configs` are preliminary. We should do more perf exploration and possibly augment the config in a follow-up.

6. This PR is rebased onto and unifies with two related PRs landed previously: https://github.com/pytorch/pytorch/pull/142045 (some infra unification with the persistent+TMA template for _scaled_mm) and https://github.com/pytorch/pytorch/pull/134532 (add possibility to disable prolog fusion for selected choices).

7. The current Triton TMA API only supports 1D and 2D descriptors (even after https://github.com/triton-lang/triton/pull/5290, see [here](9829ce87cc/python/triton/language/core.py (L1957))). For now, this blocks adding persistent+TMA template for `torch.bmm`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142101
Approved by: https://github.com/drisspg, https://github.com/eellison
2024-12-16 19:12:12 +00:00
17b71e5d6a Add config alias (#142088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142088
Approved by: https://github.com/c00w
2024-12-16 18:51:17 +00:00
1b6b86fad7 [dynamo] disable eval frame callback around most of _TorchDynamoContext wrapper function (#143211)
Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1559636954674510/

If the `_fn` returned by `_TorchDynamoContext.__call__` makes an external function call, dynamo is recursively invoked. This can cause issues if there are added calls that are not skipped by Dynamo. So we should disable the eval frame callback as much as possible.

Differential Revision: [D67211749](https://our.internmc.facebook.com/intern/diff/D67211749)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143211
Approved by: https://github.com/jansel
2024-12-16 18:38:58 +00:00
1bf983077f [reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)
Reland - https://github.com/pytorch/pytorch/pull/139560

As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
2024-12-16 18:38:32 +00:00
338835d0d2 Add support for other backends in get_preferred_device (#132118)
Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118
Approved by: https://github.com/kwen2501
2024-12-16 18:30:41 +00:00
ccf35af142 [Inductor] Fix the Index Put lowering with same input of self and values (#139366)
**Summary**
Fix the issue: https://github.com/pytorch/pytorch/issues/138908, the root-cause is in https://github.com/pytorch/pytorch/issues/138908#issuecomment-2449192447

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_put
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_index_add
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139366
Approved by: https://github.com/jgong5, https://github.com/eellison
2024-12-16 17:07:14 +00:00
7ab3177776 Revert "[AMD] Turn on TF32 for aten::mm (#139869)"
This reverts commit e0bdae7884aed09d9e3f1a3f7a53c095e74a9aff.

Reverted https://github.com/pytorch/pytorch/pull/139869 on behalf of https://github.com/jeffdaily due to causing ROCm CI failures, need to investigate, revert for now ([comment](https://github.com/pytorch/pytorch/pull/139869#issuecomment-2546127069))
2024-12-16 16:46:48 +00:00
a8cc19bb51 [CD] Fix XPU linux CD whl test failure (#143268)
Follow https://github.com/pytorch/pytorch/pull/142482, refer the original fix PR https://github.com/pytorch/pytorch/pull/130742 and new issue in https://github.com/pytorch/pytorch/actions/runs/12323126436/job/34403681230
Works for https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143268
Approved by: https://github.com/atalman
2024-12-16 15:00:03 +00:00
e4d2e81086 Update slow tests (#143278)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143278
Approved by: https://github.com/pytorchbot
2024-12-16 12:40:40 +00:00
d745b2b516 remove allow-untyped-defs for distributed/rpc/_testing/__init__.py (#143271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143271
Approved by: https://github.com/aorenste
2024-12-16 02:35:37 +00:00
9706ada369 [RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677)
# Motivation
This PR intends to add C++ accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is f84e533a2c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #143171, #133572
2024-12-16 02:18:41 +00:00
45ac4ebf15 [RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)
# Motivation
This PR intends to add UTs for accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is 952514f0c8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #143171
2024-12-16 02:18:41 +00:00
c1d4d9d3cf [MPS] Support torch.accelerator.synchronize() on mps (#143171)
# Motivation
Support `torch.accelerator.synchronize()` on mps. The root cause is that MPS doesn't support lazy initialization. So we must check if the current accelerator supports device lazy initialization rather than early return.

# Additional Context
Add a mps UT to test code change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143171
Approved by: https://github.com/albanD
2024-12-16 02:18:32 +00:00
cyy
af8789c056 Hide torch_python symbols (#142214)
Change symbols in torch_python to invisible by default on platforms other than Apple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214
Approved by: https://github.com/ezyang
2024-12-16 00:59:26 +00:00
744a303dee [FlexAttention] Optimzing learned bias perf to dq calc (#142281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142281
Approved by: https://github.com/Chillee
2024-12-15 21:44:32 +00:00
e0bdae7884 [AMD] Turn on TF32 for aten::mm (#139869)
Summary: hipblaslt supports TF32, so adding the support.

Test Plan: CI

Differential Revision: D65435392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139869
Approved by: https://github.com/leitian
2024-12-15 10:02:29 +00:00
5273d8fd2a [audio hash update] update the pinned audio hash (#143265)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143265
Approved by: https://github.com/pytorchbot
2024-12-15 03:41:14 +00:00
9ed045eae9 Revert "[Profiler] Add Optional Flag to turn off external correlations (#142516)"
This reverts commit b29fc52f827cc4b4336ecd24cc0a019ec9cf24b6.

Reverted https://github.com/pytorch/pytorch/pull/142516 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the test is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/142516#issuecomment-2543431758))
2024-12-15 03:34:37 +00:00
dd2d360b7d [ca] re-enable disabled tests (#143247)
FIXES https://github.com/pytorch/pytorch/issues/133197

The unspecified floats PR landed while this test was disabled, and it added an analysis restart which counts towards the backend call counter the test is using

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143247
Approved by: https://github.com/zou3519
2024-12-15 02:11:39 +00:00
cyy
4273e1a059 [5/N] Apply bugprone-unchecked-optional-access (#143111)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143111
Approved by: https://github.com/Skylion007
2024-12-15 01:07:28 +00:00
91bf2e16de [distributed] Remove unused variable in test_composability/test_pp_composability.py (#143191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143191
Approved by: https://github.com/mori360
2024-12-14 12:23:44 +00:00
de484134e4 support slicing with symints in non-strict (#143217)
Differential Revision: [D67215745](https://our.internmc.facebook.com/intern/diff/D67215745/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143217
Approved by: https://github.com/tugsbayasgalan
2024-12-14 10:27:45 +00:00
9933e59c2b [torch][cuda] fix race condition in cuda initialization (#143238)
The access to lazy init callbacks (`_lazy_seed_tracker` and `_queued_calls`) is not synchronized with the initialization lock.

This exposes us to the following race:
1. start `_lazy_init`
2. take `_initialization_lock`
3. flush `_queued_calls` and run them all
4. another thread comes in and uses `_lazy_call` to put something on the queue (in our case, the `manual_seed`)
5. original thread finishes initializing, but never runs that call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143238
Approved by: https://github.com/ngimel
2024-12-14 07:41:24 +00:00
28d8297712 Migrate compiler config to Config (#143152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152
Approved by: https://github.com/ezyang
ghstack dependencies: #143229
2024-12-14 07:38:25 +00:00
7c4d29485e Add typechecking indirection for Config (#143229)
When we create a Config[T], we actually dynamically unbox this in the module, so lets have type checker believe that Config[T] creates a T. This enables proper typechecking support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143229
Approved by: https://github.com/aorenste
2024-12-14 07:38:25 +00:00
be5b342332 [Inductor] Move peak memory pass and overlap pass to be run at the right place (#142822)
This PR moves `decide_global_ordering_of_comms` to run first before all other Inductor scheduler passes, so that downstream passes have the correct dependency tracking info. It also moves peak memory pass and overlap pass to the end of all passes, because they need to be the final decision maker on the node order to achieve the desired peak memory and overlap.

This PR fixes hard-to-debug peak memory pass errors caused by incorrect tracking in `.unmet_dependencies` during the enablement of SimpleFSDP on internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142822
Approved by: https://github.com/eellison
2024-12-14 06:53:02 +00:00
3cc617b6a7 __cuda_array_interface__: Use "<V2" for bfloat16. (#143042)
Rationale: While Numpy doesn't support `bfloat16` and therefore there's no official typestr for `bfloat16` in `__array_interface__` (https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__), JAX/ml_dtypes uses "<V2":

```
>>> from jax import numpy as jnp
>>> jnp.bfloat16.dtype.str
'<V2'
```

Using the same in PyTorch has the upside of making the typestrs returned by `__cuda_array_interface__` identify the torch dtype uniquely.

### Misc notes

(1) JAX itself just refuses to do `__cuda_array_interface__` for `bfloat16`:

```
>>> from jax import numpy as jnp
>>> jnp.arange(10, dtype=jnp.bfloat16).__cuda_array_interface__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
jaxlib.xla_extension.XlaRuntimeError: INVALID_ARGUMENT: __cuda_array_interface__ is not supported for bfloat16 buffers.
```

(2) The "official" description of `__cuda_array_interface__` doesn't mention bfloat16, it just references `__array_interface__`: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html

(3) Ongoing issue for numpy to support bfloat16: https://github.com/numpy/numpy/issues/19808

(4) Tweet that triggered this: https://x.com/HeinrichKuttler/status/1866761979349844211, with @ezyang responding.

(5) "<V2" is kinda weird, as it's a "little-endian void" type. When given to Numpy, it gets turned into endian-agnostic:

```
>>> import numpy as np
>>> import ml_dtypes
>>> np.dtype("bfloat16").str
'<V2'
>>> np.dtype("<V2").str
'|V2'
```

Still, it makes sense to have a unique string for `bfloat16` and since Google chose "<V2" we might as well use that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143042
Approved by: https://github.com/ezyang
2024-12-14 06:27:52 +00:00
c0a39ad35a [ROCm] Fix TunableOp UTs: Rotating Buffer (#143172)
TunableOp's rotating buffer feature cannot be properly tested because the environment variable that controls this feature is sticky. A Python API is introduced to modify this value.

Additional items in this PR:
* UT for rotating buffer API
* Clean up UTs that were setting the rotating buffer via the environment variable
* Align behavior of environment variable and Python API when a negative value (< 0) is set.
* Update documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143172
Approved by: https://github.com/jeffdaily
2024-12-14 06:18:11 +00:00
96c3b2c388 Expose remaining sharedMem cudaDeviceProps to python (#143226)
Was a bit too fast with my earlier PR, `sharedMemPerMultiprocessor` includes some memory that is reserved for the system. The amount a kernel can actually use is limited by `sharedMemPerBlockOptin`.

I also expose `sharedMemPerBlock` for completeness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143226
Approved by: https://github.com/ezyang
2024-12-14 06:13:28 +00:00
cyy
4764303cc6 Use static initialization to avoid once_flag in getCUDAHooks (#143198)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143198
Approved by: https://github.com/albanD
2024-12-14 06:05:41 +00:00
23379e8933 Add torch._compile to uninteresting files (#143209)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143209
Approved by: https://github.com/albanD
2024-12-14 05:40:21 +00:00
ca973069ed Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2024-12-14 03:53:28 +00:00
24f24eebde Get rid of _lazy_import hack (#143213)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143213
Approved by: https://github.com/aorenste, https://github.com/albanD
2024-12-14 03:46:21 +00:00
698eefaddd [audio hash update] update the pinned audio hash (#143245)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143245
Approved by: https://github.com/pytorchbot
2024-12-14 03:37:56 +00:00
cyy
e9f6045e80 [15/N] Fix extra warnings brought by clang-tidy-17 (#143100)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143100
Approved by: https://github.com/Skylion007
2024-12-14 03:24:10 +00:00
33dee721ae Reraise worker errors as runtime errors in more cases when the original exception can't be constructed (#140911)
related to https://github.com/pytorch/pytorch/issues/34130

when pytorch attempts to re-raise an exception from a worker process (e.g. multiprocessing dataloader), if it can't reconstruct the original exception message due to a type error, it instead raises it as a runtime error. However, if it can't reconstruct the exception for some other reason, it throws an error with a stacktrace pointing to the `ExceptionWrapper` code rather than the original underlying issue.

One case in which I run into this is with boto3's [HTTPClientError](66dc1f8d52/botocore/exceptions.py (L94))s. They must be constructed with a keyword argument `error`, but if `error` isn't passed, a `KeyError` is thrown instead of a `TypeError`, due to the particular way it is implemented:

* [HTTPClientError](66dc1f8d52/botocore/exceptions.py (L94))'s constructor excepts variable keyword arguments it passes to `super` (BotoCoreError)
* [it also defines a field `fmt` with `error`](66dc1f8d52/botocore/exceptions.py (L95))
* BotoCoreError [expects to be able to format that string with the kwargs](66dc1f8d52/botocore/exceptions.py (L41))

So in this case, if a HTTPClientError occurs on a worker process, you simply get a `KeyError: error` with a stacktrace pointing to [this line](3e2f276a14/torch/_utils.py (L710)) which is unhelpful.

Instead, I propose to reraise the error as a `RuntimeError` unconditionally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140911
Approved by: https://github.com/vmoens
2024-12-14 03:11:36 +00:00
cdc03f99b7 [ca] add graph id (#141906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141906
Approved by: https://github.com/jansel
ghstack dependencies: #141919
2024-12-14 03:02:06 +00:00
19f3570000 [EZ] Remove --pre from numpy installation command (#143237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143237
Approved by: https://github.com/janeyx99, https://github.com/kit1980
2024-12-14 02:55:21 +00:00
bf8d4f5b7a [Inductor UT] Generalize device-bias code in test_triton_syntax.py. (#143178)
Fix #143177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143178
Approved by: https://github.com/eellison
2024-12-14 02:08:32 +00:00
86c3370bc3 operator benchmark: write output to a JSON (#142809)
This pull request adds the functionality of writing the output of operator benchmark to an optional JSON file specified. The output is still printed in the terminal like before, but the user has the option of saving it in a JSON file as well.

Main part of the functionality is implemented using the function _perf_result_to_dict which outputs a dictionary to be put inside a JSON file. Each dictionary corresponds to a single test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142809
Approved by: https://github.com/albanD
2024-12-14 01:42:00 +00:00
12098ad242 Add torch.cat tensors type promotion description (#141339)
Fixes #126964

Add note description about type promotion of `torch.cat`

**Test Result**

**Before**
![image](https://github.com/user-attachments/assets/2449f11b-48ed-406e-b73e-6d00f8eadb00)

**After**
![image](https://github.com/user-attachments/assets/cba99572-e8b1-4b9c-ba95-a963b54859ba)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141339
Approved by: https://github.com/albanD
2024-12-14 01:36:41 +00:00
13233e062d Fix Apple Clang ICE when building with -march=armv8.6a (#142879)
When investigating #142703, I found that the build with -march=armv8.6 on my M1 mac was hitting a clang ICE. When looking at the blame code, I finally noticed that this constructor was nonsense, apparently in a way that the compiler frontend accepted but the backend choked on.

example ICE error message:
```
fatal error: error in backend: Cannot select: 0x12689c260: bf16 = uint_to_fp 0x1258324a0
  0x1258324a0: i32 = AssertZext 0x125822d90, ValueType:ch:i16
    0x125822d90: i32,ch = CopyFromReg 0x1238dddc0, Register:i32 %22
      0x12689c6c0: i32 = Register %22
In function: _ZN2at6native7DEFAULTL12logit_kernelERNS_18TensorIteratorBaseERKN3c106ScalarE
c++: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Apple clang version 16.0.0 (clang-1600.0.26.3)
Target: arm64-apple-darwin24.1.0
Thread model: posix
```

Unbreaks `env CFLAGS=-march=armv8.6-a CXXFLAGS=-march=armv8.6-a python setup.py develop --cmake` on M1 Mac.

Differential Revision: [D67102953](https://our.internmc.facebook.com/intern/diff/D67102953/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142879
Approved by: https://github.com/malfet
2024-12-14 01:07:01 +00:00
063194aa32 add additional CK BMM Instances (2) (#142874)
Summary: stacked changes to keep new codegen-ed instances below 2000 LOC

Reviewed By: zjing14

Differential Revision: D66985408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142874
Approved by: https://github.com/mxz297
2024-12-14 01:04:34 +00:00
00b0210139 [Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
2024-12-14 00:27:55 +00:00
d53164880f dont attempt to fuse in unaligned accesses to mm (#142435)
This isn't profitable - we were trying to fuse in a padding of unaligned mm, which defeats padding's purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142435
Approved by: https://github.com/jansel
ghstack dependencies: #142401, #142402
2024-12-14 00:22:31 +00:00
70be7900bb Fix Tensor clear to properly clear slots (#143203)
Fixes a bug introduced in https://github.com/pytorch/pytorch/pull/137267

While the test ensures the finalizer did run to make sure things are cleared, the objects are not properly collected by the gc due to the faulty tp_clear implementation. So, while the finalizer did run, the object was still alive.
Fixing this by giving tp_clear the same treatment as tp_traverse and tp_dealloc on Tensor: make it a unique function that handles the full subclass hierarchy in one place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143203
Approved by: https://github.com/ezyang, https://github.com/colesbury
ghstack dependencies: #143202
2024-12-14 00:17:07 +00:00
8741d72e3c move function before modifying it (#143202)
This is a no-op. Just to make the diff in the next PR easier to read

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143202
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2024-12-14 00:17:07 +00:00
3bfdf6f063 Exclude py 31.3t triton package from PyTorch 3.13t wheel (#143218)
Follow up after https://github.com/pytorch/pytorch/pull/143162
Include triton only for 3.13 packages not 3.13t
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143218
Approved by: https://github.com/kit1980
2024-12-14 00:12:45 +00:00
515abb7744 [CI] Add Triton 3.13t build (#143212)
By just extending the matrix and invoking script with appropriate cpython runtime
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere
2024-12-13 23:45:47 +00:00
8621b9ff0c Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402)
For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics.

Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box.

We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402
Approved by: https://github.com/jansel
ghstack dependencies: #142401
2024-12-13 23:25:15 +00:00
4e0de50eb5 Revert "[CI] Add Triton 3.13t build (#143212)"
This reverts commit 571cd92d7c4c7bd2d5f068b5a285e0e70b8d0a40.

Reverted https://github.com/pytorch/pytorch/pull/143212 on behalf of https://github.com/janeyx99 due to lint is failing, the other failures don't seem relevant but ci has turned red after this change haha ([comment](https://github.com/pytorch/pytorch/pull/143212#issuecomment-2542521875))
2024-12-13 23:03:45 +00:00
f406207af2 Revert "[ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827)"
This reverts commit 1e2b841675e50a6abd8dab9a95b33fda64b12e2b.

Reverted https://github.com/pytorch/pytorch/pull/142827 on behalf of https://github.com/jeffdaily due to prematurely dropped support for gfx900/gfx906 ([comment](https://github.com/pytorch/pytorch/pull/142827#issuecomment-2542507857))
2024-12-13 22:48:44 +00:00
ad2faec8bb Add a pass which analyzes whether a prologue preserves zero mask (#142401)
We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask:
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tl.where(a_mask, tmp1, 0.0)
```
now we do not need to ->
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tmp1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401
Approved by: https://github.com/jansel
2024-12-13 22:37:33 +00:00
b29fc52f82 [Profiler] Add Optional Flag to turn off external correlations (#142516)
Summary: External Correlations are super spammy and oftentimes not even useful. Add flag during init to remove them entirely

Test Plan: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Dec_10_12_33_31.531106.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D67048206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142516
Approved by: https://github.com/ngimel
2024-12-13 22:32:09 +00:00
bb574abe73 [BC-Breaking]Remove capture_pre_autograd_graph references in quantization (#139505)
Summary:
As title

This is a BC-breaking change because graph produced by "capture_pre_autograd_graph" cannot be input to quantization anymore. But this is ok, since this API is deprecated for a while and is going to be deleted. We have removed all call sites of it.

We remove the deprecated API references in code, docs, and tests.

We also removed two tests that specific to capture_pre_autograd_graph API.

Test Plan: CI

Differential Revision: D65351887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139505
Approved by: https://github.com/tugsbayasgalan, https://github.com/andrewor14, https://github.com/jerryzh168
2024-12-13 22:26:22 +00:00
d25e6e623f Fix unused Python variables in test/[a-d]* (#134665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134665
Approved by: https://github.com/albanD
2024-12-13 22:13:12 +00:00
e19f493f02 add private config to temporarily preserve old FSDP guard behavior (#142871)
Summary: https://github.com/pytorch/pytorch/pull/138819 wobbled dynamo guards in a way that caused some performance regression, so this PR temporarily adds a config to get the old behavior back while we investigate.

Test Plan: CI

Differential Revision: D67096751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142871
Approved by: https://github.com/yf225
2024-12-13 22:06:48 +00:00
8fae4397b4 Add "inductor_pre_grad_graph" logging (#142717) (#143126)
Summary:

Add new structured logging "inductor_pre_grad_graph"

This is for inductor provenance tracking front-end to load this graph from tlparse.
ghstack-source-id: 257581974
exported-using-ghexport

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' //caffe2/test/dynamo:test_dynamo -- -r StructuredTraceTest
```

Differential Revision: D67150288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143126
Approved by: https://github.com/desertfire
2024-12-13 21:48:25 +00:00
8a04018329 [MPS] Fix conv backward for channels last (cont) (#143196)
This is a continuation of https://github.com/pytorch/pytorch/issues/140902 but extends the same logic to input.

Looks like existing channels-last logic just produced incorrect results on pre MacOS-15 versions and fails on MacOS-15, so removing it feels like a right idea

Fixes https://github.com/pytorch/pytorch/issues/142344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143196
Approved by: https://github.com/manuelcandales
2024-12-13 21:32:42 +00:00
571cd92d7c [CI] Add Triton 3.13t build (#143212)
By just extending the matrix and invoking script with appropriate cpython runtime
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143212
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/seemethere
2024-12-13 21:28:52 +00:00
60c54467db [logging] Log runtime autotuning timing to scuba (#141919)
See test plan in internal diff [D66679369](https://our.internmc.facebook.com/intern/diff/D66679369)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141919
Approved by: https://github.com/jamesjwu, https://github.com/ezyang
2024-12-13 21:22:13 +00:00
0d6d29af38 [CUDA] Follow up to clean up some set_per_process_memory_fraction usage in tests (#142811)
follow-up to #140852 now that #140620 has landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142811
Approved by: https://github.com/Skylion007
2024-12-13 21:09:05 +00:00
65d0a25289 [associative_scan] patch inductor tests to always run with static shape (#143161)
fixes #143053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143161
Approved by: https://github.com/eellison
2024-12-13 21:06:12 +00:00
52f31cc238 dynamo tracing perf: Guard slots: 51.76 -> 51.34 (#143060)
See #143056 for overall docs.

This PR: Add slots to Guard
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143060
Approved by: https://github.com/jansel
ghstack dependencies: #143066, #143056, #143058, #143059
2024-12-13 21:02:50 +00:00
e87f07d3b8 Revert "Migrate compiler config to Config (#143152)"
This reverts commit 1ebdfd56053dafa8880a0dedf535fff70aa92e09.

Reverted https://github.com/pytorch/pytorch/pull/143152 on behalf of https://github.com/oulgen due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/143152#issuecomment-2542342073))
2024-12-13 20:55:14 +00:00
625b4edb97 [CD] Test torch.compile on 3.13 (#143207)
Follow up after https://github.com/pytorch/pytorch/pull/143162
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143207
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2024-12-13 20:01:36 +00:00
fe9365f3f5 Add check_binary workflow to pytorch/pytorch (#143201)
Migrated from pytorch/builder
Related to: https://github.com/pytorch/builder/issues/2054

Copying from : 3468139e81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143201
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-12-13 19:30:10 +00:00
8f40446770 Fix precedence of bitwise and/or printing (#143197)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143197
Approved by: https://github.com/albanD, https://github.com/williamwen42
2024-12-13 19:29:42 +00:00
1ebdfd5605 Migrate compiler config to Config (#143152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143152
Approved by: https://github.com/ezyang
ghstack dependencies: #143150, #143151
2024-12-13 19:29:07 +00:00
f1ff8bc1c5 Add type to Config (#143151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143151
Approved by: https://github.com/ezyang
ghstack dependencies: #143150
2024-12-13 19:29:07 +00:00
9d05c8110d Require Config to have a default (#143150)
With aliases coming soon, we want to reject alias + default combo, so we need defaults to be passed in. On top of this, this simplifies statically type checking config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143150
Approved by: https://github.com/ezyang
2024-12-13 19:28:59 +00:00
bf711a9cce [ROCm] Improve performance of reduce sum for 3D shapes (#143137)
Improve performance of reduce sum for 3D shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143137
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-12-13 19:02:00 +00:00
6178be822d dynamo tracing perf: direct Guard: 52.58 -> 51.76 (#143059)
See #143056 for overall docs.

This PR: Remove explicit constant check from `VariableBuilder.install_guards()`
the args calling convention.  Also remove a lambda binding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143059
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #143066, #143056, #143058
2024-12-13 18:20:48 +00:00
6bcda3a21a dynamo tracing perf: cache on import_source: 52.9 -> 52.58 (#143058)
See #143056 for overall docs.

This PR: add cache to `InstructionTranslatorBase.import_source()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143058
Approved by: https://github.com/jansel
ghstack dependencies: #143066, #143056
2024-12-13 18:20:48 +00:00
b472d82c96 dynamo tracing perf: import in build: 60.48 -> 59.92 (#143056)
A series of directed perf improvements to drive down the dynamo tracing cost of
the given test. Before this PR stack the compile took about 60s, and after takes
30s. Individual improvements are listed below along with the approximate
improvement of that change.

Tested with this model:
```
@torch.compile(backend="eager")
def model_add(x, y):
    out = x
    for i in range(5000):
        out = torch.add(out, y)
    return out
```

This PR: Stop importing builder in the inner loop of `VariableTracker.build()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143056
Approved by: https://github.com/jansel
ghstack dependencies: #143066
2024-12-13 18:20:48 +00:00
63e1f97f4b dynamo tracing perf: don't unnecessarily call getframeinfo on the hot path: 47.26 -> 37.66 (#143066)
See #143056 for overall docs.

This PR: Stop using `getframeinfo()` when we only care about the function name
and throw the rest away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143066
Approved by: https://github.com/jansel
2024-12-13 18:20:48 +00:00
e0c8abda76 Fix potentially undefined behaviour in index_put sample input (#143116)
From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_:

> If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements.

Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`.

This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116
Approved by: https://github.com/albanD
2024-12-13 17:59:01 +00:00
23b8ea3094 Allow disabling int specialization on nn.Modules (#142829)
Resolves issue #140464 by adding an option to not specialize int from nn.Modules (False by default to maintain existing behavior).

Test Plan: `buck2 test mode/opt caffe2/test/dynamo:test_dynamo -- test_modules.py::NNModuleTests::test_nn_module_unspec_int_attr`

Differential Revision: D66837042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142829
Approved by: https://github.com/ezyang, https://github.com/yanboliang
2024-12-13 17:26:11 +00:00
82a45d19b4 Expose sharedMemPerMultiprocessor device property to python (#143119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143119
Approved by: https://github.com/ezyang
2024-12-13 16:53:57 +00:00
3f62054de1 [ROCm] upgrade nightly wheels to rocm6.3 - 1 of 2 (docker images) (#142151)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142151
Approved by: https://github.com/jeffdaily
2024-12-13 16:21:17 +00:00
7968732f5b Fix int8 mm V.ops.mul dispatching (#143127)
This is sort of subtle - because we were doing `V.ops.mul` at binding time, we dont redispatch later when we invoke the epilogue. and then later running into assertion checking in pr above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143127
Approved by: https://github.com/drisspg
ghstack dependencies: #143048
2024-12-13 16:17:23 +00:00
da67a6a7bb [inductor] Replace set by OrderedSet (#138466)
Uses the set_linter from https://github.com/pytorch/pytorch/pull/138454
and considerable manual editing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138466
Approved by: https://github.com/eellison
2024-12-13 16:08:45 +00:00
fbfc530442 [export][ez] Fix forward D67044185 (#143193)
Summary: Fixing forward D67044185 and T210459833 by adding the missing buld file.

Test Plan: buck2 build --flagfile fbcode//mode/opt fbcode//admarket/training_data/augmentation/processors/tests:model_manager_test

Differential Revision: D67200056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143193
Approved by: https://github.com/tugsbayasgalan
2024-12-13 16:06:42 +00:00
04bb82f097 Linux Wheels: Remove triton dependency python < 3.13 constraint (#143162)
We do build pytorch-triton package for python 3.13 : https://github.com/pytorch/pytorch/actions/runs/12304476674/job/34344764271
Hence constraint is no longer needed.
This stack enabled torch.compile for Python 3.13 : https://github.com/pytorch/pytorch/pull/141264
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143162
Approved by: https://github.com/kit1980
2024-12-13 15:08:44 +00:00
810808d97d Enable cutlass-based all-gather matmul when TORCH_SYMM_MEM_ENABLE_NATIVE_ASYNC_TP is set (#142283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142283
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-12-13 10:29:14 +00:00
3e1f587514 [AOTI] Fix an autotune block grid computation issue (#143098)
Summary: There is a grid computation issue after switching to one-pass codegen in https://github.com/pytorch/pytorch/pull/141980. When max-autotune is turned on, there is an incorrect grid codegen in some cases.

Reviewed By: henrylhtsang

Differential Revision: D67120987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143098
Approved by: https://github.com/henrylhtsang
2024-12-13 07:52:30 +00:00
9f90583ca2 [CI] Run aarch64 tests on Graviton3 (#143129)
Which is armv8.6 that has SVE and BF16 capability

mkldnn_pattern_matcher skips are tracked in https://github.com/pytorch/pytorch/issues/143146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143129
Approved by: https://github.com/digantdesai
2024-12-13 07:39:22 +00:00
c37185c76a [BE] Stop using deprecated APIs in mkldnn_pattern_matcher (#143156)
This should fix
```
/var/lib/jenkins/workspace/test/inductor/test_mkldnn_pattern_matcher.py:157: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143156
Approved by: https://github.com/kit1980
2024-12-13 06:37:20 +00:00
cyy
075905b7bd [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2024-12-13 06:22:13 +00:00
72fd7abb35 [ca] fix flex attention backward HOP capture in initial graph (#143155)
FIXES https://github.com/pytorch/pytorch/issues/142313

So with previous HOPs, compiled autograd could just inline into their body and get their post-dispatch aten representation. You can't do that with this flex attention HOP, which just wants any proxy tracing mechanism to insert it into its graph. Okay, compiled autograd does use proxy tracing, so we can do that.

This is safe because other than the reenter_make_fx call, there were no other make_fx internals usage in the HOP. And compiled autograd specializes on the AOT backward's saved symints which should cover any changes in shapes to the inputs of the HOP.

However, there's still an issue: Dynamo doesn't know how to handle `FlexAttentionBackwardHOP` and will graph break, so the flex attention backward is running in eager as of this PR. The tlparse looks really scuffed after the compiled autograd capture: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpMMHBEH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143155
Approved by: https://github.com/drisspg
2024-12-13 06:04:39 +00:00
b4f4c75e19 [dynamo] Support multiple inheritance for custom dict construction (#142416)
This patch applies a local and practical workaround for custom dict
construction when multiple inheritance is involved.

Handling multiple inheritance in general could be a lot more involved,
so I created #142414 to track that.

Fixes #141118.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416
Approved by: https://github.com/jansel
2024-12-13 05:13:05 +00:00
b5d8d2444a add README.md for compile time benchmarks (#143145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143145
Approved by: https://github.com/laithsakka
ghstack dependencies: #141517, #143143
2024-12-13 05:12:26 +00:00
b7ad52abb0 Use new group instead of split group on non-CUDA device (#141469)
Motivation:

Currently, `split_group` only works for NCCL backend. https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L4745. Then we need to use `use_group` on other non-CUDA device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141469
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-13 05:11:33 +00:00
57c46af47a [Inductor][CPU] Add torchao da8w8 pattern with sym quantized act & wgt (#142110)
### Summary

Extends #142036 for Inductor pattern-matching pattern covered for torchao API `int8_dynamic_activation_int8_weight` in the following scenario (inference-only, freezing enabled) -

- int8 quantized (symmetrically) activation (per token quantized).
- Statically (so, scales are also constant. But then they would have been constant even in case of dynamic quantization due to constant weights, anyway) per-channel int8 quantized (symmetrically) weights (which are also constant because freezing is enabled).

The pattern that's matched is `torch._intmm` -> convert to FP32/BF16 -> [optional expand for activation scale] ->`mul` -> `mul`.

We don't check if the activation is dynamically quantized or whether the weights are statically quantized, though (since the implementation won't have have any side-effects even if that wouldn't be true).

In practice, it also matches the smooth-quant int8 quantized linear pattern if its output is not reshaped (if activation is 2D).

### More details

oneDNN int8 matmul supports application of per-channel weight scale but not a vector activation scale, which could be applied as a post op, but is currently unsupported in ATen. Bias addition (which could be supported with an add post-op) is also unfused.

The fusion pattern used in this PR is `torch._intmm` -> convert to FP32/BF16 ->`mul`, which will be replaced by oneDNN qlinear op.

The speedup over eager-mode is due to 2 reasons -
1. fusion of int8xint8 -> int32 GEMM, conversion to FP32/BF16 & application of weight scale. (In case of BF16, many intermediate conversions are also avoided).
2. weight is pre-packed & cached by Inductor, so a reorder is avoided at run-time.

But, in the future, the whole pattern (including application of activation scale, which would be a mul post-op) + bias could be fused if corresponding support would be enabled in ATen.

### Verification

Added UT in this PR
```
python test/inductor/test_mkldnn_pattern_matcher.py -v -k test_da8w8_sym_act_sym_wgt_with_int_mm
```

#### Corresponding torchao UTs

1. int8 Smoothquant legacy API - `TORCHINDUCTOR_FREEZING=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" python test/integration/test_integration.py -v -k test_non_dynamically_quantizable_linear`.
The difference from #139595 is that there are no reshapes of the linear output in this pattern.

2. int8 da8w8 - symmetrically quantized activation (dynamically) & statically quantized weights -  ` TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor" TORCHINDUCTOR_FREEZING=1 python test/integration/test_integration.py -v -k test_int8_dynamic_quant_subclass_api_0_cpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142110
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #142036
2024-12-13 04:59:03 +00:00
b731ced91f Prologue Fusion (#134532)
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.

Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`

And the modification api:

```
{{ modification(
    subgraph_number=0,
    output_name="post_mod_scores",
    score="qk",
    out="qk"
) | indent_except_first(1) }}
```

We have:

```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```

Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.

There are a couple main use cases for prologue fusion:

- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.

Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066

Other notes:

By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.

With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
2024-12-13 04:18:25 +00:00
ceb664aca6 add float_args benchmark (#143143)
71% improvement with automatic dynamic float arguments

with specialize_float=False
```
float_args,compile_time_instruction_count,346293869
```

with specialize_float=True
```
float_args,compile_time_instruction_count,1198546486
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143143
Approved by: https://github.com/laithsakka
ghstack dependencies: #141517
2024-12-13 03:35:59 +00:00
ab04f3aee1 [ca] set autograd graph task state (#143108)
GraphTask holds metadata needed for a single execution of backward(), it is 1:1 with backward calls, at least for compiled autograd. It is used for certain torch._C global autograd state APIs.

In SAC, we use torch._C._current_graph_task_id() as a dict key to store information during unpack hook execution: a5fb07af27/torch/utils/checkpoint.py (L1128)

If we don't set an active task, it will randomize the key, and will do its logic as if each unpacked tensor was from a different graph task
a5fb07af27/torch/utils/checkpoint.py (L1112-L1115)

The sketchy part of this PR is that in eager autograd, GraphTask is mutated during execution. But inspecting the struct, the mutation seems to only be used to communicate between autograd threads (created when multiple devices are involved) or for deprecated uses. We shouldn't run into the mutation case at all in compiled autograd. Also, only the graph task id is accessible from python hooks.

FIXES https://github.com/pytorch/pytorch/issues/142862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143108
Approved by: https://github.com/jansel, https://github.com/albanD
2024-12-13 03:10:48 +00:00
dbe4b69df0 [Inductor] Fix cooperative reduction tests broken in recent refactor (#143135)
These tests were broken by https://github.com/pytorch/pytorch/pull/142020. This PR updates the fixed configs accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143135
Approved by: https://github.com/jansel, https://github.com/huydhn
2024-12-13 02:03:43 +00:00
cyy
9f5ebf3fc6 Clang-format aten/src/ATen/native/Tensor*{cpp,h} (#143089)
These files are relatively stable, so it should be safe to format them without incurring conflicts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143089
Approved by: https://github.com/albanD
2024-12-13 00:06:48 +00:00
2533a5a843 upgrade sccache to 0.9.0 (#142854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142854
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2024-12-12 22:49:50 +00:00
fb93462904 [Reopen][Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#142036)
Reopen of https://github.com/pytorch/pytorch/pull/139595

**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`

The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.

Note that `onednn.qlinear_pointwise` only supports a scalar activation scale, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.

**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```

Differential Revision: [D66796966](https://our.internmc.facebook.com/intern/diff/D66796966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142036
Approved by: https://github.com/jerryzh168, https://github.com/jgong5

Co-authored-by: sanchitintel <sanchit.jain@intel.com>
2024-12-12 21:18:03 +00:00
602c86a420 [DSD] Fix strict=False case for DDP (#143038)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143038
Approved by: https://github.com/mori360
2024-12-12 21:15:21 +00:00
a7509e98c5 [pipelining] fix backward_one_chunk when the output of the model is a… (#142237)
fixes #142229

if any of ``stage_output`` is a view, it cannot be detached in place. Replacing it with ``t = t.detach()`` or similar would not free the graph for the output given to the user. Detaching the base tensor could cause a side effect.

The same code is used in ``_backward.py`` (b64a537993/torch/distributed/pipelining/_backward.py (L215)) but does not seem to cause any issue in my case. Maybe needs some investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142237
Approved by: https://github.com/H-Huang
2024-12-12 20:59:35 +00:00
39cacc1d81 Fix missing tests on test tool lint job (#143052)
A follow-up from https://github.com/pytorch/pytorch/pull/142476#discussion_r1878888558 where some tests are not discovered correctly by pytest

### Testing

https://github.com/pytorch/pytorch/actions/runs/12287448581/job/34289531307?pr=143052#step:14:162 shows the correct number of tests now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143052
Approved by: https://github.com/ZainRizvi
2024-12-12 20:29:32 +00:00
82ce888273 c10::string_view -> std::string_view in more places (#142517)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142517
Approved by: https://github.com/malfet
2024-12-12 19:45:59 +00:00
0b75b7ff2b [Easy] factor out inductor ophandler decompositions (#142400)
Factor out inductor operator decompositions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-12-12 19:03:26 +00:00
c170248b78 [Profiler] Enable Iterative Step without profiler in fbcode (#142077)
Summary: Adds post optimizer hook for fbcode so that we can run iterative on demand without having to use a frontend profiler interface. Since this is being used more frequently, it would be convenient for users to be able to trigger this on-demand feature without having to worry about being within some timing window.

Test Plan: Ran iterative tracing without profiler.profile

Differential Revision: D66734119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142077
Approved by: https://github.com/briancoutinho
2024-12-12 19:00:13 +00:00
e3fe5f62b6 Remove Checkout pytorch/builder for Linux Binary Builds (#143125)
Follow Up after: https://github.com/pytorch/pytorch/pull/142282

Remove Checkout pytorch/builder for Linux Binary Builds
I believe we where not using builder already. Hence remove this checkout.
We should be using scripts from this folder:
```
/pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh
```

TODO: Will followup with removing BUILDER_ROOT everywhere from PyTorch repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143125
Approved by: https://github.com/kit1980
2024-12-12 18:55:00 +00:00
d48b16a725 Revert "[Dynamo] only import einops if version is lower than 0.7.0 (#142847)"
This reverts commit 357e261b1eded933d98de18ddcef2b083f87259d.

Reverted https://github.com/pytorch/pytorch/pull/142847 on behalf of https://github.com/atalman due to Breaks binary builds, see the comment above ([comment](https://github.com/pytorch/pytorch/pull/142847#issuecomment-2539759580))
2024-12-12 18:44:35 +00:00
b0c3d39e0d [pipelining] Update tutorials and documentation (#143045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143045
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-12 18:42:17 +00:00
ee5bceaee6 [sigmoid] Write the new export schema format to archive without breaking compatibility. (#142511)
Summary:
This diff make it possible to migrate to PyTorch's OSS export schema from sigmoid. Basically, we add a new field called "methods" to ExportedProgram in Model definition, which contains the thrift schema generated based on schema.py from OSS. This way, we can keep writing the old fields while double write a new format in equivalent form. Since thrift doesn't support inlining type definitions, we do it manually here and it shouldn't break on-wire compatibility. As long as every sigmoid user is using sigmoid.frontend.serialization.serialize, we always guarantee to have the new format saved sa well.

Eventually we will will use json deserialization from OSS so we will only keep this double writing for a couple of months. Eventually, we will migrate every serialization path to the OSS workflow.

Test Plan:
buck test mode/opt sigmoid/frontend:serialization_test
buck test mode/opt sigmoid/frontend/test_gpu:serializer_test

Differential Revision: D67044185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142511
Approved by: https://github.com/desertfire
2024-12-12 18:41:10 +00:00
5dabe2d464 Fix NJT backward tests (#143072)
This PR fixes some issues with NJT backward / compile backward tests:
1. `requires_grad` was not being propagated appropriately during `SampleInput` generation, so a LOT of backward cases were untested before (sad times). This PR utilizes a helper function `_clone()` to clone() / detach() NJTs for SampleInputs while preserving `requires_grad` status. Note: the clone() / detach() stuff is for autograd; can't have two SampleInputs as part of the same autograd graph.
2. Per-sample skips weren't -fully- working; the op logic would still be invoked even with a skip. I found this out thanks to `split_with_sizes`, which segfaults during backwards because it tries to use an NST-specific formula. As annoying as it is, I tried a ton of things but ultimately had to split the `subtest_ctx` into that + a `skip_xfail_ctx` to run the subtests within.
    * Updated all uses of per-sample skips / xfails: 4 in `test_nestedtensor.py` and 1 in `test_vmap.py`
3. Added the appropriate skips / xfails to get everything passing. There are a shitton of bugs to fix!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143072
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-12-12 18:06:23 +00:00
d47a80246a [dynamo][pytree][3/N] make CXX pytree traceable: tree_map / tree_map_ (#137399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137399
Approved by: https://github.com/jansel
ghstack dependencies: #137398
2024-12-12 18:05:25 +00:00
7edeb1005a [dynamo][pytree][2/N] make CXX pytree traceable: tree_flatten / tree_unflatten / tree_structure (#137398)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137398
Approved by: https://github.com/jansel
2024-12-12 18:05:25 +00:00
c85323c5e8 Revert "Tests Generelization for multiple accelerator devices (#139184)"
This reverts commit b576a8c318201b63269f7ff25ec5830d00662a7a.

Reverted https://github.com/pytorch/pytorch/pull/139184 on behalf of https://github.com/clee2000 due to Failing internally when trying to pickle distributed test files D67098795 ([comment](https://github.com/pytorch/pytorch/pull/139184#issuecomment-2539610187))
2024-12-12 17:48:30 +00:00
2f0fe82f6d Revert "[14/N] Fix extra warnings brought by clang-tidy-17 (#141644)"
This reverts commit 24a5a2ef258d2b482ded674cdb9555afaf081402.

Reverted https://github.com/pytorch/pytorch/pull/141644 on behalf of https://github.com/clee2000 due to failing internally D67112938 ([comment](https://github.com/pytorch/pytorch/pull/141644#issuecomment-2539602023))
2024-12-12 17:43:36 +00:00
dc23f1944a Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-12 17:39:14 +00:00
7667235a23 c10::optional -> std::optional (#142514)
Fixes issues introduced in https://github.com/pytorch/pytorch/pull/141348 and https://github.com/pytorch/pytorch/pull/139578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142514
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-12 17:23:46 +00:00
520ba556cd [Inductor] Refactor "r" reduction prefix to {"r0_", "r1_"}. (#142020)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.

# Feature

This PR changes the `RINDEX` / `"r"` symbol type to `(R0_INDEX, R1_INDEX)` and `("r0_", "r1_")`, respectively. This allows the relevant code to support 2D (often ND) reductions. Unlike the parent PR, this one does not change the tiling algorithm, so `"r1_"` is never used. However, it prepares other parts of the system to handle `"r1_"` once we start using it. This should significantly reduce the chances of hitting merge conflicts, making the parent PR much easier to land.

The only change to the generated triton code is to rename `"rindex"` -> `"r0_index"`, `"RBLOCK"` -> `"R0_BLOCK"`, etc. To maintain compatibilty with existing codegen, this also generates aliases to the old reduction variables like `rindex = r0_index`. If we generated 2D reductions (which this PR will not do), the aliases would be more complicated and would collapse 2D multi-indices to linear indices. See some example kernels in the parent PR.

These aliases can be eliminated by the Triton compiler, and should not impact the final machine code running on the GPU. See the perf testing in the parent PR which confirms the aliases do not impact perf.

# Test plan

The existing CI provides good coverage. This PR modifies the expected code in a few places, renaming reduction variables from `r.*` to `r0_.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142020
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2024-12-12 17:22:20 +00:00
cf538efd0c Revert "Hide torch_python symbols (#142214)"
This reverts commit da76e912a4c58c649061fc84b29a42714897a0ca.

Reverted https://github.com/pytorch/pytorch/pull/142214 on behalf of https://github.com/huydhn due to The MacOS failure looks legit as it shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/142214#issuecomment-2539543504))
2024-12-12 17:15:51 +00:00
15ee2960e1 [aot] Functionalize aot backward prologue and epilogue wrappers (#142415)
For functional compiled autograd, we're having dynamo trace through the aot backward implementation. To avoid graph breaking and imposing too many restrictions, we allow_in_graph the prologue and epilogue. This adds 2 restrictions:
- code must be available in the global context
- inputs other than tensors/symnodes must be const foldable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142415
Approved by: https://github.com/bdhirsh
2024-12-12 17:14:29 +00:00
30b61e521c [logging] Populate compile_time_autotune_time_us (#143104)
See testing in attached diff

Differential Revision: [D67128210](https://our.internmc.facebook.com/intern/diff/D67128210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143104
Approved by: https://github.com/ezyang
2024-12-12 17:08:43 +00:00
e3ddc0ca33 Support remote caching requiring redis auth (#141679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141679
Approved by: https://github.com/masnesral
2024-12-12 17:07:50 +00:00
0f78be5573 Fix search icon (#142808)
Removing:

.pytorch-left-menu-search input[type=text] {
    background-image: none;
}
so that the search icon correctly appears in the sphinx searchbox

Also, fixing scrolling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142808
Approved by: https://github.com/albanD
2024-12-12 16:09:30 +00:00
725526abc5 Fix scan dtypes (#143048)
FIx for https://github.com/pytorch/pytorch/issues/142883. We weren't getting test coverage of scan because the tests were being skipped. see, https://github.com/pytorch/pytorch/issues/143053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143048
Approved by: https://github.com/arui-meta, https://github.com/blaine-rister
2024-12-12 15:57:00 +00:00
d83a049232 [EZ] Update lintrunner in CI to 0.12.7 (#143073)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143073
Approved by: https://github.com/wdvr
2024-12-12 15:35:37 +00:00
7cc3a591c2 [FlexAttention] Fix a few more symbolic shape issues (#142816)
# Summary

See  https://github.com/pytorch/pytorch/issues/139064 for more details. This fixes a number of issues with dynamic shapes. Thanks to @alexdremov for finding most of these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142816
Approved by: https://github.com/yanboliang, https://github.com/ezyang
2024-12-12 15:29:21 +00:00
84f791381a Python 3.13 CI add crossref test to existing linux-focal-py3_13-clang10-build (#143074)
Add  linux-jammy-py3_13-gcc11-build and test - similar to Py 3.9
Add crossref test to existing linux-focal-py3_13-clang10-build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143074
Approved by: https://github.com/malfet
2024-12-12 14:45:56 +00:00
cd1b5924d5 Revert "[Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)"
This reverts commit 79cf8fa75176a8f6bb79d426c6d0f9369d03ff98.

Reverted https://github.com/pytorch/pytorch/pull/142360 on behalf of https://github.com/jeanschmidt due to seems to have broken macos tests ([comment](https://github.com/pytorch/pytorch/pull/142360#issuecomment-2539143039))
2024-12-12 14:42:55 +00:00
30e2b322a1 Add <string> to uninteresting_files (#142984)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142984
Approved by: https://github.com/albanD, https://github.com/IvanKobzarev
2024-12-12 14:35:30 +00:00
91261107e0 debug handler maintain through decomposition (#141612)
Add checks in the ao numberic debugger to guard the debug handle consistency between aten op decomposition

Differential Revision: [D66517480](https://our.internmc.facebook.com/intern/diff/D66517480/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141612
Approved by: https://github.com/jerryzh168
2024-12-12 12:26:45 +00:00
18785c1af9 [BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-12 10:53:48 +00:00
a5fb07af27 [Torch][#142396]Resolve Failure When Uploading To Remote Storage (#143046)
Summary: Catch io.UnsupportedOperation exception so that stream's without fileno support don't cause failure

Test Plan: UT

Differential Revision: D67108487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143046
Approved by: https://github.com/saumishr
2024-12-12 08:17:15 +00:00
497f89ff83 fix dynamo nn module stack fqn (#142823)
Dynamo can produce sources that have funny patterns in their `.name()` that break `nn_module_stack` fqns. Added a test that used to have `._modules` inside nn_module_stack fqns, now doesn't. (Unfortunately couldn't repro a case mentioned in the GH issue where `.slice(...)` is claimed to appear as well.)

Fixes https://github.com/pytorch/pytorch/issues/141939

Differential Revision: [D67064189](https://our.internmc.facebook.com/intern/diff/D67064189/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142823
Approved by: https://github.com/pianpwk, https://github.com/zhxchen17
2024-12-12 07:02:13 +00:00
da76e912a4 Hide torch_python symbols (#142214)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142214
Approved by: https://github.com/ezyang
2024-12-12 07:00:54 +00:00
dcb128d495 [ROCm] TunableOp use thread-safe getenv functions (#142274)
Fixes #142403

~~PR fixes breakage due to this commit
8cd7ad8b48~~

PR is a partial reland of this https://github.com/pytorch/pytorch/pull/140594 with a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142274
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-12-12 06:49:26 +00:00
5ad7d5304c [DTensor][random] add HSDP+TP model init test (#143077)
**Summary**
1. Move the model init tests from `DistTensorRandomOpTest` to `DistTensorRandomInitTest`
2. Added a HSDP+TP meta init test to show correct model init result in this use case. Note that this test requires 8 GPUs to run and our CI doesn't have that capacity so this test will be skipped on CI testing. A local run shows that the test passes on a 8-GPU host.

**Test**
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_hsdp_tp_model_meta_init`

<details>
<summary> Test Result </summary>
<img width="3343" alt="image" src="https://github.com/user-attachments/assets/a960c5e6-37bc-49be-9e36-ecc29ed47eb0" />

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143077
Approved by: https://github.com/weifengpy
2024-12-12 06:46:16 +00:00
357e261b1e [Dynamo] only import einops if version is lower than 0.7.0 (#142847)
Fixes internal xref (https://fb.workplace.com/groups/257735836456307/posts/804793021750583/?comment_id=805229281706957&reply_comment_id=805232695039949)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142847
Approved by: https://github.com/zou3519
2024-12-12 06:38:22 +00:00
9701c50bdc [Dynamo] Add missing tensor builtins to allowed functions (#142841)
Fixes https://github.com/pytorch/pytorch/issues/141232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142841
Approved by: https://github.com/yanboliang
2024-12-12 06:38:19 +00:00
b25f64b613 Add-o pipefail for all bash scripts (#143050)
Fixes #142380
I have added -o pipefail in all bash scripts in pytorch/.ci/pytorch. Sorry I didn't double-check the submodule in my last PR. Thanks for the correction! Please contact me again if there are any problems with this fix^^. (Actually contributing to the open source community is an assignment for one of my courses and today is the deadline so I rushed to revise it when I saw an email early in the morning. Haha.)
 @ezyang @malfet @huydhn @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143050
Approved by: https://github.com/ezyang, https://github.com/huydhn

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
2024-12-12 06:18:41 +00:00
79cf8fa751 [Inductor] Use sleef implementation for CPP backend asinh codegen (#142360)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/142345. Previously, we use `asinh(x) = log(x + sqrt(1 + x**2))` to calculate the result of `asinh`, the issue happens when input with `-10000.1`, which makes `x + sqrt(1 + x**2)` close to 0 and log(0) is invalid. We use the `sleef` implementation in this PR to fix this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_asinh_with_corner_inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142360
Approved by: https://github.com/jgong5
2024-12-12 05:40:48 +00:00
1e2b841675 [ROCm] Prune old gfx archs gfx900/gfx906 from binaries (#142827)
Remove gfx900 and gfx906 archs as they're long-in-the-tooth. Should help reduce the increasing size of ROCm binaries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142827
Approved by: https://github.com/jeffdaily
2024-12-12 05:33:40 +00:00
cyy
fda43c98d1 Improve implementation of quantized_batch_norm (#141570)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141570
Approved by: https://github.com/albanD
2024-12-12 04:35:00 +00:00
cyy
20df80a669 Remove unneeded optional dereference (#141578)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141578
Approved by: https://github.com/swolchok
2024-12-12 04:34:43 +00:00
cyy
f7b9533c3f [4/N] Apply bugprone-unchecked-optional-access (#142832)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142832
Approved by: https://github.com/albanD
2024-12-12 04:33:32 +00:00
fbbafd0320 Turn on AOTAutogradCache by default on open source (#141981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141981
Approved by: https://github.com/bdhirsh, https://github.com/oulgen
2024-12-12 04:21:11 +00:00
4d0775462e E2E composability testing (#141398)
Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability
Currently provide @parametrize on
"ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble]
"MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32]

Future work:
1. add fp8
2. add cp(context parallelism) to enable 4D test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-12 04:19:29 +00:00
cyy
2903cf0ad8 Re-enable some C++ warnings (#142332)
It enables some C++ warnings since the code base is fairly clean. Meanwhile, Wextra-semi is disabled on CUDA generated code since there is no way to fix them without the cooperation of CUDA team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142332
Approved by: https://github.com/albanD, https://github.com/eqy
2024-12-12 04:02:12 +00:00
f892f9862a [ROCM] Enable *_load_dwordx4 ISA for BFloat16 and Half. (#141397)
Remove input_vec_size constexpr and move it to template parameter. This enables generation of vectorized loads in ROCm AMDGPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141397
Approved by: https://github.com/jeffdaily

Co-authored-by: Jerry Mannil <jerry.mannil@amd.com>
2024-12-12 03:27:49 +00:00
4d8357e912 [CD] Use Anaconda cmake for Mac builds (#143054)
To find Anaconda-env-installed OpenMP
(As OpenMP from PyPI is looking for it in a different places)

For posterity: our build script names are very confusing:
 - [`.ci/wheel/build_wheel.sh`](6cb6e8d790/.ci/wheel/build_wheel.sh) is only used for MacOS wheel/libtorch builds
 - [`.ci/manywheel/build.sh`](6cb6e8d790/.ci/manywheel/build.sh) are used for Linux wheel/libtorch builds
 - [`.ci/pytorch/windows/build_pytorch.bat`](6cb6e8d790/.ci/pytorch/windows/build_pytorch.bat) is used for Windows wheel builds

Fixes https://github.com/pytorch/pytorch/issues/142873
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143054
Approved by: https://github.com/Jack-Khuu, https://github.com/atalman
2024-12-12 03:05:46 +00:00
cb354f8b47 [PGNCCL] Move NCCLComm impl to cpp (#142826)
BE as titled. No behavior change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142826
Approved by: https://github.com/wconstab, https://github.com/c-p-i-o
2024-12-12 02:45:52 +00:00
06075d3d18 [Inductor][CPP] Fix Mask Dtype mismatch (#142103)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/141559. The `vec_mask` store data type doesn't aligned when doing `bitwise_and`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142103
Approved by: https://github.com/jgong5
2024-12-12 01:21:32 +00:00
d68403df3b filelock: Make waitcounter variant to use (#139816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816
Approved by: https://github.com/ezyang
2024-12-12 01:18:34 +00:00
6cb6e8d790 Python 3.11, 3.12 Remove tests covered by 3.13 (#143078)
We do have linux-focal-py3_13-clang10-build and test. Hence removing linux-focal-py3_11-clang10-build/test and linux-focal-py3_12-clang10-build/test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143078
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-12 01:12:00 +00:00
1dd6f21029 Cuda 12.1 - Remove from trunk tests (#143076)
Remove cuda 12.1 from trunk tests. This is covered by 12.4 tests.
Move ``libtorch-linux-focal-cuda12_4-py3_7-gcc9-debug-build`` -> ``libtorch-linux-focal-cuda12_4-py3_10-gcc9-debug-build``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143076
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-12 01:10:09 +00:00
bd7d81db9e Use validate-docker-images workflow from test-infra (#143081)
After PR: https://github.com/pytorch/test-infra/pull/6029 use validate-docker-images.yml from test-infra.
Related to: https://github.com/pytorch/builder/issues/2054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143081
Approved by: https://github.com/huydhn
2024-12-12 00:24:27 +00:00
cyy
db81a3f31c [TorchGen] remove remove_non_owning_ref_types from valuetype_type (#142449)
It is not used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142449
Approved by: https://github.com/ezyang
2024-12-12 00:15:44 +00:00
1b3f8b7589 Revert "[RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)"
This reverts commit 209119424922b135fef39aba1f25da3b67f5879a.

Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))
2024-12-11 21:47:18 +00:00
dfe5669076 Revert "[RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677)"
This reverts commit 734bb01460d59e661e9114e7aa17e04821e4b57a.

Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the new test is still very flaky on MacOS even when it does not segfault anymore ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537256522))
2024-12-11 21:47:17 +00:00
cd50bd8477 Revert "[BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)"
This reverts commit fb02b40d27737213e0547dec0e30977dfc50f2f3.

Reverted https://github.com/pytorch/pytorch/pull/140542 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert this in order to revert https://github.com/pytorch/pytorch/pull/133572#issuecomment-2537204202 due to a conflict ([comment](https://github.com/pytorch/pytorch/pull/140542#issuecomment-2537253665))
2024-12-11 21:44:23 +00:00
de313f1155 [foreach_map] Initial foreach map HOP impl for inference (#142098)
This is the initial foreach map HOP for pointwise ops which will be extended in the future to support grouped GEMMs and other ops.

This PR utilizes PrimHOPBase class to represent foreach_map as a HOP with a single subgraph. The way this is implemented is that the user API `foreach_map` provides a single pointwise torch op, and internally this function calls a polyfill which has the same semantics as a foreach op (ie iterates over lists of operands applying the op elementwise). The higher order op is passed through the stack down to inductor where a lowering in essence inlines the subgraph into the main graph. This is done by interpreting it with a pointwise subgraph lowering, grouping the outputs by device, and registering the output buffers as foreach groups as applicable. For testing I was able to reuse the existing foreach tests by creating a wrapper function which matches the foreach op interfaces for those tests and then run all of the existing foreach tests on foreach_map.

TODO before landing:
* Add tests for general functions
* Test warning if unsupported op will block fusion

Followups:
* I need to add tests for backwards (this will be a followup PR because backwards will  require other work as well)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142098
Approved by: https://github.com/eellison
2024-12-11 21:32:11 +00:00
bd199bc754 [EZ] Move slow job from CU12.1 to CU12.4 (#142856)
I though it was migrated a while back

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142856
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi
2024-12-11 21:12:35 +00:00
688f44824b DistributedDataParallel: add init_sync option to control collectives during initialization (#142824)
This controls whether or not we run collectives during the DDP init function. This makes it easier to use fault tolerant ProcessGroup implementations that may not be starting at the same time.

torchft uses a dummy process group and a comm hook to get around these checks. With this change torchft can use the normal ProcessGroup API via the stock comm hook.

https://github.com/pytorch-labs/torchft/blob/main/torchft/ddp.py#L50-L59

Test plan:

```
pytest test/distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142824
Approved by: https://github.com/wconstab, https://github.com/fegin, https://github.com/H-Huang
2024-12-11 20:28:38 +00:00
fd65bd755d [BE] replace incorrect .. note:: invocations (#142868)
Something I've noticed is that a lot of the distributed sites don't render on our docs at all, but if they ever do, the notes will render properly now 😛

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142868
Approved by: https://github.com/albanD
2024-12-11 19:58:18 +00:00
0b96413dbf Upgrade expecttest to 0.3.0 (#142869)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142869
Approved by: https://github.com/albanD, https://github.com/malfet
2024-12-11 19:04:16 +00:00
cyy
e5f08c0cbf [TorchGen] Remove cpp_type_registration_declarations (#142452)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142452
Approved by: https://github.com/ezyang
2024-12-11 19:01:36 +00:00
cyy
e228381846 [TorchGen] Simplify argument_type_str (#142491)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142491
Approved by: https://github.com/ezyang
2024-12-11 19:01:20 +00:00
42d4eec5f3 Don't install lintrunner on S390 (#142876)
Not sure if there are many users of this platform, but hopefully this will fix https://github.com/pytorch/pytorch/issues/142872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142876
Approved by: https://github.com/jeanschmidt
2024-12-11 18:54:12 +00:00
e647b6d590 Fix undesired specialization on slice after split. (#142372)
Fix: #141251

This PR adds a few static guard checks when decomposing and lowering the `slice`
operation, so that we avoid adding unnecessary guards. Specifically, when clamping the end
values.

In summary, the changes are:

- `slice` dynamo decomposition: checks `end >= sizes[dim]` statically. If we don't know
  that, the following guard ensures that we (don't) need clamping.
- `evaluate_min` inductor `sizevar` function: checks whether we can solve it statically or
  not, before actually creating a new guard.

The latter had to be changed because `evaluate_min` (called by `ir.SliceView` constructor)
would always try to create a guard based on the hints operation result. However, if both
`left` and `right` hints were true, it would default to `left <= right` guard. By checking
the guards statically before, we can avoid that.

```python
N = 16

@torch.compile(backend="inductor", dynamic=False, fullgraph=True)
def fn(x):
    splits = torch.ops.aten.split.Tensor(x, N)
    first = splits[0]
    return torch.ops.aten.slice.Tensor(first, 0, 0, N)

x = torch.arange(N)
torch._dynamo.mark_dynamic(x, 0)

fn(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142372
Approved by: https://github.com/ezyang
2024-12-11 18:52:17 +00:00
0ddb33ba22 [ONNX] Avoid overwriting overlapped decomposed functions (#142831)
Fixes #141770

The decomposed function in `torch.export.default_decompositions().items()` is overwritten by `torch._decomp.decomposition_table`. As from `torch.onnx.export()` perspective, we should rather respect the table of decompositions in `torch.export.default_decompositions().items()` and avoid overwriting it with `torch._decomp.decomposition_table.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142831
Approved by: https://github.com/justinchuby
2024-12-11 18:47:40 +00:00
c632e29774 [hop][dynamo] support torch.SymInt inputs (#141524)
Fixes https://github.com/pytorch/pytorch/issues/141305.

```python
        class M(torch.nn.Module):
            def forward(self, x, y, z):
                a = y.shape[0]
                b = z.shape[0]

                def true_fn(x):
                    return x + a

                def false_fn(x):
                    return x + b * z

                # When exporting with non-strict: a and b are symints,
                # so torch.compile need to wrap and trace symint inputs.
                return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,))
```

In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types.  The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524
Approved by: https://github.com/zou3519
ghstack dependencies: #142185
2024-12-11 18:46:58 +00:00
a8fa98ccef skip test dynamo for aot_dispatch tests on ci (#142185)
A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185
Approved by: https://github.com/zou3519
2024-12-11 18:46:58 +00:00
cyy
24a5a2ef25 [14/N] Fix extra warnings brought by clang-tidy-17 (#141644)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141644
Approved by: https://github.com/ezyang
2024-12-11 18:40:42 +00:00
be27dbf2b8 Enable CPP/CUDAExtension with py_limited_api for python agnosticism (#138088)
Getting tested with ao, but now there is a real test i added.

## What does this PR do?

We want to allow custom PyTorch extensions to be able to build one wheel for multiple Python versions, in other words, achieve python agnosticism. It turns out that there is such a way that setuptools/Python provides already! Namely, if the user promises to use only the Python limited API in their extension, they can pass in `py_limited_api` to their Extension class and to the bdist_wheel command (with a min python version) in order to build 1 wheel that will suffice across multiple Python versions.

Sounds lovely! Why don't people do that already with PyTorch? Well 2 things. This workflow is hardly documented (even searching for python agnostic specifically does not reveal many answers) so I'd expect that people simply don't know about it. But even if they did, _PyTorch_ custom Extensions would still not work because we always link torch_python, which does not abide by py_limited_api rules.

So this is where this PR comes in! We respect when the user specifies py_limited_api and skip linking torch_python under that condition, allowing users to enroll in the provided functionality I just described.

## How do I know this PR works?

I manually tested my silly little ultra_norm locally (with `import python_agnostic`) and wrote a test case for the extension showing that
- torch_python doesn't show up in the ldd tree
- no Py- symbols show up
It may be a little confusing that our test case is actually python-free (more clean than python-agnostic) but it is sufficient (and not necessary) towards showing that this change works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138088
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-12-11 18:22:55 +00:00
fb02b40d27 [BE][accelerator] formalize API name {current,set}_device_{idx => index} (#140542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140542
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-12-11 17:57:56 +00:00
cyy
82aaf64422 [3/N] Apply py39 ruff fixes (#142115)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142115
Approved by: https://github.com/ezyang
2024-12-11 17:50:10 +00:00
f7e621c3ce [ROCm] TunableOp do not log during exit (#142818)
Depending on the order of static object destruction, the TunableOp logger is not available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142818
Approved by: https://github.com/jeffdaily
2024-12-11 17:44:29 +00:00
233853a66f Revert "Prologue Fusion (#134532)"
This reverts commit 59ab3825e77451b29c3a118fd24304afcbf52c09.

Reverted https://github.com/pytorch/pytorch/pull/134532 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:26 +00:00
f0b80d014d Revert "Update low prec codegen for div/mod (#142350)"
This reverts commit 1fb3d5a4e35d4ea5691287d4ce77da40578bda4a.

Reverted https://github.com/pytorch/pytorch/pull/142350 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:26 +00:00
829a93562a Revert "[Easy] factor out inductor ophandler decompositions (#142400)"
This reverts commit fa746e3eeb8e1cdcbe3f47ded9e3ca30efac383c.

Reverted https://github.com/pytorch/pytorch/pull/142400 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:26 +00:00
9e88279737 Revert "Add a pass which analyzes whether a prologue preserves zero mask (#142401)"
This reverts commit 1a0bd402436af4c127817a31c76d7ae47d4668b2.

Reverted https://github.com/pytorch/pytorch/pull/142401 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:25 +00:00
b118702a4e Revert "Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402)"
This reverts commit f2d8d7b7acf12f079cadc41b9fdd91cbae94daac.

Reverted https://github.com/pytorch/pytorch/pull/142402 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:25 +00:00
2dcba6eac8 Revert "dont attempt to fuse in unaligned accesses to mm (#142435)"
This reverts commit 22683195964398b37ba0d539cb1bb55bff197db6.

Reverted https://github.com/pytorch/pytorch/pull/142435 on behalf of https://github.com/clee2000 due to A couple of PRs in this stack are breaking internally on different tests ([comment](https://github.com/pytorch/pytorch/pull/134532#issuecomment-2536643675))
2024-12-11 17:32:25 +00:00
5c97ac9721 Revert "Remove unused Python variables in torch/[_-a]* (#133492)"
This reverts commit fda975a7b3071a20dab8fc2c4e453479e1bb7cf2.

Reverted https://github.com/pytorch/pytorch/pull/133492 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else.  The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/133492#issuecomment-2536635516))
2024-12-11 17:29:12 +00:00
db51308d9c fix output node name (#142506)
Fixes #142227

Differential Revision: [D67043283](https://our.internmc.facebook.com/intern/diff/D67043283/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142506
Approved by: https://github.com/ydwu4
2024-12-11 17:28:28 +00:00
2374d460d0 Revert "filelock: Make waitcounter variant to use (#139816)"
This reverts commit 237c4b559c0f928dd89cf1e773458a1bdcea0b9d.

Reverted https://github.com/pytorch/pytorch/pull/139816 on behalf of https://github.com/clee2000 due to Sorry, I need to revert this in order to revert something else.  The only thing you need to do is rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/139816#issuecomment-2536616808))
2024-12-11 17:26:46 +00:00
498a7808ff Fix unused Python variables outside torch/ and test/ (#136359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136359
Approved by: https://github.com/albanD
2024-12-11 17:10:23 +00:00
241bf047b3 [Dynamo] Skip some unresolvable tests (#142508)
Fixes #127738
Fixes #127755

In the discussion in https://github.com/pytorch/pytorch/issues/127738 we
determined that this is not fixable, so we're just going to skip the
test.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142508
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos
ghstack dependencies: #142502, #142503
2024-12-11 17:00:23 +00:00
00ac4237b2 [Dynamo] stop import third-party astunparse (#142503)
PyTorch's minimum version is 3.9, so we can now use ast.unparse.

Test Plan:
- wait for tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142503
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos
ghstack dependencies: #142502
2024-12-11 17:00:23 +00:00
0268abd627 [Dynamo] Stop importing transformers (#142502)
This import was free because transformers should already have been
imported by this time.

Test Plan:
- CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142502
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang, https://github.com/mlazos
2024-12-11 17:00:22 +00:00
8fd4b26504 Revert "[dynamo] Support multiple inheritance for custom dict construction (#142416)"
This reverts commit a45326b6497e47d01527e141cdd16d91fee94c18.

Reverted https://github.com/pytorch/pytorch/pull/142416 on behalf of https://github.com/clee2000 due to The newly added test is faling internally D67056273 ([comment](https://github.com/pytorch/pytorch/pull/142416#issuecomment-2536537693))
2024-12-11 16:56:26 +00:00
c3b30c283f add additional CK BMM instances (#142409)
Summary: stacked changes to keep new codegen-ed instances below 2000 LOC

Reviewed By: zjing14

Differential Revision: D66738746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142409
Approved by: https://github.com/mxz297, https://github.com/xw285cornell
2024-12-11 16:54:05 +00:00
d622040ab1 [AOTI] Unskipped test_scaled_dot_product_efficient_attention for ROCm (#142138)
The test should no longer fail for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142138
Approved by: https://github.com/janeyx99
2024-12-11 16:36:04 +00:00
d5e00412c7 sparse_broadcast_to: less memory footprint, fewer kernel launches (#142364)
As per title.

The following implementation removes the usage of `repeat_interleave, tile` and `full_coo_indices` and replaces them with broadcasting. That way we reduce memory traffic (and are likely to hit cache a lot) and the total number of launched kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142364
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-12-11 16:09:09 +00:00
eed9bb3a0e allow -E to be in any spot in the compiler command (#142813)
Follow up of TODO in https://github.com/pytorch/pytorch/pull/140614

It was found experimentally, that for one GPU architecture, `sccache` passes `-E` as 1st, 2nd or 3rd argument, but it's much better to do this if `-E` is passed as any argument

No need to worry about exit or elif chains, as `exec` aborts script execution

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142813
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2024-12-11 15:58:08 +00:00
ee817e8cf3 [ROCm] Second attempt to fix unit test: matmul_small_brute_force_tunableop (#142422)
Fixes #141458
Fixes #141635
Fixes #141636

~~Address OOM issue by clearing PyTorch's caching allocator.~~

Disabling this test on NVIDIA since it doesn't do much on NVIDIA hardware at the moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142422
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-12-11 15:36:37 +00:00
371bcc1e33 [checkpointing][oss] Throw an error when loading a different size than saved tensor (#141571)
Summary: Fixing issue reported in https://github.com/pytorch/pytorch/issues/126604

Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner -- --exact 'caffe2/test/distributed/checkpoint:test_planner - test_planner.TestLoadPlanner: test_strict

Differential Revision: D66389578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141571
Approved by: https://github.com/mhorowitz
2024-12-11 15:35:48 +00:00
bacd68107a [inductor] Parenthesize expression in _helper_sqrt (#142352)
Fixes https://github.com/pytorch/pytorch/issues/142328. The implied cast-then-sqrt order matches the behavior of the `halide` backend.

2cc01cc6d3/torch/_inductor/codegen/halide.py (L115-L116)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142352
Approved by: https://github.com/ezyang
2024-12-11 15:30:52 +00:00
86300965b6 Add automatic_dynamic_shapes_mark_as == "oblivious" (#141444)
Fixes https://github.com/pytorch/pytorch/issues/137100

Should also add a mark_oblivious API for manual control.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141444
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #141415
2024-12-11 14:39:13 +00:00
e53696bfdb automatic_dynamic_shapes_mark_as (#141415)
This adds an option to cause automatic dynamic shapes to trigger
unbacked SymInts rather than backed SymInts.  This can potentially
help if you are still seeing recompilations from 0/1 specialization
but it also might just cause your program to fail with
GuardOnDataDependent errors.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141415
Approved by: https://github.com/bobrenjc93
2024-12-11 14:39:13 +00:00
b576a8c318 Tests Generelization for multiple accelerator devices (#139184)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Depedency : #133209  Merged now.
There was a #135242  for these changes and closed due to in correct commits. I have incoroprated the changes as suggested in comments.
@kwen2501  @zeshengzong Please review the changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139184
Approved by: https://github.com/kwen2501

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2024-12-11 13:31:20 +00:00
539c46b6e8 [Dynamo] Add register_hook as in-graph tensor method (#142820)
Fixes https://github.com/pytorch/pytorch/issues/141046

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142820
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang
2024-12-11 12:02:03 +00:00
c29b4edbb9 Remove no-op aot_compilation_time (#142490)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142490
Approved by: https://github.com/xuzhao9
2024-12-11 10:37:25 +00:00
30d8b30db7 refactor tensorify restart logic to use sources (#141517)
Differential Revision: [D67066706](https://our.internmc.facebook.com/intern/diff/D67066706)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141517
Approved by: https://github.com/ezyang
2024-12-11 07:15:39 +00:00
bdbdbeeb3d Implements nonzero_static on cuda (#141838)
using blockwide cub primitives.
This adds CUDA functionality for nonzero_static, which was missing in https://github.com/pytorch/pytorch/pull/97417.

For `size` approx equal to number of nonzeros, the perf is very close to the regular version, for larger sizes filling in padding  indices takes additional time.
Disabled for cuda <=11.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141838
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-11 06:44:48 +00:00
1d3b0108a6 [subclass] Fix unwrap subclass parametrization for Nested subclasses (#142481)
@tugsbayasgalan found a bug for nested subclasses:

E.g. we have

TwoTensor(TwoTensor(t1, t2), t0).
After right_inverse we have:

rebuilt_stack == [(TwoTensor, meta, ["a", "b"]), (TwoTensor, meta, ["a", "b"])]
plain_tensors == [t0, t1, t2]
We will first put plain tensors, and only then the nested TwoTensor.

But when we unflatten:
todo = [t0, t1, t2]
we first create TwoTensor[t1, t2]
put it to todo [t0, TwoTensor[t1, t2]]
And as a result get

 TwoTensor(t0, TwoTensor(t1, t2))
which is swapping original a and b :)

So the fix should be different, we need to preserve the order of elements in the stack for plain/subclasses.
I will think about the fix.

Fix:

Keep order of inner_tensor_attr_names according them added to the stack. (first - plain tensor attributes, then subclass attributes)

Test:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```

Differential Revision: [D67032477](https://our.internmc.facebook.com/intern/diff/D67032477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142481
Approved by: https://github.com/tugsbayasgalan, https://github.com/bdhirsh
2024-12-11 06:05:48 +00:00
7e92b02e09 add test for module list slice (#142520)
Nothing to fix for #142439

Differential Revision: [D67049962](https://our.internmc.facebook.com/intern/diff/D67049962/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142520
Approved by: https://github.com/ydwu4
2024-12-11 05:11:00 +00:00
256bfd1096 Rename 'cache limit' to 'recompile limit' (#141542)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141542
Approved by: https://github.com/oulgen, https://github.com/jansel
2024-12-11 05:05:11 +00:00
84cf94ee0b Make more of the reshape_symint stride calculation oblivious (#142488)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142488
Approved by: https://github.com/albanD
2024-12-11 05:04:42 +00:00
921ba0a75e Mark torch._library.custom_ops / torch._dynamo.eval_frame as uninteresting (#142492)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142492
Approved by: https://github.com/bobrenjc93
2024-12-11 05:03:30 +00:00
47a571e166 Document that load_inline requires having a compiler installed (#137521)
Prompted by this forum q: https://discuss.pytorch.org/t/are-the-requirements-for-using-torch-utils-cpp-extension-with-cuda-documented-anywhere/211222

Would be curious to know if we could get more precise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137521
Approved by: https://github.com/zou3519
2024-12-11 03:47:54 +00:00
21833c9642 Added Diffentiable per_sample_weights Check to EmbeddingBag.cpp (#142338)
Added a check in aten/src/ATen/native/EmbeddingBag.cpp that checks if per_sample_weights needs a gradient in order to determine if at::_embedding_bag_forward_only or at::_embedding_bag should run.

Also, added two tests in test_embedding.py that check if the command now works.

Fixes #136457
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142338
Approved by: https://github.com/soulitzer
2024-12-11 03:42:17 +00:00
92cc345683 Implement "torch.mtia.max_memory_allocated" API (#142406)
Summary: This diff implements the inferface of  "torch.mtia.max_memory_allocated" API. The internal implementation will be addressed in a separate diff.

Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

```
----------------------------------------------------------------------
Ran 15 tests in 16.862s

OK
I1127 11:31:14.613909 2272144 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-uqJKuNc0 executable has been unloaded
I1127 11:31:14.615438 2272144 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
```

Reviewed By: ttrung149, nautsimon

Differential Revision: D66553954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142406
Approved by: https://github.com/nautsimon
2024-12-11 03:06:18 +00:00
ed388394d1 add torchrec collectives to enforce global ordering (#141970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141970
Approved by: https://github.com/yf225
2024-12-11 02:45:24 +00:00
082124a322 [Dynamo] Refactor to use install subgraph method in higher order ops (#141384)
Replaced the function in HOP infra with a method on output graph to make it more general and accessible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141384
Approved by: https://github.com/zou3519
ghstack dependencies: #141381, #141382, #141383
2024-12-11 02:22:21 +00:00
c31543c7ae [Dynamo] Initial deduplication pass impl (#141383)
This PR implements the deduplication pass (blocked by config currently) for dynamo where identical regions from https://github.com/pytorch/pytorch/pull/141381 are replaced with a common subgraph.

The two phases of deduplication are explained below.

**Subgraph creation**:
Subgraph creation works by taking one representative region from each region group and creating a subgraph from it, which will then be used to replace all regions in the group. This is implemented by first copying all nodes of the region to the new subgraph and then finding all inputs which are not within the region and creating placeholders for them. For the outputs, all regions in a region group need to be scanned to ensure the largest set of outputs is found, and then an output node is created which returns a tuple of all outputs.

**Graph replacement**:
To replace each region with the extracted subgraph, the node index in the region and argument index within the node's flattened args and kwargs are recorded once during subgraph creation. This allows us to determine which (external to the region) nodes and in which order these nodes are passed as inputs. For the outputs, getitem nodes are created for each output, and all nodes in the region with external outputs are replaced by the proper getitem node. Finally, all original nodes are erased (there should be no uses of these left in the graph).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141383
Approved by: https://github.com/zou3519
ghstack dependencies: #141381, #141382
2024-12-11 02:22:21 +00:00
49e4307686 [Dynamo] add debug logging for graph region expansion (#141382)
This PR adds debug logging for the region expansion algorithm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141382
Approved by: https://github.com/williamwen42
ghstack dependencies: #141381
2024-12-11 02:22:21 +00:00
96c36a6947 [Dynamo] Implement graph region tracking for deduplication (#141381)
This PR implements graph region tracking for later extraction into common subgraphs. The algorithm is as follows:

`GraphRegionTracker` tracks each node added to the output graph and generates a key based on the source location, instruction pointer, input shapes, and global state at the time the node is inserted into the graph. Nodes with the same key are grouped together in a list of identical nodes.

Once graph capture is complete, these nodes are organized into region groups. A region group looks like this:
[[IdenticalNode1], [IdenticalNode2], [IdenticalNode3]] and each sublist is called a region. For each region group (starting at the topologically latest region group), the inner regions are gradually expanded one node at time from args and kwargs of the node in each region provided that for all regions in the group, the nodes being added are also identical (ie have the same key computed above). The `get_identical_regions` function is the main entry point which will be used by the graph replacement algorithm in #141383

Edge cases to add more testing for in future PRs (in progress):
* ~~multiple nodes on the same line~~ (implemented)
* ~~dynamic shapes checking (need to verify symbolic inputs are the same across subgraphs)~~ (implemented)
* ensure we don't expand regions where it will create a cycle during subgraph replacement
* ensure outputs are always tensors (or tuples of tensors iirc)
* ~~out of order kwargs, unevenly nested kwargs~~ (implemented)
* input aliasing - TBD, we may add support for this in `invoke_subgraph` or reuse the aliasing analysis here to not form regions with these properties
* ~~all global state~~ (implemented)

Other followups:
* consolidate global state checking across all caching infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141381
Approved by: https://github.com/zou3519
2024-12-11 02:22:21 +00:00
734bb01460 [RELAND] Add device-agnostic runtime Device/Stream C++ API (#138677)
# Motivation
This PR intends to add C++ accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is f84e533a2c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142468, #133572
2024-12-11 02:04:52 +00:00
2091194249 [RELAND] Add UTs for accelerator device-agnostic runtime APIs (#133572)
# Motivation
This PR intends to add UTs for accelerator device-agnostic APIs.

# Additional Context
This PR is relanded. It is reverted because `torch.Event` doesn't support mps backend. We have fixed it in https://github.com/pytorch/pytorch/pull/142468. The previous commit is 952514f0c8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #142468
2024-12-11 02:04:52 +00:00
88154024b3 [pipelining] Add ZBV schedule (#142084)
Adds ZBV schedule which is explained in https://arxiv.org/pdf/2401.10241, Section 6. Tested it works under the new PipelineScheduleRuntime by fixing a small bug in handling V-shaped schedules. This PR is a replacement for https://github.com/pytorch/pytorch/pull/138444

cc the original authors: @QPHutu @ufotalent https://github.com/pytorch/pytorch/pull/138444#issuecomment-2472684977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142084
Approved by: https://github.com/kwen2501
2024-12-11 02:00:57 +00:00
95b17f6346 [MPS] Add CompileShader method (#141478)
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
   kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
     x[idx] += idx;
   }
")
m.foo(x)
```

And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
2024-12-11 02:00:51 +00:00
d2e5e5b1a5 [AOTI] Remove redudant AOTI_TORCH_EXPORT (#142500)
Summary: Remove redundant AOTI_TORCH_EXPORT from shim_common.cpp since these functions are already declared with AOTI_TORCH_EXPORT in the corresponding header file. This is to solve the issue in https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716

Differential Revision: D67031626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142500
Approved by: https://github.com/frank-wei
2024-12-11 01:59:37 +00:00
cyy
7d98b3dcee [3/N] Apply bugprone-unchecked-optional-access (#142442)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142442
Approved by: https://github.com/albanD
2024-12-11 01:39:10 +00:00
2b105de2c1 [Monitor] Enable non-perf linux test monitor (#142168)
# Overview
Enable monitorings for non-perf linux tests

# Other
- move monitoring step right before build artifact for mac_test.yml, notice this test is not enable monitoring now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142168
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-12-11 01:10:43 +00:00
393cf46f42 Revert "[MPS] Add CompileShader method (#141478)"
This reverts commit 0478fee42db16a0477add1d0a644ce713f31a875.

Reverted https://github.com/pytorch/pytorch/pull/141478 on behalf of https://github.com/malfet due to Broke doctests, by trying to run MPS example on Linux ([comment](https://github.com/pytorch/pytorch/pull/141478#issuecomment-2533351909))
2024-12-11 00:37:10 +00:00
b94a206414 [CI] Use sccache-0.8.2 for CUDA builds (#140614)
Instead of an ancient prebuilt binary

This is a followup from https://github.com/pytorch/pytorch/pull/121323
For some reason, newer `sccache` does not work when `gcc` is invoked with `-E` option, so one have to special-case `-E` case in `/opt/ccache/bin/gcc` wrapper, which had to be special cased to work with  `nvcc` by checking whether `-E` is passed not only as first or second, but as 3rd argument as well(to be followed up by a generic https://github.com/pytorch/pytorch/pull/142813 ), i.e. to generate following wrapper:
```shell
#!/bin/sh

if [ "$1" = "-E" ] || [ "$2" = "-E" ] || [ "$3" = "-E" ]; then
  exec /usr/bin/gcc "$@"
elif [ $(env -u LD_PRELOAD ps -p $PPID -o comm=) != sccache ]; then
  exec sccache /usr/bin/gcc "$@"
else
  exec /usr/bin/gcc "$@"
fi
```

Without it `sccache nvcc hello.cu` failed with no-descriptive
```
    sccache: error: failed to execute compile
    sccache: caused by: Compiler not supported: ""
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140614
Approved by: https://github.com/wdvr

Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>
2024-12-11 00:34:38 +00:00
ea152d2472 [be] better error message for flight recorder status (#142505)
Summary:
Change back log to VLOG(2) in waitForFutureOrTimeout. Instead, print a more user friendly message - if FR completes successfully.
This message is meant for developers only - so don't default to `INFO` in this function.

Also, change one more message from LOG(ERROR) to LOG(INFO).

Tested locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142505
Approved by: https://github.com/kwen2501
2024-12-11 00:10:59 +00:00
eqy
bd4071f0b0 [Matmul][CUDA][FP8] Skip rowwise scaling tests on non-sm90 (#141596)
Since the current kernel is using sm90-specific features, just pre-emptively skip the test for any non-sm90 compute capabilities

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141596
Approved by: https://github.com/drisspg
2024-12-10 23:16:19 +00:00
4a16a60052 [C10D] Add better profiling title for NCCL barrier, nccl:all_reduce to nccl:all_reduce_barrier (#140785)
Fixes [issue](https://github.com/pytorch/pytorch/issues/140257)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140785
Approved by: https://github.com/wconstab
2024-12-10 23:08:15 +00:00
237c4b559c filelock: Make waitcounter variant to use (#139816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139816
Approved by: https://github.com/ezyang
2024-12-10 23:02:59 +00:00
e36fbbf826 Fix ARM bfloat16 fmsub & improve vec_test_all_types coverage (#142499)
This function was very broken and untested. Now it is tested, and vec_test_all_types is passing internally as well.

Differential Revision: [D67036894](https://our.internmc.facebook.com/intern/diff/D67036894/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142499
Approved by: https://github.com/malfet
2024-12-10 22:51:41 +00:00
0478fee42d [MPS] Add CompileShader method (#141478)
This allows one to do something like that
```python
import torch
x = torch.ones(10, device="mps")
m = torch.mps._compile_shader("""
   kernel void foo(device float* x, uint idx [[thread_position_in_grid]]) {
     x[idx] += idx;
   }
")
m.foo(x)
```

And in general enables writing custom operators using Metal shaders purely in Python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141478
Approved by: https://github.com/manuelcandales
2024-12-10 22:43:17 +00:00
95e7fcf82e inductor: remove duplicate triton configs for autotuning (#142254)
Summary:
# Why

- sampling the same config multiple times is wasteful, especially on exhaustive
- for AMD we rewrite the configs to have a specific number of stages, which might lead to some configs appearing multiple times

# What

cast the configs, already defined as a tuple, through a set to remove duplicates

Test Plan:
taken from the `mm_kernel_configs` logic in the same file
```
>>> mm_kernel_configs = [        {"config": (BLOCK_M, BLOCK_N, BLOCK_K, num_stages, num_warps), "cond": True}        for BLOCK_M, BLOCK_N, BLOCK_K in itertools.product(            [16, 32, 64, 128, 256], repeat=3        )        for num_stages in [1, 2, 3, 4, 5]        for num_warps in [2, 4, 8]    ]
>>> configs = [c['config'] for c in mm_kernel_configs]
>>> a = tuple((c[0], c[1], c[2], 0, c[4]) for c in configs)
>>> len(set(a))
375
>>> len(a)
1875
>>>
```

Differential Revision: D66893774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142254
Approved by: https://github.com/henrylhtsang
2024-12-10 22:19:06 +00:00
da29c13693 [while_loop] data-dependent op in body_fn (#142031)
The idea is the parent hop's fake tensor mode should ignore the newly allocated unbacked symints in subgraph because the bindings of unbacked symbols in the subgraph should already be done when we trace the subgraph. E.g. if there's an operator in subgraph that produces unbacked symints, the track_tensor_tree logic for that operator will take care of it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142031
Approved by: https://github.com/zou3519
ghstack dependencies: #142162
2024-12-10 21:54:28 +00:00
7111cd6ee0 [hop][BE] add util diff_meta with prettier error message. (#142162)
The error message changes from:
```python
-torch._dynamo.exc.Unsupported: Expected branches to return tensors with same metadata. [(tensor_pair, difference)...]:[('pair0:', TensorMetadata(shape=torch.Size([4, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}), TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}))]
```
to
```python
+torch._dynamo.exc.Unsupported: Expect branches to return tensors with same metadata but find pair[0] differ in 'shape', where lhs is TensorMetadata(shape=torch.Size([4, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={}) and rhs is TensorMetadata(shape=torch.Size([2, 3]), dtype=torch.float32, requires_grad=False, stride=(3, 1), memory_format=None, is_quantized=False, qparams={})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142162
Approved by: https://github.com/zou3519
2024-12-10 21:54:28 +00:00
9ced54a51a [hop] lift free symbols in slice (#142385)
Before the change, we get an unfound proxy error when linting the subgraph.

After the change, we have the following dynamo graph for dynamic_shape test.

```python
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]  /data/users/yidi/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]     def forward(self, s0: "Sym(s0)", s1: "Sym(s1)", s2: "Sym(s2)", L_x_: "f32[s0, s1, s2][s1*s2, s2, 1]cpu"):
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         l_x_ = L_x_
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]          # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:307 in f, code: i = x.size(0) - 2
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         sub: "Sym(s0 - 2)" = s0 - 2
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]          # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:308 in f, code: j = x.size(1) - 3
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         sub_1: "Sym(s1 - 3)" = s1 - 3
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]          # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:310 in f, code: return wrap(lambda x: x[:i, :j, k:], x)
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         wrap_body_0 = self.wrap_body_0
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         wrap = torch.ops.higher_order.wrap(wrap_body_0, s0, s1, s2, l_x_, sub, sub_1);  wrap_body_0 = s0 = s1 = s2 = l_x_ = sub = sub_1 = None
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         getitem: "f32[s0 - 2, s1 - 3, 0][s1*s2, s2, 1]cpu" = wrap[0];  wrap = None
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         return (getitem,)
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]     class wrap_body_0(torch.nn.Module):
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]         def forward(self, s0: "Sym(s0)", s1: "Sym(s1)", s2: "Sym(s2)", l_x_: "f32[s0, s1, s2][s1*s2, s2, 1]cpu", sub: "Sym(s0 - 2)", sub_1: "Sym(s1 - 3)"):
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]              # File: /data/users/yidi/pytorch/test/dynamo/test_higher_order_ops.py:310 in <lambda>, code: return wrap(lambda x: x[:i, :j, k:], x)
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]             getitem: "f32[s0 - 2, s1 - 3, 0][s1*s2, s2, 1]cpu" = l_x_[(slice(None, sub, None), slice(None, sub_1, None), slice(s2, None, None))];  l_x_ = sub = sub_1 = s2 = None
V1209 11:11:06.187000 4091124 torch/_dynamo/output_graph.py:1346] [0/2] [__graph_code]             return (getitem,)
```

We lift sub, sub_1 because they're compound expressions and are directly used in argument of the getitem node. We lift s0, s1 and s2 because they're basic symbols in the tensor input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142385
Approved by: https://github.com/zou3519
2024-12-10 21:52:30 +00:00
795ff0e9f7 [ROCm] Improve reduce sum calculation for low CU count (#141378)
Improve reduce sum calculation for low CU count by enabling splitting the rows across warps for some 2D tensor shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141378
Approved by: https://github.com/jeffdaily
2024-12-10 21:48:56 +00:00
fda975a7b3 Remove unused Python variables in torch/[_-a]* (#133492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492
Approved by: https://github.com/albanD
2024-12-10 21:48:44 +00:00
2268319596 dont attempt to fuse in unaligned accesses to mm (#142435)
This isn't profitable - we were trying to fuse in a padding of unaligned mm, which defeats padding's purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142435
Approved by: https://github.com/jansel
ghstack dependencies: #134532, #142350, #142400, #142401, #142402
2024-12-10 21:35:26 +00:00
f2d8d7b7ac Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402)
For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics.

Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box.

We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402
Approved by: https://github.com/jansel
ghstack dependencies: #134532, #142350, #142400, #142401
2024-12-10 21:26:03 +00:00
bee445c3a3 [MPS] Support torch.Event for MPS (#142468)
# Motivation
Support `torch.Event` on mps backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142468
Approved by: https://github.com/malfet
2024-12-10 21:17:25 +00:00
1a0bd40243 Add a pass which analyzes whether a prologue preserves zero mask (#142401)
We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask:
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tl.where(a_mask, tmp1, 0.0)
```
now we do not need to ->
```
        tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last')
        tmp1 = tmp0.to(tl.float32)
        a = tmp1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401
Approved by: https://github.com/jansel
ghstack dependencies: #134532, #142350, #142400
2024-12-10 21:16:13 +00:00
c30dd35877 [Device] Add "mps" to torch._utils._get_device_attr (#142447)
Follow up after  https://github.com/pytorch/pytorch/pull/141098
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142447
Approved by: https://github.com/kit1980
2024-12-10 20:57:17 +00:00
6f8751dcc9 Fix timeout check workflow lint job (#142476)
Fixes https://github.com/pytorch/pytorch/issues/142485

The workflow check lint job timed out in trunk, i.e. https://github.com/pytorch/pytorch/actions/runs/12261226178/job/34207762939, and here was what happened:

1. https://github.com/pytorch/pytorch/pull/142294 landed yesterday to build ROCm on 3.13, but the PR had a landrace with https://github.com/pytorch/pytorch/pull/142282 in the generated workflow file
2. The trunk lint check caught that in https://github.com/pytorch/pytorch/blob/main/.github/scripts/report_git_status.sh#L2
3. However, the script also attempted to print the difference with `git diff .github/workflows`.  This command was the one that stuck because `git diff` uses page by default and requires a prompt to display the next page ¯\_(ツ)_/¯

It took so long to debug this because a timeout Nova GHA doesn't print any progress.  I'll create an issue for this.

Bonus:

I also fix the broken print from test tool lint job that confuses GitHub https://github.com/pytorch/pytorch/actions/runs/12261226178 with an annotation failure `Credentials could not be loaded, please check your action inputs`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142476
Approved by: https://github.com/wdvr
2024-12-10 20:47:22 +00:00
f57606ab85 Migrate smoke tests to pytorch/pytorch (#142482)
Related to https://github.com/pytorch/builder/issues/2054
This should fix nightly xpu failure: https://github.com/pytorch/pytorch/actions/runs/12251477588/job/34180135207 and rocm failure: https://github.com/pytorch/pytorch/actions/runs/12251477588/job/34182185374 due to missing : `` /builder/check_binary.sh``

Builder Scripts revision: 3468139e81
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142482
Approved by: https://github.com/chuanqi129, https://github.com/kit1980, https://github.com/malfet, https://github.com/jeffdaily, https://github.com/huydhn
2024-12-10 20:43:36 +00:00
117b6c3e2c [Easy][Dynamo][TVM] remove unnecessary prints (#142445)
This PR intends to remove the unnecessary prints in the auto-scheduler of dynamo's TVM backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142445
Approved by: https://github.com/jansel
2024-12-10 19:52:02 +00:00
e95bd337e1 Some workflows to use oidc instead of AWS keys (#142264)
Roles defined in https://github.com/pytorch-labs/pytorch-gha-infra/pull/563

With this, I think we can get rid of the AWS credentials in the upload-stats environment

Untestable because I can't add branches to the upload-stats environment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142264
Approved by: https://github.com/huydhn
2024-12-10 19:40:23 +00:00
3c03bc2431 [dynamo] Expand support of enum attribute access (#142268)
This patch changes `EnumVariable` to support access to all types of
attributes, not just non-callable literals.

Fixes #142050.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142268
Approved by: https://github.com/jansel
ghstack dependencies: #142267
2024-12-10 19:32:40 +00:00
b117945918 [dynamo] Remove dead code in ConstantVariable.const_getattr (#142267)
This path is no longer reachable after #113390, which also updated
`test_access_class_method_from_user_class` to reflect that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142267
Approved by: https://github.com/jansel
2024-12-10 19:32:40 +00:00
f74ba5d30d [dynamo] Remove special graph break for self-referential list (#142438)
We introduced a special graph break to avoid max-recursion-depth error
in #100296.

After #111415, the original `test_list_self_reference` no longer
triggers the special graph break because we started modeling root frame
free variables with `LazyVariableTracker`.

After #117426, we no longer build the list items eagerly, and they'll hit
`variable_tracker_cache` when they get lazily constructed later.

As a result, this patch updates the `test_list_self_reference` test and
removes the special graph break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142438
Approved by: https://github.com/jansel
ghstack dependencies: #142437
2024-12-10 19:23:48 +00:00
4f75f1e80d [dynamo] Use proper item source for NamedTupleVariable (#142437)
Dynamo was generating `GetItemSource(tuple_source, index)` for items of
`NamedTupleVariable`, but that stops working when a user supplied named
tuple has a custom `__getitem__` function with different semantics.

This patch
- fixes the aforementioned issue by using `AttrSource` instead.
- handles named tuple outside `wrap_listlike`, by removing the special
  case of named tuple in `BaseListVariable.cls_for_instance`, since the
  semantics of named tuple is different enough.
- makes user all constructions of `NamedTupleVariable` has items with
  proper sources.

Fixes #142399.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142437
Approved by: https://github.com/jansel
2024-12-10 19:23:48 +00:00
a45326b649 [dynamo] Support multiple inheritance for custom dict construction (#142416)
This patch applies a local and practical workaround for custom dict
construction when multiple inheritance is involved.

Handling multiple inheritance in general could be a lot more involved,
so I created #142414 to track that.

Fixes #141118.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142416
Approved by: https://github.com/jansel
2024-12-10 19:22:15 +00:00
67cf126cf8 Disable PIP version check in collect_env (#142308)
Disables version check which might require users to reach out to PyPI, reference: https://pip.pypa.io/en/latest/cli/pip/#cmdoption-disable-pip-version-check

Switches pip to be used directly as a python module (`python3 -mpip`) instead of relying on `pip3` or `pip`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142308
Approved by: https://github.com/seemethere
2024-12-10 19:16:36 +00:00
3e28da1e06 Revert "skip test dynamo for aot_dispatch tests on ci (#142185)"
This reverts commit 7eda06b36674afa117b28ad807c3421c94e775c1.

Reverted https://github.com/pytorch/pytorch/pull/142185 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a landrace in trunk ([comment](https://github.com/pytorch/pytorch/pull/142185#issuecomment-2532605728))
2024-12-10 18:50:17 +00:00
9aefc59649 Revert "[hop][dynamo] support torch.SymInt inputs (#141524)"
This reverts commit 6713b457aee3e36ab2499fb31b733ecd7104c764.

Reverted https://github.com/pytorch/pytorch/pull/141524 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a landrace in trunk ([comment](https://github.com/pytorch/pytorch/pull/142185#issuecomment-2532605728))
2024-12-10 18:50:17 +00:00
d102cfa2cb [Profiler] Add CUDA Overhead to Auto-trace (#142271)
Summary: We already have CUDA OVERHEAD events enabled in on-demand so we should also add them to auto-trace

Test Plan: Tested using internal performance suites and found no noticeable performance change

Differential Revision: D66904879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142271
Approved by: https://github.com/ngimel
2024-12-10 18:39:59 +00:00
bce07deb96 [dtensor][cp][experiment] add CP experimental API to choose rotate method (#142093)
**Summary**
This PR adds a new experimental API `set_rotate_method` for Context Parallel. This API allows user to choose the desired communication method (between all-to-all and all-gather) for shards rotation.

**Test**
`pytest test/distributed/_tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142093
Approved by: https://github.com/fegin
2024-12-10 18:25:23 +00:00
eb84788fee [fr] change back vlog(2) to LOG(INFO) (#142441)
Summary:
Change log message for future execution back from VLOG(2) to LOG(INFO).
This message is useful for Flight Recorder to verify that flight recorder dumps completed successfully (or not).

Test Plan: Tested manually on a mast job and noted that the INFO message was as expected.
(meta only link: https://fburl.com/mlhub/iui2tpc9)
```
[trainer5]:I1208 10:21:00.772841  7528 ProcessGroupNCCL.cpp:1294] [PG ID 0 PG GUID 0(precheck) Rank 21] future is successfully executed for: Flight recorder dump in heartbeatMonitor
```

Differential Revision: D66996439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142441
Approved by: https://github.com/fduwjj
2024-12-10 17:43:22 +00:00
6713b457ae [hop][dynamo] support torch.SymInt inputs (#141524)
Fixes https://github.com/pytorch/pytorch/issues/141305.

```python
        class M(torch.nn.Module):
            def forward(self, x, y, z):
                a = y.shape[0]
                b = z.shape[0]

                def true_fn(x):
                    return x + a

                def false_fn(x):
                    return x + b * z

                # When exporting with non-strict: a and b are symints,
                # so torch.compile need to wrap and trace symint inputs.
                return torch.cond(x.shape[0] > 5, true_fn, false_fn, (x,))
```

In non-strict export, when inputs are annotated with dynamic shape, the a, and b in above example are torch.SymInt type. true_fn and false_fn will have closure that're of torch.SymInt types.  The error is triggered because we didn't handle SymInt inputs in dynamo and ends up using a UserDefinedObjectVariable for it, which doesn't have a proxy. We added support by following how we handle SymBool input previously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141524
Approved by: https://github.com/zou3519
ghstack dependencies: #141610, #142185
2024-12-10 17:33:57 +00:00
7eda06b366 skip test dynamo for aot_dispatch tests on ci (#142185)
A lot of tests in test_aotdispatch.py is not meaningful (from user's perspective) when we run with dynamo. So we skip them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142185
Approved by: https://github.com/zou3519
ghstack dependencies: #141610
2024-12-10 17:33:57 +00:00
b838bdd4d4 [dynamo] remove unnecessary set_example_value for SymBool input. (#141610)
These are automatically done in create_graph_input so we can remove them. Code refactoring only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141610
Approved by: https://github.com/zou3519
2024-12-10 17:33:48 +00:00
1986b46d63 [export] Change Tuple[()] to bool in schema to sync with thrift. (#142257)
Summary:
In thrift schema, we represent every None value as "True/False" while we represent None as () in OSS schema. This will cause some inconsistency between the type systems and the simplest thing to do here is changing Tuple[()] to bool in oss schema.

This change should NOT cause version bump, because on deserializer side we never read the value from as_none fields, as it doesn't have real meaning. Therefore this schema change should be considered as safe.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D66888892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142257
Approved by: https://github.com/yiming0416, https://github.com/hl475
2024-12-10 17:13:35 +00:00
b1b0afb8e8 [BE] Add type annotation to eliminate_dead_code (#142251)
Test Plan: CI

Reviewed By: evanleed

D-ifferential Revision: D66887283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142251
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-12-10 17:09:21 +00:00
09b2232fd1 Make core_aten_decomp to be alias to export table (#140086)
Differential Revision: [D64554098](https://our.internmc.facebook.com/intern/diff/D64554098/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140086
Approved by: https://github.com/bdhirsh
2024-12-10 17:04:59 +00:00
fa746e3eeb [Easy] factor out inductor ophandler decompositions (#142400)
Factor out inductor operator decompositions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142400
Approved by: https://github.com/Chillee, https://github.com/jansel
ghstack dependencies: #134532, #142350
2024-12-10 16:58:36 +00:00
1fb3d5a4e3 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
ghstack dependencies: #134532
2024-12-10 16:50:28 +00:00
59ab3825e7 Prologue Fusion (#134532)
This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion.

Similar to the store_output api:
`{{store_output(("idx_m", "idx_n"), "acc", "mask")}}`

And the modification api:

```
{{ modification(
    subgraph_number=0,
    output_name="post_mod_scores",
    score="qk",
    out="qk"
) | indent_except_first(1) }}
```

We have:

```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}```

Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference.

There are a couple main use cases for prologue fusion:

- Fusing dequants into a matmul. particularly for more bandwidth bound scenarios.
- Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details.

Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066

Other notes:

By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls.

With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532
Approved by: https://github.com/jansel
2024-12-10 16:25:57 +00:00
a751558467 [logging] Fix bug involving missing compilation_metrics fields in tlparse logs (#142423)
Summary: The line of code that's compiling the set of compilation_metrics to include in the corresponding tlparse log is missing the "legacy" and "common" fields populated above. Fix is to make sure we consider all fields in the compilation_metrics object.

Test Plan:
Before: https://fburl.com/d6em8csg (e.g, https://fburl.com/c19s7ny0)
After: https://fburl.com/5zr6kbvf (e.g, https://fburl.com/3hp14ht2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142423
Approved by: https://github.com/ezyang
2024-12-10 15:58:43 +00:00
882b6af219 c10::string_view -> std::string_view in autograd (#142354)
Differential Revision: D66939966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142354
Approved by: https://github.com/Skylion007
2024-12-10 15:43:41 +00:00
7e41717a26 c10::string_view -> std::string_view in caffe2/jit (#142383)
Test Plan: Sandcastle

Differential Revision: D66939979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142383
Approved by: https://github.com/malfet
2024-12-10 15:42:28 +00:00
dd2d0c6b80 [FSDP2] Gate PT2 code for torch deploy (#142456)
See diff for internal details

Differential Revision: [D67003832](https://our.internmc.facebook.com/intern/diff/D67003832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142456
Approved by: https://github.com/yf225, https://github.com/weifengpy, https://github.com/fegin
2024-12-10 14:39:07 +00:00
ff059587c6 support condition branch in ao debug handler (#141516)
This diff introduced the supportive of condition statement into ao debug handler generation.

Most of code borrowed from ExecuTorch to avoid circle dependency issue.

Differential Revision: [D66270691](https://our.internmc.facebook.com/intern/diff/D66270691/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141516
Approved by: https://github.com/jerryzh168
2024-12-10 14:05:12 +00:00
75530885ba Revert "[BE] Add type annotation to eliminate_dead_code (#142251)"
This reverts commit 3d04de6b2f78a78bc28ce82d6e3a4af1867ec7d8.

Reverted https://github.com/pytorch/pytorch/pull/142251 on behalf of https://github.com/jeanschmidt due to checking if reverting will fix 'FAILED [5.0221s] test_dataloader.py::TestIndividualWorkerQueue::test_ind_worker_queue' on windows ([comment](https://github.com/pytorch/pytorch/pull/142251#issuecomment-2531706362))
2024-12-10 13:57:00 +00:00
a3abe1a5ae Add support for bfloat16 atomic adds in fbcode (#141857)
This adds support for bfloat16 atomic add in fbcode (OSS will have to wait until those changes are upstreamed to triton)

Originally I attempted to write inline asm, but the triton API was not flexible enough to support this use case. In the long run the right answer is to implement this properly in OSS triton.

relevant issues:
* https://github.com/pytorch/pytorch/issues/137425 in fbcode only
* https://github.com/pytorch/pytorch/issues/97016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141857
Approved by: https://github.com/eellison
2024-12-10 11:40:15 +00:00
d51e6fa7f6 [inductor][cpp] Add FlexAttention support for CPU inference (#141453)
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

- Block sparse computation
- Flash decoding tuning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-12-10 11:11:09 +00:00
5ba61d7fb8 [Fix][Profiler UT] Skip CPU for the UT test/profiler/test_execution_trace.py::test_execution_trace_with_pt2 (#142027)
[Fix] Skip CPU device for the UT `test_execution_trace_with_pt2`
skip CPU because triton is only for GPUs. This UT is designed to test profiling the triton kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142027
Approved by: https://github.com/aaronenyeshi
2024-12-10 09:29:19 +00:00
3d04de6b2f [BE] Add type annotation to eliminate_dead_code (#142251)
Test Plan: CI

Reviewed By: evanleed

Differential Revision: D66887283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142251
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-12-10 09:27:29 +00:00
539286a67b Inductor annotations (#130429)
Add NVTX annotations around training phases and buffer computations

RFC/discussion: https://dev-discuss.pytorch.org/t/rfc-performance-profiling-at-scale-with-details-nvtx-annotations/2224

<img width="2160" alt="Screenshot 2024-07-10 at 11 48 04" src="https://github.com/pytorch/pytorch/assets/1175576/9ade139c-d393-473f-9b68-6c25da367dc4">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130429
Approved by: https://github.com/aorenste, https://github.com/eellison, https://github.com/albanD

Co-authored-by: Cedric GESTES <cedric.gestes@flex.ai>
2024-12-10 08:53:39 +00:00
24650c3caa Revert "[Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142273)"
This reverts commit e4ecb09b3513e0ee53ed87496d8bfdf5d2944042.

Reverted https://github.com/pytorch/pytorch/pull/142273 on behalf of https://github.com/huydhn due to Internal has been ninja unlanded D66906175 ([comment](https://github.com/pytorch/pytorch/pull/142273#issuecomment-2530751665))
2024-12-10 08:16:58 +00:00
d3d1a78774 [AOTInductor] Add standalone test for compilation from ExportedProgram (#142327)
Summary: Provide a standalone path to compile and run a ExportedProgram in C.

Test Plan:
(1) Generate a compiled model from ExportedProgram
```
python generate_lowered_cpu.py --input-path /tmp/$USER/ep.pt --output-path /tmp/$USER/final.pt
```
(2) Compile a standalone test runner
```
TORCH_ROOT_DIR=/data/users/$USER/pytorch sh standalone_compile.sh standalone_test.cpp standalone_test.out
```
(3) Run test for the compiled model in step (1)
```
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib ./standalone_test.out /tmp/$USER/final.pt
```

Differential Revision: D66872380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142327
Approved by: https://github.com/hl475
2024-12-10 06:50:09 +00:00
b9e253cb72 [inductor] update numbytes_hint for NoneLayout to allow more fusions (#141766)
We found that [this commit](6eca0aee76) caused a ~6% performance drop in ViT INT8. This was due to changes to the `numbytes_hint` for `NoneLayout`. In this PR, we reverted the changes in `numbytes_hint` to allow more fusions.

```
class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.dense = torch.nn.Linear(768, 768)
        self.layernorm = torch.nn.LayerNorm(768, eps=1e-12)
    def forward(self, context_layer, hidden_states):
        attention_output = self.dense(context_layer)
        hidden_states = attention_output + hidden_states
        layer_output = self.layernorm(hidden_states)
        return layer_output
```
The generated code before (left) and after (right) this PR is as follows:
![image](https://github.com/user-attachments/assets/0ec65ae5-103e-4e2c-bf7c-e8bed24fc179)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141766
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-12-10 06:45:07 +00:00
daa27fe59d [DeviceMesh] Call no_dispatch before doing tensor slicing in DeviceMesh (#142287)
Summary:
DeviceMesh's tensor operation is a control plane operation not data plane and should not be affected by FakeTensorMode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142287
Approved by: https://github.com/XilunWu
2024-12-10 06:33:01 +00:00
f26b75b7ac [aarch64] add CUDA 12.6 sbsa nightly binary (#142335)
related to #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142335
Approved by: https://github.com/atalman
2024-12-10 06:19:28 +00:00
1cb2ebd740 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026

Fix #140546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268, #140269

Co-authored-by: Bin Bao <binbao@meta.com>
2024-12-10 05:05:08 +00:00
6680a83e89 [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268

Co-authored-by: Bin Bao <binbao@meta.com>
2024-12-10 05:05:08 +00:00
a1c6cf7e9f Revert "Add UTs for accelerator device-agnostic runtime APIs (#133572)"
This reverts commit 952514f0c8d8ff2e1719e0ca82b0d178a5c5ff45.

Reverted https://github.com/pytorch/pytorch/pull/133572 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but it segfaults on MacOS ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2530354401))
2024-12-10 04:42:55 +00:00
adbfdbd6a0 Revert "Add device-agnostic runtime Device/Stream C++ API (#138677)"
This reverts commit f84e533a2cb89a42c021dce7d22af7d5bd5f5ac1.

Reverted https://github.com/pytorch/pytorch/pull/138677 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but it segfaults on MacOS ([comment](https://github.com/pytorch/pytorch/pull/133572#issuecomment-2530354401))
2024-12-10 04:42:55 +00:00
08e9ceb0a4 Make sure the benchmark build config is tested on trunk for easy bisect (#142376)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142376
Approved by: https://github.com/atalman
2024-12-10 04:42:52 +00:00
4e7056d94d Fixes in-order test flakiness (#142389)
Fixes #142343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142389
Approved by: https://github.com/michael-diggin, https://github.com/divyanshk
2024-12-10 04:19:20 +00:00
e3886fb13c misc. fixes to unflatten (#142141)
Combining several fixes to unflatten for bugs revealed by random graph testing.

The fixes target two categories of bugs:
1. Some bugs show up as exponential blowups for largish system of nn modules. These are fixes by converting lists to sets, using caching, or otherwise rewriting to reuse computation more effiicently.
2. Other bugs were due to missing intermediate modules created when attributes such as submodules and buffers are accessed through longish paths before calling the corresponding intermediate modules, or missing attributes such as buffers and constants in submodules corresponding to multiple calls.

Differential Revision: [D66659795](https://our.internmc.facebook.com/intern/diff/D66659795/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142141
Approved by: https://github.com/ydwu4
2024-12-10 03:45:13 +00:00
e4ecb09b35 [Inductor][Easy] Fix a test failure in loop_ordering_after_fusion (#142273)
**Summary:**
(Since I am trying the other solution for https://github.com/pytorch/pytorch/pull/141082, I moved out the test case fixes from that pr to a separate pr to land first.)

-----
Testing float8 dynamic scaling case with `TORCHINDUCTOR_LOOP_ORDERING_AFTER_FUSION=1` didn't make any difference.

The test case for fp8 (https://github.com/pytorch/pytorch/blob/main/test/inductor/test_loop_ordering.py#L425) is also failing, https://www.internalfb.com/intern/test/844425111960859?ref_report_id=0

-------

The main change here is to modify the condition of calling `loop_reordering` from `shared_data_score == 0` to `shared_data_score < config.score_fusion_memory_threshold`.

Before the change:
`shared_data_score > 0 -> won't loop_reorder -> can't fused because of shared_data_score < config.score_fusion_memory_threshold`
After the change:
`shared_data_score > 0 -> loop_reorder (shared_data_score < config.score_fusion_memory_threshold) -> get a larger shared_data_score -> fused`

----
It's the same issue as fixed in https://github.com/pytorch/pytorch/pull/136782. But the condition to call loop_reorder might be changed later, causing the test case to fail again.

**Test Plan:**
```
buck2 test 'fbcode//mode/opt' caffe2/test/inductor:loop_ordering
```
And ran a float8 dynamic scaling training script to verify it e2e

-----

Differential Revision: D66906175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142273
Approved by: https://github.com/eellison
2024-12-10 02:58:04 +00:00
3291b0a013 [DataParallel] Skip for MPS device (#142448)
As `torch._C._scatter` is only defined for CUDA/ROCm (and may be XPU?)

This is a regression introduced by https://github.com/pytorch/pytorch/pull/141098 that went unnoticed due to https://github.com/pytorch/pytorch/issues/142206

Test plan:
```
python test_autograd.py -v -k test_dataparallel_saved_tensors_hooks
```

Before this change it failed with
```
ERROR: test_dataparallel_saved_tensors_hooks (__main__.TestMultithreadAutograd.test_dataparallel_saved_tensors_hooks)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3108, in wrapper
    method(*args, **kwargs)
    ~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/test/test_autograd.py", line 13074, in test_dataparallel_saved_tensors_hooks
    model = torch.nn.DataParallel(Model())
  File "/Users/malfet/git/pytorch/pytorch/torch/nn/parallel/data_parallel.py", line 153, in __init__
    raise RuntimeError("no available devices were found")
RuntimeError: no available devices were found
```

After this change it passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142448
Approved by: https://github.com/kit1980
2024-12-10 02:49:23 +00:00
cyy
9a309fb4c6 Remove ConstQuantizerPtr in torchgen (#142375)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142375
Approved by: https://github.com/albanD
2024-12-10 02:37:01 +00:00
41757372c4 Set timeout value for remaining lint jobs (#142444)
Some lint jobs are using the default 30 minutes timeout, but the jobs could wait up to 90 minutes now for the Docker image to become available after https://github.com/pytorch/test-infra/pull/6013
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142444
Approved by: https://github.com/wdvr
2024-12-10 02:29:44 +00:00
e83b0fa945 set CUB_VERSION to 200001 for USE_ROCM (#140861)
Summary:
currently, CUB_VERSION is 0 for USE_ROCM
CUB_VERSION is used for determine whether to use advanced cub APIs for some implementation.

Test Plan:
`buck2 build --flagfile fbsource//arvr/mode/win/vs2022/cpp20/cuda12_5/dev --flagfile fbsource//arvr/mode/cuda/rtx30 fbsource//arvr/libraries/eye/apollo_visualizer:unit_test_apollo_hu_module_capability`

`buck2 build --flagfile fbcode//mode/amd-gpu fbcode//aiplatform/modelstore/checkpointing/pyper:tensor_save_load_utils`

Differential Revision: D63054638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140861
Approved by: https://github.com/eqy, https://github.com/zoranzhao, https://github.com/houseroad
2024-12-10 02:28:48 +00:00
2f1191fb6a Corrected metadata variable names (#142342)
Fixes #142341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142342
Approved by: https://github.com/janeyx99
2024-12-10 02:24:31 +00:00
5d6acd5a31 Register Intel distributed Backend (XCCL) in PyTorch distributed package (#141856)
### Motivation:

As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.

### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def run_allreduce(rank, world_size):
    setup(rank, world_size)
    device = torch.device('xpu:{}'.format(rank))
    x = torch.randn([2, 2], device=device)
    dist.all_reduce(x)
    cleanup()

if __name__ == '__main__':
    world_size = 2
    mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
2024-12-10 01:58:06 +00:00
b98f40a4d5 remove vulkan sdk installation on executorch build (#142424)
pytorch-linux-jammy-py3-clang12-executorch [started to fail](https://github.com/pytorch/pytorch/actions/runs/12244909721/job/34157668780) today due to a 404 on the Vulkan SDK we use/download (1.2.198.1, 3 years old, URL: https://sdk.lunarg.com/sdk/download/1.2.198.1/linux/vulkansdk-linux-x86_64-1.2.198.1.tar.gz )

The Vulkan SDK is probably no longer needed for building Executorch, and is not used down the line for testing.

This PR tests removing the installation of the SDK

https://github.com/pytorch/executorch/pull/7258
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142424
Approved by: https://github.com/huydhn
2024-12-10 01:50:16 +00:00
a1688d8607 Fix test_indexing on MacOS (#142440)
Where int64_t is long long rather than long

This fixes test regression introduced by https://github.com/pytorch/pytorch/pull/140597 that went undetected due to https://github.com/pytorch/pytorch/issues/142206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142440
Approved by: https://github.com/kit1980
2024-12-10 01:46:28 +00:00
871b524398 Revert "temporarily turn on keep-going/continue on error for mac (#142421)"
This reverts commit 17202ea8f6fb0eebdb14b346bb2610f08800a7df.

Reverted https://github.com/pytorch/pytorch/pull/142421 on behalf of https://github.com/malfet due to We've collected enough info for now ([comment](https://github.com/pytorch/pytorch/pull/142421#issuecomment-2530010220))
2024-12-10 01:45:21 +00:00
bef103934a [DeviceMesh][ROCm] skip ProcessGroup init test on ROCm because #ranks != #devices in CI (#142386)
**Summary**
Fixes #142361

Skip the DeviceMesh test since the test suite doesn't consider the case where `# ranks != # devices`.

**Test**
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142386
Approved by: https://github.com/huydhn, https://github.com/fegin
2024-12-10 01:22:21 +00:00
a1b5067297 Enable py3.13 wheels for ROCm (#142294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142294
Approved by: https://github.com/huydhn
2024-12-10 01:10:24 +00:00
20718cdebb [Fast Packing] Add packing ukernels to gemm config (#142191)
Add file to buck build

Differential Revision: [D66692673](https://our.internmc.facebook.com/intern/diff/D66692673/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D66692673/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142191
Approved by: https://github.com/kirklandsign, https://github.com/digantdesai
2024-12-10 01:06:17 +00:00
0f6bfc58a2 Introduce remote cache key prefix to break cache (#142148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142148
Approved by: https://github.com/jamesjwu, https://github.com/ezyang
2024-12-10 00:35:50 +00:00
1cb5f38328 [EZ] Skip test_zero_grid_with_backed_symbols on Mac (#142436)
As it expects to load traced module on CUDA, which is not available on Mac bd867d691b/test/inductor/test_aot_inductor.py (L1414)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142436
Approved by: https://github.com/kit1980
2024-12-10 00:25:32 +00:00
ec746f7026 Remove an unused variable from _prims_common/wrappers.py (#138480)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492
* albanD thinks this is a bug!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138480
Approved by: https://github.com/albanD
2024-12-10 00:12:53 +00:00
5743b11039 Improve messaging of ProcessGroupNCCL destructor (#142297)
And removed some unnecessary conditions for calling `thread.join()` -- `thread.joinable()` should have covered it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142297
Approved by: https://github.com/wconstab
ghstack dependencies: #141510, #141511
2024-12-10 00:02:33 +00:00
bd867d691b [FSDP2] Fix backward-compatible imports (#142419)
Internal only: the before way meant that `from torch.distributed._composable.fsdp import fully_shard` was importing `fully_shard.py` not the function `fully_shard`. For some reason, the resolution order is different from open source.

To fix this, we match the old import as closely as possible. Namely, we import `fully_shard.py` contents from `.fully_shard`. This should force that import to take precedence.

@diff-train-skip-merge

Differential Revision: [D66990327](https://our.internmc.facebook.com/intern/diff/D66990327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142419
Approved by: https://github.com/weifengpy
2024-12-09 23:56:32 +00:00
bcddae14ec Enhance "from_node" node meta to track source recursively (#142066)
Summary:
Change the "from_node" node meta format to be able to track the provenance of nodes recursively.

The new "from_node" format is a a list node NodeSource:

```
class NodeSource:
	self.node_name: str
	self.target: str
	self.graph_id: int
	self.pass_name: str
	self.action: str
	self.from_node: List[NoedSource]
```

This is in preparation for the inductor provenance tracking. For background, the inductor provenance tracking doc: https://docs.google.com/document/d/1dGh9myqNhywmbfP0Quzx_f04bghDFlj8cawj8MopiO8/edit?fbclid=IwZXh0bgNhZW0CMTEAAR0jUQ0Tf4ROLDED8Y_eIzrU0KVZVdRmyIQLp-avt-kGRPI_VgYVNyjH_q0_aem_HCQ_pxHDiwOkO9mQyWB2-g&tab=t.0 (internal only),

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_unflatten_multiple_graphs_state
buck run mode/dev-nosan caffe2/test:fx -- -r node_source
```

Differential Revision: D66737916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142066
Approved by: https://github.com/avikchaudhuri
2024-12-09 23:39:15 +00:00
42b222edef [AOTI] Fix an issue when fallback op does not return a value (#142339)
Summary: Refine https://github.com/pytorch/pytorch/pull/137660 to support fallback op without a return value.

Differential Revision: D66939108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142339
Approved by: https://github.com/henrylhtsang
2024-12-09 23:24:29 +00:00
17202ea8f6 temporarily turn on keep-going/continue on error for mac (#142421)
See https://github.com/pytorch/pytorch/pull/142270 for additional info.

Make all mac default shard tests run with keep going / continue on error so we can see all the test failures.

Red signal will show up later, but you can see failing tests mid run on HUD by clicking the additional test failures button

After the job is finished, searching for "consistently: " in the logs will find the failed tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142421
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-09 23:24:10 +00:00
beeffe77e4 Revert "[inductor][cpp] Add FlexAttention support for CPU inference (#141453)"
This reverts commit db379ed1ada58608d4d3c5c35777da051e4e49e5.

Reverted https://github.com/pytorch/pytorch/pull/141453 on behalf of https://github.com/malfet due to This breaks tests on platforms compiled without MKLDNN, namely MacOS, see https://github.com/pytorch/pytorch/actions/runs/12245441371/job/34159967794 ([comment](https://github.com/pytorch/pytorch/pull/141453#issuecomment-2529710573))
2024-12-09 22:57:59 +00:00
8d24eb0c94 [Inductor] Represent size_hints as a dict (#142249)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.

# Feature

Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.)

# Test plan

The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249
Approved by: https://github.com/jansel
2024-12-09 22:31:53 +00:00
4b69a68c7c [Do not revert] Re-enable Mac testing (#142270)
The bash script modification in https://github.com/pytorch/pytorch/pull/135386 results in tests on mac in default shard not running.
This PR is expected to cause test failures, but we need to start getting signal, so landing with known failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142270
Approved by: https://github.com/malfet, https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-09 22:26:26 +00:00
274223d719 Add and use borrow_arrayref_tensor_as_tensor (#142183)
Differential Revision: [D66847773](https://our.internmc.facebook.com/intern/diff/D66847773/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142183
Approved by: https://github.com/desertfire, https://github.com/hl475
ghstack dependencies: #142340, #142182
2024-12-09 22:23:21 +00:00
18d25aa7aa Rename convert_arrayref_tensor_to_tensor to copy_arrayref_tensor_to_tensor (#142182)
Be explicit about what we are doing, in preparation for adding borrow_arrayref_tensor_as_tensor.

Differential Revision: [D66847772](https://our.internmc.facebook.com/intern/diff/D66847772/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142182
Approved by: https://github.com/desertfire
ghstack dependencies: #142340
2024-12-09 22:23:21 +00:00
dc1ef9afb4 Reapply #142091 (Unbreak dynamic shape minimal arrayref interface tests) (#142340)
Simple bug got introduced somewhere.

The original PR was reverted because it broke (caused unexpected successes for) some tests in test_aot_inductor_arrayref.py that still only run internally because #123691 hasn't been fixed. I've fixed those.

Differential Revision: [D66890276](https://our.internmc.facebook.com/intern/diff/D66890276/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142340
Approved by: https://github.com/hl475
2024-12-09 22:23:21 +00:00
cca33d50b9 [PGNCCL] Use long/short wait for different non-blocking calls (#142291)
In nonblocking mode, we always check if the NCCL communicator is ready between issuing commands to it.
Today this is done by the `waitReady()` function.
Unfortunately, the `waitReady()` function is burned with `C10D_NCCL_CHECK_TIMEOUT_SLEEP` which would sleep for an interval between two consecutive checks.
While this is nice when waiting for comm init or finalize, it degrades performance of collective calls (which would almost certainly return success immediately.)

This PR adds a `bool longInterval` argument to `waitReady` and let call site determine whether long wait is likely; if not, `waitReady` would use `sched_yield()` to more eagerly check for readiness.

Thanks @eqy for reporting an issue that small collectives has perf impact in nonblocking mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142291
Approved by: https://github.com/eqy, https://github.com/fduwjj
2024-12-09 22:19:58 +00:00
452e1a7840 [c10d] Update backend arg documentation (#142404)
Update doc to reflect change brought by https://github.com/pytorch/pytorch/pull/142216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142404
Approved by: https://github.com/XilunWu
2024-12-09 21:53:44 +00:00
12f1989a4a [aoti package] seek 0 after loading buffer (#142204)
Differential Revision: D66855265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142204
Approved by: https://github.com/chenyang78, https://github.com/angelayi
2024-12-09 21:53:28 +00:00
4c7688ca06 Add pytest support for unittest.subTests to CI env (#142238)
Fixes #142157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142238
Approved by: https://github.com/malfet, https://github.com/huydhn
ghstack dependencies: #142243
2024-12-09 21:48:20 +00:00
5d3bc633ff [PGNCCL] Rework NCCLComm dtor to avoid clash with CUDA driver shutdown (#141511)
Making CUDA or NCCL calls in object destruction can be dangerous because CUDA context may have exited before the the destructor, in which case, the CUDA calls would see a "CUDA driver shutting down" error.

this PR does take a destroy call away from NCCLComm dtor, and doesn't add a new one. If users are calling destroy_process_group or abort_process_group as recommended, then we are destroying for them, and otherwise we are OK with letting them possibly leak resources (and get a warning).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141511
Approved by: https://github.com/eqy, https://github.com/wconstab
ghstack dependencies: #141510
2024-12-09 21:41:15 +00:00
4dbecf3ba7 Implement CPU pins functions for HPU hooks (#139495)
Link CPU pins function in HPU hooks to the host allocator in tensor_empty

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139495
Approved by: https://github.com/zou3519
2024-12-09 21:37:20 +00:00
e52a534994 [PGNCCL] Deprecate suppport of onCompletionHook (#142390)
The usage of `onCompletionHook` is mostly similar to what Flight Recorder does today -- for example, measuring how long a collective takes and put it into a profiler's "database".

Since FR already records and can dump info like this, we are considering deprecating the onCompletionHook support to save a side thread. (Each PG runs 3 side threads today, which is resource consuming and complicates the code)

User can file an issue if additional information needs to be recorded.
They can also file an RFC if Flight Recorder needs to accept plugins that customize the recording.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142390
Approved by: https://github.com/fduwjj, https://github.com/fegin
2024-12-09 21:11:33 +00:00
04312293a2 [Inductor] Fix wrong CSEVariable dtype for reduction. Fix #141861 (#142189)
Fix #141861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142189
Approved by: https://github.com/jansel
2024-12-09 21:07:35 +00:00
a4dedf27b9 [Inductor] Generalize newly introduced device-bias code to align the behavior of XPU unroll reduction with cuda. (#142348)
Fix #141861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142348
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-12-09 20:58:35 +00:00
8bf28b3613 [EZ] Do not checkout builder for Linux builds (#142282)
All logic should have been migrated to .ci/manywheel folder from builder repo a while back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142282
Approved by: https://github.com/atalman
ghstack dependencies: #142276, #142277, #142382
2024-12-09 20:52:13 +00:00
cb0a302dde Fix fallthrough behaviour when Meta in TLS include set (#141581)
Fixes https://github.com/pytorch/pytorch/issues/141120

Registering a fallthrough for a backend correctly alters nonFallthroughKeysPerBackend_[backend_idx]. However, the backend_idx calculation does not take into account the local dispatch key set, which is used to temporarily turn on Meta as a backend. This means that makeFallthrough does not behave exactly as if it was a normal function which redispatched rather than a "fake function" implemented with a key mask.

So e.g. impl::computeDispatchKeySet(ks, nonFallthroughKeysPerBackend_[backend_idx]); will exclude keys like Meta which may be in the TLS include set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141581
Approved by: https://github.com/bdhirsh
2024-12-09 20:32:44 +00:00
a1bd784ffd add CK BMM instances (#142002)
Summary:
adds instances of CK DeviceBatchedGemmMultiD_Xdl_CShuffle_V3 for aten.bmm CK backend. adds simple heuristic that will need improving over time.
adds support for TN, NT, TT and NN layouts.

Differential Revision: D66662554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142002
Approved by: https://github.com/mxz297, https://github.com/xw285cornell
2024-12-09 20:31:24 +00:00
5c76a2834d Revert "add torchrec collectives to enforce global ordering (#141970)"
This reverts commit ceb94d6a7d38930d662e7eb71b9c7620de8c2997.

Reverted https://github.com/pytorch/pytorch/pull/141970 on behalf of https://github.com/malfet due to Apologies for reverting this change, but it broke MacOS testing, but CI was broken at the time ([comment](https://github.com/pytorch/pytorch/pull/141970#issuecomment-2529367680))
2024-12-09 20:25:04 +00:00
960a81fdcd [EZ] Delete unsued binary_macos_test.sh (#142382)
According to https://github.com/search?type=code&q=binary_macos_test.sh+repo%3Apytorch%2Fpytorch (and grep in the repo) it's not used anywhere

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142382
Approved by: https://github.com/atalman
ghstack dependencies: #142276, #142277
2024-12-09 19:37:56 +00:00
cyy
b4c0973b59 [2/N] Apply bugprone-unchecked-optional-access (#141091)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141091
Approved by: https://github.com/Skylion007, https://github.com/albanD

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-09 19:30:19 +00:00
005c5694eb Refactor "torch.mtia.memory_stats" API (#141723)
Summary:
This diff refactors the code for the "torch.mtia.memory_stats" API to maintain the same file hierarchy as its CUDA counterpart:
- All device memory APIs are now located under ".../mtia/memory.py".
- Device memory APIs can be accessed using either "torch.mtia.XYZ" or "torch.mtia.memory.XYZ".

Test Plan:
Passed a local unit test: `buck run //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

```
Ran 14 tests in 16.657s

OK
I1127 11:06:06.505201 2133030 afg_bindings.cpp:943] afg-aten::mul.out-dtype_Float-bBtLGD6Y executable has been unloaded
I1127 11:06:06.506654 2133030 afg_bindings.cpp:943] afg-add-dtype_Float-fa37JncC executable has been unloaded
W1127 11:06:08.731138 2133030 HazptrDomain.h:148] Tagged objects remain. This may indicate a higher-level leak of object(s) that use hazptr_obj_cohort.
```

Differential Revision: D66549179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141723
Approved by: https://github.com/nautsimon
2024-12-09 19:19:19 +00:00
db379ed1ad [inductor][cpp] Add FlexAttention support for CPU inference (#141453)
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

- Block sparse computation
- Flash decoding tuning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/drisspg, https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-12-09 18:44:39 +00:00
a0d49dc047 Fix to make GELU on aarch64 preserve COW input tensor (#142366)
Fixes #142365

itensor_from_tensor call was causing COW tensor to materialize

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142366
Approved by: https://github.com/malfet
2024-12-09 18:42:06 +00:00
0610b9730e Do not use builder repo for MacOS builds (#142277)
Added c7564f31f7/wheel/build_wheel.sh to `.ci/wheel/` folder

Commented out call to 39532891a0/run_tests.sh, because since 2018 this script just checked that tests folder is there and exited, as there are no way to run all pytorch tests in single shard, see this logic:
```bash
#!/bin/bash
set -eux -o pipefail

# Essentially runs pytorch/test/run_test.py, but keeps track of which tests to
# skip in a centralized place.
#
# TODO Except for a few tests, this entire file is a giant TODO. Why are these
# tests # failing?
# TODO deal with Windows

# This script expects to be in the pytorch root folder
if [[ ! -d 'test' || ! -f 'test/run_test.py' ]]; then
    echo "builder/test.sh expects to be run from the Pytorch root directory " \
         "but I'm actually in $(pwd)"
    exit 2
fi

# Allow master skip of all tests
if [[ -n "${SKIP_ALL_TESTS:-}" ]]; then
    exit 0
fi
```

https://github.com/pytorch/pytorch/pull/123390 is a misread attempt to interpret above-mentioned logic, as run_tests will be skipped if `${SKIP_ALL_TESTS}` is a non-empty string
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142277
Approved by: https://github.com/huydhn, https://github.com/atalman
ghstack dependencies: #142276
2024-12-09 18:33:58 +00:00
5e8e1d725a Remove some unused type ignores (round 1) (#142325)
Over time, a large number of the existing type ignores have become irrelevant/unused/dead as a result of improvements in annotations and type checking.

Having these `# type: ignore` linger around is not ideal for two reasons:

- They are verbose/ugly syntatically.
- They could hide genuine bugs in the future, if a refactoring would actually introduce a bug but it gets hidden by the ignore.

I'm counting over 1500 unused ignores already. This is a first PR that removes some of them. Note that I haven't touched type ignores that looked "conditional" like the import challenge mentioned in https://github.com/pytorch/pytorch/pull/60006#issuecomment-2480604728. I will address these at a later point, and eventually would enable `warn_unused_ignores = True` in the mypy configuration as discussed in that comment to prevent accumulating more dead ignores going forward.

This PR should have no effect on runtime at all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142325
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2024-12-09 18:23:46 +00:00
a52d9f6f4c Fix torch.lerp RuntimeError when weight is CPU scalar while input & end are CUDA tensor (#141820)
Fixes #141811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141820
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-12-09 18:14:54 +00:00
d99c9c2acb [PGNCCL] Make sure we do not use split for P2P comm creation (#139013)
Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172

There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013
Approved by: https://github.com/shuqiangzhang
2024-12-09 17:56:03 +00:00
219e9c83a5 Revert "[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)"
This reverts commit 854d83133bd4b0bca8ba19477c56ef2dd896dfc7.

Reverted https://github.com/pytorch/pytorch/pull/140269 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))
2024-12-09 17:33:28 +00:00
6fcb294e18 Revert "[AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)"
This reverts commit 91d30546a4338b17f31d31a674662aa53d61b1aa.

Reverted https://github.com/pytorch/pytorch/pull/140664 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))
2024-12-09 17:33:28 +00:00
90fc2b42e3 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 82544bd3a2f71e5995e6b035433139fad884e277.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still has failures internally when building, D66923759 ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2528760716))
2024-12-09 17:04:20 +00:00
dd5df002b9 [pt2e][quant] Make move_exported_model_to_train/eval idempotent (#142239)
Summary: Before we would recompile the model unnecessarily even
if the model is already in the desired mode. For training
frameworks that assume `model.train()` is idempotent and calls
this before every single training step, this led to a bunch of
tiny graphs and poor performance. This commit makes these calls
no-ops if we're already in the target train/eval mode.

Test Plan:
python test/test_quantization -k TestQuantizePT2E.test_allow_exported_model_train_eval_idempotent
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142239
Approved by: https://github.com/jerryzh168
2024-12-09 16:50:20 +00:00
d29e0ac9e9 Use set -o pipefail for build.sh (#142377)
This would have made https://github.com/pytorch/pytorch/pull/142359 a
hard failure.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142377
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-09 16:30:57 +00:00
9b6ef8abaf Update inductor jobs to use CUDA 12.4 (#142177)
CUDA 12.4 is the default now.  This frees up some resources.  This also fixes newly added Python 3.13 job by #140733. That PR missed adding the new Docker image `pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks` into docker build workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142177
Approved by: https://github.com/atalman
2024-12-09 16:18:38 +00:00
02848c2e14 [cpu/aarch64] fix compilation for Vec:bf16 (128bit) (#142370)
Fix typo causing compilation error on aarch64 architecture with BF16 support. (#139090)

tag: @swolchok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142370
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-09 16:17:32 +00:00
a17ecd8668 Add missing bc CI dependency (#142359)
The [bc](https://www.geeksforgeeks.org/bc-command-linux-examples) command that I use to calculate the MAX_JOBS in https://github.com/pytorch/pytorch/pull/142164 isn't part of the Docker image https://github.com/pytorch/pytorch/actions/runs/12230618287/job/34113698986#step:14:321.  I missed this error when landing https://github.com/pytorch/pytorch/pull/142164.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142359
Approved by: https://github.com/Skylion007
2024-12-09 16:07:40 +00:00
1589c2bc4b [c10d][UCC] Add _reduce_scatter_base to c10d::ProcessGroupUCC (#138021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138021
Approved by: https://github.com/kwen2501
2024-12-09 16:02:24 +00:00
8d9ac9d94e [aarch64] Fix libcusparselt format for CUDA sbsa docker (#142363)
Corrects https://github.com/pytorch/pytorch/pull/141433/files

Error whe building arm wheel https://github.com/pytorch/pytorch/actions/runs/12226514901/job/34101913511
`/opt/rh/gcc-toolset-11/root/usr/bin/ld: /usr/local/cuda/lib64/libcusparseLt.so: error adding symbols: file in wrong format`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142363
Approved by: https://github.com/Aidyn-A, https://github.com/Skylion007, https://github.com/atalman
2024-12-09 15:29:36 +00:00
5fc9f419ef [AOTI] Fix multi-kernel codegen when using one-pass (#142333)
Summary: Update multi-kernel codegen to one-pass, following https://github.com/pytorch/pytorch/pull/141980.

Differential Revision: [D66936717](https://our.internmc.facebook.com/intern/diff/D66936717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142333
Approved by: https://github.com/chenyang78
ghstack dependencies: #141980
2024-12-09 14:49:10 +00:00
4d43ec2189 [AOTI] Swith GPU codegen to one-pass (#141980)
Summary: With autotune_at_compile_time enabled, AOTI now can perform CUDA codegen in one pass. CUDA kernel related code is generated in a deferred way, after autotuning is done. This one-pass implementation will eliminate any issue caused by disparity between passes in the previous two-pass implementation (which caused multiple bug reports in the past). One-pass implementation also avoids cloning mutated inputs needed in the two-pass implementation, which will reduce GPU memory consumption.

Differential Revision: [D66739414](https://our.internmc.facebook.com/intern/diff/D66739414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141980
Approved by: https://github.com/chenyang78
2024-12-09 14:40:34 +00:00
f14ce3a923 Update slow tests (#140248)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140248
Approved by: https://github.com/pytorchbot
2024-12-09 11:15:51 +00:00
7101dcfb98 Revert "[inductor][cpp] Add FlexAttention support for CPU inference (#141453)"
This reverts commit 7edbde3334df3223c009769d8226d06071e1fff9.

Reverted https://github.com/pytorch/pytorch/pull/141453 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing periodic NO_AVX2 ([comment](https://github.com/pytorch/pytorch/pull/141453#issuecomment-2527377475))
2024-12-09 09:26:20 +00:00
a108b282ff [4/N] Avoid copy in std::get (#142285)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142285
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-09 07:59:35 +00:00
2cc01cc6d3 [Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 with relu (#141556)
**Description**
Fuse and prepack weight for `linear_dynamic_fp16` with post op relu. In Inductor, the pattern we see is
```
fp32 activation
  |
(reshape)
  |
mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight
  |
(reshape) <- relu
```
Or
```
fp32 activation
  |
expand
  |
 bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight
  |
(add) <- relu
```
The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases.

Fuse the pattern with weight prepack, and we get
```
fp32 activation
  |
onednn.linear_relu_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight
```
After freezing, the prepack op is gone.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_relu_dynamic_fp16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141556
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #141549
2024-12-09 05:05:11 +00:00
7435f57f60 [BE] Remove unusued channels arg in col2im (#142336)
Number of channels is passed to col2im kernel/device function, but is not used during the computations at all
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142336
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-12-09 01:49:41 +00:00
75e72e1408 Adding lowering to persistent-tma device kernel for _scaled_mm (#142045)
# Summary
This PR adds an alternative triton lowering for _scaled_mm. This uses an updated mm template that utilizes persistent scheduling + TMAs on A and B matrices.

Limitations:
* This implementations does not work with Bias values: 0602676c8d/torch/_inductor/kernel/mm_scaled.py (L106) Plan is to remove this work around and enforce that both scaling + bias is properly done as epilogues onto the existing templates
* K dim must be 32 or greater for these to take effect
* Gated by a config flag ( currently defaults to Off, maybe should be on)

## Testing
We dont have any tests exercising this code in CI/CD but I updated the relevant tests in test_fp8 and they are all green:
<img width="1680" alt="Screenshot 2024-12-05 at 7 24 07 PM" src="https://github.com/user-attachments/assets/9c520541-d97a-416f-9af7-e68b366ec90f">

## Follow Ups
* Work to update the base mm triton templates and utilize the same template from mm/addmm/scaled_mm w/ respective epilogues
* Tuning on Persistent kernel configs. I found ones that work for my problem shapes but need to do some more NCU work

### Some profiling code I was using

Code I am using to iterate w/
```Python
import torch
from dataclasses import dataclass
from jsonargparse import CLI
import logging
from pathlib import Path

from transformer_nuggets.utils.benchmark import ProfileConfig, profile_function
from torchao.float8.inference import (
    addmm_float8_unwrapped_inference,
    preprocess_data,
    Float8MMConfig,
)
from transformer_nuggets.fp8.fp8_matmul import (
    matmul_persistent,
    matmul_tma_persistent,
    matmul_device_tma_persistent,
)
from enum import Enum

logging.getLogger("transformer_nuggets").setLevel(logging.INFO)

class FP8Kernel(Enum):
    PERSISTENT = "Persistent"
    PERSISTENT_TMA = "Persistent-TMA"
    DEVICE_TMA = "Device-TMA"
    SCALED_MM = "Scaled-MM"

class ScalingStrategy(Enum):
    PER_TENSOR = "PerTensor"
    PER_ROW = "PerRow"

@dataclass(frozen=True)
class ExperimentConfig:
    M: int
    K: int
    N: int
    scaling_strategy: ScalingStrategy
    fp8_kernel: FP8Kernel
    compile: bool

def get_fp8_matmul(
    A: torch.Tensor,
    B: torch.Tensor,
    scaling_strategy: ScalingStrategy,
    fp8_kernel: FP8Kernel,
):
    A_fp8 = A.to(torch.float8_e4m3fn)
    B_fp8 = B.to(torch.float8_e4m3fn)
    A_fp8, B_fp8 = preprocess_data(A_fp8, B_fp8, Float8MMConfig(use_fast_accum=True))

    if scaling_strategy == ScalingStrategy.PER_TENSOR:
        a_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
        b_scale = torch.tensor(1, device="cuda", dtype=torch.float32)
    elif scaling_strategy == ScalingStrategy.PER_ROW:
        a_scale = torch.ones((A_fp8.size(0), 1), device="cuda", dtype=torch.float32)
        b_scale = torch.ones((B_fp8.size(1), 1), device="cuda", dtype=torch.float32).T
    else:
        raise ValueError(f"Invalid scaling strategy: {scaling_strategy}")

    assert fp8_kernel == FP8Kernel.SCALED_MM
    return lambda: addmm_float8_unwrapped_inference(
        A_fp8, a_scale, B_fp8, b_scale, output_dtype=torch.bfloat16, use_fast_accum=True
    )

def run_matmul(config: ExperimentConfig):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    A = torch.randn(config.M, config.K, device=device, dtype=torch.bfloat16)
    B = torch.randn(config.K, config.N, device=device, dtype=torch.bfloat16)

    fp8_matmul = get_fp8_matmul(A, B, config.scaling_strategy, config.fp8_kernel)

    if config.compile and config.fp8_kernel == FP8Kernel.SCALED_MM:
        fp8_matmul = torch.compile(fp8_matmul, mode="max-autotune-no-cudagraphs")

    _ = fp8_matmul()

    return

def main():
    torch.random.manual_seed(123)

    # Define your experiment configuration here
    config = ExperimentConfig(
        M=8192,
        K=8192,
        N=8192,
        scaling_strategy=ScalingStrategy.PER_TENSOR,
        fp8_kernel=FP8Kernel.SCALED_MM,
        compile=True,
    )

    run_matmul(config)

if __name__ == "__main__":
    CLI(main)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142045
Approved by: https://github.com/eellison
2024-12-09 01:48:40 +00:00
29e985b7b0 [dim_order] raised runtime error when tensor has ambiguous dim order (#141632)
This diff makes tensor.dim_order() raise error when tensor's dim order is ambiguous. Detail discussion can be found https://fb.workplace.com/groups/894363187646754/permalink/2039987243084337/

Differential Revision: [D65133579](https://our.internmc.facebook.com/intern/diff/D65133579/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141632
Approved by: https://github.com/larryliu0820
2024-12-08 23:16:57 +00:00
e1196dfe51 Deprecate torch._utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-12-08 22:55:36 +00:00
869665c44c [torchgen] Fix an unused variable in api/python.py (#142337)
Extracted from https://github.com/pytorch/pytorch/pull/136359

Changes behavior, but the original code seems like it was an obvious oops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142337
Approved by: https://github.com/Skylion007
2024-12-08 21:48:08 +00:00
ef26f1c57e Migrate windows build scripts from builder to pytorch (#142156)
Move builder windows build scripts to pytorch/pytorch
Remove builder checkout during windows build
Pending remove windows build scripts https://github.com/pytorch/builder/tree/main/windows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142156
Approved by: https://github.com/malfet, https://github.com/chuanqi129
2024-12-08 21:43:59 +00:00
05c1f37188 Enable memory swap on Linux docker build (#142293)
Lots of CUDA build jobs are OOM-ing in trunk and the hotspot seems to come from building flash attention, for example https://github.com/pytorch/pytorch/actions/runs/12208390090/job/34061532155#step:14:9369.  There are several options around:

* Mimic the logic from https://github.com/Dao-AILab/flash-attention/blob/main/setup.py#L495-L508.  We are using `linux.2xlarge` for the build with 8 CPU and 16GB.  The current max number of parallel jobs is `(8 - 2)/3 = 2` while the logic from upstream repo has `16 / 9 = 1.7`, so it's very close.
* Upgrade to `linux.2xlarge.memory` with 8 CPU and 64GB for all CUDA build, it could afford up to 7 max parallel jobs according to the above logic.
* Enable swap.

These approaches can work together, so I want to experiment with swapping first as this technique, if working, could be useful in other context too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142293
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-12-08 20:59:43 +00:00
46dc2965de Adding missing space to pybind_utils.h error message (#142258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142258
Approved by: https://github.com/Skylion007
2024-12-08 20:46:32 +00:00
0c66cee9a2 [Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864)
# Feature
Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16.

This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast.

# Test Plan
Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex.

Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts.

This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864
Approved by: https://github.com/eellison, https://github.com/arui-meta

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-12-08 19:42:48 +00:00
c814dd08aa Fixed installing dependencies instructions in CONTRIBUTING.md (#142334)
In the original code, “pip install -r requirements” missed the suffix “.txt”, so I'll add it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142334
Approved by: https://github.com/malfet
2024-12-08 19:35:36 +00:00
e343f46464 [inductor] Refactor is_big_gpu (#142220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142220
Approved by: https://github.com/yanboliang
ghstack dependencies: #142219, #142033, #142222
2024-12-08 18:51:36 +00:00
dc7461d6f5 docstring_linter finds long classes and functions without docstrings (#140426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140426
Approved by: https://github.com/eellison
2024-12-08 17:03:57 +00:00
d0b9874603 Teach ruff_linter to report syntax errors (fix #140228) (#142312)
Now syntax errors look like this:

```
torch/_dynamo/variables/base.py:20:

  Error (RUFF) E999
    SyntaxError: Expected ',', found indent.
    See https://beta.ruff.rs/docs/rules/
    >>>  19  |class SourceType(Enum]:
         20  |    """
         21  |    This Enum divides VariableTracker into 2 cases, depending on the variable
         22  |    it represents:

[...more errors...]
```

Note that most syntax errors lead to a cascade of other errors, so the exception is generally wrong, but the location and name are good.

Before they looked like this:

```
>>> General linter failure:

  Error (RUFF) Linter failed
    Linter failed. This a bug, please file an issue against the linter
    maintainer.

    CONTEXT:
    Linter command failed with non-zero exit code.
    STDERR:
    <MainThread:DEBUG> $ /home/rec/.conda/envs/pytorch-dev-constant/
    bin/python3 -m ruff check --exit-zero --quiet --output-format=json
    --config=pyproject.toml /home/rec/git-constant/pytorch/torch/_dynamo/
    variables/base.py
    <MainThread:DEBUG> took 38ms
    Traceback (most recent call last):
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 465, in <module>
        main()
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 424, in main
        lint_messages = check_files(
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 273, in check_files
        return [
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 288, in <listcomp>
        severity=severities.get(vuln["code"],
    get_issue_severity(vuln["code"])),
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 172, in get_issue_severity
        if any(
      File "/home/rec/git-constant/pytorch/tools/linter/adapters/
    ruff_linter.py", line 173, in <genexpr>
        code.startswith(x)
    AttributeError: 'NoneType' object has no attribute 'startswith'

    STDOUT:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142312
Approved by: https://github.com/Skylion007
2024-12-08 16:48:05 +00:00
2c6d094869 [AOTI] Assert misaligned input (#142136)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141891. JIT Inductor relies on copy_misaligned_inputs to fix misaligned inputs. For AOTInductor's use scenario, this is an unacceptable performance hit, so we codegen input alignment check at the entry point and throws an error if any misalignment exists.

Differential Revision: [D66881038](https://our.internmc.facebook.com/intern/diff/D66881038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142136
Approved by: https://github.com/eellison, https://github.com/ezyang
ghstack dependencies: #142133
2024-12-08 15:13:01 +00:00
5035ff0796 [AOTI] Refactor codegen_inputs signature (#142133)
Summary: Since codegen_inputs only writes to self.prefix, drop IndentedBuffer from its parameters, to make the API consistent with other similar functions.

Differential Revision: [D66881040](https://our.internmc.facebook.com/intern/diff/D66881040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142133
Approved by: https://github.com/chenyang78
2024-12-08 15:05:03 +00:00
32b94644fc Add manylinux_2_28_x86_64 tags to wheel builds (#141988)
Tag the Wheels with appropriate Manylinx 2.28 tags
Initially used auditwheel but it does much more that just adding tags. It also tries to package multiple libs into wheel, which we don't want at this point. Hence just changed tag and filename. If no librs are repackages by auditwheel all ti does is to tag and rename.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141988
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-12-08 14:36:24 +00:00
0ecba57561 [CI] Add xpu new docker image name into docker builds workflow (#142298)
Add missed new xpu docker image name to adapt the new mechanism introduced by https://github.com/pytorch/test-infra/pull/6013
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142298
Approved by: https://github.com/huydhn
2024-12-08 09:34:20 +00:00
7edbde3334 [inductor][cpp] Add FlexAttention support for CPU inference (#141453)
This PR brings the FlexAttention inference support for the inductor backend in torch.compile (support precisions: bf16 and fp32) on CPUs.

Based on the existing CPP template, this PR extends and implements a FlexAttention CPP template to support broad attention variants, and meanwhile brings optimized performance on CPUs.

With this, users can transparently extend their Flex Attention usages to CPUs with good and common support from torch.compile, both functionality and performance.

For UT tests, in this PR, we include partial critical tests for CPUs as the following (conduct inference tests):
```
pytest test/inductor/test_flex_attention.py
`TestFlexAttention`
#common functions:
run_test
preprocess_paged_attention
run_paged_attention
run_test_with_paged_attention
run_test_with_call
run_dynamic_test
run_automatic_dynamic_test

#test functions:
test_builtin_score_mods
test_builtin_score_mods_automatic_dynamic
test_builtin_score_mods_different_seqlen
test_builtin_score_mods_different_block_size
test_kv_batch_broadcast
test_GQA
test_cpu_error_message_return_lse
test_validate_cpu_dtype_error_message

`TestPagedAttention`
#test function:
test_paged_builtin_score_mods
```
For the rest UTs in `test/inductor/test_flex_attention.py ` and `test/inductor/test_flex_decoding.py`, due to bigger lines of changes (1500+ LOC) that make this PR hard to review, will submit another PR specific for CPU device UTs enabling and refactor.

Besides, more optimizations are also planned in follow up PRs, including:

- Block sparse computation
- Flash decoding tuning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141453
Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-12-08 07:57:21 +00:00
0bd7b7ae58 Add version check for C++ pytree availability (#142299)
Resolves #142256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142299
Approved by: https://github.com/jansel, https://github.com/weifengpy
2024-12-08 06:27:32 +00:00
2682e5e0d4 [BE]: Add TypeGuard to is_symbolic (#142304)
Improves type inference for is_symbolic. If it's True, it must be either a SymInt or Torch Tensor currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142304
Approved by: https://github.com/jansel
2024-12-08 02:18:17 +00:00
2fc8bac091 [ROCm] Fix unit test: matmul_offline_mgpu_gpu_tunableop (#142269)
Fixes #141652

This PR fixes (at least in part) the unit test failure. However, we may also need to do a separate flush of the untuned results-- if this test continues to be flaky, another PR would be needed to flush the untuned results as well.

Tested locally and it seems to be working.

Also fixing code that was accidentally commented out code in the unit test from the prior multi-gpu offline tuning PR https://github.com/pytorch/pytorch/pull/139673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142269
Approved by: https://github.com/jeffdaily
2024-12-08 02:18:00 +00:00
b1bb860d3c c10::string_view -> std::string_view in aten (#141903)
D66560348 passes internally, but won't export, so I'm rebuilding here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141903
Approved by: https://github.com/Skylion007
2024-12-07 23:23:52 +00:00
8cb68b136f Proper modeling of recursive types (#142300)
Currently there are a few type annotations that falsely state that mypy doesn't support recursive types.

Recursive type support is available in mypy for a few years already. It has been officially enabled in [version 0.991](https://mypy-lang.blogspot.com/2022/11/mypy-0990-released.html). Pyright even had support for recursive types earlier (https://github.com/microsoft/pyright/issues/569), so there is probably no reason not to model these types correctly.

This PR models these types properly now. Since this has turned a few implicit `Any` into fully typed variables that are not narrowed cleanly, a small number of type ignores were necessary.

Note that regarding the `Argument` it is desirable to model it in a covariant way (i.e. using `Sequence` and `Mapping`) instead of making it invariant unnecessarily (using `List` and `Dict`). If it were modeled invariant, it would for instance mean that a `List[Node]` would not type check as `Argument`, because invariance would mean that it really has to be a `List[Argument]` (i.e., including all the branches of the union type). Since even the name of the type "argument" strongly suggest that it is semantically used as "argument", having covariance natural anyway.

There are no chances in this PR that affect runtime behavior.

CC @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142300
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-12-07 21:30:45 +00:00
17f1a42c13 Add missing py::bytes to pybind_utils tryToInferType (#142265)
I'm not sure what the best way to fix this is, but this does unbreak an internal test.

Test Plan: Sandcastle

Reviewed By: itamaro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142265
Approved by: https://github.com/houseroad
2024-12-07 20:31:57 +00:00
3b531f18c7 [BE] Improve Flight Recorder efficacy (#142178)
Summary:
This is an attempt to improve the flight recorder efficacy.
We have a small subset of jobs that are timing out (i.e. failing to write out FR logs in 1 minute) and some that are throwing a `std::exception - broken promise`.

There are two changes in here.
1. We attempt to write out FR buffer with stack traces. If this fails, we attempt to capture FR buffer again - but this time without stack traces. The assumption here is that FR could be locking up when unwinding stack.
Note, to keep things simple, I'm re-using the same file name for both with/without stack_trace.
2.  Add additional catch statements in the Manifold writer. There might be something going on in here - so we'll get a log statement if this is failing.

TODO:
- there's nothing differentiating in the output that says whether stack traces were omitted purposefully or not.
This info might be useful for the analyzer - so I'll add this in a follow on diff.

Test Plan: Unit tests.

Differential Revision: D66843194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142178
Approved by: https://github.com/kwen2501
2024-12-07 19:32:28 +00:00
91d30546a4 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. (#140664)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026

Fix #140546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140664
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268, #140269
2024-12-07 19:22:04 +00:00
854d83133b [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268
2024-12-07 19:22:04 +00:00
3d227ae315 [Intel GPU] Support getStreamFromExternel for XPU. (#140268)
In AOT inductor scenario, the GPU Stream can be created outside of the pool of `XPUStream`, and we need to create a `XPUStream` which refers to this stream for the the common logic of AOTI, for example a stream guard is a guard for `XPUStream`.  So we add the getStreamFromExternel following the design of CUDAStream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140268
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
2024-12-07 19:22:04 +00:00
843018f407 [inductor] Refactor split factor into V.choices.reduction_split_factor (#142222)
I want to reuse this for cooperative reduction heuristics (in a later PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142222
Approved by: https://github.com/eellison
ghstack dependencies: #142219, #142033
2024-12-07 17:48:45 +00:00
81edca08ab [inductor] Refactor some DeviceProperties usage (#142033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142033
Approved by: https://github.com/eellison
ghstack dependencies: #142219
2024-12-07 17:48:45 +00:00
0367a31401 [inductor] Minor typing changes (#142219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142219
Approved by: https://github.com/Skylion007, https://github.com/yanboliang
2024-12-07 17:48:37 +00:00
524395edf4 [aarch64] build cuda 12.6 manywheel dockers (#139988)
Add Builds sbsa 12.6 manywheel dockers to workflow
Related to #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139988
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-12-07 15:38:41 +00:00
82544bd3a2 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-07 15:23:38 +00:00
f84e533a2c Add device-agnostic runtime Device/Stream C++ API (#138677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138677
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #133572
2024-12-07 13:14:10 +00:00
952514f0c8 Add UTs for accelerator device-agnostic runtime APIs (#133572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133572
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-12-07 13:14:10 +00:00
b6a64b64de Add ncu profile to final output_code.py (#142259)
This PR adds `--ncu` to the output code benchmark utils to generate ncu profile reports.

Test Plan:
```
% python torch_compile_debug/run_2024_12_05_22_27_59_182730-pid_4112931/torchinductor/model__0_forward_1.0/output_code2.py --ncu
0.000160
Peak GPU memory usage 671.220 MB
==PROF== Connected to process 502514 (python3.10)
==PROF== Connected to process 503187 (python3.10)
==WARNING== Unable to access the following 6 metrics: ctc__rx_bytes_data_user.sum, ctc__rx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__rx_bytes_data_user.sum.per_second, ctc__tx_bytes_data_user.sum, ctc__tx_bytes_data_user.sum.pct_of_peak_sustained_elapsed, ctc__tx_bytes_data_user.sum.per_second.

==PROF== Profiling "distribution_elementwise_grid..." - 0: 0%....50%....100% - 38 passes
==PROF== Profiling "vectorized_elementwise_kernel" - 1: 0%....50%....100% - 38 passes
==PROF== Profiling "triton_poi_fused_embedding_0" - 2: 0%....50%....100% - 38 passes
6.891588
==PROF== Disconnected from process 502514
==PROF== Disconnected from process 503187
==PROF== Report: /tmp/ncu_output_20241206_131245.ncu-rep

NCU profiling results for benchmark None:
NCU report has been written to /tmp/ncu_output_20241206_131245.ncu-rep
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142259
Approved by: https://github.com/eellison
2024-12-07 07:54:43 +00:00
fc831f76f8 Revert "[Inductor] Represent size_hints as a dict (#142249)"
This reverts commit f870ee2cc4f3dd1babd3043b5291d54f487a2999.

Reverted https://github.com/pytorch/pytorch/pull/142249 on behalf of https://github.com/blaine-rister due to would break internal tests ([comment](https://github.com/pytorch/pytorch/pull/142249#issuecomment-2524991008))
2024-12-07 07:43:51 +00:00
f870ee2cc4 [Inductor] Represent size_hints as a dict (#142249)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243.

# Feature

Follow up to https://github.com/pytorch/pytorch/pull/141751. Since we now represent `numels` as a dict, it's natural to extend this to `size_hints`. The latter are basically just the former rounded up to the nearest power of 2. This simplifies various heuristics such as the coordinate descent tuner. Where we previously needed to determine which index in `size_hints` corresponds to each dimension, now we can just query by prefix. This will be especially important when we enable 2D reductions, as it becomes harder to keep track of these things when we have multiple reduction dimensions. (See the previous PR for some examples.)

# Test plan

The existing CI provides good coverage. This PR modifies a few tests which explicitly constructed size hints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142249
Approved by: https://github.com/jansel
2024-12-07 06:43:05 +00:00
a58d2f14e8 [DTensor] Add a private util for sharding tensor (#142288)
Locally shards a full tensor based on indicated sharding arrangement, and returns a DTensor containing the local shard.

warning: This is a private API purposed to skip the communication otherwise required by `distribute_tensor`. It is only applicable to a case where all ranks have the same `full_tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142288
Approved by: https://github.com/wz337
2024-12-07 05:30:18 +00:00
2d9b081012 Move knapsack algorithms into separate filing (#141451) (#141614)
Summary:

Original Context Doc: https://docs.google.com/document/d/1Gv5ZqN3UY7kTuCUd0JLU3L_onwVIr2vd3AiZyim1Muc/edit?disco=AAABZX_tqdk

### Changes

This diff restructures the Partitioners.py file in the _functorch package.

* Moves the three knapsack problem algorithems (greedy, ilp, dp) into a separate file

Test Plan:
### Unit Testing

```
$ buck test mode/opt //caffe2/test/functorch:test_ac
File changed: fbsource//xplat/caffe2/test/functorch/TARGETS
File changed: fbsource//xplat/caffe2/test/functorch
File changed: fbsource//xplat/caffe2/test
7 additional file change events
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/a2f91f8a-5326-435e-9075-5af0de930b8b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/7036874660108924
Network: Up: 28MiB  Down: 3.5GiB  (reSessionID-19af3d71-7528-448c-9126-5615d27b3bd7)
Jobs completed: 423656. Time elapsed: 3:57.2s.
Cache hits: 99%. Commands: 146147 (cached: 145758, remote: 317, local: 72)
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0

```

### Tested on Local Training Run

```
CUDA_VISIBLE_DEVICES=5,6 AOT_PARTITIONER_DEBUG=1 PARTITIONER_MEMORY_BUDGET_PARETO=0 buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2 2>&1 | tee log_2024-11-2320:41:50.757256__bento_trigger.txt
```

Output Summary Paste: P1685697066

Differential Revision: D65800097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141614
Approved by: https://github.com/jansel, https://github.com/Chillee
2024-12-07 03:21:52 +00:00
c863227be3 [Quant][Inductor][X86] add fusion pass for linear_dynamic_fp16 (#141549)
**Description**
For `linear_dynamic_fp16`, we insert `quantize` and `dequantize` between x/w and linear to have the following pattern:
```
  x
  |
linear <- to_fp32 <- to_fp16 <- w
```
In Inductor, the pattern we finally see will be
```
fp32 activation
  |
(reshape)
  |
mm/addmm <- t <- to_fp32 <- tp_fp16 <- weight
  |
(reshape)
```
Or
```
fp32 activation
  |
expand
  |
 bmm <- expand <- t <- to_fp32 <- tp_fp16 <- weight
  |
(add)
```
The second pattern is for x.ndim > 2 and x is not contiguous. The first pattern is for other cases.

Fuse the pattern with weight prepack, and we get
```
fp32 activation
  |
onednn.linear_dynamic_fp16 <- onednn.linear_prepack_fp16 <- weight
```
After freezing, the prepack op is gone.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_dynamic_fp16
```

Differential Revision: [D66802159](https://our.internmc.facebook.com/intern/diff/D66802159)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141549
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-12-07 03:08:08 +00:00
7939b5f5f9 remove sccache from bazel, to go together with #140614 (#142241)
removes sccache from bazel builds. Will move bazel builds to periodic if build succeed

CUDA bazel test succeeded, moving to periodic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142241
Approved by: https://github.com/malfet
2024-12-07 02:08:06 +00:00
40d1b5f490 Revert "Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320)"
This reverts commit add4a42ea2c56f7687a3564aefe9e017cd118936.

Reverted https://github.com/pytorch/pytorch/pull/140320 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_hip_device_count is failing in trunk after this land ([comment](https://github.com/pytorch/pytorch/pull/140320#issuecomment-2524742845))
2024-12-07 01:28:51 +00:00
db313c87f9 [OSS] Enable Flight Recorder buffer for all (#142260)
Summary: Enable collecting Flight Recorder data for all.

Test Plan: This has been rolled out internally for a while now.

Differential Revision: D66897635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142260
Approved by: https://github.com/kwen2501, https://github.com/fduwjj, https://github.com/wconstab
2024-12-07 01:28:12 +00:00
78425bff30 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Changes for Reland**
- Preserved the public objects from `torch/distributed/_composable/fsdp/fully_shard.py` so that the import path still works internally
- Added a unit test that we can do `from torch.distributed._composable.fsdp.fully_shard import FSDPModule`

Differential Revision: [D66890387](https://our.internmc.facebook.com/intern/diff/D66890387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy, https://github.com/fegin, https://github.com/XilunWu

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-07 01:24:28 +00:00
868d62552d [aoti] Add load_constants to package api (#142246)
Summary:
With the changes in https://github.com/pytorch/pytorch/pull/140755 and https://github.com/pytorch/pytorch/pull/141997, I added a load_constants function to the packaging API. Currently this doesn't work for cpu.

The workflow is something like:

```
ep = torch.export.export(model, example_inputs)
package = torch._inductor.aoti_compile_and_package(ep, inductor_configs=inductor_configs)
compiled = torch._inductor.aoti_load_package(package)

print(compiled.get_constant_fqns())  # see what are the fqns needed/available

compiled.load_constants(new_state_dict, check_full_update=True)  # update the constants in AOTI
```

You can also use the `aot_inductor.package_constants_in_so` config to stop including the constants in the so:
```
package = torch._inductor.aoti_compile_and_package(ep, inductor_configs={`aot_inductor.package_constants_in_so`: False)
compiled = torch._inductor.aoti_load_package(package)
compiled(*inputs)  # segfaults because there are no constants --> we should probably have a better error msg

compiled.load_constants(new_state_dict, check_full_update=True)
compiled(*inputs)
```

Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_so_without_weight"  `

Differential Revision: D66796206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142246
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire
2024-12-07 01:18:42 +00:00
3a3638be50 [BE] Enable Scalar.h compilation on 32-bit system (#142235)
By hiding ambiguous Scalar(long long) constructor behind `std::enable_if_t<sizeof(void *) == 8>`

Followup after https://github.com/pytorch/pytorch/pull/141244

Test Plan: Run `printf "#include <c10/core/Scalar.h>\n c10::Scalar x(3);" | gcc -x c++ -std=c++17 -I. -Ibuild - -c` on ARMv7 system.
Before this change it failed with:
```
In file included from <stdin>:1:
./c10/core/Scalar.h:83:3: error: ‘c10::Scalar::Scalar(long long int)’ cannot be overloaded with ‘c10::Scalar::Scalar(int64_t)’
   83 |   Scalar(long long vv) : Scalar(vv, true) {}
      |   ^~~~~~
./c10/core/Scalar.h:50:3: note: previous declaration ‘c10::Scalar::Scalar(int64_t)’
   50 |   Scalar(type vv) : Scalar(vv, true) {}
      |   ^~~~~~
./c10/core/ScalarType.h:288:3: note: in expansion of macro ‘DEFINE_IMPLICIT_CTOR’
  288 |   _(int64_t, Long)                                \
      |   ^
./c10/core/Scalar.h:52:3: note: in expansion of macro ‘AT_FORALL_SCALAR_TYPES_AND7’
   52 |   AT_FORALL_SCALAR_TYPES_AND7(
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142235
Approved by: https://github.com/Skylion007
2024-12-07 01:05:55 +00:00
022cbf2f31 Back out "[Reland][Environment Variable][5/N] Use thread-safe getenv functions (#140594)" (#142226)
Summary: Failed to write the auto-tune result to `PYTORCH_TUNABLEOP_FILENAME` with this change (empty file) , reverting to unblock.

Test Plan: CI

Differential Revision: D66870750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142226
Approved by: https://github.com/leitian
2024-12-07 00:59:22 +00:00
c6e18a1ed1 [EZ] Remove unused binary_linux_build.sh (#142276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142276
Approved by: https://github.com/huydhn
2024-12-07 00:48:56 +00:00
716a06d22c Mark async-tp ops as needs_fixed_stride_order (#142252)
Inductor seems to not respect the input striding of these ops, which is required for fp8 async-tp and has performance implication on other cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142252
Approved by: https://github.com/weifengpy
2024-12-07 00:42:27 +00:00
be16dd678e [BE][MPS] Fix unused parameter waring
Before this change running `xcrun metal -c Indexing.metal -Wall -Wextra -fno-fast-math` resulted in 
```
Indexing.metal:75:10: warning: unused parameter 'thread_index' [-Wunused-parameter]
    uint thread_index [[thread_position_in_grid]]) {
         ^
1 warning generated.
```
After no warnings are generated

Also, remove redundant semicolons
2024-12-06 16:41:27 -08:00
a30bfab224 random dag (#142180)
Utils for creating random dags and generating code for nn modules based on such dags.

In particular, this was used to do fuzz testing for unflatten, where the random dags instructed the generation of calls, const accesses, and buffer mutations in a system of nn modules.

Example of generated test:
```python
    def test_unflatten_random_dag_const_preserving_3(self):
        class N2(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.const = torch.ones(1)

            def forward(self, x):
                return x + 1

        class N1(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.const = torch.ones(1)
                self.n2 = N2()

            def forward(self, x):
                x = x + self.n2.const
                x = self.n2(x + 1)
                return x + 1

        class N0(torch.nn.Module):
            def __init__(self):
                super().__init__()
                self.const = torch.ones(1)
                self.n1 = N1()

            def forward(self, x):
                x = x + self.n1.n2.const
                x = self.n1(x + 1)
                x = self.n1.n2(x + 1)
                return x + 1

        inp = (torch.ones(1),)
        eager = N0()(*inp)
        ep = torch.export.export(
            N0(),
            inp,
            strict=False,
            preserve_module_call_signature=(
                "n1",
                "n1.n2",
            ),
        )
        epm = ep.module()
        ufm = torch.export.unflatten(ep)
        assert torch.allclose(epm(*inp), eager)
        assert torch.allclose(ufm(*inp), eager)
```

Differential Revision: [D66838348](https://our.internmc.facebook.com/intern/diff/D66838348/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142180
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-12-07 00:39:43 +00:00
0052943bee [SymmetricMemory] reorganize the op registry (#140763)
- Separates the definition and implementation
- Removes the false pt2_compliant flags

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140763
Approved by: https://github.com/weifengpy
2024-12-06 23:59:11 +00:00
6e203ae6de [REFACTOR] Implement AOTDispatchCompiler wrapper (#142205)
This implements a new wrapper class AOTDispatchCompiler wrapper, which is just a wrapper around a callable that returns an OutputCode. We can then use it in AOTDispatch to decide whether or not to use the cache: if fw_compiler, bw_compiler and inference_compiler are all AOTDispatchCompilers, then we enable caching.

This type is pretty close to _CompiledFxGraphCallable, except it's not allowed to take any kwargs. Not sure how to consolidate the two ideas together just yet: unfortunately, there's no way to properly annotate the types to make them related. But a lot of the time, the input to this function will be a partially applied _CompiledFxGraphCallable.

This allows the PR above this one to enable AOTAutogradCache everywhere, but not increase instruction count or enable cache on unit tests that use aot_eager or other non inductor compilers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142205
Approved by: https://github.com/oulgen, https://github.com/bdhirsh
2024-12-06 23:23:20 +00:00
5663ad99e7 Fix per-sample xfails for NJT tests (#142243)
#140736 fixed some xfails, but these were not properly failing in CI due to #142157. This PR removes the xfails so we can land a fix to that issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142243
Approved by: https://github.com/huydhn
2024-12-06 22:39:35 +00:00
ceb94d6a7d add torchrec collectives to enforce global ordering (#141970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141970
Approved by: https://github.com/yf225
2024-12-06 22:38:54 +00:00
UV
7597ab6370 Corrected AMSGrad max equation in Adam and AdamW (#142051)
Fixes #142041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142051
Approved by: https://github.com/janeyx99
2024-12-06 21:55:26 +00:00
424156c26c [ROCm] Update to AOTriton 0.8b (#140172)
Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:

1. Nestedtensor support;
2. MQA/GQA support;
3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases;
    + The kernel should use top-left alignment, bottom right alignment will be added later
4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status.
   However, users are strongly recommended to update to ROCM 6.2.4, notably for
   its firmware updates.

Related unit tests are enabled as well.

Notable related changes from AOTriton 0.8b:

1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`;
2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB
3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency.
    + Should not be a problem, since `lzma` is part of Python Standard Library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140172
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-12-06 21:45:18 +00:00
eqy
0a619a212f [CUDA] Cleanup per-process-memory-fraction in test_cuda.py tests (#140852)
Otherwise certain sequences of tests will fail with OOM e.g.,
```
# python test/test_cuda.py -k max_split_expandable -k test_assigning_back_deleter_fns_to_tensor  --repeat 100                                                                                                                                                                                                                                                                                          ..                                                                                                                                                                                                                                                                                                                                                                                                                                         ----------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                                                     Ran 2 tests in 0.311s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 OK                                                                                                                                                                                                                                                                                                                                                                                                                                         E.                                                                                                                                                                                                                                                                                                                                                                                                                                         ======================================================================                                                                                                                                                                                                                                                                                                                                                                     ERROR: test_assigning_back_deleter_fns_to_tensor (__main__.TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/pytorch/torch/testing/_internal/common_utils.py", line 3058, in wrapper
    method(*args, **kwargs)
  File "/workspace/pytorch/test/test_cuda.py", line 4320, in test_assigning_back_deleter_fns_to_tensor
    graph, outputs = cudagraphify(foo, [inp])
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/pytorch/test/test_cuda.py", line 4080, in cudagraphify
    fn(*inputs)
  File "/workspace/pytorch/test/test_cuda.py", line 4316, in foo
    int8_cuda(LARGE_BUFFER) + x,
    ~~~~~~~~~~~~~~~~~~~~~~~~^~~
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 31.73 GiB of which 31.30 GiB is free. Process 2916661 has 442.00 MiB memory in use. 120.00 MiB allowed; Of the allocated memory 52.00 MiB is allocated by PyTorch, and 6.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

To execute this test, run the following from the base repo dir:
    python test/test_cuda.py TestBlockStateAbsorption.test_assigning_back_deleter_fns_to_tensor
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 2 tests in 0.136s

FAILED (errors=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140852
Approved by: https://github.com/Skylion007
2024-12-06 21:26:54 +00:00
660845a1aa [AOTI] Add deprecation warning for torch._export.aot_load (#142212)
Summary: Add deprecation warning for torch._export.aot_load, and encourage user to move to the new torch._inductor.aoti_load_package.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142212
Approved by: https://github.com/angelayi
2024-12-06 21:12:34 +00:00
f36cccba2e Revert "[Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864)"
This reverts commit 80ca6dd892613fd4f1dee9040b8273ddeadb1c50.

Reverted https://github.com/pytorch/pytorch/pull/140864 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/140864#issuecomment-2524168602))
2024-12-06 21:03:06 +00:00
cyy
1fa27f6e82 [3/N] Avoid copy in std::get (#141843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141843
Approved by: https://github.com/Skylion007
2024-12-06 20:13:36 +00:00
add4a42ea2 Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#140320)
Fixes #140318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140320
Approved by: https://github.com/eqy, https://github.com/jithunnair-amd, https://github.com/jataylo, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <jack.taylor@amd.com>
2024-12-06 20:09:56 +00:00
37c4b19e4d make sure ukernel prod is everywhere XNNPACK is (#142086)
Just double checking that ukernel prod (which should be linked with XNNPACK) is in all the places XNNPACK is
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142086
Approved by: https://github.com/kirklandsign
2024-12-06 20:09:30 +00:00
18ef3a09cd Add option in data loader for out of order data (#141833)
Fixes #105203

Facing a similar problem to the linked issue, where variable sized input data can mean that a handful of slow to process samples holds up smaller and faster to process samples from being used. This also leads to lower GPU utilization as well. In certain cases, e.g. evaluation epochs, inference pipelines or other cases where reproducibility isn't important, this can bring significant speed ups.

This PR adds an `allow_out_of_order` bool input to the `DataLoader` class, defaulting to `false`, which when set to `true` will returning data from workers in whatever order they are ready/processed in, rather in the strict index order.
Instead of storing data that was returned out of order, it is passed directly to the main thread and the entry in `_task_info` is deleted. The main changes are they to check that an entry in `_task_info` does exist, and only increasing `self._rcvd_idx` when the lowest index remaining gets returned.

Two tests are added to test this for iterable type datasets and index type datasets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141833
Approved by: https://github.com/andrewkho
2024-12-06 19:55:58 +00:00
61a7c83c64 [Inductor] fix device error for NopKernelSchedulerNode (#141372)
This PR adds device guard support for NopKernelSchedulerNode which may create a tensor. Prior to this PR, we do not codegen device guard for NopKernelSchedulerNode, leading to errors.

Prior to the PR:
```python
def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args
    args.clear()
    assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1))
    buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # TODO: ERROR here. Should be cuda:1
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
        stream1 = get_raw_stream(1)
        breakpoint()
        triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1)
        del arg0_1
        del arg1_1
        del arg2_1
        del arg3_1
        del arg4_1
        del arg5_1
        del arg6_1
        del buf0
    return (buf1, )
```

After the PR:
```python
def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1, arg9_1, arg10_1 = args
    args.clear()
    assert_size_stride(arg0_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg1_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg2_1, (1, 1, 2048, 128), (262144, 262144, 128, 1))
    assert_size_stride(arg3_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg4_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg5_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg6_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg7_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg8_1, (1, 1, 16, 16), (256, 256, 16, 1))
    assert_size_stride(arg9_1, (1, 1, 16), (16, 16, 1))
    assert_size_stride(arg10_1, (1, 1, 16, 16), (256, 256, 16, 1))
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf0 = empty_strided_cuda((1, 1, 2048), (2048, 2048, 1), torch.float32) # New: move into device guard
        buf1 = empty_strided_cuda((1, 1, 2048, 128), (262144, 262144, 128, 1), torch.bfloat16)
        # Topologically Sorted Source Nodes: [flex_attention], Original ATen: []
        stream1 = get_raw_stream(1)
        triton_tem_fused_0.run(arg0_1, arg1_1, arg2_1, buf0, arg3_1, arg4_1, arg5_1, arg6_1, buf1, grid=torch._inductor.kernel.flex_attention.flex_attention_grid(1, 1, 2048, 128, meta0), stream=stream1)
        del arg0_1
        del arg1_1
        del arg2_1
        del arg3_1
        del arg4_1
        del arg5_1
        del arg6_1
        del buf0
    return (buf1, )
```

Fixes #141010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141372
Approved by: https://github.com/eellison
2024-12-06 19:27:50 +00:00
3fd51e079d Revert "[Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759)"
This reverts commit 35752cb1ba8324a00b06d72ed388f6437c82c5e5.

Reverted https://github.com/pytorch/pytorch/pull/141759 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/141759#issuecomment-2523983558))
2024-12-06 19:14:36 +00:00
db13bd9ac2 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit b8eb4b56d8dbcd07570bec616f7ea58e9dd58fb4.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/atalman due to Break internal tests see errors like: csrc\inductor\aoti_torch\shim_common.cpp(481): error C2491: 'aoti_torch__embedding_bag': definition of dllimport function not allowed ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2523968128))
2024-12-06 19:04:04 +00:00
cf58de59d7 [cuBLASLt][Memtracker] Allocate temprorary cuBLASLt workspaces using tensors rather than going to the caching allocator directly (#139442)
CC @zdevito @janeyx99

This isn't ideal but cuBLASLt workspaces are not currently cached, so this additional untracked allocation will cause `test_cuda_tracker_equivalence` to fail with a large enough workspace size e.g., `CUBLAS_LT_WORKSPACE_SIZE=32768`. One solution is to just use byte-tensors for the workspace instead of going directly to the caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139442
Approved by: https://github.com/Aidyn-A, https://github.com/albanD, https://github.com/janeyx99
2024-12-06 19:01:12 +00:00
b7b56576d8 Allow user to manually pass module.name associated with global in {add}_safe_global (#142153)
Fixes #142144

A global x is saved in checkpoint as `GLOBAL x.__module__ x.__name__`. So , after allowlisting a GLOBAL it is expected to match any GLOBAL instruction of the form `GLOBAL x.__module__ x.__name__`  but there are edge cases when for the same API from the same module, what `__module__` gives changes between versions which prevents users from allowlisting the global.

In this case, in numpy < 2.1

```
torch.save("bla", np_array)
# checkpoint has GLOBAL "np.core.multiarray" "_reconstruct"
```
In np version 2.1

```
with safe_globals([np.core.multiarray._reconstruct]):
    torch.load("bla")
```
np.core.multiarray._reconstruct.__module__ gives "np._core.multiarray" (note the extra _ before core) and see what was done [here](https://github.com/numpy/numpy/blob/main/numpy/core/multiarray.py)

Since the dictionary to access safe globals is keyed on "{foo.__module__}.{foo.__name__}", __module__, __name__ will no longer match that in the checkpoint so "np.core.multiarray._reconstruct" can no longer be properly allowlisted (instead np._core.multiarray._reconstruct is a key in the dict).

We allow `add_safe_globals/safe_globals` to optionally take tuples of (global, str of module.name) to workaround such (odd/edge case) situations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142153
Approved by: https://github.com/albanD
2024-12-06 18:56:39 +00:00
1a7da6e7e9 [export] Add test to enforce consistency between synced thrift and generated thrift from schema.py (#141989)
Summary:
In this diff we implement a way to ensure the internal thrift schema from cfgr (configerator/structs/caffe2/torch/export/schema.thrift) and the schema in OSS (torch/_export/serde/schema.thrift) are in sync, by adding a unittest to reflect on the type names and fields from each schema and compare them field by field.

When we detect new fields/types from torch/_export/serde/schema.thrift, there'll be a test failure on the trunk and the error message hints people to add the missing field/type to the thrift schema from cfgr, so that they are always in sync in practice.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_thrift_schema_in_sync

Differential Revision: D66716834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141989
Approved by: https://github.com/yiming0416
2024-12-06 18:42:20 +00:00
bab15df40a Revert "[FSDP2] Move to public torch.distributed.fsdp (#141868)"
This reverts commit 45583a5df907a7948693c047e5fe2c8349622069.

Reverted https://github.com/pytorch/pytorch/pull/141868 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/141868#issuecomment-2523925180))
2024-12-06 18:38:12 +00:00
4af7aa5e64 Revert "E2E composability testing (#141398)"
This reverts commit ad93aa854d2d7837c917ae81cfb8f3bf05ee58c9.

Reverted https://github.com/pytorch/pytorch/pull/141398 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/141868, we can try rebase and reland this after ([comment](https://github.com/pytorch/pytorch/pull/141398#issuecomment-2523908998))
2024-12-06 18:28:51 +00:00
683ec42958 Revert "Unbreak dynamic shape minimal arrayref interface tests (#142091)"
This reverts commit 2bfc600644ed59332f9da7b94558b9c4c9562b0d.

Reverted https://github.com/pytorch/pytorch/pull/142091 on behalf of https://github.com/atalman due to Breaks internal changes ([comment](https://github.com/pytorch/pytorch/pull/142091#issuecomment-2523906048))
2024-12-06 18:25:54 +00:00
f2f95ba813 [dynamo] Remove workaround for functools.wraps in functorch (#142014)
This is no longer needed after #142000.

Fixes #123365.

D66838774
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142014
Approved by: https://github.com/zou3519
ghstack dependencies: #142000
2024-12-06 17:34:59 +00:00
c0ffeab02f [dynamo] Simplify handling of functools.wraps (#142000)
Previously when Dynamo encounters a `functools.wrap(...)` call, it would
check `VariableTracker.can_reconstruct` and graph break if failed.

That has 2 issues:
1. Implementation of `can_reconstruct` is incorrect, since logic of
   reconstructability isn't necessarily encapsulated in
   `VariableTracker.reconstruct` -- for some VTs like `CellVariable`,
   it's also in `SideEffects.codegen_save_tempvars`. This is exposed by
   #134731.
2. We don't always need to reconstruct the result of
   `functools.wrap(...)`, for those cases we don't want to give up
   tracing by an early `con_reconstruct` check. Instead we could just
   let it fall through, and graph break in the actual `reconstruct` call
   later, if needed.

This patch removes the `can_reconstruct` check altogether. It was
introduced in #114279, but the added tests pass even without the check
now; this might be because of some recent bug fixing on cells and side
effects.

Fixes #134731, #141514.

D66838708
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142000
Approved by: https://github.com/zou3519
2024-12-06 17:34:59 +00:00
5872a8c6b0 Use task submitter TLS in gloo working threads (#142184)
Fixes: #86830

CC: @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142184
Approved by: https://github.com/albanD
2024-12-06 17:03:17 +00:00
692b5e75ed [logging] Add triton_compile_time_us column to dynamo_compile (#142068)
Test Plan: See internal diff [D66799565](https://www.internalfb.com/diff/D66799565)

Differential Revision: [D66799565](https://our.internmc.facebook.com/intern/diff/D66799565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142068
Approved by: https://github.com/c00w
2024-12-06 16:11:57 +00:00
b64a537993 [CD] xpu nightly manylinux whl with cxx11-abi (#142210)
Follow https://github.com/pytorch/pytorch/issues/123649
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142210
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
2024-12-06 15:10:47 +00:00
34033cce4d Enable concat support through inductor using pointwise kernels (#141966)
Summary: Add ability to always force pointwise kernel for concat codegen through Inductor.

Differential Revision: D66669372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141966
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/jansel
2024-12-06 14:28:07 +00:00
661d1f0372 [aotd] non-contiguous NestedTensor mutation in compile (#139630)
Allow mutations mutations for subclasses that are non-contiguous.

Changes:

Removing assert in collect_metadata_analysis

Main requested testcase:
Compilation of NJT.index_put()

Adding test in test_nestedtensor.py, that compiles NJT.index_put()

It is  decomposed to NJT split,unbind, which  needed additional `torch._check`, `torch._check_is_size` for NJT.unbind()  and guard_size_oblivious() usage in _meta_registrations and _inductor/lowering.py.

Special case:
If tangent is mutated outside of the graph, it does not participate in backward graph. Autograd in this case will set this tangent to zeros tensor.

We handle it separately in CompiledFunction.backward: not doing any processing for this tangent and broadcast to number of expected subclass unwrapped arguments.

disabling for dynamo 2 tests:
1/ For nested tensor - symbolic shapes issue on nested_tensor index operation that does splits [0, 0, 0] - there is a failure with "pending unbacked symints". This PR does not add more .tolist()/item() ops than it was before.

2/ As we do not fail with exception in collect_metadata_analysis new paths for dynamo started working and it started failing with smth strange that set_ in storage_offset (because of test for views) handling updates storage "cpu" -> "meta"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139630
Approved by: https://github.com/bdhirsh
2024-12-06 12:18:46 +00:00
c683839e6e [AOTI] Clean up temporary files generated by AOTI package loader. (#141773)
Fix #141772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141773
Approved by: https://github.com/desertfire, https://github.com/EikanWang
2024-12-06 11:46:47 +00:00
c8c669ce74 When using a third-party device to test DeviceMesh,the error check for 'test_raises_invalid_device_type' can only prints 'GPU' (#142038)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142038
Approved by: https://github.com/kwen2501
2024-12-06 11:14:00 +00:00
cc64ad659d Detect accelerator type when backend is not specified (#142216)
Today, when user does `init_process_group()`, without `backend` or `device_id` specification, we would auto-translate it into `cuda:nccl,cpu:gloo`. The idea was to initialize all **default** backends to cover what the user may do later.

A side effect is increase of initialization time and resources.

This PR changes it to detecting the accelerator type on the machine, and initialize only the backend for that accelerator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142216
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-12-06 10:55:56 +00:00
cyy
5d3622447d Enable Wtype-limits (#142099)
Since it can detect underflow bugs of unsigned integers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142099
Approved by: https://github.com/ezyang
2024-12-06 08:14:18 +00:00
01d7644dc9 [dynamo] Undo some jvp old workarounds in functorch (#142082)
This basically undoes most of the workarounds introduced in #123118, the
root causes of which have been fixed by #142078.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142082
Approved by: https://github.com/zou3519
ghstack dependencies: #142078, #142080, #142081
2024-12-06 08:06:53 +00:00
9d54cd1504 [dynamo] Undo some jvp old workarounds in functorch (#142081)
This basically undoes some workarounds introduced in #119926, the
root causes of which have been fixed by #142078 and other changes in
Dynamo.

Now that Dynamo traces the spec comparison code, the test also needs update:
- removing the `_jvp_treespec_compare` calls in fx graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142081
Approved by: https://github.com/zou3519
ghstack dependencies: #142078, #142080
2024-12-06 08:06:53 +00:00
59de5e867b [dynamo] Undo some vjp old workarounds in functorch (#142080)
This basically undoes most of the workarounds introduced in #119405, the
root causes of which have been fixed by #142078 and other changes in
Dynamo.

Now that Dynamo traces the spec comparison code, the test also needs update:
1. renaming `o` to `pimals_out`
2. removing the `_vjp_treespec_compare` calls in fx graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142080
Approved by: https://github.com/zou3519
ghstack dependencies: #142078
2024-12-06 08:06:53 +00:00
aab0f32ea4 [dynamo] Properly handle != under user-defined __eq__ (#142078)
Previously Dynamo modelled `object.__ne__` as just comparison over value
identity; however, in CPython the default `!=` dispatches to `__eq__`,
which might've been overriden by user. This patch fixes the behavior
divergence.

Fixes #142055.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142078
Approved by: https://github.com/jansel, https://github.com/zou3519
2024-12-06 08:06:53 +00:00
c5cfc6a4c9 [pipelining] forward fix for _validate_schedule (#142211)
https://github.com/pytorch/pytorch/pull/142009 broke CSV loading since it can no longer handle schedules with `I` and `W`. This was caught in the torchtitan tests which loads a custom CSV file using `I` and `W` https://github.com/pytorch/torchtitan/actions/runs/12188167461/job/34000683921?pr=689.

Follow up would be to test a real custom schedule in PyTorch rather than torchtitan. The custom schedule in titan is here:  https://github.com/pytorch/torchtitan/blob/main/test/assets/custom_schedule.csv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142211
Approved by: https://github.com/mori360
ghstack dependencies: #142009
2024-12-06 08:04:31 +00:00
eqy
8fc6d3a5d8 [SDPA] Allow user-specified priority order with context manager (#140467)
TODO: docs changes?
For better debuggability of issues like https://github.com/pytorch/pytorch/issues/139298

Better testing, current sketch:

``` Python
import torch
from torch.nn.functional import scaled_dot_product_attention
from torch.nn.attention import SDPBackend, sdpa_kernel

q = torch.randn(64, 1024, 8, 64, dtype=torch.half, device='cuda')
print(torch._C._get_sdp_priority_order())

orders = [[SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION],
          [SDPBackend.MATH, SDPBackend.CUDNN_ATTENTION, SDPBackend.EFFICIENT_ATTENTION],
          [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH]]
import time
times = list()
for order in orders:
    print(order)
    with sdpa_kernel(order, set_priority=True):
        scaled_dot_product_attention(q, q, q)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    with sdpa_kernel(order, set_priority=True):
        scaled_dot_product_attention(q, q, q)
    torch.cuda.synchronize()
    t1 = time.perf_counter()
    times.append(t1 - t0)
print(times)
assert times[0] < times[1]
assert times[0] > times[2]
assert times[1] > times[2]
print(torch._C._get_sdp_priority_order())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140467
Approved by: https://github.com/drisspg
2024-12-06 07:56:35 +00:00
e7de245ee1 Revert "[reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)"
This reverts commit 8bfc0094e468b0abefe087d671903a1ca738edf0.

Reverted https://github.com/pytorch/pytorch/pull/141085 on behalf of https://github.com/williamwen42 due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/141085#issuecomment-2522403360))
2024-12-06 07:50:10 +00:00
4a6c056466 Revert "[3/N] Avoid copy in std::get (#141843)"
This reverts commit 671e9c7aba2dc72b65391aa4bba1b9e079c2f1b2.

Reverted https://github.com/pytorch/pytorch/pull/141843 on behalf of https://github.com/huydhn due to Sorry fo reverting your change but a bunch of CUDA builds are failing in trunk after this lands due to OOM ([comment](https://github.com/pytorch/pytorch/pull/141843#issuecomment-2522335911))
2024-12-06 07:32:07 +00:00
8bdcdae733 [DTensor] Support matmul in inference_mode (#142197)
Fixes #142190 .

The solution is to add a `decompose_handler` for `aten.matmul`, similar to how we handle `aten.linear`.
With the decomposition, `aten.matmul` becomes `aten.mm` which has sharding strategy registered with DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142197
Approved by: https://github.com/XilunWu, https://github.com/wz337
2024-12-06 07:15:05 +00:00
02c509669a Aoti minifier flatten (#141156)
Flatten the inputs to minifier so AOTI Minifier can handle unflattened inputs and kwargs.

- flatten the inputs in minifier
- changed the "load_and_run" part of the minifier verification to run on the flattened inputs.
- refactored code to keep `torch._inductor.__init__.py` clean
- update doc

`python test/inductor/test_minifier.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141156
Approved by: https://github.com/desertfire
2024-12-06 07:12:45 +00:00
23e2f8ab3a [Inductor] add flag for linear binary folding and turn it off by default (#142108)
Fix https://github.com/pytorch/pytorch/issues/141755.

Summary:
linear binary folding results in a timm_model(levit_128) accuracy regression, this PR adds flag `enable_linear_binary_folding` for linear binary folding and turn it off by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142108
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-06 07:12:29 +00:00
67ba79676f [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [7/N] (#140922)
related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688
- #140922
- #140924
- #140933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140922
Approved by: https://github.com/williamwen42
2024-12-06 07:07:29 +00:00
52b7f0ba12 [DTensor] fix stride of fake tensor produced by shard_dim_alltoall (#141835)
currently, DTensor redistributions involving all2all `Shard(n)->Shard(m)` will generate faulty inductor code when compiled:
```python
# torchrun --nproc_per_node=2 crash.py
import torch
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.tensor import Shard, DTensor

mesh = init_device_mesh('cuda', (2,), mesh_dim_names=('ep',))
dt = DTensor.from_local(torch.randn(2, 4, device='cuda'), mesh, [Shard(0)]).requires_grad_()
def f(dt): return dt.redistribute(placements=[Shard(1)]).to_local()
f(dt).sum().backward() # no crash
f = torch.compile(f)
f(dt).sum().backward() # crash
```
resulting:
```python
[rank1]: Traceback (most recent call last):
[rank1]:   File "/crash.py", line 11, in <module>
[rank1]:     f(dt).sum().backward() # crash
[rank1]:     ^^^^^
...
[rank1]:   File "/tmp/torchinductor_main/gu/cgurkeb7tzx7kfsnooolsjefrgoizzylrldrugc52n4avmgiccas.py", line 41, in call
[rank1]:     assert_size_stride(buf0, (4, 2), (4, 1))
[rank1]: AssertionError: expected size 4==4, stride 2==4 at dim=0
```

This happens because the current [`register_fake` implementation for `shard_dim_alltoall` ops](5deca07c0d/torch/distributed/tensor/_collective_utils.py (L32)) returns an erroneous stride:

```python
import torch
import torch.distributed as dist
from torch._C._distributed_c10d import _register_process_group
from torch.distributed.device_mesh import init_device_mesh
from torch.distributed.tensor._collective_utils import _shard_dim_alltoall_meta, _get_group_size_by_name

mesh = init_device_mesh('cuda', (2,), mesh_dim_names=('ep',))
_register_process_group('ep', mesh['ep'].get_group())
x = torch.randn(2, 4, device='meta')
y = _shard_dim_alltoall_meta(x, 0, 1, 'ep')
if dist.get_rank() == 0:
    print(x.shape, x.stride()) # torch.Size([2, 4]) (4, 1)
    print(y.shape, y.stride()) # torch.Size([4, 2]) (4, 1)
```

---

The proposed fix in the pull request causes the provided example code to compile correctly && stop erroring. However, I know very little about torch internals, and expect there to be something wrong with this patch. Any corrections are appreciated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141835
Approved by: https://github.com/awgu, https://github.com/tianyu-l
2024-12-06 06:56:03 +00:00
35752cb1ba [Inductor] Constrain the shape of other tensor for Conv/Linear + broadcast add fusion. (#141759)
Fix https://github.com/pytorch/pytorch/issues/141671.

Summary:
The performance regression of these two timm_models is caused by Conv/Linear + broadcast add fusion run into oneDNN ref path. This PR constrains the shape of other tensor for Conv/Linear + broadcast add fusion to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141759
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-06 06:20:41 +00:00
b8eb4b56d8 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-06 04:54:42 +00:00
20f24e3fbd [inductor][cpp] Add BMM kernel template for autotuning (#129772)
This PR adds the Cpp template for BMM, for FP32, FP16, and BF16. See #125683 for more background.

1.  Adds `CppBmmTemplate` class which inherits from `CppPackedGemmTemplate`. Given a number of worker threads `num_threads` and batch size `B`, execute the Gemm kernel. For the first `B - (B % num_threads)` batch inputs, run one sub-gemm problem per thread. Then for the remaining `B % num_threads` sub-gemms, we execute each subproblem using the parallelized Gemm kernel.
To manage this code, the `GEMM_TEMPLATE` from `CppPackedGemmTemplate` is rendered two different times, one with a single thread and one which includes the parallel OMP pragma.
2. Adapts `CppPackedGemmTemplate` to allow for child class. The `GEMM_TEMPLATE` is separated into different strings to allow for rendering by the child class. Slicing/indexing are adapted to allow for 3D BMM inputs. Additional methods `get_options()` and `_get_params_for_choices()` are added to reduce code duplication.

BMM within `dlrm` benchmark has a single input buffer which is used for but X and W inputs. This is currently not supported in this PR.

### Performance
On Granite/Sapphire Rapids, cpp_bmm template code uses AMX which requires an expensive transpose operation so the BMM op is rarely selected as faster than the existing external bmm kernel. As a result, speedup on SPR is identical with and without BMM code. Pass rate matches the rates for main exactly.

#### Test Summary on Granite Rapids
Test   Scenario | Comp Item | Date | Compiler | torchbench | huggingface | timm_models
-- | -- | -- | -- | -- | -- | --
Single Socket Multi-Threads | Pass   Rate | gemm autotune| inductor | 91%,   73/80 | 100%,   46/46 | 100%,   61/61
   |     |   |  bmm + gemm autotune | inductor | 91%,   73/80 | 100%,   46/46 | 100%,   61/61
  |  |  Geomean Speedup | gemm autotune| inductor | 2.15x | 1.91x | 2.52x
   |     |   |  bmm + gemm autotune | inductor | 2.15x | 1.96x | 2.53x
Single Core Single-Thread | Pass   Rate | gemm autotune | inductor | 91%,   73/80 | 100%,   46/46 | 100%,   61/61
   |    |   |  bmm + gemm autotune| inductor | 91%,   73/80 | 100%,   46/46 | 100%,   61/61
 |  | Geomean Speedup | inductor_locally_benchmark_586 | inductor | 2.43x | 1.56x | 2.60x
   |    |   |  inductor_locally_benchmark_585 | inductor | 2.45x | 1.56x | 2.63x

This is not the case on an older Skylake Xeon machine.
For the BMM ops contained in torchbench models, bmm performance improves by 1.10-2.64x.

#### BF16 28-core Skylake Xeon
| Model | Inductor | GemmAutotune | Gemm+BMM Autotune |
|--------|--------|--------|--------|
| BERT_pytorch | 1.233x | 2.597x | 2.608x |
| hf_DistilBert | 1.128x | 2.242x | 2.368x |
| hf_Reformer | 1.124x | 1.419x | 1.590x |
| hf_T5_base | 1.012x | 1.257x | 1.382x |
| hf_T5_large | 1.085x | 2.228x | 2.345x |

## Example BMM Code
```
#include <c10/util/Unroll.h>
#include <torch/csrc/inductor/aoti_torch/c/shim.h>

template <bool accum>
inline void cpp_bmm_micro_gemm_amx_kernel_32_2(
    AMXState& amx_state,
    const bfloat16* __restrict__ A,
    const bfloat16* __restrict__ B,
    float* __restrict__ C,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc,
    uint8_t tilecfg_rows
) {
    // TODO(jgong5): add prefetch hint for A, B, C
    auto loadconfig = [](const amx_tilecfg& cfg) {
        _tile_loadconfig(&cfg);
    };
    const auto last_k_offset = K / 32 * 32;
    const auto tail_k_size = K - last_k_offset;
    if C10_LIKELY (last_k_offset > 0) {
        amx_state.configure(tilecfg_rows, 64, 32 / 16, 2, loadconfig);
    } else {
        amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 32 / 16, 2, loadconfig);
    }
    auto load_c = [&]() {
        _tile_loadd(0, C + 0 * ldc + 0, ldc * sizeof(float));
        _tile_loadd(1, C + 0 * ldc + 16, ldc * sizeof(float));
        _tile_loadd(2, C + 16 * ldc + 0, ldc * sizeof(float));
        _tile_loadd(3, C + 16 * ldc + 16, ldc * sizeof(float));
    };
    auto zero_c = [&]() {
        _tile_zero(0);
        _tile_zero(1);
        _tile_zero(2);
        _tile_zero(3);
    };

    if constexpr (accum) {
        load_c();
    } else {
        zero_c();
    }

    auto compute = [&](int k) {
        _tile_stream_loadd(4, A + 0 * lda + k, lda * sizeof(bfloat16));
        _tile_loadd(6, B + k * ldb + 0, ldb * 2 * sizeof(bfloat16));
        _tile_dpbf16ps(0, 4, 6);
        _tile_loadd(7, B + k * ldb + 32, ldb * 2 * sizeof(bfloat16));
        _tile_dpbf16ps(1, 4, 7);
        _tile_stream_loadd(5, A + 16 * lda + k, lda * sizeof(bfloat16));
        _tile_dpbf16ps(2, 5, 6);
        _tile_dpbf16ps(3, 5, 7);
    };

    #pragma GCC unroll 4
    for (int k = 0; k < last_k_offset; k += 32) {
        compute(k);
    }

    auto store_c = [&]() {
    // store to C
        _tile_stored(0, C + 0 * ldc + 0, ldc * sizeof(float));
        _tile_stored(1, C + 0 * ldc + 16, ldc * sizeof(float));
        _tile_stored(2, C + 16 * ldc + 0, ldc * sizeof(float));
        _tile_stored(3, C + 16 * ldc + 16, ldc * sizeof(float));
    };

    // TODO(jgong5): move tail k computation to separate loopnest to save tile configuration overhead
    if C10_UNLIKELY (tail_k_size > 0) {
        if C10_LIKELY (last_k_offset > 0) {
            store_c();
            amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 32 / 16, 2, loadconfig);
            load_c();
        }
        compute(last_k_offset);
    }

    store_c();
}

template <bool accum>
inline void cpp_bmm_micro_gemm_amx_kernel_16_2(
    AMXState& amx_state,
    const bfloat16* __restrict__ A,
    const bfloat16* __restrict__ B,
    float* __restrict__ C,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc,
    uint8_t tilecfg_rows
) {
    // TODO(jgong5): add prefetch hint for A, B, C
    auto loadconfig = [](const amx_tilecfg& cfg) {
        _tile_loadconfig(&cfg);
    };
    const auto last_k_offset = K / 32 * 32;
    const auto tail_k_size = K - last_k_offset;
    if C10_LIKELY (last_k_offset > 0) {
        amx_state.configure(tilecfg_rows, 64, 16 / 16, 2, loadconfig);
    } else {
        amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 16 / 16, 2, loadconfig);
    }
    auto load_c = [&]() {
        _tile_loadd(0, C + 0 * ldc + 0, ldc * sizeof(float));
        _tile_loadd(1, C + 0 * ldc + 16, ldc * sizeof(float));
    };
    auto zero_c = [&]() {
        _tile_zero(0);
        _tile_zero(1);
    };

    if constexpr (accum) {
        load_c();
    } else {
        zero_c();
    }

    auto compute = [&](int k) {
        _tile_stream_loadd(2, A + 0 * lda + k, lda * sizeof(bfloat16));
        _tile_loadd(3, B + k * ldb + 0, ldb * 2 * sizeof(bfloat16));
        _tile_dpbf16ps(0, 2, 3);
        _tile_loadd(4, B + k * ldb + 32, ldb * 2 * sizeof(bfloat16));
        _tile_dpbf16ps(1, 2, 4);
    };

    #pragma GCC unroll 4
    for (int k = 0; k < last_k_offset; k += 32) {
        compute(k);
    }

    auto store_c = [&]() {
    // store to C
        _tile_stored(0, C + 0 * ldc + 0, ldc * sizeof(float));
        _tile_stored(1, C + 0 * ldc + 16, ldc * sizeof(float));
    };

    // TODO(jgong5): move tail k computation to separate loopnest to save tile configuration overhead
    if C10_UNLIKELY (tail_k_size > 0) {
        if C10_LIKELY (last_k_offset > 0) {
            store_c();
            amx_state.configure(tilecfg_rows, tail_k_size * sizeof(bfloat16), 16 / 16, 2, loadconfig);
            load_c();
        }
        compute(last_k_offset);
    }

    store_c();
}

template <bool accum>
inline void cpp_bmm_micro_gemm(
    AMXState& amx_state,
    const bfloat16* __restrict__ A,
    const bfloat16* __restrict__ B,
    float* __restrict__ C,
    int64_t M,
    int64_t N,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc
) {
    AOTI_TORCH_CHECK(N % 32 == 0, "N dimension must be multiple of 32");
    AOTI_TORCH_CHECK(K % 2 == 0, "K dimension must be multiple of 2");
    // TODO(jgong5): loop unroll for M and N
    for (int64_t n = 0; n < N; n += 32) {
        for (int64_t m = 0; m < M; m += 32) {
            int64_t block_m = std::min<int64_t>(M - m, 32);
            int64_t m_tail = m;
            if (block_m >= 32) {
                cpp_bmm_micro_gemm_amx_kernel_32_2<accum>(
                    amx_state,
                    A + m * lda,
                    B + n,
                    C + m * ldc + n,
                    K,
                    lda,
                    ldb,
                    ldc,
                    16
                );
                block_m -= 32;
                m_tail += 32;
            }
            else
            if (block_m >= 16) {
                cpp_bmm_micro_gemm_amx_kernel_16_2<accum>(
                    amx_state,
                    A + m * lda,
                    B + n,
                    C + m * ldc + n,
                    K,
                    lda,
                    ldb,
                    ldc,
                    16
                );
                block_m -= 16;
                m_tail += 16;
            }
            if (block_m > 0) {
                cpp_bmm_micro_gemm_amx_kernel_16_2<accum>(
                    amx_state,
                    A + m_tail * lda,
                    B + n,
                    C + m_tail * ldc + n,
                    K,
                    lda,
                    ldb,
                    ldc,
                    block_m
                );
            }
        }
    }
}
void threaded_mm(const bfloat16* X, const bfloat16* W, bfloat16* Y, const int64_t ks_b_index)
{

    constexpr int64_t num_threads = 48;
    constexpr int64_t N = 64;
    constexpr int64_t K = 96;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
    constexpr int64_t Nr_blocks = (N + Nr - 1) / Nr;
    constexpr int64_t Kr_blocks = (K + Kr - 1) / Kr;
    constexpr int64_t M = static_cast<int64_t>(384L);
    constexpr int64_t Mr_blocks = (M + Mr - 1) / Mr;
    constexpr int64_t Mt_blocks = 1;
    constexpr int64_t Nt_blocks = 1;
    constexpr int64_t Kt_blocks = 3;
    constexpr int64_t Mc_blocks = 1;
    constexpr int64_t Nc_blocks = 1;
    constexpr int64_t Kc_blocks = 3;
    constexpr int64_t num_Mc_blocks = (Mr_blocks + Mc_blocks - 1) / Mc_blocks;
    constexpr int64_t num_Nc_blocks = (Nr_blocks + Nc_blocks - 1) / Nc_blocks;
    constexpr int64_t num_Mt_blocks = (Mr_blocks + Mt_blocks - 1) / Mt_blocks;
    constexpr int64_t num_Nt_blocks = (Nr_blocks + Nt_blocks - 1) / Nt_blocks;
    constexpr int64_t num_Kt_blocks = (Kr_blocks + Kt_blocks - 1) / Kt_blocks;

    // make sure all partitions are assigned
    AOTI_TORCH_CHECK(
        Mt_blocks * Nt_blocks * Kt_blocks * 48 >= Mr_blocks * Nr_blocks * Kr_blocks,
        "Not all partitions are assigned."
    );
    #pragma omp parallel num_threads(48)
    {
        const int tid = omp_get_thread_num();
        const int64_t k_group_id = tid / num_Kt_blocks;
        const int64_t k_slice_id = tid % num_Kt_blocks;
        const int64_t n_group_id = k_group_id / num_Nt_blocks;
        const int64_t n_slice_id = k_group_id % num_Nt_blocks;
        const int64_t k_block_start = k_slice_id * Kt_blocks;
        const int64_t k_block_end = std::min(k_block_start + Kt_blocks, Kr_blocks);
        const int64_t n_block_start = n_slice_id * Nt_blocks;
        const int64_t n_block_end = std::min(n_block_start + Nt_blocks, Nr_blocks);
        const int64_t m_block_start = std::min(n_group_id * Mt_blocks, Mr_blocks);
        const int64_t m_block_end = std::min(m_block_start + Mt_blocks, Mr_blocks);
        const int64_t num_Mc_blocks_per_thread = (m_block_end - m_block_start + Mc_blocks - 1) / Mc_blocks;
        AMXState amx_state;
        auto _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocks*Mr*Nc_blocks*Nr)); auto local_acc_buf = _local_acc_buf.get();
        for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) {
            const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread;
            const int64_t mc = m_block_start + my_mc_block_id * Mc_blocks;
            const int64_t m_start = mc * Mr;
            const int64_t m_end = std::min(std::min(mc + Mc_blocks, m_block_end) * Mr, M);
            const int64_t m_size = m_end - m_start;
            for (int64_t nc = n_block_start; nc < n_block_end; nc += Nc_blocks) {
                const int64_t n_start = nc * Nr;
                const int64_t n_end = std::min(std::min(nc + Nc_blocks, n_block_end) * Nr, N);
                const int64_t n_size = n_end - n_start;
                // NB: assume we pad N, nc_block_end won't exceed padded N here.
                const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end);
                if (_local_acc_buf == nullptr) { _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocks*Mr*Nc_blocks*Nr)); local_acc_buf = _local_acc_buf.get(); }
                for (int64_t kc = k_block_start; kc < k_block_end; kc += Kc_blocks) {
                    int64_t k_start = kc * Kr;
                    int64_t k_end = std::min(std::min(kc + Kc_blocks, k_block_end) * Kr, K);
                    for (int64_t nci = nc; nci < nc_block_end; nci++) {
                        if (kc == k_block_start) {
                            cpp_bmm_micro_gemm<static_cast<bool>(false)>(
                                amx_state,
                                &(X[static_cast<int64_t>(k_start + (96L*m_start) + (36864L*ks_b_index))]),
                                &(W[static_cast<int64_t>((32L*k_start) + (3072L*nci) + (6144L*ks_b_index))]),
                                &(local_acc_buf[static_cast<int64_t>((Nr*nci) + ((-1L)*Nr*nc))]),
                                static_cast<int64_t>(m_end + ((-1L)*m_start)),
                                static_cast<int64_t>(Nr),
                                static_cast<int64_t>(k_end + ((-1L)*k_start)),
                                static_cast<int64_t>(96L),
                                static_cast<int64_t>(32L),
                                static_cast<int64_t>(Nc_blocks*Nr)
                            );

                        } else {
                            cpp_bmm_micro_gemm<static_cast<bool>(true)>(
                                amx_state,
                                &(X[static_cast<int64_t>(k_start + (96L*m_start) + (36864L*ks_b_index))]),
                                &(W[static_cast<int64_t>((32L*k_start) + (3072L*nci) + (6144L*ks_b_index))]),
                                &(local_acc_buf[static_cast<int64_t>((Nr*nci) + ((-1L)*Nr*nc))]),
                                static_cast<int64_t>(m_end + ((-1L)*m_start)),
                                static_cast<int64_t>(Nr),
                                static_cast<int64_t>(k_end + ((-1L)*k_start)),
                                static_cast<int64_t>(96L),
                                static_cast<int64_t>(32L),
                                static_cast<int64_t>(Nc_blocks*Nr)
                            );

                        }
                    }
                }
                {
                    {
                        #pragma GCC ivdep
                        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(m_end + ((-1L)*m_start)); x0+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))); x1+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocks*Nr*x0)), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::convert<bfloat16>(tmp0);
                                tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64L*m_start) + (64L*x0) + (24576L*ks_b_index)), static_cast<int64_t>(16));
                            }
                            for(int64_t x1=static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))); x1<static_cast<int64_t>(n_end + ((-1L)*n_start)); x1+=(static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))) == 0 ? 1 : static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))))))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocks*Nr*x0)), static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))));
                                auto tmp1 = at::vec::convert<bfloat16>(tmp0);
                                tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64L*m_start) + (64L*x0) + (24576L*ks_b_index)), static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))));
                            }
                        }
                    }

                }
            }
        }
        amx_state.release([]() { _tile_release(); });
    }
}
void single_thread_mm(const bfloat16* X, const bfloat16* W, bfloat16* Y, const int64_t ks_b_index)
{

    constexpr int64_t num_threads = 1;
    constexpr int64_t N = 64;
    constexpr int64_t K = 96;
    constexpr int64_t Mr = 32;
    constexpr int64_t Nr = 32;
    constexpr int64_t Kr = 32;
    constexpr int64_t Nr_blocks = (N + Nr - 1) / Nr;
    constexpr int64_t Kr_blocks = (K + Kr - 1) / Kr;
    constexpr int64_t M = static_cast<int64_t>(384L);
    constexpr int64_t Mr_blocks = (M + Mr - 1) / Mr;
    constexpr int64_t Mt_blocks = 12;
    constexpr int64_t Nt_blocks = 2;
    constexpr int64_t Kt_blocks = 3;
    constexpr int64_t Mc_blocks = 12;
    constexpr int64_t Nc_blocks = 1;
    constexpr int64_t Kc_blocks = 3;
    constexpr int64_t num_Mc_blocks = (Mr_blocks + Mc_blocks - 1) / Mc_blocks;
    constexpr int64_t num_Nc_blocks = (Nr_blocks + Nc_blocks - 1) / Nc_blocks;
    constexpr int64_t num_Mt_blocks = (Mr_blocks + Mt_blocks - 1) / Mt_blocks;
    constexpr int64_t num_Nt_blocks = (Nr_blocks + Nt_blocks - 1) / Nt_blocks;
    constexpr int64_t num_Kt_blocks = (Kr_blocks + Kt_blocks - 1) / Kt_blocks;

    // make sure all partitions are assigned
    AOTI_TORCH_CHECK(
        Mt_blocks * Nt_blocks * Kt_blocks * 1 >= Mr_blocks * Nr_blocks * Kr_blocks,
        "Not all partitions are assigned."
    );
    {
        constexpr int tid = 0;
        constexpr int64_t k_group_id = 0;
        constexpr int64_t k_slice_id = 0;
        constexpr int64_t n_group_id = 0;
        constexpr int64_t n_slice_id = 0;
        constexpr int64_t m_block_start = 0;
        constexpr int64_t n_block_start = 0;
        constexpr int64_t n_block_end = Nr_blocks;
        constexpr int64_t k_block_start = 0;
        constexpr int64_t k_block_end = Kr_blocks;
        constexpr int64_t num_Mc_blocks_per_thread = num_Mc_blocks;
        constexpr int64_t m_block_end = Mr_blocks;
        AMXState amx_state;
        auto _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocks*Mr*Nc_blocks*Nr)); auto local_acc_buf = _local_acc_buf.get();
        for (int64_t mc_block_id = 0; mc_block_id < num_Mc_blocks_per_thread; mc_block_id++) {
            const int64_t my_mc_block_id = (mc_block_id + n_slice_id) % num_Mc_blocks_per_thread;
            const int64_t mc = m_block_start + my_mc_block_id * Mc_blocks;
            const int64_t m_start = mc * Mr;
            const int64_t m_end = std::min(std::min(mc + Mc_blocks, m_block_end) * Mr, M);
            const int64_t m_size = m_end - m_start;
            for (int64_t nc = n_block_start; nc < n_block_end; nc += Nc_blocks) {
                const int64_t n_start = nc * Nr;
                const int64_t n_end = std::min(std::min(nc + Nc_blocks, n_block_end) * Nr, N);
                const int64_t n_size = n_end - n_start;
                // NB: assume we pad N, nc_block_end won't exceed padded N here.
                const int64_t nc_block_end = std::min(nc + Nc_blocks, n_block_end);
                if (_local_acc_buf == nullptr) { _local_acc_buf = std::make_unique<float[]>(static_cast<int64_t>(Mc_blocks*Mr*Nc_blocks*Nr)); local_acc_buf = _local_acc_buf.get(); }
                for (int64_t kc = k_block_start; kc < k_block_end; kc += Kc_blocks) {
                    int64_t k_start = kc * Kr;
                    int64_t k_end = std::min(std::min(kc + Kc_blocks, k_block_end) * Kr, K);
                    for (int64_t nci = nc; nci < nc_block_end; nci++) {
                        if (kc == k_block_start) {
                            cpp_bmm_micro_gemm<static_cast<bool>(false)>(
                                amx_state,
                                &(X[static_cast<int64_t>(k_start + (96L*m_start) + (36864L*ks_b_index))]),
                                &(W[static_cast<int64_t>((32L*k_start) + (3072L*nci) + (6144L*ks_b_index))]),
                                &(local_acc_buf[static_cast<int64_t>((Nr*nci) + ((-1L)*Nr*nc))]),
                                static_cast<int64_t>(m_end + ((-1L)*m_start)),
                                static_cast<int64_t>(Nr),
                                static_cast<int64_t>(k_end + ((-1L)*k_start)),
                                static_cast<int64_t>(96L),
                                static_cast<int64_t>(32L),
                                static_cast<int64_t>(Nc_blocks*Nr)
                            );

                        } else {
                            cpp_bmm_micro_gemm<static_cast<bool>(true)>(
                                amx_state,
                                &(X[static_cast<int64_t>(k_start + (96L*m_start) + (36864L*ks_b_index))]),
                                &(W[static_cast<int64_t>((32L*k_start) + (3072L*nci) + (6144L*ks_b_index))]),
                                &(local_acc_buf[static_cast<int64_t>((Nr*nci) + ((-1L)*Nr*nc))]),
                                static_cast<int64_t>(m_end + ((-1L)*m_start)),
                                static_cast<int64_t>(Nr),
                                static_cast<int64_t>(k_end + ((-1L)*k_start)),
                                static_cast<int64_t>(96L),
                                static_cast<int64_t>(32L),
                                static_cast<int64_t>(Nc_blocks*Nr)
                            );

                        }
                    }
                }
                {
                    {
                        #pragma GCC ivdep
                        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(m_end + ((-1L)*m_start)); x0+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))); x1+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocks*Nr*x0)), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::convert<bfloat16>(tmp0);
                                tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64L*m_start) + (64L*x0) + (24576L*ks_b_index)), static_cast<int64_t>(16));
                            }
                            for(int64_t x1=static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))); x1<static_cast<int64_t>(n_end + ((-1L)*n_start)); x1+=(static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))) == 0 ? 1 : static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L)))))))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<int64_t>(x1 + (Nc_blocks*Nr*x0)), static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))));
                                auto tmp1 = at::vec::convert<bfloat16>(tmp0);
                                tmp1.store(Y + static_cast<int64_t>(n_start + x1 + (64L*m_start) + (64L*x0) + (24576L*ks_b_index)), static_cast<int64_t>(n_end + ((-1L)*n_start) + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>((n_end + ((-1L)*n_start))), static_cast<int64_t>(16L))))));
                            }
                        }
                    }

                }
            }
        }
        amx_state.release([]() { _tile_release(); });
    }
}
extern "C"
void cpp_bmm(const bfloat16* X, const bfloat16* W, bfloat16* Y)
{
    const int64_t B = static_cast<int64_t>(5L);
    constexpr int64_t num_threads = 48;
    int64_t B_single_thread_block = (B / num_threads) * num_threads;

    #pragma omp parallel for num_threads(48)
    for (int64_t b_start = 0; b_start < B_single_thread_block; ++b_start) {
        single_thread_mm(X, W, Y, b_start);
    }
    for (int64_t b_start = B_single_thread_block; b_start < B; ++b_start) {
        threaded_mm(X, W, Y, b_start);
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129772
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-12-06 04:54:00 +00:00
39425feac7 Filter pattern matching tests based on ACL (#141921)
There are a number of cases where pattern matching differs based on the presence of ACL, causing the tests to fail. This adds `TEST_ACL` and `skipIfACL` so that these tests can still run with different values or be entirely skipped if necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141921
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-06 04:19:41 +00:00
cyy
671e9c7aba [3/N] Avoid copy in std::get (#141843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141843
Approved by: https://github.com/Skylion007
2024-12-06 04:15:31 +00:00
cyy
4bc8de334f Remove __ubsan_ignore_undefined__ in some cases (#142120)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142120
Approved by: https://github.com/ezyang
2024-12-06 04:13:57 +00:00
646024e823 Add convnext_base to higher tolerance (#142159)
See https://github.com/pytorch/pytorch/issues/141498 https://github.com/pytorch/pytorch/issues/141703

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142159
Approved by: https://github.com/bertmaher, https://github.com/huydhn
2024-12-06 04:00:13 +00:00
80ca6dd892 [Inductor] Expand dtype aware codegen for libdevice and tl.math ops (#140864)
# Feature
Previously, only the codegen for `torch.sqrt` was dtype aware. This PR updates most of the `libdevice`/`tl.math` ops to support dtype-aware codegen as well. This is often necessary to get correct code when `config.triton.codegen_upcast_to_fp32=False`, as most Triton math ops do not support float16/bfloat16.

This PR enables dtype aware codegen via the `maybe_upcast_float32` decorator. This wraps `TritonOverrides` macros to upcast arguments to float32, and downcast the result back to the original dtype. The exception is for ops that return booleans, in which case we set `convert_output=False` and skip the output cast.

# Test Plan
Added CI tests for all the new ops. The list of ops to test is automatically generated based on uses of the `maybe_upcast_float32` decorator, and stored in the new `OpDtypeSupport` class. In each new test, we search the generated code for upcasts/downcasts using a regex.

Also added a unit test for `OpDtypeSupport` which checks that we have correct dtype info for ops that require upcasts.

This PR also moves some existing tests around, to collect all the dtype aware codegen tests in one file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140864
Approved by: https://github.com/eellison, https://github.com/arui-meta

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-12-06 03:15:20 +00:00
0602676c8d [CUTLASS][AOTI] Fixes undefined symbol: cudaLaunchKernelExC (#142094)
Summary:
### Context
* When compiling the object file for a CUTLASS kernel, CUDA RT symbols are left undefined.
* When compiling the final shared object file, we statically link with `libcudart_static.a`.
* One important thing is that ordering matters when specifying the lib search paths (-L).

Test Plan:
```
// before diff
RuntimeError: Failure loading .so: /tmp/tmpqhz_dnza/model.so: undefined symbol: cudaLaunchKernelExC
```

Differential Revision: D66793974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142094
Approved by: https://github.com/chenyang78, https://github.com/hl475
2024-12-06 02:18:54 +00:00
8bfc0094e4 [reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)
Reland - https://github.com/pytorch/pytorch/pull/139560

As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
2024-12-06 01:49:55 +00:00
ce22a01e11 Add an option for classic search (#142018)
Fixes https://github.com/pytorch/tutorials/issues/3143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142018
Approved by: https://github.com/albanD
2024-12-06 01:24:52 +00:00
e803a3d83a Fix reductions for NJTs with ragged_idx != 1 (#142173)
**Background:** conversion from outer dim -> inner dim makes the (previously valid) assumption that the ragged dim is immediately next to the batch dim. This is no longer the case after #137125.

This PR:
* Updates the outer dim -> inner dim conversion logic to match the actual ragged_idx. Since ragged_idx tells us where the packed ragged / batch dim is, both ragged and batch outer dims should map to this inner dim. The conversion logic must now take in `ragged_idx` to make this possible, so the PR updates all call-sites to pass this.
* Fixes outputs across keepdim settings when reducing over ragged / batch dims.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142173
Approved by: https://github.com/drisspg
2024-12-06 01:23:17 +00:00
6b0df2f720 [torch.func] expand stack_module_state's typing (#142169)
Summary:
https://github.com/pytorch/pytorch/pull/141894 made this API actually
typed w.r.t. pyre, which is causing some internal type failures. This PR
expands the typing for stack_module_state to squash those failures.

Test Plan:
- pyre
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142169
Approved by: https://github.com/albanD
2024-12-06 01:08:53 +00:00
93214aad30 [ROCM] Fix unit test: matmul_small_brute_force_tunableop (#142089)
Fixes #141636
Fixes #141635
Fixes #141458

Changes include:

- TunableOp filename that wasn't set properly
- Activate numerical check (see additional test comment)
- Entire test in try-finally clause to avoid OS environment variable leakage (see additional test comment)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142089
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-06 00:58:10 +00:00
ad93aa854d E2E composability testing (#141398)
Add 3D(pp+tp+fsdp) test `test_3d_with_tp_dp_pp` at test_pp_compodability
Currently provide @parametrize on
"ScheduleClass" for pp in [ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, ScheduleLoopedBFS, ScheduleInterleavedZeroBubble]
"MixedPrecisionParam" for fsdp in [torch.bfloat16, torch.float32]

Future work:
1. add fp8
2. add cp(context parallelism) to enable 4D test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141398
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-12-06 00:53:22 +00:00
461bd2c2f7 Update nested tensor warning to recommend layout=torch.jagged (#142140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142140
Approved by: https://github.com/YuqingJ
2024-12-06 00:40:30 +00:00
90052a8ae2 Revert "[ROCm] unskip hermite_polynomial_h unit tests (#141150)"
This reverts commit 69f8b3e269641fae93ed7afba49b6df8e44ed3c9.

Reverted https://github.com/pytorch/pytorch/pull/141150 on behalf of https://github.com/jeffdaily due to this PR is tied to #141955 and that one was reverted so need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/141150#issuecomment-2521830067))
2024-12-06 00:39:56 +00:00
efab8c433f [subclass] Fix unwrap_subclass_parameters parametrization (#142155)
Parametrization can not be registered for non-direct child parameters of the module.
We have to iterate through all submodules and register parametrization at every level.

Original testcase did not test the nested modules case - adding submodule to the test.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142155
Approved by: https://github.com/bdhirsh
2024-12-05 23:53:36 +00:00
2bfc600644 Unbreak dynamic shape minimal arrayref interface tests (#142091)
Simple bug got introduced somewhere.

Differential Revision: [D66792420](https://our.internmc.facebook.com/intern/diff/D66792420/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142091
Approved by: https://github.com/desertfire, https://github.com/hl475
2024-12-05 23:26:35 +00:00
ca7be75e0a Reduce the nproc when building FA on 8.9 (#142164)
The newly introduce sm89 build is failing consistently in trunk now because of OOM https://github.com/pytorch/pytorch/actions/runs/12186328178/job/33994606556.  I suspect that FlashAttention is the cause.

### Testing

CI to see if the build works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142164
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-05 23:23:40 +00:00
4981bd8355 Make cache keys consistent between OSS and internal (#142147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142147
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2024-12-05 22:29:07 +00:00
a9e3281e94 [rfc][be] static assert that nccl version is >= 2.4 (#142023)
Summary:
Static assert that NCCL VERSION is greater than 2.4.
This is in preparation of enabling error checking by default in PyTorch library and removal of some macros.
This is in PR #141914.
The rationale behind this version is:
1. 2.4 released ~2 years ago so it's unlikely that someone is still using the old library.
2. Enabling error checking is benefitial to the community as it helps debug subtle bugs in production environments.

Test Plan: unit tests

Differential Revision: D66737055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142023
Approved by: https://github.com/kwen2501
2024-12-05 22:11:14 +00:00
5513e2ec35 [SymmetricMemory] use the python version of empty() and rendezvous() for tests and library ops (#142154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142154
Approved by: https://github.com/weifengpy
2024-12-05 22:09:36 +00:00
16ea0ddcdb Ignore logger methods to avoid graph breaks (#139403)
Fixes #132635

Calls to logging.logger cause a graph break, this PR allows the user to avoid these graph breaks (for specific methods) by setting DISABLE_LOGS_WHILE_COMPILING to 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139403
Approved by: https://github.com/williamwen42
2024-12-05 20:12:26 +00:00
41952c1876 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 38e0f72274cfad88e0f2ca40f27c79cd49413f5e.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/malfet due to This broke sm89 builds ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2521290457))
2024-12-05 20:07:29 +00:00
39482907be [AOTI] Refactor codegen_inputs in wrapper codegen (#141965)
Summary: Fork codegen_inputs for CppWrapperCodegen, because the behavior between python and cpp needs to diverge. On the python side, input backed symbols need to be generated for the autotune block. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66718225](https://our.internmc.facebook.com/intern/diff/D66718225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141965
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388, #141387, #141979
2024-12-05 19:49:34 +00:00
2fd8a7be71 [AOTI] Refactor additional_files generation (#141979)
Summary: https://github.com/pytorch/pytorch/pull/140675 adds logic to collect all the generated cubin file paths into an additional_files list, but the collection should only happen when DeferredGpuKernelLine is materialized. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66718227](https://our.internmc.facebook.com/intern/diff/D66718227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141979
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388, #141387
2024-12-05 19:49:02 +00:00
7e49da6077 DLPack: add support to PyTorch/XLA (#138470)
Taking over: #128176.

In summary, this PR:

- `__dlpack__`: Calls PyTorch/XLA `to_dlpack` function, if the tensor lives in an XLA:CUDA device
- `__dlpack_device__`: Correctly maps PyTorch/XLA tensors to `kDLGPU`, if XLA:CUDA is being used

The tests are introduced in https://github.com/pytorch/xla/pull/7213.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138470
Approved by: https://github.com/albanD

Co-authored-by: iefgnoix <isaacwxf23@gmail.com>
2024-12-05 19:36:36 +00:00
5f28c42746 [AOIT] Remove several overloaded members from WrapperCodegen (#141387)
Summary: Remove several overloaded string members from WrapperCodegen classes, including open_bracket, closed_braket, size, stride. Instead of relying on polymorphism, we explicitly generate different strings for PythonWrapperCodegen and CppWrapperCodegen. This is to prepare for one-pass AOTI CUDA codegen.

Differential Revision: [D66459991](https://our.internmc.facebook.com/intern/diff/D66459991)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141387
Approved by: https://github.com/chenyang78
ghstack dependencies: #141388
2024-12-05 19:29:38 +00:00
4cc0fc2707 [AOTI] Remove WrapperCodegen.expr_printer (#141388)
Summary: Avoid using expr_printer as an overriden class member for WrapperCodegen. Instead, use pexpr and cexpr explicitly for python and cpp expression print respectively. This is to prepare for one-pass AOTI CUDA codegen, where PythonWrapperCodegen is used to generate the autotune block and CppWrapperCodegen is used to generate the model code.

Differential Revision: [D66459992](https://our.internmc.facebook.com/intern/diff/D66459992)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141388
Approved by: https://github.com/chenyang78
2024-12-05 19:20:39 +00:00
12b8c2fd8b Remove lintrunner windows exclusion (#142150)
As it's available right now, see https://pypi.org/project/lintrunner/0.12.7/#files
And no longer ask developers to install rust on the platform

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142150
Approved by: https://github.com/wdvr
2024-12-05 19:02:21 +00:00
a9d84875a9 Fix mha torch._check in jit tracing (#142059)
Test Plan: `buck2 run @//mode/dev-nosan //mobile-vision/d2go/projects_oss/detr:tests -- -r test_detr_fbnet_export`

Differential Revision: D66769339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142059
Approved by: https://github.com/ezyang
2024-12-05 18:38:17 +00:00
540dc0c114 [aoti] Prototype loading from bytes (#142070)
Loader needs to have an official solution -- I'm pretty sure miniz can do this out of box, but haven't gotten the time to look at it yet. For now it just loads the buffer into a file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142070
Approved by: https://github.com/henrylhtsang
2024-12-05 18:38:02 +00:00
a5ec09d0cd Flip specialize_float to default False in fbcode (#142111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142111
Approved by: https://github.com/ezyang
2024-12-05 18:23:47 +00:00
7ff42f7f04 Use the correct CSV filenames for MPS benchmark (#142034)
After https://github.com/pytorch/pytorch/pull/135386 and https://github.com/pytorch/pytorch/pull/141999, MPS benchmark has been running for a while and the data has been uploaded correctly.  However, the dashboard is still using the old schema that requires the output CSV files to be named in a certain way for its query to work https://github.com/pytorch/test-infra/blob/main/torchci/clickhouse_queries/compilers_benchmark_performance/query.sql#L32-L40.  Specifically, the filename needs to be in the following format `inductor_${backend}_${suite}_${dtype}_${mode}_${device}_${target}.csv`.

The new schema gets away with all this hacky setup, but the dashboard hasn't been migrated to the new schema yet. So, this is a quick way to just get the data to show up first.

### Testing

https://github.com/pytorch/pytorch/actions/runs/12153886764
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142034
Approved by: https://github.com/skotapati, https://github.com/malfet
2024-12-05 17:53:58 +00:00
cb70d9fd05 Revert "[ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955)"
This reverts commit 51b7528e274d350c1d5091acc40572d6b43879b8.

Reverted https://github.com/pytorch/pytorch/pull/141955 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/141955#issuecomment-2521024701))
2024-12-05 17:39:32 +00:00
ae9cda0221 Add truediv support in export serializer (#136364)
Fixes #136113

- [x] Inital `truediv` coverage
- [ ] Expand/reduce coverage?
- [x] Add tests
- [x] Re-check docstrings
- [ ] Linting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364
Approved by: https://github.com/pianpwk

Co-authored-by: Angela Yi <angelayi@meta.com>
Co-authored-by: Pian Pawakapan <pianpwk@meta.com>
2024-12-05 17:33:33 +00:00
07edb2ec4d Update documentation for torch.mean() to note behavior with empty tensors (#142039)
This PR updates the documentation for `torch.mean()` to explicitly mention that computing the mean over an empty tensor returns `nan`. This clarification helps users understand the behavior and handle it appropriately in their code.

Fixes #141057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142039
Approved by: https://github.com/albanD
2024-12-05 17:21:53 +00:00
5bc09ac5e9 Remove option for fork-based compile pool (#142001)
Summary: This has been set to "subproc" for a while internally and externally, so we can remove and simplify some of the code. Note that there's no pressing need here -- just that since we've had internal outage with the legacy "fork" implementation, it doesn't seem helpful to leave it available. But if people aren't in the mood for this sort of cleanup, I won't be offended to abandon it.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142001
Approved by: https://github.com/eellison, https://github.com/jansel
2024-12-05 17:02:08 +00:00
3cdd997f4c Update torch-xpu-ops commit pin (#142113)
Update the torch-xpu-ops commit to [7ecb0b](7ecb0b1a56), includes:

- Capture rrelu_with_noise noise mutation in compile (Reslove https://github.com/pytorch/pytorch/issues/142102)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142113
Approved by: https://github.com/EikanWang
2024-12-05 17:00:29 +00:00
f8c212a925 Transform unbacked int expressions into a fresh unbacked int. (#141917)
Fix: #141419

This PR introduces the `torch.sym_fresh_size` API, which transforms an unbacked int
expression into a fresh unbacked int.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141917
Approved by: https://github.com/ezyang
2024-12-05 16:53:44 +00:00
c376b29c67 [CI] Add more tests to the numpy 2 CI (#141925)
Related to #107302

This PR adds all the tests that failed with NumPy 2, which all have been fixed, to the CI to test with NumPy 2 to prevent regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141925
Approved by: https://github.com/albanD
2024-12-05 16:46:21 +00:00
822e8a01c6 [ROCm][Inductor][CK] Add batched gemms into gemm max autotune with CK backend (#141520)
## Testing
```
TORCH_LOGS=+torch._inductor pytest --capture=no test/inductor/test_ck_backend.py -k bmm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141520
Approved by: https://github.com/chenyang78
2024-12-05 16:03:12 +00:00
ca9aeedf40 Revert "[dynamo] Simplify handling of functools.wraps (#142000)"
This reverts commit f8cb692d77fb1ab75d6663eb32d71037b82e9107.

Reverted https://github.com/pytorch/pytorch/pull/142000 on behalf of https://github.com/atalman due to Newly added test test_functions.py::DefaultsTests::test_tree_map is failing internally ([comment](https://github.com/pytorch/pytorch/pull/142000#issuecomment-2520611808))
2024-12-05 15:23:53 +00:00
82c140327e Install magma from a tarball (#140417)
Magma is built for specific CUDA versions and stored in the ossci-linux bucket. Install it from there rather than the deprecated conda package.

There are two places where magma is installed today:
- `install_conda.sh`: extract the magma package in the same exact location where conda would install it, using a dedicated `install_magma_conda.sh` script. The new script is included in the relevant Dockerfiles where CUDA+magma is needed
- `install_magma.sh`: this script already uses a tarball. Use the new tarball instead of the tarball from the conda package. The format of the new tarball is compatible with the old one, so changes here are minimal:wq

Fixes #140538
Test PR: https://github.com/pytorch/pytorch/pull/141584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140417
Approved by: https://github.com/atalman
2024-12-05 15:20:58 +00:00
86d08f0b4a Revert "[dynamo] Remove workaround for functools.wraps in functorch (#142014)"
This reverts commit ed77901ec521f3516c96f9ac2a48e659816c8905.

Reverted https://github.com/pytorch/pytorch/pull/142014 on behalf of https://github.com/atalman due to Sorry https://github.com/pytorch/pytorch/pull/142000 is failing internally, need to revert this ([comment](https://github.com/pytorch/pytorch/pull/142014#issuecomment-2520601186))
2024-12-05 15:18:56 +00:00
d24b147520 Update dead reference link for triplet margin loss (#142071)
The current link for _Learning local feature descriptors with triplets and shallow convolutional neural networks_ (https://www.bmva.org/bmvc/2016/papers/paper119/index.html) is dead (404). The paper is archived here: https://bmva-archive.org.uk/bmvc/2016/papers/paper119/index.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142071
Approved by: https://github.com/albanD
2024-12-05 15:01:10 +00:00
08df79819d Uniformly pass secrets: inherit to all jobs that go to _linux-build/_linux-test (#141995)
There's also a new lint to make sure you did it right.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141995
Approved by: https://github.com/albanD, https://github.com/malfet
2024-12-05 14:52:43 +00:00
c6c45467a3 Use cxx11-abi for Linux CUDA 12.6 builds (#142064)
Manylinux 2.28 and cxx11-abi migration. Please see: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142064
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-12-05 14:51:50 +00:00
2d1d125d60 Add BFloat16 support and use a new pack method for flash attention forward kernel (#138783)
* Add BFloat16 support for BRGEMM flash attention forward kernel
* Use a new pack method instead of oneDNN pack for flash attention forward kernel to avoid the output leading dimension limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138783
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-12-05 14:50:26 +00:00
470b775d7a Remove functorch config: _max_aliased_inputs_with_dynamic_shapes_enabled. (#141680)
This PR removes the functorch config that set an upper limit on the number of aliased
inputs with dynamic shapes. After moving them to be run at runtime in C++, the compilation
time and runtime (in true alias cases) improved, rendering the error no longer relevant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141680
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554, #139555, #140013
2024-12-05 14:43:58 +00:00
12d28a5929 Move overlapping guards to C++. (#140013)
This PR moves the logic for computing the overlapping relations between input tensors that
share a storage instance to C++.

In summary, this PR:

- Moves both `tensors_definitely_do_not_overlap` and part of `compute_overlapping_tensors`
to C++
- Introduces a `check_overlapping` function that re-runs `compute_overlapping_tensors`,
checking that the result is consistent with what is expected
- Introduces the `StorageOverlapChecker` class
    - Keeps track of overlapping and non-overlapping tensors
    - Actually checks the overlapping relation (call `check_overlapping`) when all tensors
    are collected
- Introduces the `STORAGE_OVERLAPPING` relational guard
    - Has a reference to a `StorageOverlapChecker`
    - Stores the to-be-checked tensors in the checker, and triggers its check
- Introduces `install_storage_overlapping_guard` python function
    - Creates an instance of `StorageOverlapChecker`
    - Creates 2 instances of the `STORAGE_OVERLAPPING` guard (for overlapping and
    non-overlapping tensors), referencing the same `StorageOverlapChecker` instance

**Why is `StorageOverlapChecker` needed?**

The way `GuardManager` is implemented, we have no control over the order in which the
check methods are called, i.e. no control over the order the tensors are collected. So, we
can't easily split them in "overlapping" and non-overlapping kinds.

Instead, we create 2 instances of `STORAGE_OVERLAPPING` guard, each of which helps
collecting the tensors for one of the kinds mentioned above. They are then used in a
single `StorageOverlapChecker` instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140013
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554, #139555
2024-12-05 14:43:58 +00:00
3a1ded5caa Add tensor overlapping guards. (#139555)
Fix: #118214

This PR replaces the guards introduced by running `_tensors_definitely_do_not_overlap` at
compile-time by a single `___check_overlapping` guard. When evaluated, this function calls
the original `_tensors_definitely_do_not_overlap` so as to check whether the current state
of the inputs are consistent, i.e. tensors that should overlap do overlap, and those that
shouldn't don't.

In summary, the changes are:

- Introduce `StorageOverlap` derived class from `GuardEnvExpr`
- Plumb `AOTConfig` to the `compute_overlapping_inputs` function, so as to have access to
AOTAutograd input sources
- Suppress the guards generated by `_tensors_definitely_do_not_overlap` function at runtime
- Issue a `StorageOverlap` AOTAutograd guard, specifying the sources that should and
shouldn't overlap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139555
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139554
2024-12-05 14:43:58 +00:00
cbfab8b4de Add tensor._base as a tracked fake for ShapeEnv guards. (#139554)
This PR fixes the issue where AOTAutograd would produce a guard that used a symbolic value
that came from one of the input's base.

```python
@torch.compile(backend="aot_eager", dynamic=True)
def f(a, b):
    a.add_(1)
    b.add_(1)
    return a

x = torch.ones(10)
f(x[1:], x[1:])
```

In the example above, AOTAutograd functionalizes the mutation by making use of
`as_strided_scatter` operation, which produces the guard: `s0 >= s1 + 1`, where:

- `s0`: corresponds to `x.size()[0]`
- `s1`: corresponds to `a.size()[0]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139554
Approved by: https://github.com/bdhirsh
2024-12-05 14:43:58 +00:00
27bf7d62e7 Enable retry on A100 perf nightly (#142074)
This is a quick mitigation while the investigation on https://github.com/pytorch/pytorch/issues/142069 going.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142074
Approved by: https://github.com/jeanschmidt
2024-12-05 14:35:53 +00:00
6183c90e99 Avoid recursion in FloorDiv constructor (#142057)
address https://github.com/pytorch/pytorch/issues/141215 and max recursion issue in
this also optimize perf by avoiding a lot of sympy expressions construction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142057
Approved by: https://github.com/ezyang
2024-12-05 14:25:28 +00:00
895c8ce5b3 MetaTensorDesc changes for reconstructing proper FakeTensors (#141926)
A few changes to MetaTensorDesc and friends:

1. Change view_func from a raw method to an ADT where the common case (FakeTensor._view_func_unsafe) is a simple representation instead.
2. (minor) Remove and fix some `type: ignore`s added by #141839
3. (minor) Fix _UNSERIALIZABLE to be a set instead of a dict which is converted into a set each time it's used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141926
Approved by: https://github.com/ezyang
2024-12-05 14:21:57 +00:00
65c2086d45 fix the lint from D66795414 (#142122)
Summary: this diff is to fix the lint issues from D66457500 / https://github.com/pytorch/pytorch/pull/142056

Test Plan: OSS CI

Reviewed By: houseroad, FulinHuang

Differential Revision: D66795414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142122
Approved by: https://github.com/houseroad
2024-12-05 12:05:51 +00:00
38e0f72274 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-05 11:25:55 +00:00
ad2cc96218 Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)
This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587
Approved by: https://github.com/jansel
2024-12-05 09:57:08 +00:00
8dd4673cea Support torch.xpu.mem_get_info API (#141230)
# Motivate
Fix https://github.com/pytorch/pytorch/issues/130599
This PR intends to add a new API, `torch.xpu.mem_get_info,` which is widely used in popular model workloads.
For example, [here](403c0714d1/src/accelerate/utils/modeling.py (L721)) we need to get current GPU memory usage to split or load the model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141230
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-12-05 08:17:25 +00:00
0be004ff37 Enable fuse_by_partitions to always return output as tuple (#142056)
Summary:
aot_compile only accept a graph with tuple output.
we introduce an option to fuse_by_partitions to alway return outputs as tuple, even if it only have a single entry.

Test Plan: OSS CI

Differential Revision: D66457500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142056
Approved by: https://github.com/angelayi, https://github.com/hl475
2024-12-05 08:07:41 +00:00
f675f644fd Cleanup between each test in test/test_utils_config_module.py (#142087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142087
Approved by: https://github.com/ezyang
2024-12-05 07:17:27 +00:00
cyy
aa95618268 [2/N] Apply py39 ruff fixes (#141938)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141938
Approved by: https://github.com/ezyang
2024-12-05 06:26:06 +00:00
cyy
653efe14e4 [3/N] Enable UBSAN tests (#142022)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142022
Approved by: https://github.com/ezyang
2024-12-05 06:06:53 +00:00
b31d3b2f41 Update torch-xpu-ops commit pin (#141949)
Update the torch-xpu-ops commit to [f31219](f312190a92), includes:

- Add lazy init for empty_xpu
- Fix nan propagation error for soft_shrink

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141949
Approved by: https://github.com/EikanWang
2024-12-05 05:22:38 +00:00
31f2d4eb4e [export] Update docs (#142011)
Summary:
Update export docs. Including:
1. Update the output graph.
2. Misc fixes for examples.

Test Plan: CI

Differential Revision: D66726729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142011
Approved by: https://github.com/angelayi
2024-12-05 03:44:46 +00:00
471017cbc9 avoid specializing strides with DDPOptimizer + inductor (#140751)
Fixes https://github.com/pytorch/pytorch/issues/140229

Fixes https://github.com/pytorch/pytorch/issues/139474

The issue was that:

(1) DDPOptimizer has some logic to partition the dynamo graph into buckets, and run AOTAutograd/inductor on each bucket

(2) doing so requires knowing the **exact** strides of the outputs of each subgraph, so we can have example inputs (with correct strides) to each of the later subgraphs to compile with

(3) there is some existing logic to do this today: we have a `fakify_first_call` flag in AOTAutograd that lets you run it with fake tensor inputs (to handle the calling convention changes that AOTAutograd performs at runtime). During this process, we query inductor for the output strides that it compiled with

(4) these outputs strides are stored in the FX graph cache as raw strings of sympy expressions. We have a function, `evaluate_symexpr`, which given the sympy string, and the ShapeEnv's `var_to_val` mapping, will evaluate the sympy string to generate concrete strides

(5) evaluating this expression will specialize on the exact values of any variables in our shape env, however. In DDPOptimizer, we want to know what inductor's stride outputs are symbolically. This requires converting the (string) sympy expression into actual `SymInts` that we can return.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140751
Approved by: https://github.com/eellison
2024-12-05 03:41:12 +00:00
b08bc07cd7 [AOTInductor] Option to not include weight in .so (#141997)
Summary: Add an option in config to not include weights in .so

Test Plan: `test/inductor:test_aot_inductor -- -r test_so_without_weight_cuda`

Reviewed By: desertfire

Differential Revision: D65968885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141997
Approved by: https://github.com/desertfire
2024-12-05 03:35:54 +00:00
51cbac4e6a [export] Change fx graph _replace_hook to a list of Callable (#142006)
Summary: Change fx graph module's _replace_hook from a single hook, to a list of hooks. This is to prepare to registering more hooks for inductor provenance tracking, where we might need to register multiple hooks for node replacement.

Test Plan:
```
buck run mode/dev-nosan caffe2/test:fx -- -r test_hooks_for_node_update
buck run mode/dev-nosan caffe2/test:test_export -- -r test_replace_hook
```

Differential Revision: D66726724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142006
Approved by: https://github.com/zhxchen17
2024-12-05 03:26:48 +00:00
45583a5df9 [FSDP2] Move to public torch.distributed.fsdp (#141868)
**Overview**
This PR moves `torch/distributed/_composable/fsdp` to `torch/distributed/fsdp/_fully_shard` and makes public APIs available from `torch.distributed.fsdp`, e.g.:
```
from torch.distributed.fsdp import fully_shard
```
This is targeting 2.6 release. I rewrote some of the documentation with (hopefully) improved phrasing.

**Follow-Ups**
- [x] Add some explanation in the docs about FSDP1 vs. FSDP2
- [ ] Move unit tests from `test/distributed/_composable/fsdp` to `test/distributed/fsdp/fully_shard/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141868
Approved by: https://github.com/kwen2501, https://github.com/wconstab, https://github.com/weifengpy

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-12-05 03:04:01 +00:00
f9af86de01 [Inductor] Represent tiling as a dict (#141751)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions.

This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`.

Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first.

# Test plan
The existing CI provides good coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751
Approved by: https://github.com/jansel
2024-12-05 02:28:16 +00:00
ff0cfec4c0 AsyncCollectiveTensor: fix _are_we_tracing() in dynamo (#142075)
Fixes https://github.com/pytorch/pytorch/issues/142076. Under compile, functional collectives are supposed to **not** return `AsyncCollectiveTensor`, and instead immediately issue calls to `wait_tensor()` (that we rely on the compiler to reorder as necessary.

This is done with a function `_are_we_tracing()`, that tries to detect if we are running from inside of the compiler. One of the checks it performs is `is_torchdynamo_compiling()` ([here](https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L808C8-L808C34)).

Unfortunately, this will always return False, even if dynamo is indeed tracing. The problem is that this function only returns true if dynamo **intercepts** the bytecode for `is_torchdynamo_compiling()`. However, this function is called during fake-tensor propagation, which is run as part of dynamo, but is not actually intercepted by dynamo itself.

One thing that we know is the case during dynamo tracing, however, is that a `FakeTensorMode` is active. So I tweaked the logic to assume that we are tracing if there is an active fake mode.

This could potentially have consequences for anybody running functional collectives with a fake mode directly, without compile in the loop. Although hopefully it's not too unreasonable to issue wait() calls immediately if you are running with fake tensor (presumably you only care about fake tensor propagation, in which case the wait() calls should technically be a no-op).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142075
Approved by: https://github.com/yifuwang, https://github.com/kwen2501
ghstack dependencies: #141725, #141728
2024-12-05 02:01:18 +00:00
dbd7b820dd Revert "[ROCm] port CK rowwise F8 from fbgemm (#140856)"
This reverts commit 291626fb22832f9381524be73241b495efa60532.

Reverted https://github.com/pytorch/pytorch/pull/140856 on behalf of https://github.com/atalman due to Failing internal build ([comment](https://github.com/pytorch/pytorch/pull/140856#issuecomment-2518911997))
2024-12-05 01:51:40 +00:00
7e77c5ffba cpp_wrapper: input kwargs to custom ops (#141370)
Fixes a situation where kwargs were being passed to a Python fallback op, but as args rather than kwargs. This does not work for arguments that are kwarg-only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141370
Approved by: https://github.com/desertfire
ghstack dependencies: #141368, #141580, #141369
2024-12-05 00:58:01 +00:00
dd7debdbe8 cpp_wrapper: rethrow Python exceptions, when present (#141369)
When running fallback operations in `cpp_wrapper` mode, Python errors thrown in the fallback should be propagated up the stack. This PR fixes the current situation, which discards all Python errors thrown in the fallback op in favor of an uninformative `RuntimeError`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141369
Approved by: https://github.com/desertfire
ghstack dependencies: #141368, #141580
2024-12-05 00:58:01 +00:00
4613bd393d cpp_wrapper: Add support for torch.device arguments (#141580)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141580
Approved by: https://github.com/desertfire
ghstack dependencies: #141368
2024-12-05 00:58:01 +00:00
923a778f97 cpp_wrapper: Complete support for Layout arguments (#141368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141368
Approved by: https://github.com/desertfire
2024-12-05 00:58:01 +00:00
3fdc74ae29 Fix dumb typo (#142079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142079
Approved by: https://github.com/jainapurva, https://github.com/soulitzer
2024-12-05 00:43:49 +00:00
60a192036b Refactor optional graph module into CompiledFxGraphConstants (#141897)
FXGraphCache supports freezing, but AOTAutogradCache does not. This is due to the fact that when freezing is turned on, instead of using the constants from the graph module that was saved on cache miss, we have to take the constants from the AOTAutograd generated graph module. This PR does two things:

- It bypasses AOTAutogradCache when freezing is turned on. We should have always been doing this.

- It refactors the code to be way more clear about the constants we're using and when we're using them.

Basically, there are two possible sets of constants we can grab from the compiled fx graph.

1. If freezing is turned off, we save the constants directly in CompiledFxGraph.
2. If freezing is turned on, we save the *names* of the constants in CompiledFxGraph, and use the runtime GraphModule's actual constant values: we reconstruct them from the saved names + the new graph module from AOTDispatch.

We implement two different classes for doing just this: one that has access to the post aotdispatch gm, which supports freezing, and one that doesn't have it, which does not support freezing. Then we construct the wrappers and unwrap the result as needed.

This makes it clear that the gm passed to AOTAutogradCache is *not* part of post compile, only the cache key generated from it is.

The whole flow is pretty confusing, but hopefully this gives us better types and static information for understanding what the different codepaths are doing.

Will add a specific AOTAutogradCache to confirm we bypass freezing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141897
Approved by: https://github.com/ezyang, https://github.com/masnesral
2024-12-05 00:34:14 +00:00
25d9fa84ea [CI, 3.13] enable dynamo_wrapped unittests in 3.13 (#141264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141264
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887, #141950, #141951
2024-12-05 00:33:26 +00:00
797a347cd0 [ci, 3.13] disable segfaulting dynamo-wrapped profiler test (#141951)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141951
Approved by: https://github.com/sraikund16, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887, #141950
2024-12-05 00:33:26 +00:00
ae71240780 [ci, 3.13] fix/skip failing numpy 2.0+ dynamo-wrapped tests (#141950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141950
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886, #141887
2024-12-05 00:33:26 +00:00
fbd130a41f [ci, 3.13] skip failing module tracker dynamo-wrapped test (#141887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141887
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860, #141886
2024-12-05 00:33:26 +00:00
9e474231d7 [ci, 3.13] skip failing torch.package dynamo-wrapped test (#141886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141886
Approved by: https://github.com/PaliC
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859, #141860
2024-12-05 00:33:26 +00:00
408669a559 [dynamo, 3.13] disable 3.13.0 warning in dynamo-wrapped tests (#141860)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141860
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733, #141859
2024-12-05 00:33:26 +00:00
d34235a2a3 [dynamo, 3.13] add JUMP_BACKWARD_NO_INTERRUPT to terminal opcodes (#141859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141859
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533, #140733
2024-12-05 00:33:26 +00:00
2f45484331 [ci] add 3.13 inductor unittests to CI (#140733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140733
Approved by: https://github.com/malfet, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862, #139533
2024-12-05 00:33:26 +00:00
3baf8859e6 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [4/N] (#140253)
related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140253
Approved by: https://github.com/soulitzer
2024-12-05 00:30:00 +00:00
416f500bfe [CI, 3.13] enable 3.13 CI (#139533)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139533
Approved by: https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858, #141862
2024-12-05 00:25:03 +00:00
abc4111348 [ci, 3.13] skip dynamo-xpass'd numpy tests in numpy >= 2.0 (#141862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141862
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674, #141858
2024-12-05 00:25:02 +00:00
76d1047629 [dynamo, 3.13] support CONVERT_VALUE (#141858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141858
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673, #141674
2024-12-05 00:24:55 +00:00
40c959484c [ci, 3.13] disable segfaulting profiler tests in 3.13 (#141674)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141674
Approved by: https://github.com/sraikund16, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623, #141673
2024-12-05 00:24:48 +00:00
c946d82077 [ci, 3.13] disable another failing cpp_extension test in 3.13 (#141673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141673
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621, #141623
2024-12-05 00:24:42 +00:00
cd56cd30f2 [ci, 3.13] disable failing cpp_extension test due to weights_only error in numpy 2.1 (#141623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141623
Approved by: https://github.com/mikaylagawarecki, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605, #141621
2024-12-05 00:24:35 +00:00
2be8d16247 [ci, 3.13] disable some quantization tests affected by numpy 2.1 overflow error (#141621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141621
Approved by: https://github.com/jerryzh168, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577, #141605
2024-12-05 00:24:29 +00:00
314e5dd1d1 [ci, 3.13] skip some parts of a failing jit test in 3.13 (#141605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141605
Approved by: https://github.com/davidberard98, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572, #141577
2024-12-05 00:24:22 +00:00
1a44f01beb [ci, 3.13] update test_testing.py usage of locals() for 3.13 (#141577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141577
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003, #141572
2024-12-05 00:24:14 +00:00
9459952175 [ci, 3.13] update tensorboard version for 3.13 to fix broken tests (#141572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141572
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409, #142003
2024-12-05 00:24:07 +00:00
c93dd531d3 format test_monitor.py and test_tensorboard.py (#142003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142003
Approved by: https://github.com/StrongerXi, https://github.com/atalman
ghstack dependencies: #141409
2024-12-05 00:23:54 +00:00
22ae34af88 [torch.package, 3.13] fixes to torch.package for 3.13 (#141409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141409
Approved by: https://github.com/PaliC, https://github.com/atalman
2024-12-05 00:23:47 +00:00
e6e75ebd0a Silent TD warnings when there is no td_results.json (#142083)
Despite the fact that we have `continue-on-error: true` there, GH behaves noisily when `td_results.json` doesn't exist.
 For example, all benchmark jobs in https://github.com/pytorch/pytorch/actions/runs/12149624686 finished successfully but they all showed up as errors on GH UI.  To make this worst, log classifier sometimes pick up the error https://github.com/pytorch/pytorch/actions/runs/12149624686/job/33882285001#step:16:37
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142083
Approved by: https://github.com/clee2000
2024-12-04 23:43:29 +00:00
UV
0318589e87 Changed 'standard-deviation' to 'variance' in GroupNorm documentation (#141982)
Fixes #141315

Updated the GroupNorm documentation to replace 'standard-deviation' with 'variance' to accurately reflect the calculation
method.

@pytorchbot label "topic: not user facing"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141982
Approved by: https://github.com/mikaylagawarecki
2024-12-04 22:49:45 +00:00
326f487809 Bypass AutogradCache when view replay affects the mutation meta (#141978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141978
Approved by: https://github.com/bdhirsh
2024-12-04 22:13:12 +00:00
fa2fe9cafb Delete linux-focal-cuda12.4-py3.10-gcc9-sm86 from trunk (#142073)
As it exactly mirrors the the job on pull.yml, see
c83b739f14/.github/workflows/pull.yml (L479-L495)

And also HUD [permalink](53768d67ab/3):

<img width="1201" alt="image" src="https://github.com/user-attachments/assets/d3f0f81c-843b-4f96-82ce-9fd18ebfe2ad">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142073
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman
2024-12-04 22:04:55 +00:00
69f8b3e269 [ROCm] unskip hermite_polynomial_h unit tests (#141150)
Large n input caused a regression starting in ROCm 6.1. The for loop will run for an excessive number of iterations. The root cause seems to be how static_cast<int64_t> behaves for large float values such as 1e20 that certain unit tests will use. The workaround is to break out of the loop once the returned value reaches nan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141150
Approved by: https://github.com/eqy, https://github.com/malfet
2024-12-04 22:01:57 +00:00
53768d67ab Fix unit test failures with SciPy 1.13+ (#141986)
Related to #107302

To use `numpy>=2`, we need to upgrade `scipy>=1.13.0` from `1.11.0`.
This PR fixes a failed test caused by the `scipy` upgrade.

The `scipy` implementation of `logsumexp` has changed and deviated from the torch implementation.
So, we replace it with a simple custom implementation as the ground truth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141986
Approved by: https://github.com/rgommers, https://github.com/albanD
2024-12-04 21:41:38 +00:00
54324fc2d9 [MPS] Release MetalShaderLibrary cached resources (#142053)
By releasing retained `id<MTLFunction>` and `id<MTLComputePipelineState>`
Please note, that `id<MTLLibrary>` associated with class are currently leaked, which is by design, all dynamic shader allocations shoudl use `DynamicMetalShaderLibrary`

Test plan: `leaks --atExit -- ./bin/mps_test_metal_library`

Before:
```
STACK OF 1 INSTANCE OF 'ROOT LEAK: <_MTLFunctionInternal>':
18  dyld                                  0x197a94274 start + 2840
17  mps_test_metal_library                0x1002cb420 main + 68
16  mps_test_metal_library                0x1002fa388 testing::UnitTest::Run() + 124
15  mps_test_metal_library                0x1002fa40c bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 80
14  mps_test_metal_library                0x1002fac50 testing::internal::UnitTestImpl::RunAllTests() + 1588
13  mps_test_metal_library                0x1002e9934 testing::TestSuite::Run() + 1032
12  mps_test_metal_library                0x1002e8688 testing::TestInfo::Run() + 960
11  mps_test_metal_library                0x1002e715c testing::Test::Run() + 812
10  mps_test_metal_library                0x1002e7200 void testing::internal::HandleExceptionsInMethodIfSupported<testing::TestSuite, void>(testing::TestSuite*, void (testing::TestSuite::*)(), char const*) + 80
9   mps_test_metal_library                0x1002c5518 MPSTestMetalLibrary_ArangeShader_Test::TestBody() + 420
8   libtorch_cpu.dylib                    0x10fdd3804 at::native::mps::MetalShaderLibrary::getKernelFunction(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 56
7   libtorch_cpu.dylib                    0x10fdd3394 at::native::mps::MetalShaderLibrary::getLibraryPipelineState(id<MTLLibrary>, std::__1::basic_string<char, id<MTLLibrary>::char_traits<char>, id<MTLLibrary>::allocator<char>> const&) + 268
6   com.apple.Metal                       0x1a2be43b4 -[_MTLLibrary newFunctionWithName:] + 28
5   com.apple.Metal                       0x1a2be4498 -[_MTLLibrary newFunctionWithNameInternal:] + 148
4   com.apple.Metal                       0x1a2be4580 MTLLibraryContainer::functionWithName(NSString*, id<MTLDevice>) + 68
3   com.apple.Metal                       0x1a2be4724 MTLLibraryDataWithArchive::newFunction(NSString*, id<MTLDevice>) + 368
2   libobjc.A.dylib                       0x197a49ddc _objc_rootAllocWithZone + 48
1   libsystem_malloc.dylib                0x197c3baf8 _calloc + 88
0   libsystem_malloc.dylib                0x197c4e9bc _malloc_zone_calloc_instrumented_or_legacy + 128
====
    2 (592 bytes) ROOT LEAK: <_MTLFunctionInternal 0x1325e5550> [448]
       1 (144 bytes) _functionQueue --> <dispatch_queue_t (serial) 0x13254c340> [144]  "function queue" (from Metal)
```
After:
```
Process:         mps_test_metal_library [30687]
Path:            /Users/USER/*/mps_test_metal_library
Load Address:    0x100f74000
Identifier:      mps_test_metal_library
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  leaks [30686]

Date/Time:       2024-12-04 07:57:01.020 -0800
Launch Time:     2024-12-04 07:56:59.030 -0800
OS Version:      macOS 15.1.1 (24B2091)
Report Version:  7
Analysis Tool:   /usr/bin/leaks

Physical footprint:         177.2M
Physical footprint (peak):  236.5M
Idle exit:                  untracked
----

leaks Report Version: 4.0, multi-line stacks
Process 30687: 40691 nodes malloced for 5575 KB
Process 30687: 0 leaks for 0 total leaked bytes.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142053
Approved by: https://github.com/manuelcandales
ghstack dependencies: #142052
2024-12-04 21:40:50 +00:00
e8200a507d [MPS] Fix memory leak (#142052)
`NSProcessInfo` was allocated inside autorelease pool, but was not added to the pool

Test plan: `leaks --atExit -- ./bin/mps_test_print`

Before it reported the leaks as follows
```
leaks Report Version: 4.0, multi-line stacks
Process 30066: 39595 nodes malloced for 5034 KB
Process 30066: 7 leaks for 448 total leaked bytes.

STACK OF 1 INSTANCE OF 'ROOT LEAK: <NSProcessInfo>':
29  dyld                                  0x197a94274 start + 2840
28  mps_test_print                        0x10224440c main + 68
27  mps_test_print                        0x1022733e4 testing::UnitTest::Run() + 124
26  mps_test_print                        0x102273468 bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 80
25  mps_test_print                        0x102273cac testing::internal::UnitTestImpl::RunAllTests() + 1588
24  mps_test_print                        0x102262990 testing::TestSuite::Run() + 1032
23  mps_test_print                        0x1022616e4 testing::TestInfo::Run() + 960
22  mps_test_print                        0x1022601b8 testing::Test::Run() + 812
21  mps_test_print                        0x10226025c void testing::internal::HandleExceptionsInMethodIfSupported<testing::TestSuite, void>(testing::TestSuite*, void (testing::TestSuite::*)(), char const*) + 80
20  mps_test_print                        0x102240f88 MPSPrintTest_PrintFloatMatrix_Test::TestBody() + 88
19  mps_test_print                        0x1022414f4 torch::randn(c10::ArrayRef<long long>, c10::TensorOptions) + 72
18  libtorch_cpu.dylib                    0x10de1cb34 at::_ops::randn::call(c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 280
17  libtorch_cpu.dylib                    0x10de1cf1c at::_ops::randn::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 152
16  libtorch_cpu.dylib                    0x10d9b1078 at::native::randn(c10::ArrayRef<long long>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 60
15  libtorch_cpu.dylib                    0x10d9b1220 at::native::randn(c10::ArrayRef<long long>, std::__1::optional<at::Generator>, std::__1::optional<c10::ScalarType>, std::__1::optional<c10::Layout>, std::__1::optional<c10::Device>, std::__1::optional<bool>) + 256
14  libtorch_cpu.dylib                    0x10e0151f8 at::_ops::normal_::call(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 476
13  libtorch_cpu.dylib                    0x10f08ceac c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor& (at::Tensor&, double, double, std::__1::optional<at::Generator>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_MPS__normal_(at::Tensor&, double, double, std::__1::optional<at::Generator>)>, at::Tensor&, c10::guts::typelist::typelist<at::Tensor&, double, double, std::__1::optional<at::Generator>>>, at::Tensor& (at::Tensor&, double, double, std::__1::optional<at::Generator>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor&, double, double, std::__1::optional<at::Generator>) + 84
12  libtorch_cpu.dylib                    0x10f037674 at::(anonymous namespace)::(anonymous namespace)::wrapper_MPS__normal_(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 72
11  libtorch_cpu.dylib                    0x111d8bde8 at::native::normal_mps_(at::Tensor&, double, double, std::__1::optional<at::Generator>) + 132
10  libtorch_cpu.dylib                    0x111d8c334 at::native::mps::normal_mps_impl(at::Tensor&, double, double, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Generator>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 884
9   libtorch_cpu.dylib                    0x111d8b8d8 at::Tensor& at::native::mps::random_mps_impl<double>(at::Tensor&, double, double, std::__1::optional<at::Tensor> const&, std::__1::optional<at::Tensor> const&, MPSGraphRandomDistribution, std::__1::optional<at::Generator>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, MPSGraphTensor* (at::native::mps::RandomCachedGraph*, MPSGraphTensor*) block_pointer) + 2508
8   libtorch_cpu.dylib                    0x111d453bc at::native::mps::Placeholder::Placeholder(MPSGraphTensor*, at::Tensor const&, NSArray<NSNumber*>*, bool, MPSDataType, bool) + 5120
7   libtorch_cpu.dylib                    0x111d2dbc8 at::mps::MPSDevice::isMacOS13Plus(at::mps::MacOSVersion) const + 404
6   libtorch_cpu.dylib                    0x111d2ddf0 at::mps::MPSDevice::isMacOS13Plus(at::mps::MacOSVersion) const::$_0::operator()(int, int) const + 48
5   libobjc.A.dylib                       0x197a7b3f4 objc_alloc_init + 80
4   com.apple.Foundation                  0x19995fbe4 +[NSProcessInfo alloc] + 112
3   com.apple.Foundation                  0x19995faec +[NSProcessInfo allocWithZone:] + 120
2   libobjc.A.dylib                       0x197a49ddc _objc_rootAllocWithZone + 48
1   libsystem_malloc.dylib                0x197c3baf8 _calloc + 88
0   libsystem_malloc.dylib                0x197c4e9bc _malloc_zone_calloc_instrumented_or_legacy + 128
====
    1 (64 bytes) ROOT LEAK: <NSProcessInfo 0x102ce4de0> [64]
```
After test run finishes with no leaks reported
```
Process 29875 is not debuggable. Due to security restrictions, leaks can only show or save contents of readonly memory of restricted processes.

Process:         mps_test_print [29875]
Path:            /Users/USER/*/mps_test_print
Load Address:    0x10223c000
Identifier:      mps_test_print
Version:         0
Code Type:       ARM64
Platform:        macOS
Parent Process:  leaks [29874]

Date/Time:       2024-12-04 07:43:15.287 -0800
Launch Time:     2024-12-04 07:43:14.400 -0800
OS Version:      macOS 15.1.1 (24B2091)
Report Version:  7
Analysis Tool:   /usr/bin/leaks

Physical footprint:         172.0M
Physical footprint (peak):  234.1M
Idle exit:                  untracked
----

leaks Report Version: 4.0, multi-line stacks
Process 29875: 39508 nodes malloced for 5021 KB
Process 29875: 0 leaks for 0 total leaked bytes.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142052
Approved by: https://github.com/manuelcandales
2024-12-04 21:40:50 +00:00
c0c8f41679 [ROCm] add gfx1101 to wheels (#141667)
- Remove older ROCm 5.x build condition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141667
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-04 21:21:29 +00:00
760b8ec10a [easy] Log bypass reasons if we're unable to serialize or deserialize a saved graph (#141911)
When we fail to deserialize/serialize a graph, we should alert and log it somewhere so that it's debuggable.

This can happen in OSS if we use view_replay and encounter an output that requires functional tensor to be serialized to work.

Differential Revision: [D66669993](https://our.internmc.facebook.com/intern/diff/D66669993/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141911
Approved by: https://github.com/oulgen, https://github.com/ezyang
2024-12-04 21:03:32 +00:00
f24a9d0755 [PGNCCL] Fix behavior of destroy_process_group (#141510)
Today `destroy_process_group()` is implemented via `ncclCommAbort`.
When user call it in CPU, risk is that a healthy NCCL kernel gets preempted, which causes data corruption.

Instead of aborting kernels, we should flush collectives in `destroy_process_group`, i.e. let them complete normally, before we tear down resources.

This PR implements such "flushing" behavior using `ncclCommFinalize`, then reclaims resources via `ncclCommDestroy`.

Expected behaviors:
For a bad program, a hang is expected at `destroy_process_group()`. If the PG uses non-blocking communicators, such hang is recoverable, because we attaches a timeout to the flush behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141510
Approved by: https://github.com/wconstab
2024-12-04 20:30:47 +00:00
f7bd0c6b60 [doc] Fix the toctree level (#142008)
Changing this back 1 in order to not expand on the index.html page.
Before:
![Screenshot 2024-12-04 at 11 47 54 AM (2)](https://github.com/user-attachments/assets/40d730ee-61b9-4d60-ab13-9b9075cb3cba)
After:
![Screenshot 2024-12-04 at 11 48 30 AM (2)](https://github.com/user-attachments/assets/5eb711a0-e76c-4573-9fdf-88b6b94b31a9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142008
Approved by: https://github.com/sekyondaMeta, https://github.com/malfet
2024-12-04 19:52:14 +00:00
d552625920 [xla] Update pin to current xla/master (#142065)
Previous xla pin update was to branch on xla, which I force pushed to remove .torch_pin from xla PR and this commit became non available. Updating to merged xla/master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142065
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-12-04 19:44:50 +00:00
ed77901ec5 [dynamo] Remove workaround for functools.wraps in functorch (#142014)
This is no longer needed after #142000.

Fixes #123365.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142014
Approved by: https://github.com/zou3519
ghstack dependencies: #142000
2024-12-04 19:10:46 +00:00
f8cb692d77 [dynamo] Simplify handling of functools.wraps (#142000)
Previously when Dynamo encounters a `functools.wrap(...)` call, it would
check `VariableTracker.can_reconstruct` and graph break if failed.

That has 2 issues:
1. Implementation of `can_reconstruct` is incorrect, since logic of
   reconstructability isn't necessarily encapsulated in
   `VariableTracker.reconstruct` -- for some VTs like `CellVariable`,
   it's also in `SideEffects.codegen_save_tempvars`. This is exposed by
   #134731.
2. We don't always need to reconstruct the result of
   `functools.wrap(...)`, for those cases we don't want to give up
   tracing by an early `con_reconstruct` check. Instead we could just
   let it fall through, and graph break in the actual `reconstruct` call
   later, if needed.

This patch removes the `can_reconstruct` check altogether. It was
introduced in #114279, but the added tests pass even without the check
now; this might be because of some recent bug fixing on cells and side
effects.

Fixes #134731, #141514.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142000
Approved by: https://github.com/zou3519
2024-12-04 19:10:45 +00:00
51b7528e27 [ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955)
Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN.

According to my short test
```Python
import torch
device = "cuda"
dtype = torch.float32

x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype)

for n in range(1024):
    if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]:
        print(f"hermite_polynomial_h: all outputs are nans! n = {n}")
        break

for n in range(1024):
    if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]:
        print(f"hermite_polynomial_he: all outputs are nans! n = {n}")
        break
```

The output values become NaNs at these orders:
```
hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32
hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32
hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64
hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64
```

Surely, it makes sense to increase the limit as a safety margin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955
Approved by: https://github.com/malfet
2024-12-04 18:21:44 +00:00
c47dae8646 [functional autograd] refactor CopyBackward to be functional (#141719)
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141719
Approved by: https://github.com/soulitzer
ghstack dependencies: #141278, #141348
2024-12-04 18:06:31 +00:00
215f5d77b5 [functional autograd] Refactor validate_outputs into a functional variant (#141348)
Today, validate_outputs is stateful (it depends on the autograd graph).
This PR refactors it into a stateless form that just depends on
InputMetadata.

Test Plan:
- new unittest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141348
Approved by: https://github.com/soulitzer
ghstack dependencies: #141278
2024-12-04 18:06:31 +00:00
2b4f1f4990 [functional autograd] Refactor built-in autograd nodes into functional variants (#141278)
This PR refactors all builtin autograd nodes (e.g. MulBackward0) from
having a single MulBackward0::apply into having:
- a "pure function variant" `MulBackward0_apply_functional`
- a stateful variant MulBackward0::apply that ends up calling
  `MulBackward0_apply_functional`.

In order to do this we left the stateful pieces in MulBackward0::apply
(like unpacking of saved vars, determining which gradients actually need
computing).

The motivation is that this will be useful for compiled autograd in a
future PR. We might refactor this more later, but I wanted to get
something reviewed, shipped, and tested in-tree because the entire stack
is going to be big and this change by itself might have subtle perf issues.

The new codegen looks like the following:
- https://gist.github.com/zou3519/84721cfbef71bb640ddf1a64ef8583a3

Here's the old codegen for comparison:
- https://gist.github.com/zou3519/73f925fe6aca6dd3ceb0a6e6fcf5f77d

Test Plan:
- existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141278
Approved by: https://github.com/soulitzer
2024-12-04 18:06:31 +00:00
fd35be2fd3 TritonTemplate dtype fixes (#141991)
- Set the dtype of "acc" appropriately so that epilogue fusion will have args with dtype
- Update dtype propagation to use `type_to_dtype` instead of instantiating tensor
- Throw if we have a string arg where we should have a proper CSEVariable, unless we're doing the Modification Subgraph thing which is nyi. everything else is appropriately typed (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @drisspg ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141991
Approved by: https://github.com/drisspg
ghstack dependencies: #139945, #140057, #141495, #141882
2024-12-04 17:24:23 +00:00
920e4364b7 [BE] Remove "$PACKAGE_TYPE" == 'conda' logic from build scripts (#142019)
Please see: https://github.com/pytorch/pytorch/issues/138506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142019
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-12-04 16:05:43 +00:00
0582b32f6c Enable Extension Support (#142028)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142028
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-12-04 15:54:06 +00:00
38d10a1b17 Revert "[Inductor] Represent tiling as a dict (#141751)"
This reverts commit 5deca07c0dcf1482eba99bf93b805cf1cc41ad6c.

Reverted https://github.com/pytorch/pytorch/pull/141751 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/141751#issuecomment-2517815899))
2024-12-04 15:43:16 +00:00
7830c213d7 [FlexAttention] Fix max-autotune bug with captured buffer grads (#141531)
# Summary
Fix tensor argument ordering for autotuning flex attention, change how we enabled scatters codegen for triton. We used to go through the existing store_output triton codegen but now we just short circuit and generate the correct expression earlier on.

This enables us to instead of relying on arg.python_defs to thread arguments through via input_buffers we can instead reuse the exact same mutated buffer infra as we did for multiple outputs before.

Test cases added for both default and max-autotune-no-cudagraphs modes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141531
Approved by: https://github.com/Chillee
2024-12-04 14:56:58 +00:00
6ad422d778 set_linter finds and replaces built-in set in Python code (#138454)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138454
Approved by: https://github.com/eellison
2024-12-04 14:31:24 +00:00
7666c8263a [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel

Co-authored-by: James Wu <jjwu@meta.com>
2024-12-04 14:21:04 +00:00
f85e238186 [aotd] capture rrelu_with_noise noise mutation in compile (#141867)
Rebase-copy of long standing already approved PR https://github.com/pytorch/pytorch/pull/138503 that was blocked on landing by xla build issues.

Got a new  PR with the same content (ghstack checkout was failing due to changed submodules)

Corresponding xla PR:
https://github.com/pytorch/xla/pull/8363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141867
Approved by: https://github.com/bdhirsh
2024-12-04 12:18:58 +00:00
61dc5e9c0a Enforce contiguity for alltoall (#141816)
Summary: We cannot relax the alltoall contiguous requirement which will lead to wrong results.

Test Plan: Added a test.

Differential Revision: D66560930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141816
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/fduwjj, https://github.com/fegin, https://github.com/yoyoyocmu
2024-12-04 10:17:39 +00:00
eff99a4b4b fix linalg.SVD docs typo: wrong V* shape in reduced SVD (#142037)
https://en.wikipedia.org/wiki/Singular_value_decomposition#Reduced_SVDs

in reduced SVD
V* shape is (n, k)
V shape is (n, k)

in docs it was wrong

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142037
Approved by: https://github.com/lezcano
2024-12-04 09:18:33 +00:00
16676fd17b Disable unused ARM SME to reduce android app binary size (#141942)
Summary: ARM SME kernels aren't currently used right now, so disabling their build so

Reviewed By: digantdesai

Differential Revision: D66336599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141942
Approved by: https://github.com/digantdesai
2024-12-04 07:24:50 +00:00
9dffd12f90 Upgrade ROCm wheels to manylinux2_28 - 2 of 2 (binaries) (#141423)
Depends on https://github.com/pytorch/pytorch/pull/140681 and https://github.com/pytorch/pytorch/pull/141609

Highlights:
* Upgrade binaries to ROCm6.2.4 to use latest docker images
* Remove pre-cxx11 builds for libtorch on ROCm
* Use manylinux2_28 docker images for ROCm
* Set `DESIRED_DEVTOOLSET=cxx-abi` (and hence `_GLIBCXX_USE_CXX11_ABI=1`) for ROCm manylinux2_28 wheels (ROCm RHEL8 packages also have GCC_ABI=1, so it keeps it consistent)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141423
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2024-12-04 07:00:25 +00:00
c0e1fc4919 Avoid casting low precision inputs to high precision for XPU Tensor in torch.linalg.vector_norm (#141954)
Fixes https://github.com/pytorch/pytorch/issues/141953

For mixed precision cases, tensors with device is cpu would cast type to `out_dtype`, while tensors with cuda devices will not do so for computational efficiency. For Intel xpu tensors, low-precision inputs should also not be converted to high-precision (same as cuda).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141954
Approved by: https://github.com/guangyey, https://github.com/ezyang
2024-12-04 06:44:19 +00:00
75d57b04ec [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [9/N] (#140933)
related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688
- #140922
- #140924
- #140933

> This is the last one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140933
Approved by: https://github.com/ezyang
2024-12-04 06:28:08 +00:00
d6481333ad [MPS] Add scatter_reduce.two (#141948)
Which has been request 20+ times on https://github.com/pytorch/pytorch/issues/77764 is just a flavor of out-of-box scatter-reduce, so all this op does is redispatches existing implementation.
Unsupported dtype/reduction type combinations:
 - min/max for int64
 - min/max for int32 on MacOS-14 or older

Following swift code demonstrates problem with scatterAlongAxis MPS call
```swift
import Metal
import MetalPerformanceShadersGraph

func scatterMPS(device: MTLDevice,
                inp_buf: MTLBuffer, upd_buf: MTLBuffer,
                idx_buf: MTLBuffer, out_buf: MTLBuffer,
                inp_elem: Int, upd_elem: Int) {
  let graph = MPSGraph()
  let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .int64, name: nil)
  let updatesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil)
  let outNode = graph.scatterAlongAxis(0, data: inputPlaceholder, updates: updatesPlaceholder, indices: indicesPlaceholder, mode: .min, name: nil)
  let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .int64)
  let mpsUpdatesBuffer = MPSGraphTensorData(upd_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64)
  let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .int64)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer,
                               updatesPlaceholder: mpsUpdatesBuffer,
                               indicesPlaceholder: mpsIndicesBuffer ],
            targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer])
}

func makeBufferWithValues(device: MTLDevice, values: [Int64]) -> MTLBuffer {
  guard let buf = device.makeBuffer(length: values.count * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let buf_data = buf.contents().assumingMemoryBound(to: Int64.self)
  for i in 0..<values.count {
    buf_data[i] = values[i]
  }
  return buf
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let inp_elem = 4
let upd_elem = 4
let inp_buf = makeBufferWithValues(device: device, values: [1, 2, 3, 4])
let upd_buf = makeBufferWithValues(device: device, values: [Int64.max - 1, Int64.max - 2 , Int64.max >> 16 , 11])
let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3])
guard let out_buf = device.makeBuffer(length:inp_elem * MemoryLayout<Int64>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

scatterMPS(device: device,
           inp_buf: inp_buf, upd_buf: upd_buf,
           idx_buf: idx_buf, out_buf: out_buf,
           inp_elem: inp_elem, upd_elem: upd_elem)

let obuf_data = out_buf.contents().assumingMemoryBound(to: Int64.self)
for i in 0..<inp_elem {
    print("out_buf[\(i)] = \(obuf_data[i])")
}
```
that prints `4294967294, 4294967293, 4294967295, 4` instead of expected `1, 2, 3, 4`
Where `torch.tensor([[1, 9223372036854775806], [2, 9223372036854775805], [3, 140737488355327], [4, 11]], dtype=torch.int64, device='mps').max(1)` yields an expected results
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141948
Approved by: https://github.com/manuelcandales
2024-12-04 04:56:43 +00:00
deffbbdb91 Update submodule ideep for pd cache changes (#141555)
Fixes https://github.com/pytorch/pytorch/issues/141327.
Fixes https://github.com/pytorch/pytorch/issues/141328.
Fixes https://github.com/pytorch/pytorch/issues/141329.
Fixes https://github.com/pytorch/pytorch/issues/141330.
Fixes https://github.com/pytorch/pytorch/issues/141331.

Summary:
1. Modify to_bytes function to include binary_src shape information into the keys of pd cache.
2. Modify inner_product_forward to support broadcast add fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141555
Approved by: https://github.com/jgong5
2024-12-04 04:55:33 +00:00
e8e65764d1 [pipelining] Improve schedule csv loading (#142009)
Add small changes based on feedback from Less when testing out https://github.com/pytorch/torchtitan/pull/707
- expose `validate_schedule` as a function
- handle spaces around actions in csv file
- add error arrow to `_format_pipeline_schedule()` to better show where the step errored

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142009
Approved by: https://github.com/lessw2020
2024-12-04 04:15:34 +00:00
86f306b15e _inductor: Add dynamo_timed for async_compile.precompile and turn on (#141920)
waitcounters

This fixes some review comments from https://github.com/pytorch/pytorch/pull/141379
and gives us another dynamo_timed event for local compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141920
Approved by: https://github.com/masnesral
2024-12-04 04:03:46 +00:00
30d907c6fb When serializing treespec context, support enum as well (#141525)
Following https://github.com/pytorch/pytorch/pull/102716, per @angelayi's suggestion.

Note that in general enum as an input is not supported.

repro:
```
class TestEnum(enum.Enum):
    A = auto()
    B = auto()

    @staticmethod
    def from_string(s):
        return TestEnum[s.upper()]

class M(torch.nn.Module):
    def forward(self, x, en):
        return x.clone()

input1 = (
    torch.rand(10, device="cuda"),
    {TestEnum.A: torch.rand(10, device="cuda")},
)
inputs = [input1]
model = M().cuda()

_ = model(*input1)

ep = torch.export.export(model, input1, strict=False)
path = torch._inductor.aot_compile(ep.module(), input1)
```

Differential Revision: D66269157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141525
Approved by: https://github.com/angelayi
2024-12-04 03:08:50 +00:00
288b73cb14 [Redo] Set remote cache version and backend type once in compilation metrics (#141967)
(Got reverted due to a silly bug, fixed now.)

This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times

Differential Revision: [D66701970](https://our.internmc.facebook.com/intern/diff/D66701970/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141967
Approved by: https://github.com/laithsakka
2024-12-04 03:07:53 +00:00
7dfb439a2a Only write predicate once when there are multiple torch.cond (#141528)
Fixes #140606

TEST PLAN:

```
python test/inductor/test_aot_inductor.py -k cond_share
python test/inductor/test_aot_inductor_arrayref.py -k cond_share
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141528
Approved by: https://github.com/desertfire
2024-12-04 01:56:10 +00:00
cyy
bffaddf9ea Format caffe2/serialize (#141850)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141850
Approved by: https://github.com/cpuhrsch
2024-12-04 01:14:24 +00:00
941da90e8a Add macos perf run to the dashboard upload (#141999)
Adjust the inductor workflow to ensure the macOS perf run gets uploaded
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141999
Approved by: https://github.com/huydhn
2024-12-04 01:08:13 +00:00
291626fb22 [ROCm] port CK rowwise F8 from fbgemm (#140856)
This ports (copies) FBGEMM's implementation from @jwfromm.

https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140856
Approved by: https://github.com/drisspg, https://github.com/atalman
2024-12-04 00:32:24 +00:00
a51a048027 [AOTI][refactor] Move stack allocation related configs (#139093)
Summary: Move allow_stack_allocation and use_minimal_arrayref_interface configs into the aot_inductor subclass.

Test Plan: CI

Differential Revision: D65064301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139093
Approved by: https://github.com/chenyang78
2024-12-04 00:15:19 +00:00
0190d929f2 [BE] Remove unused argument (#141983)
Summary: As title, the `node_filter` argument is not used.

Test Plan: CI

Differential Revision: D66712599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141983
Approved by: https://github.com/tugsbayasgalan
2024-12-04 00:07:33 +00:00
9286c21b22 Fix fbcode tests for automatic dynamic unspecialize float (#141975)
Differential Revision: D66708552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141975
Approved by: https://github.com/bdhirsh, https://github.com/atalman
2024-12-03 23:59:06 +00:00
20912ba582 fix incorrect c10::SymFloat::sqrt (#141728)
Fixes the silent correctness for SDPA in https://github.com/pytorch/pytorch/issues/141710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141728
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/drisspg
ghstack dependencies: #141725
2024-12-03 23:34:16 +00:00
af3e7389ef guard on flash attention SymFloat scale instead of incorrectly casting to float (#141725)
Fixes https://github.com/pytorch/pytorch/issues/141710. Previously, if we called flash attention with a `SymFloat` scale that was properly symbolic, we would unsafely cast its raw `SymFloat._data` into a `float`, which is pretty much guaranteed to give `NaN`.

This avoids the NaNs in the linked issue, but I'm not sure if it's worth landing yet because we'll start specializing and recompiling for every distinct `scale` that is passed in (which in the dynamic shapes case, is some function of `query.size(-1)`).

The real fix would be to ensure that the flash attention (and related) ops all accept a symbolic version of the `scale`. I'm not sure if we should use `SymFloat` or `Scalar` though - more discussion in the issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141725
Approved by: https://github.com/ezyang
2024-12-03 23:34:16 +00:00
da5b281f23 Generate op variants for core CIA ops (#141797)
There are four core ATen ops with Composite Implicit Autograd (CIA) dispatch: upsample_bilinear2d.vec, upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Op variant auto-generation is currently skipped for CIA ops. In preparation to disable the decompositions for upsample ops by default in export, we need to generate out variants for these ops.

This change enables autogen for core-tagged CIA ops, which enables generation of upsample_bilinear2d.vec_out and upsample_nearest2d.vec_out.

Test Plan:
Added a new test test_functional_variant_autogen_out_variant_core to cover this case in test_codegen.py.
Confirmed that upsample_bilinear2d.vec_out and upsample_nearest2d.vec_out op overloads are registered (they were previously not available).

Differential Revision: D66590257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141797
Approved by: https://github.com/larryliu0820
2024-12-03 22:57:46 +00:00
f0b33658f8 Dont use constant mask if ynumel potentially overflows ygrids (#139751)
If (ynumel / YBLOCK)  > get_max_ygrids(), the z dimension will be used if znumel is None. However, if (ynumel / YBLOCK) % get_max_ygrids() != 0, there will be program launches with inputs that require masking, and so this needs to be considered when determining if the y dimension has a constant mask.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139751
Approved by: https://github.com/eellison

Co-authored-by: George White <georgew@graphcore.ai>
2024-12-03 22:56:18 +00:00
cc98a1b599 _inductor: Add WaitCounter for triton.compile calls. (#141379)
_inductor: Add WaitCounter for async_compile.wait calls.

This will start recording how long these async_compile.wait calls take.

Note that we want to just unify dynamo_timed in the long term.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141379
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-12-03 22:56:04 +00:00
f86a1753d1 Add option to split Linear gates for Quantizable LSTM into separate ops (#141366)
Add option to split Linear gates for Quantizable LSTM into separate ops (#141366)

Summary:

Reattempt to land D65283170, adding pyre-fixmes / mypy ignores following D52890934

For LSTM, the input and hidden state are projected with Linear layers to construct the 4 gates. This is typically performed together as a single Linear (for each state) with output channel count `4 * hidden_dim` for efficiency.
https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=52-58
The output is then ultimately split into 4:
https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=83-87

For on-device latency (and possibly memory) considerations, we want to avoid constructing the intermediate `gates` tensor (which can be relatively large), by splitting `igates` and `hgates` first (as 4x `Linear(hidden_dim, hidden_dim)` each), applying add separately, then proceeding as usual.

This functionality can be enabled by specifying `split_gates=True` (default False is original behavior) at any entry point (directly with `torch.ao.nn.quantizable.LSTM`  or via `_get_lstm_with_individually_observed_parts`).

Test Plan:
piggy back on existing test to check for correct swap handling, numerics, and jit.script during prepare/convert
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_custom_module_lstm (caffe2.test.quantization.core.test_quantized_op.TestQuantizedOps)'
```
https://www.internalfb.com/intern/testinfra/testrun/4503599884152725

This test is quite long running now (more than double original).

---

shorter test to confirm original `LSTMCell` passes
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_fx -- --exact 'caffe2/test:quantization_fx - test_static_lstm_with_custom_fixed_qparams (quantization.fx.test_quantize_fx.TestQuantizeFx)'
```
https://www.internalfb.com/intern/testinfra/testrun/11258999127933996

Reviewed By: Ninja91

Differential Revision: D66380336
2024-12-03 17:21:44 -05:00
80705d3abf Convert assert to torch._check in MHA (#141918)
Fixes https://github.com/pytorch/pytorch/issues/139610
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141918
Approved by: https://github.com/ezyang
2024-12-03 21:58:02 +00:00
5303af2d27 Structured compile_fx (#141505)
- Turn fx_codegen_and_compile() into a class (FxCompile) so we can override the implementation.
- Pull the current body into an implementation (_InProcessFxCompile) which just performs the existing behavior.
- Add an async interface. (See below)

The intended future behavior of Async Compile will be to allow dynamo functions to start compiling in the background (and on a separate machine) while we continue to run eager in the foreground. As such we'll need to put the compilation behind some kind of Future implementation - it makes sense to reuse the existing python futures for that.  An async function is just a syntactic way to return an asyncio.Future.

Because asyncio.run() adds confusion to the stack traces when the called function isn't actually being used in an asynchronous way we also provide a synchronous interface which can be directly called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141505
Approved by: https://github.com/ezyang
ghstack dependencies: #141502
2024-12-03 21:27:32 +00:00
02147fe0f9 codecache: pull out some Graph serialization code into common helpers (#141502)
Moved some code from FxGraphCache.lookup_graph() which dealt with serializing and deserializing CompiledFxGraph into CompiledFxGraph itself so it can be reused later by Async Compile.

Async Compile will need to serialize the compiled CompiledFxGraph from one process and deserialize it in another - so it's very similar to the cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141502
Approved by: https://github.com/ezyang
2024-12-03 21:27:32 +00:00
8e9873d0a3 Allow attribute mutation for MutableMappingVariable (#141376)
Fixes #141375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141376
Approved by: https://github.com/vmoens
2024-12-03 21:00:10 +00:00
b4ea913978 Check /var/lib/jenkins/workspace exists before setting permissions (#141767)
Currently, if you run these CI scripts in a non-jenkins environment, they fail due to the folder not existing. This ensures the CI scripts can be re-used in different runners.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141767
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-12-03 20:56:20 +00:00
cyy
7c1d5db1f3 [2/N] Enable UBSAN tests (#141740)
Apply c10::load in more places. The function was introduced to cast a byte to valid boolean values, thus fixing the UBSAN errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141740
Approved by: https://github.com/ezyang
2024-12-03 20:52:26 +00:00
28efc17d2c [pytorch/profiler] Honor escape quotes arg in a profiler metadata log formatter (#141527) (#141626)
Summary:

We were ignoring the with_escaped_quotes param in format_list inline function iin utils.cpp in the case where we had to truncate a list of more than kTruncatelength items.

In that case we would truncate a list into a string but always return it with an escaped quotes wrapping it. this will cause issues if this string is meant to be added to other lists which will also go through formatting. Leading to cases like `"["[a, b, c, ...]"]"`.

now the above will be well formatted as `"[[a, b, c, ...]]"` as the escape quote requests will be honored.

Differential Revision: D66521676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141626
Approved by: https://github.com/sraikund16
2024-12-03 20:42:57 +00:00
78e53a92c3 Remove monkeypatch of has_frozen_params in test/inductor/test_codecache.py (#141898)
Summary: This particular test isn't really needed since the code path is already exercised in `test_freezing`. While I was here, I beefed up testing in that method to consider whether the frozen paramater is inlinable vs. not since the caching behavior is different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141898
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-12-03 20:38:10 +00:00
42547f8d48 Add support for blackwell codegen (#141724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141724
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy
2024-12-03 20:34:43 +00:00
8b0fcad0fd [AOTInductor] Add update_constant_buffer pybind support (#140755)
Summary: We add update_constant_buffer python support for testing purpose.

Test Plan: Included in commit

Differential Revision: D65968613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140755
Approved by: https://github.com/22quinn
2024-12-03 20:34:25 +00:00
e5f5283ab2 Fix cuda arch full version for 12.6 (#141976)
follow up for https://github.com/pytorch/pytorch/pull/141433/files
build still showing up as 12.6.2 in the name, see latest https://github.com/pytorch/pytorch/actions/runs/12134985224/job/33833276884.

related to https://github.com/pytorch/pytorch/issues/138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141976
Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/Skylion007
2024-12-03 20:33:01 +00:00
f472b3aee1 improve typings around torch.export (#141829)
This is another follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230).

This PR improves the type annotations around `torch._export`. Even though the PR introduces a few runtime type asserts, the runtime behavior should stay equivalent, because the failed assertions should have been immediate crashes anyway.

CC @Skylion007 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141829
Approved by: https://github.com/ezyang
2024-12-03 19:57:21 +00:00
43c5f59190 flip capture_autograd_function to default to true and warn if false (#141972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141972
Approved by: https://github.com/zou3519
ghstack dependencies: #141932
2024-12-03 19:50:14 +00:00
96a35716d1 [aoti] Improve OSSProxyExecutor error messages (#141501)
For debugging issues like https://fb.workplace.com/groups/1028545332188949/permalink/1092584242451724/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141501
Approved by: https://github.com/henrylhtsang
2024-12-03 19:32:49 +00:00
6b620423a3 dynamo_timed: Add a log_waitcounter option. (#141402)
This logs a waitcounter of the name pytorch.dynamo_timed.{key}.

Primarily sending this now to make sure everyone likes the API, then
I'll add tests, and migrate one dynamo_timed to use it. (likely starting
with
https://github.com/pytorch/pytorch/pull/141379).

Testing is a bit harder, since we don't normally have any way to read
_WaitCounter state AFAICT. I want to poke around and see if I can figure
out a way to read the state, otherwise I'll just mock it to at least
make sure it's mostly working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141402
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2024-12-03 19:24:29 +00:00
d35358b271 [FlexAttention] Remove failing num_warps=8 in bwds (#141653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141653
Approved by: https://github.com/BoyuanFeng
2024-12-03 19:22:52 +00:00
9125e9119c Fix memory leak in ModuleTracker (#141960)
Thanks @drisspg and @albanD for finding the fix

**TEST PLAN**
```
import gc
import torch
import torch.nn as nn
from torch.utils.module_tracker import ModuleTracker

class MyModel(nn.Module):
    def forward(self, x):
        return x * x

print(f"torch=={torch.__version__}")
m = MyModel()
m.cuda()
m.to(torch.bfloat16)
mt = ModuleTracker()
for i in range(1000):
    if i % 100 == 0:
        gc.collect()
        print("memory_allocated:", torch.cuda.memory_allocated())
    x = torch.randn([128, 256], device="cuda", dtype=torch.bfloat16, requires_grad=True)
    with mt:
        m(x)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141960
Approved by: https://github.com/albanD
2024-12-03 18:36:15 +00:00
7bb2228ffd Test cpp_wrapper_hipify string comparison (#141353)
Updating the test to match this code that takes device warpsize into account: cf1d95a965/torch/_inductor/codegen/cuda/device_op_overrides.py (L120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141353
Approved by: https://github.com/desertfire
2024-12-03 18:25:32 +00:00
8b5c26287d Initialize lr as a tensor if it is originally a tensor (#141620)
Fix https://github.com/pytorch/pytorch/issues/139575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141620
Approved by: https://github.com/kwen2501
2024-12-03 18:10:23 +00:00
314e08eb52 [fr_trace][bugfix] Log missing ranks when provided (#141924)
Summary: For missing ranks issues, `build_collectives` doesn't log any errors (5c2584a14c/tools/flight_recorder/components/builder.py (L293C23-L306C24)), which means that when `EntryState.to_collective` is called [here](5c2584a14c/tools/flight_recorder/components/builder.py (L400C21-L405C22)), errors will be empty and `to_collective` will enter the first if statement. But that codepath doesn't log `missing_ranks`, meaning it will be absent from the `Collective` returned. This diff fixes that oversight.

Test Plan:
eyes

Sandcastle run

Differential Revision: D66679224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141924
Approved by: https://github.com/c-p-i-o
2024-12-03 17:54:43 +00:00
5c59f4a55a Remove old FSDP1 fully_shard (#141875)
FSDP1's `fully_shard` frontend was an exploration at the end of 2022 H2 as part of the `torch/distributed/_composable` APIs to avoid `nn.Module` wrappers. It calls into the same backend code as FSDP1's `FullyShardedDataParallel`.

The API did not gain traction internally, so we instead reused the name `fully_shard` for FSDP2, which similarly is not an `nn.Module` wrapper and follows similar design principles as FSDP1's `fully_shard`.

To the best of our knowledge, we have removed all instances of FSDP1's `fully_shard` internally, and we put the deprecation warning in open source in 2.4 saying it will be removed after 2.5 (which is now):
4959784dac/torch/distributed/_composable/fully_shard.py (L40-L48)

We are skipping the PR sanity check because this PR is only removing code, not adding new code, and should not require this sanity check.

Differential Revision: [D66664988](https://our.internmc.facebook.com/intern/diff/D66664988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141875
Approved by: https://github.com/weifengpy
2024-12-03 17:00:47 +00:00
ed4831b93c Improve torch.library.opcheck and register_autograd docs (#141883)
Fixes https://github.com/pytorch/pytorch/issues/141618
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141883
Approved by: https://github.com/albanD
ghstack dependencies: #141894, #141880
2024-12-03 16:28:56 +00:00
827c322290 Make torch.library.triton_op public (#141880)
We've been using it privately for half a year and everything's been
good. This PR:
1. Makes torch.library.triton_op public
2. Renames capture_triton -> wrap_triton. We got feedback that no one
   knew what "capture triton" does.
3. Makes torch.library.wrap_triton public.

triton_op is used to construct a Python custom operator that may call 1+
triton kernels. Each of those triton kernels must be annotated with
wrap_triton.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141880
Approved by: https://github.com/albanD
ghstack dependencies: #141894
2024-12-03 16:28:56 +00:00
ac600fdce6 Type exposed_in decorator (#141894)
Test Plan:
- lintrunner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141894
Approved by: https://github.com/albanD
2024-12-03 16:28:17 +00:00
7a806a839d [FP8] Expand MaskedSelect to float8 (#141928)
Needed for printing those.
Though I wonder if better solution would be to change those ops to use element size rather than actual type (to extend them automatically to unsigned integral types as well)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141928
Approved by: https://github.com/ezyang, https://github.com/jgong5
2024-12-03 15:14:26 +00:00
78543e6002 [dynamo][pytree][1/N] make CXX pytree traceable: tree_iter / tree_leaves (#137397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137397
Approved by: https://github.com/jansel
2024-12-03 11:17:39 +00:00
9990b47ea3 [inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)
Fixes #139970, #139812.

Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-12-03 09:26:43 +00:00
ff73e2e679 [dynamo] Validate mutation_type and source in VariableTracker.__init__ (#141717)
As title, this also uncovered a few invalid use cases; the cases that
cause error are fixed in separate patches prior to this patch, and the
rest are fixed in this patch.

This patch also moves a few `.source` mutation to variable construction,
to increase the coverage of the validation.

Fixes #133027.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141717
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715, #141902, #141716
2024-12-03 09:18:06 +00:00
0efd184685 [dynamo] Fix side effects for range iterator that escapes the graph (#141716)
`wrap_range_iterator` mistakenly used `ValueMutationNew`, when it
should've used `ValueMutationExisting`, because this code path always
has a source.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141716
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715, #141902
2024-12-03 09:18:06 +00:00
7c3c8a662e [dynamo] Add RANGE_ITERATOR_MATCH to properly guard on range iterators (#141902)
A subsequeunt patch attempts to fix a side-effect issue for range
iterators, which in turn exposed an exising issue on guards for range
iterators -- the following test started failing:
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_hstack_column_stack_cpu_int16
```

This patch adds a `RANGE_ITERATOR_MATCH` guard to make sure that we
properly guard on range iterators, and adds a regression test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141902
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715
2024-12-03 09:18:06 +00:00
ff3f4a164c [dynamo] Fix aliasing issue for dict.copy that escapes the graph (#141715)
Dynamo accidentally passed the original `ConstDictVariable.source` to
the result of `dict.copy(...)`, which caused aliasing issue when the
result escapes the graph (e.g., is a return value).

This patch fixes that and adds a regression test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141715
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714
2024-12-03 09:18:06 +00:00
9eb0520d75 [dynamo] Fix side-effect handling for pre-existing collections.deque (#141714)
Previously we never replayed side effects to `DequeVariable` with a
source; the bug was already in the `test_deque_input` test, but went
unnoticed because we didn't check the deque objects.

This patch adds limited but practical support for this (see comments in
`side_effects.py` for why limited), and updates the deque tests to check
for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141714
Approved by: https://github.com/jansel
ghstack dependencies: #141713
2024-12-03 09:18:06 +00:00
f2ce2d435b [dynamo] Add test for returning a nested recursive function and update documentation (#141713)
Addresses https://github.com/pytorch/pytorch/pull/137905#discussion_r1806923085.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141713
Approved by: https://github.com/jansel
2024-12-03 09:18:06 +00:00
f8a64c324e Broadcast constants on vectorised stores in CppTile2DKernel (#140262)
Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following:
```shell
error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char'
   61 |                                 tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8));
      |                                           ^~~~~
```
This PR adds the required broadcasting.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262
Approved by: https://github.com/jgong5
2024-12-03 09:15:17 +00:00
e1e3bbc2e1 Set capture_autograd_function=False by default (#141932)
https://github.com/pytorch/pytorch/pull/136959 cleaned up the flag and added a warning. @Chillee pointed out that we should really default this flag to false otherwise we subject all users that go down this code path to log spew.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141932
Approved by: https://github.com/jansel
2024-12-03 07:59:03 +00:00
e499b46465 Speed up half tensors printing (#141927)
This PR removes copycast of reduced precision types to float before printing, that was added in https://github.com/pytorch/pytorch/pull/14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs

(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```

Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```

Convert fp8 dtypes to float16, as float range is an overkill
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141927
Approved by: https://github.com/ezyang
2024-12-03 07:01:49 +00:00
d035db3d86 [AMD] [submodule] aten.bmm CK-backend prototype (#140758)
Summary:
Early prototype of adding CK backend for aten.bmm. Currently, it is very limited in that:

1. BF16 only
2. A single CK instance
3. NT layout only
4. Alpha=1, Beta=0 only

Reviewed By: xw285cornell, zjing14

Differential Revision: D65954695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140758
Approved by: https://github.com/bradleyhd
2024-12-03 06:54:51 +00:00
6afcec0c58 Assert is GraphModule in compile_fx_aot (#141575)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141575
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2024-12-03 05:39:44 +00:00
ce86119503 Revert "Set remote cache version and backend type once in compilation metrics (#141707)"
This reverts commit d633cf1f55f87e5536f63981357d543ac46e48f7.

Reverted https://github.com/pytorch/pytorch/pull/141707 on behalf of https://github.com/malfet due to It breaks tests by referencing FbRemoteFxGraphCache, but CI was green ([comment](https://github.com/pytorch/pytorch/pull/141707#issuecomment-2513555185))
2024-12-03 05:01:02 +00:00
2999dbfd21 Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)"
This reverts commit 3ab4a28eaa7dc67d5c46c2016bbfe9932b36de06.

Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Job are failing en masse after this lands, so it looks like a land race ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2513552752))
2024-12-03 04:57:58 +00:00
38bbe37187 Enable CI on SM89 (#140305)
Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376

To enable more balanced sharding, had to push 148ae19935

Added `@xfailIfSM89` to the following tests:
 - test_fp8_pattern_2
 - test_original_aten_preserved_split_addmm
 - test_sparse_semi_structured_scaled_mm
 - test_sparse_semi_structured_scaled_mm_fp8
 - test_sparse_fp8fp8_mm

Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA`

Skipped following inductor tests (that either flaky OOMs or timeouts):
 - test_reduction_fn_std_float64
 - test_reduction_fn_var_mean_float64
 - test_multi_output_unbacked_custom_op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305
Approved by: https://github.com/wdvr, https://github.com/ZainRizvi
2024-12-03 04:49:46 +00:00
af88326250 Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)
Fixes https://github.com/pytorch/pytorch/issues/141435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625
Approved by: https://github.com/drisspg
ghstack dependencies: #138788
2024-12-03 04:45:05 +00:00
9cfc9e636d [while_loop] change to guard_equals for checking output and carry (#141734)
The input with the same can be represented with different symbols e.g.
```python
def body_fn(a, b):
  return b.sin(), a.sin()
```
, where a = torch.randn(3, 4), b= torch.randn(3, 4). There could be 4 symbols allocated for a and b. So instead of checking their shapes and strides' symbol must be the same, we just use guard_equals to enforce the constraint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141734
Approved by: https://github.com/zou3519, https://github.com/eellison
2024-12-03 04:00:21 +00:00
871b93bc59 [associative_scan] Fixing shape checks (#141698)
This PR fixes the shape checks that are done in the associative_scan operation.
Before all shapes of the input leaves were required to be the same. With this PR only the shapes of the output of the combine_fn and the input leaves need to be the same, but not among the input leaves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141698
Approved by: https://github.com/ydwu4
2024-12-03 03:49:11 +00:00
3ab4a28eaa [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel

Co-authored-by: James Wu <jjwu@meta.com>
2024-12-03 03:48:23 +00:00
ecbb8a8800 Mention version of flip in weights_only error message (#141304)
Fixes https://github.com/pytorch/pytorch/issues/141139

How the 3 versions of the error message now look

### Version 1

Old error message:

```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
        (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

New error message:

```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
````
### Version 2

Old error message:

```
_pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs)
```

New error message:
```

_pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs)
 In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.

 ```

 ### Version 3

Old error message
```
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

New error message
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141304
Approved by: https://github.com/zou3519
2024-12-03 03:26:27 +00:00
4cbb3b4bd2 [ROCm] Enable finding HIP and ROCm libraries on Windows (#137279)
This PR introduces support for finding HIP-SDK Libraries on Windows.

Since reading the code changes using the diff view is a bit cumbersome due to introduced if branch, let me explain what was changed:
- The linux-specific steps to find HIP packages have been dragged into `if(UNIX) block`
- Windows steps follow in the `else()` clause

The separation was needed, because of several factors:
- HIP SDK for Windows typically names its components using `hip` in their names (for exmaple: `hip_version.h` instead of `rocm_version.h`, `HIP_VERSION_DEV_MAJOR` instead of `ROCM_VERSION_DEV_MAJOR`, etc.),
- The libraries included in HIP SDK are only a subset of what is available in Linux ROCm (missing hsa-rt, rccl, roctx)
- MIOpen isn't a part of HIP SDK, but can be built separately and as of now requires additional path to be defined using and env var.
- Windows can only find hip package in version greater than 1.0 and its libraries if the lowercase `find_package(hip ...)` is invoked first. This is because the lowercase `hip` name will cause the mechanism to find hip's packages using [config mode](https://cmake.org/cmake/help/latest/command/find_package.html#search-modes) which is the only one supported on Windows, assuming we also want to [include its libraries](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html#consuming-the-hip-api-in-c-code). The upper-case module-mode-seearched `find_package(HIP)` is used later for inclusion of macros such as `hip_add_library` and related macros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137279
Approved by: https://github.com/jeffdaily
2024-12-03 03:26:01 +00:00
33573488d0 Make Dtypepropagation singleton (#141882)
Should fix compile time regression, it was doing fairly expensive meta programming in init and being instantiated multiple times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141882
Approved by: https://github.com/ezyang
ghstack dependencies: #139945, #140057, #141495
2024-12-03 03:15:16 +00:00
f911361de1 Correctly specify size of sparse_csr tensors in maskedtensor binary ops (#134335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134335
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-12-03 02:55:57 +00:00
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
34127fc688 Only reconstruct dict if needed (#141606)
Fixes #141452

This is a follow-up of PR #134876, which optimized dict reconstruct to codegen only if any value changed. In this PR we cover the general case and do not codegen any instruction if the dictionary remains the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141606
Approved by: https://github.com/zou3519
2024-12-03 02:22:34 +00:00
a6bea3d86d Fix DCe in training IR to reflect correct record function op (#141899)
Summary: The exit function is actually exit._recordFunction not exit.default

Test Plan: CI

Differential Revision: D66665359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141899
Approved by: https://github.com/ydwu4
2024-12-03 01:59:37 +00:00
d633cf1f55 Set remote cache version and backend type once in compilation metrics (#141707)
This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times.

(Same problem, different fix of D66502138)

Differential Revision: [D66508492](https://our.internmc.facebook.com/intern/diff/D66508492/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141707
Approved by: https://github.com/ezyang
2024-12-03 01:49:11 +00:00
77748ed8ec fix c10::Event UT failure on XPU backend (#141800)
# Motivation
Fix this UT failure introduced by https://github.com/pytorch/pytorch/pull/140865. The unrelated failure suppressed this UT failure.
It goes to happen since https://github.com/pytorch/pytorch/pull/141546 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141800
Approved by: https://github.com/EikanWang
2024-12-03 01:34:42 +00:00
09ce760fef Revert "Add missing data types at torch export serialization (#138561)"
This reverts commit 1ef1b3b39123255483c51fafbd21217d76e140e7.

Reverted https://github.com/pytorch/pytorch/pull/138561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138561#issuecomment-2513343401))
2024-12-03 01:32:50 +00:00
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
5c33c9202f Skip test_cpu_repo.py::CPUReproTests::test_auto_zvec_vsx_simd on AArch64 (#141155)
The skipping logic clearly states it shouldn't be running on this architecture. The test then fails due to `VecNEON` returning `128` from `bit_width()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141155
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet
2024-12-03 00:19:06 +00:00
c17ba69ba5 [submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936) (#141901)
Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590

Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-12-03 00:16:35 +00:00
e41a0b33ec Allow Fakified subclass to have different device for inner and outer tensor (#141839)
Previously if a wrapper tensor subclass is fakified, the inner tensors would end up having the same device as the outer tensor. This PR makes it so that inner and outer tensors can have different devices.

See OffloadTensor PR https://github.com/pytorch/pytorch/pull/141840/files#diff-3bc0cf540b694f4ec0a3749f78b047456657a53a5657e495ffb68e5970c5fdaaR1955 for an application. A simpler test has been added in this PR.

This is technically bc-breaking because now the callback passed to MetaConverter needs to accept an extra argument, but no one external should be using this anyway?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141839
Approved by: https://github.com/bdhirsh
ghstack dependencies: #141166
2024-12-03 00:09:41 +00:00
9830e7b1e4 Update OpenBLAS to 0.3.28 (#137263)
This includes a number of performance improvements, such as threading optimisations and forwarding GEMM calls to GEMV for calls where N=1 or M=1.

See: https://github.com/OpenMathLib/OpenBLAS/releases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137263
Approved by: https://github.com/malfet
2024-12-03 00:05:34 +00:00
9f9105a67b [MPS] Write/Invoke Metal shaders from C++ (#141547)
By introducing `DynamicMetalShaderLibrary` and `MetalShaderFunction`
Add unittests that also serves as an example of how API works

Using this primitive, one can compile and dispatch any 1D or 2D shader over MPS tensor using the following pattern
```cpp
auto x = torch::empty({8, 16}, at::device(at::kMPS));
DynamicMetalShaderLibrary lib(R"MTL(
  kernel void full(device float* t, constant ulong2& strides, uint2 idx [[thread_position_in_grid]]) {
    t[idx.x*strides.x + idx.y*strides.y] = idx.x + 33.0 * idx.y;
  }
)MTL");
auto func = lib.getKernelFunction("full");
func->runCommandBlock([&] {
   func->startEncoding();
   func->setArg(0, x);
   func->setArg(1, x.strides());
   func->dispatch({8, 16});
});

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141547
Approved by: https://github.com/Skylion007
2024-12-02 23:57:59 +00:00
5c2584a14c [ROCm] Enable inductor GEMM lowering for gfx11 (#141687)
This check doesn't make sense for some of the AMD gpus since they have the right amount of CUs but multi_processor_count returns WGPs on RDNA while still performing adequately. A lot of tests fail on modern archs due to this check defaulting them to not using the GEMMs backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141687
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-02 22:13:34 +00:00
1f3d8896bc Fix mismatched tensor metadata between FakeTensor and Intel XPU concrete tensor when running F.logsigmoid (#141333)
Fixes https://github.com/pytorch/pytorch/issues/141332
`F.logsigmoid` will return two outputs: `output` and `buffer`.
For `F.logsigmoid` cpu path, it will use buffer to store some intermediate values and use them when computing gradients, so it returns a `buffer` tensor with nonzero size. For cuda and xpu paths, buffer is useless, so the `buffer ` tensor size of xpu `F.logsigmoid`  will be zero, just like cuda. The root cause of the issue is that the codes in `decompositions.py` (ref:https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2803) only handle the cuda cases, when the a fake tensor with device is xpu run to here, it will use the cpu path and return a `buffer` with nonzero size, which is conflict to the  implementation of intel xpu concrete tensor. Therefore this pr add conditions to handle xpu cases. Make sure the two returned buffer sizes match each other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141333
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang
2024-12-02 22:09:20 +00:00
74eb92ed6e fix deep copy of empty graph (#141660)
Differential Revision: D66532131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141660
Approved by: https://github.com/ezyang
2024-12-02 22:03:13 +00:00
41e59754b4 [CI] Remove inductor-perf-test-nightly-a10g.yml (#141895)
Summary: Deprecate the A10g nightly perf run. The workflow was introduced as an experiment and doesn't seem to be used by developers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141895
Approved by: https://github.com/huydhn
2024-12-02 21:55:20 +00:00
cyy
55250b324d [1/N] Apply py39 ruff fixes (#138578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138578
Approved by: https://github.com/Skylion007
2024-12-02 21:46:18 +00:00
b47bdb06d8 Revert "[inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)"
This reverts commit 942a2438e263a2632b8934dd245060c9b237f4be.

Reverted https://github.com/pytorch/pytorch/pull/141334 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/141334#issuecomment-2512891840))
2024-12-02 21:29:02 +00:00
6b05e31042 Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)"
This reverts commit 61534391ba8204286f5c9ed15ab636e94bd3daf2.

Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a lot of failures shows up after this lands ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2512890426))
2024-12-02 21:26:13 +00:00
64d44a39a1 remote_cache: Add a waitcounter for gets and sets (#141307)
This adds a basic waitcounter to help show if we're spending a lot of
time doing gets and sets to remote caches

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141307
Approved by: https://github.com/masnesral
2024-12-02 20:48:47 +00:00
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af2ce50560fa5a74473665ea229e6c9d.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
54adbbf6b8 cpp_wrapper: Add support for MemoryFormat arguments (#141367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141367
Approved by: https://github.com/desertfire
2024-12-02 20:40:24 +00:00
30574380a3 [REFACTOR] Factor _fx_graph_cache_key and _time_taken_ns to common base class (#141878)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141878
Approved by: https://github.com/jamesjwu
ghstack dependencies: #141877
2024-12-02 20:07:12 +00:00
61534391ba [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2024-12-02 19:48:05 +00:00
fe68f61c59 Migrate micro benchmark results to benchmark database schema v3 (#141745)
Similar to https://github.com/pytorch/pytorch/pull/141087, this uploads the micro benchmark results to benchmark database with its new schema v3. The data can then be queried.

~I'm testing with `inductor-micro-benchmark-x86` which should be sufficient because `inductor-micro-benchmark` is broken atm.  The CSV output stays for now until the dashboard is migrated to schema v3.~ https://github.com/pytorch/pytorch/issues/141747 has been resolved, so inductor-micro-benchmark should work now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141745
Approved by: https://github.com/yanboliang
2024-12-02 19:45:51 +00:00
cyy
ab5467897a Fix NOLINTNEXTLINE (#141794)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-02 19:22:00 +00:00
161a2340ee Switch to using Python nested int (#141166)
Doesn't seem to noticeably slow down eager - TestNestedTensorSubclass tests with and without the PR finished in similar amounts of time (around 57s, 58s)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141166
Approved by: https://github.com/ezyang
2024-12-02 19:17:30 +00:00
2d708752f0 [dynamo] Remove AutoDerefLocalSource and simplify cell handling (#141629)
This patch
1. removes `AutoDerefLocalSource` in favor of `LocalSource`, thereby
   removing its special handling in guards.
2. introduces a `LocalCellSource` for cells from the root frame, with
   only `reconstruct` implemented, to programmatically enforce that thse
   cells should never be used by other components like guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141629
Approved by: https://github.com/jansel
ghstack dependencies: #141628
2024-12-02 19:09:30 +00:00
e14d8c980f [dynamo][NFC] Rename NewCellVariable to CellVariable (#141628)
It was named `NewCellVariable` because we originally used it to
represent cells by the code Dynamo is tracing through. However, now we
use it to represent pre-existing cells as well, so this patch renames it
to avoid confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141628
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-12-02 19:09:30 +00:00
0989871ac9 pytorch/feature: Record if parallel compile is enabled (#141074)
This gets a bit messy, but this appears to be the best spot to make a
true / false decision.

Note that since we're looking at whether or not it's used, if the pool
doesn't warm up within the time it takes for a compile, we will mark the
feature use as false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141074
Approved by: https://github.com/masnesral
ghstack dependencies: #141059
2024-12-02 19:09:11 +00:00
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
9012e7a62f Revert "[dynamo][pytree][1/N] make CXX pytree traceable: tree_iter / tree_leaves (#137397)"
This reverts commit 07850bb2c1771ba3f5578b0aa85792e5cd70de1c.

Reverted https://github.com/pytorch/pytorch/pull/137397 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/137397#issuecomment-2511934283))
2024-12-02 16:05:14 +00:00
eb7deb2db5 Revert "Fix NOLINTNEXTLINE (#141794)"
This reverts commit 7dd9b5fc4343d101294dbbab4b4172f2859460bc.

Reverted https://github.com/pytorch/pytorch/pull/141794 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/12087979418/job/33711943084) [HUD commit link](7dd9b5fc43) ([comment](https://github.com/pytorch/pytorch/pull/141794#issuecomment-2511789484))
2024-12-02 15:07:50 +00:00
a34a56f69f Revert "Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)"
This reverts commit 795f28ac552eb61d02ea02fd64637ba814133bd8.

Reverted https://github.com/pytorch/pytorch/pull/141625 on behalf of https://github.com/albanD due to Broken main ([comment](https://github.com/pytorch/pytorch/pull/141625#issuecomment-2511639687))
2024-12-02 14:10:38 +00:00
ec96597e47 Revert "ILP for auto FSDP wrapping (#140298)"
This reverts commit d4cdc098817a0af10b478256b524533ed67285a9.

Reverted https://github.com/pytorch/pytorch/pull/140298 on behalf of https://github.com/xuanzhang816 due to for other PR ([comment](https://github.com/pytorch/pytorch/pull/140298#issuecomment-2511638743))
2024-12-02 14:08:04 +00:00
942a2438e2 [inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)
Fixes #139970, #139812.

Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-12-02 08:42:10 +00:00
96d2a511ce [Inductor][CPP] Fix issue in CPP GEMM Template Prune Tensor (#141798)
**Summary**
When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example:

```
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]     class sdpa_score30(torch.nn.Module):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]         def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]             return arg0_1

V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_score30 = self.sdpa_score30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_mask30 = self.sdpa_mask30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,));  add_276 = sdpa_score30 = sdpa_mask30 = None
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0];  flex_attention_30 = None
```
We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798
Approved by: https://github.com/jgong5
2024-12-02 07:38:57 +00:00
90f4d60672 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit daed864f7b3ca3b3e64ed13624369fd3007ad47d.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/xuhancn due to need to fix on XPU. ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2510737212))
2024-12-02 07:10:41 +00:00
cyy
8cada5cbe5 Use std::apply (#141834)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141834
Approved by: https://github.com/Skylion007
2024-12-02 05:49:10 +00:00
f16e08042c [user triton] Fix grid codegen for configs with empty kwargs (#141824)
Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824
Approved by: https://github.com/Chillee
2024-12-02 04:17:21 +00:00
daed864f7b export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-02 03:20:29 +00:00
81ab2cc757 Update torch-xpu-ops commit pin (#141201)
Update the torch-xpu-ops commit to [1e32bbc](1e32bbc3d9), includes:

- Improve XPU aten operator coverage
- Support basic `SparseXPU` operators

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141201
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-02 01:49:07 +00:00
795f28ac55 Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)
Fixes https://github.com/pytorch/pytorch/issues/141435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625
Approved by: https://github.com/drisspg
ghstack dependencies: #138788
2024-12-02 00:35:29 +00:00
8eb259fdc3 Added option to control number of kernel options displayed (#138788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138788
Approved by: https://github.com/drisspg
2024-12-02 00:35:29 +00:00
fc74ec4989 [2/N] Avoid copy in std::get (#141826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141826
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-02 00:16:48 +00:00
b2fe1b9409 [inductor] Fix 3d tiling (#141709)
Fixes #141121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709
Approved by: https://github.com/eellison
2024-12-01 19:47:41 +00:00
90f19fee8a [MPS] Convert channels_last_3d to contiguous for input tensor in nn.Conv3d (#141780)
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.

Added a regression test that verifies the output by running the same op on the CPU.

I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?

Fixes #141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
2024-12-01 18:36:53 +00:00
5deca07c0d [Inductor] Represent tiling as a dict (#141751)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions.

This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`.

Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first.

# Test plan
The existing CI provides good coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751
Approved by: https://github.com/jansel
2024-12-01 09:54:34 +00:00
cyy
96be048f06 [1/N] Avoid copy in std::get (#141812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141812
Approved by: https://github.com/Skylion007
2024-12-01 03:53:35 +00:00
c2fa544472 [Inductor] move block pointer analysis to a new module (#141733)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This refactors the ModularIndexing block pointer analysis into its own module. That way, we can call it from other places besides Triton codegen. In the parent PR, we will use this to find tiling splits that simplify the indexing.

# Test plan

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141733
Approved by: https://github.com/jansel
2024-11-30 23:21:24 +00:00
49fde426ba [Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`.

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738
Approved by: https://github.com/jansel
2024-11-30 22:38:13 +00:00
394c339691 improve typings in unflatten (#141817)
A first follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230).

This PR improves the type annotations around `unflatten.py` which had been inaccurate due to the previously suppressed type checking on `torch.nn.Module`.

CC @Skylion007 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141817
Approved by: https://github.com/Skylion007
2024-11-30 22:12:15 +00:00
8a81f7a4b6 Refactor functions in functorch for functional (#141808)
As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141808
Approved by: https://github.com/Skylion007
2024-11-30 20:15:40 +00:00
0f3f801fc2 Add windows CUDA 12.6 nightly builds (#141805)
Windows AMI was published to prod. This PR adds CUDA 12.6 nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141805
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-11-30 14:39:47 +00:00
eqy
9532589b53 [CUDA][64-bit indexing] Support 64-bit indexing in distribution_elementwise_grid_stride_kernel (#141613)
For #141544
Overhead doesn't seem to be noticeable even on small sizes (e.g., 2**10 elements)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141613
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2024-11-30 06:55:02 +00:00
7fafaa9c82 Introduce CompiledAOTI (#141695)
Stacked on https://github.com/pytorch/pytorch/pull/141691

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141695
Approved by: https://github.com/aorenste
ghstack dependencies: #141681, #141683, #141685, #141688, #141689, #141691
2024-11-30 00:05:41 +00:00
2f72635a5c automatic dynamic unspecialize float (#141647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647
Approved by: https://github.com/ezyang
2024-11-29 22:36:53 +00:00
cyy
e29dabbd71 Fix performance-unnecessary-copy-initialization (#141792)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141792
Approved by: https://github.com/Skylion007
2024-11-29 22:10:06 +00:00
a23ac6f8bd [CD] Enable pypi dependencies both for XPU linux and Windows whls (#141135)
Enable xpu runtime pypi packages as dependencies of XPU CD wheels both for Linux and Windows.
Fixes https://github.com/pytorch/pytorch/issues/135867
Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141135
Approved by: https://github.com/atalman
2024-11-29 21:35:07 +00:00
44707b0667 Pass rounding_mode for div reference inputs through kwargs (#136308)
Previously, the reference inputs for div with rounding mode did not supply the rounding_mode keyword argument. This didn't match the sample inputs for this op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136308
Approved by: https://github.com/albanD

Co-authored-by: Xia, Weiwen <weiwen.xia@intel.com>
Co-authored-by: Bob Ren <bobren@meta.com>
Co-authored-by: Xilun Wu <12968408+XilunWu@users.noreply.github.com>
Co-authored-by: siahuat0727 <tansiahuat@gmail.com>
2024-11-29 21:28:24 +00:00
ed092e2161 [2/N] Rename NCCLTraceBuffer to FlightRecorder (#141712)
Just name change. No behavior change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141712
Approved by: https://github.com/wconstab, https://github.com/fduwjj
ghstack dependencies: #141648
2024-11-29 21:15:31 +00:00
a8a570512b [export] Generate compatible thrift schema out of schema.py (#141611)
Summary: To make sure schema.py and schema.thrift are kept in sync, we use the int keys from thrift and use Python Annotated type to associate fields between thrift and schema.py. Later we will use this association to build a single source of truth between the schemas.

Test Plan: CI

Differential Revision: D66253157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141611
Approved by: https://github.com/yiming0416
2024-11-29 20:09:49 +00:00
7dd9b5fc43 Fix NOLINTNEXTLINE (#141794)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-29 16:23:59 +00:00
9e98b3d73c Revert "automatic dynamic unspecialize float (#141647)"
This reverts commit 1a32daeb17cd56601c60cb4000a4ef75120af37f.

Reverted https://github.com/pytorch/pytorch/pull/141647 on behalf of https://github.com/atalman due to functorch/test_aotdispatch.py::TestAOTAutogradWithCache::test_inner_grad [GH job link](https://github.com/pytorch/pytorch/actions/runs/12080983316/job/33697901875) [HUD commit link](1a32daeb17) ([comment](https://github.com/pytorch/pytorch/pull/141647#issuecomment-2507980876))
2024-11-29 15:00:33 +00:00
3c63e76b03 [PT2E Quantization] Fix RecursionError when prepare_pt2e graph with concat of the same node (#141651)
Fixes #129038

Related PR #129567

Here is the new PR against main, thanks! @jerryzh168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141651
Approved by: https://github.com/jerryzh168
2024-11-29 09:19:22 +00:00
ce572fedfc [dtensor][random] use torch.uint64 as the seed/offset tensor dtype to avoid overflow (#141532)
**Summary**
DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141532
Approved by: https://github.com/wconstab
ghstack dependencies: #141731, #141220, #141223
2024-11-29 07:59:34 +00:00
93cbb287c2 [dtensor][random] allow user to manual_seed different seed on device mesh; only sync RNG state in WORLD when manual_seed has not been called (#141223)
**Summary**
This PR proposes 4 changes to DTensor RNG management:
1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`.
2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling
```
world_mesh = init_device_mesh(
    device_type="cuda",
    mesh_shape=(2, 2, 2),
    mesh_dim_names=("pp", "dp", "tp"),
)
pp_mesh = world_mesh["pp"]
pp_rank = pp_mesh.get_local_rank()
spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd")  # this flattening is only needed if you need to call collective over this mesh
torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh)
```

In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize.

3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen.

4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous.

**Motivation**
tl;dr

1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel.
2. Users may want to set different seed on ranks in one device mesh.
3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it.

see detail in https://github.com/pytorch/pytorch/issues/140301

**Test**
`pytest test/distributed/_tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141223
Approved by: https://github.com/wconstab
ghstack dependencies: #141731, #141220
2024-11-29 07:59:34 +00:00
7f5bc9dd87 [dtensor][random][tp] remove the adhoc DTensor RNG tracker TensorParallelRNGTracker since it does not match FSDP2+TP (#141220)
**Summary**
The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results).

**Motivation**
`TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue.

**Impact**
`TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`).

For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant.

For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence.

**Test**
1-d model weight meta init:
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init`

2-d model weight meta init:
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init`

TP model weight init test:
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`

FSDP+TP model weight init test:
`pytest test/distributed/_composable/fsdp/test_fully_shard_init.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141220
Approved by: https://github.com/wconstab
ghstack dependencies: #141731
2024-11-29 07:59:26 +00:00
c55191f3a2 [dtensor][random] add 1d and 2d model meta init tests (#141731)
**Summary**
Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks.

**Test**
`pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141731
Approved by: https://github.com/wconstab
2024-11-29 07:59:20 +00:00
1a32daeb17 automatic dynamic unspecialize float (#141647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647
Approved by: https://github.com/ezyang
2024-11-29 07:53:53 +00:00
9827d677b4 [Quant][PT2E][X86] annotate and convert for linear_dynamic_fp16 (#141480)
Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer`
After `convert_pt2e`, the pattern will be
```
  x
  |
linear <- to_fp32 <- to_fp16 <- w
```

**Test plan**
```
pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141480
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-11-29 07:48:39 +00:00
b7a45dbae3 Add monitor script (#141438)
# Overview
Add monitor script to collect system-level utilization data during CI tests.
Currently all monitoring scripts are disabled.

# Details
- Add flag to customize the time intervals for logging
- Enable multiple GPU utilization logging

# Next step
enable monitor scritpt in non-perf-test workflows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141438
Approved by: https://github.com/huydhn
2024-11-29 04:14:31 +00:00
4d5c096a55 [MPS] Add autocast rule for SDPA (#141776)
Fixes #141774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141776
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-29 03:34:03 +00:00
b97a786125 Inline compile_to_fn at its only call site (#141691)
Stacked on https://github.com/pytorch/pytorch/pull/141689

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141691
Approved by: https://github.com/jansel
ghstack dependencies: #141681, #141683, #141685, #141688, #141689
2024-11-29 01:15:38 +00:00
9e4723cc6e Unify post_compile1 and CompiledFxGraph constructor (#141689)
Stacked on https://github.com/pytorch/pytorch/pull/141688

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141689
Approved by: https://github.com/jansel
ghstack dependencies: #141681, #141683, #141685, #141688
2024-11-29 01:15:38 +00:00
29326b9d29 Hoist post_compile1 into fx_codegen_and_compile (#141688)
Stacked on top of https://github.com/pytorch/pytorch/pull/141685

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141688
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #141681, #141683, #141685
2024-11-29 01:15:31 +00:00
cf3daf723f Unify cache disable and cache bypass paths (#141685)
I was constantly annoyed at the fact that we had a separate else branch for when cache was disabled which was distinct from when cache was bypassed. This diff gets rid of the disabled cache branch, so we use the same logic for bypass/disable. I actually think this change probably didn't actually matter much for the POC but I think it's cleaner.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141685
Approved by: https://github.com/aorenste
ghstack dependencies: #141681, #141683
2024-11-29 01:15:24 +00:00
7224cd4471 [BE]: Update 12.6 builds to CUDA 12.6.3 (#141433)
Update CUDA 12.6 to Update 3 and make cusparse-lt 0.6.3? #141365 Was going to leave some comments on #141365, but though it was just faster to open a PR here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141433
Approved by: https://github.com/atalman
2024-11-28 22:01:47 +00:00
ae6519cb74 [codemod] c10::string_view -> std::string_view in fields (#141736)
Summary: `c10::string_view` is being removed, so we need to migrate.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D65830276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141736
Approved by: https://github.com/Skylion007
2024-11-28 21:35:53 +00:00
09a3eddc07 Revert #141066 and #141494 (#141721)
manual revert due to merge conflicts

note: #141494 was reverted out of order blocking automatic revert of #141066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141721
Approved by: https://github.com/avikchaudhuri
2024-11-28 20:18:19 +00:00
d08bd6d627 Revert "Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)"
This reverts commit 8a3317cd41d0442d13090932ae5548e7b9fe45bd.

Reverted https://github.com/pytorch/pytorch/pull/141587 on behalf of https://github.com/atalman due to inductor/test_torchinductor_strided_blocks.py::TritonBlockPointerTestGPU::test_expand_broadcast_x_size0_y_size0_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/12072823884/job/33669367764) [HUD commit link](8a3317cd41) ([comment](https://github.com/pytorch/pytorch/pull/141587#issuecomment-2506690095))
2024-11-28 19:41:03 +00:00
907c31f529 [ROCm] devtoolset / GCC11 upgrade on manylinux images - 1b of 2 (docker images) (#141609)
Upgrade gcc version from 9 to 11 on ROCm manylinux images.

Needed for #141423 since almalinux8-based manylinux2_28 images for ROCm (#140681) installs gcc-toolset-9, which installs [gcc 9.2.1](https://pkgs.org/download/gcc-toolset-9-gcc-c++). However, PyTorch CMakeLists.txt enforces a [minimum gcc version of 9.3](5318bf8baf/CMakeLists.txt (L61)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141609
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2024-11-28 19:18:09 +00:00
f4187050fe [ONNX] Remove special handling of torchvision.ops imports in onnx export (#141569)
Fixes #141568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141569
Approved by: https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
2024-11-28 18:05:40 +00:00
6d204cb5ed Hoist set_feature_use out of conditional, rename some variables (#141683)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141683
Approved by: https://github.com/jamesjwu, https://github.com/jansel
ghstack dependencies: #141681
2024-11-28 17:43:11 +00:00
229daf7470 Inline FxGraphCache.load into its sole call site (#141681)
I need to restructure the body of FxGraphCache.load with the outer if-else in its call site, so inline it goes!

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141681
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2024-11-28 17:43:11 +00:00
b9a8df4bdd [CD] Add triton xpu build back (#141775)
Triton xpu build was stopped by https://github.com/pytorch/pytorch/pull/139206 temporally to wait triton xpu upgrade PR https://github.com/pytorch/pytorch/pull/137886 landed.

Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141775
Approved by: https://github.com/atalman
2024-11-28 17:37:42 +00:00
cyy
6b430c26bd Fix bugprone-argument-comment (#141777)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141777
Approved by: https://github.com/Skylion007
2024-11-28 16:56:50 +00:00
8a3317cd41 Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)
This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587
Approved by: https://github.com/jansel
2024-11-28 16:45:25 +00:00
5aacfa037b [Inductor] fix broadcast logic for Triton (#141027) (#141693)
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141693
Approved by: https://github.com/blaine-rister
2024-11-28 16:38:25 +00:00
f684dbd002 Try to simplify FloorDiv axioms implications when needed during evaluations. (#141267)
Summary:
This very much the same solution proposed by bobrenjc93 except that it restrict it to expressions and axioms that have FloorDiv, since those are the only ones that could have became CleanDiv. and the only one that can changes as shape env changes.

This also does not break torchrec benchmarks, it might be worth it to know why the generalization of this does break the torchrec benchmarks, but we could just be hitting another bug or NYI situation.

ovearhead?
None on
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```

Differential Revision: D66307433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141267
Approved by: https://github.com/ezyang
2024-11-28 15:35:35 +00:00
d49f0bf466 [CI] Fix xpu linux ci build environment duplicated issue (#141546)
We found that there are duplicated build environments in XPU linux ci test, it led to test jobs may download wrong pytorch build artifact file. Refer https://github.com/pytorch/pytorch/actions/runs/12023238798/job/33518351906#step:14:633

Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141546
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-11-28 14:21:21 +00:00
0f261e8f77 Add Manylinux2014 and Manylinux 2.28 config to triton builds. Run auditwheel on triton binaries (#141704)
This PR combines Manylinux 2_28 and Manylinux 2014  builds of triton under one workflow. This is required in order to support torch cpu, cuda 118, cuda 12.4 wheels built with Manylinux 2014 and torch cuda 12.6 wheels built with Manylinux 2_28.

Manylinux 2014 wheels:
``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl``
Manylinux 2_28 wheels:
``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141704
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn
2024-11-28 13:40:39 +00:00
f83361b274 inductor dtype propagation fixes (#141495)
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.

Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.

Follow ups:

- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
ghstack dependencies: #139945, #140057
2024-11-28 11:39:38 +00:00
1ef1b3b391 Add missing data types at torch export serialization (#138561)
Related to #131654

Added missing FP8 data types at torch export serialization.
Added test cases of FP8 data types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138561
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2024-11-28 08:35:03 +00:00
5212ec3879 Add admonition about as_float_unchecked() (#141742)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141742
Approved by: https://github.com/bdhirsh
2024-11-28 06:25:18 +00:00
d905f1350a Friendly catch exception when fail to initialize XPU devices (#141658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141658
Approved by: https://github.com/EikanWang
2024-11-28 05:17:08 +00:00
60fe50aa42 Move post compile steps into post_compile1/post_compile2 method (#141656)
The intention for turning these into methods is so that the AOTInductor compile path can implement them differently. I haven't worked out the implications yet though, but this seemed like a good stopping point for now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141656
Approved by: https://github.com/aorenste, https://github.com/jamesjwu, https://github.com/jansel
2024-11-28 04:45:40 +00:00
9f48881ba8 [BE]: Enable RUF013 ban implicit optional (#141706)
Enables RUF013 rule to ban implicit Optional (from areas not already checked by mypy).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141706
Approved by: https://github.com/ezyang
2024-11-28 04:03:01 +00:00
b33f770574 Revert "[inductor] Fix 3d tiling (#141709)"
This reverts commit ca9bfa1a384ed6871d4b1874bae81e72c747fd11.

Reverted https://github.com/pytorch/pytorch/pull/141709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is one failed test showing up in trunk.  It was missed by target determination ([comment](https://github.com/pytorch/pytorch/pull/141709#issuecomment-2505213481))
2024-11-28 03:55:31 +00:00
3becdaf8a7 [c10] Fix static_assert for 32-bit systems (#141244)
the `__ANDROID__` macro was used as a proxy to check whether compilation is targeting a 32 or 64 bit system, causing build failure on non-android 32 bit linux targets like arm v7.

This modification adjusts the check to fail if and only if int64_t and long and not the same on 64-bit systems, on systems where `sizeof(void*) == 8`

Like I said in the issue #141043 , I'm not sure whether a different `Scalar` constructor should be defined in the 32 bit case. My code does not break but I'm not sure other people's code won't.

Fixes #141043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141244
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-28 03:11:52 +00:00
54d26d670e [CP] Add assertion for unsupported load-balance + non-causal (#141622)
We actually do not support load-balance mode when non_causal = True, due
to changes in data shuffling for load_balance mode.  This PR just adds
an assertion to make this limitation clear.

Fixes #141429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141622
Approved by: https://github.com/XilunWu
2024-11-28 02:52:35 +00:00
b556549357 Use default context on Windows for Intel GPU (#138049)
# Motivation
Use default context in Windows to keep consistency with Linux. It makes it easy to interact with external libraries like `dlpack`.

# Additional Context
This PR depends on Intel GPU oneAPI 2025.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138049
Approved by: https://github.com/gujinghui
2024-11-28 02:49:46 +00:00
a8482ab3a8 [Reland] Enable XPUEvent elapsed_time function (#140873)
# Motivation
This PR intends to reland https://github.com/pytorch/pytorch/pull/134666 that has been reverted in https://github.com/pytorch/pytorch/pull/140872
We reverted it because I forgot to support `elapsed_time` for `XPUGuardImpl`, which resulted in `c10::Event` not supporting' elapsed_time' and blocking XPU CI.

# Additional Context
We split https://github.com/pytorch/pytorch/pull/134666 into two parts: one part, PR #140865, supports `elapsed_time` for `torch.Event` and another one, this PR, supports for `torch.xpu.elapsed_time`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140873
Approved by: https://github.com/gujinghui
ghstack dependencies: #140865
2024-11-28 02:41:11 +00:00
b1a8be6b0a Support torch.Event elapsed_time method on XPU (#140865)
# Motivation
This PR aims to support c10::Event/torch.Event elapsed_time method on XPU. We create a profiling tag Event when the timing flag is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140865
Approved by: https://github.com/Samkm0084, https://github.com/gujinghui
2024-11-28 02:41:11 +00:00
d70b7029c8 [MTIA] Support torch.mtia.empty_cache() (#141533)
Summary: As title

Test Plan:
Passed a local unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testrun/4785074861101240

Reviewed By: nautsimon

Differential Revision: D66481778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141533
Approved by: https://github.com/nautsimon
2024-11-28 02:24:19 +00:00
f35bb55256 Update triton xpu commit pin (#137886)
# Motivation
Due to the code change of https://github.com/pytorch/pytorch/pull/135567, triton-xpu needs to fetch `tensor.data_ptr()` via `uint64` instead of `int64`, refer to https://github.com/intel/intel-xpu-backend-for-triton/pull/2192

# Additional Context
triton commit comes from release branch: https://github.com/intel/intel-xpu-backend-for-triton/tree/release/3.2.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137886
Approved by: https://github.com/EikanWang, https://github.com/atalman
ghstack dependencies: #135567
2024-11-28 02:01:52 +00:00
ac0b0d11ab [Reland] Fix tensor.data_ptr() representation overflow (#135567)
# Motivation
fix https://github.com/pytorch/pytorch/issues/135550
In PyTorch, [`tensor.data_ptr()`](e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)) is reinterpreted by a [signed int64](e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)) data type, which could result in an **overflow issue**, like below:
```python
import torch
a = torch.randn(2).to('xpu')
a.data_ptr()
# one possible output is
-23453392437248
# this is inconsistent with storage.data_ptr()
a.untyped_storage().data_ptr()
# one possible output is
18446720620317114368
```
This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)). With this PR, the output will become:
```python
import torch
a = torch.randn(2).to('xpu')
a.data_ptr()
# one possible output is
18446720620317114368
# this is consistent with storage.data_ptr()
a.untyped_storage().data_ptr()
# one possible output is
18446720620317114368
```

# Solution
Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`.

# Additional Context
This PR has been reverted (in place, no more change, and revert commit 2e8d431a8f) due to the change of `tensor.data_ptr()`, which needs to sync up to intel xpu triton side, see [#2192](https://github.com/intel/intel-xpu-backend-for-triton/pull/2192). So we have to update xpu triton commit pin with this PR together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567
Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD
2024-11-28 02:01:52 +00:00
cyy
5ca75ac1df Enable UBSAN tests (#141672)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141672
Approved by: https://github.com/ezyang
2024-11-28 01:55:15 +00:00
ca9bfa1a38 [inductor] Fix 3d tiling (#141709)
Fixes #141121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709
Approved by: https://github.com/eellison
2024-11-28 01:34:28 +00:00
ad3986498a [Partitioner] Speed up the update of partition map (#136616)
We can update partition map by iterating users of node but not all of the downstream users of node. The former is faster than the latter which has many duplicate insertion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136616
Approved by: https://github.com/jgong5, https://github.com/tarun292
2024-11-28 01:11:44 +00:00
cyy
45ed7c13fa Remove unneeded std::make_optional (#141567)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141567
Approved by: https://github.com/albanD
2024-11-28 00:05:21 +00:00
fea771dcce Revert "Install magma from a tarball (#140417)"
This reverts commit 30ab10247d07d1682388313c3982e05dd73a055c.

Reverted https://github.com/pytorch/pytorch/pull/140417 on behalf of https://github.com/atalman due to Caused failures in calculate docker image ([comment](https://github.com/pytorch/pytorch/pull/140417#issuecomment-2504968996))
2024-11-27 23:22:43 +00:00
e24190709f [BE] Remove Model Dump utility (#141540)
So I found this utility by accident, trying to find how many html files we have in the repo so I could convert them to markdown

Turns out we package some html and js files in pytorch to visualize torchscript models. This seems kinda strange, probably shouldn't be in core, I removed the tests I could find. Maybe some internal tests will break but considering torchscript is being superseded might make sense to do this

Last time there was a meaningful update to the test for this file was about 2 years ago by @digantdesai since then it's a bunch of routine upgrades

It seems like this package is unused https://github.com/search?type=code&auto_enroll=true&q=torch.utils.model_dump&p=1 I skimmed through 5 pages of these and the only time this shows up in code search is when someone is either cloning pytorch or checking in their venv into github
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141540
Approved by: https://github.com/malfet
2024-11-27 22:52:55 +00:00
533798ef46 [dynamo] Enforce some invariants on ConstantVariable.create (#140984)
This addresses https://github.com/pytorch/pytorch/pull/140745#issuecomment-2480854259.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140984
Approved by: https://github.com/jansel
ghstack dependencies: #141504
2024-11-27 21:58:35 +00:00
3141e038f0 [dynamo] Fix VariableBuilder._wrap on frozenset and enforce invariants on ConstantVariable (#141504)
Prior to this patch, we are using `ConstantVariable.create` to create VT
for frozenset objects, and intended yet failed to predicate that on all
itmes being literals (see https://github.com/pytorch/pytorch/pull/140984#discussion_r1847393736).

The code was from https://github.com/pytorch/torchdynamo/commit/7c03434 and
the original goal was to help DBR quantization, but as the new test in
this patch shows, it could lead to silent incorrectness.

Upon a closer look, this exposes some subtleties in how Dynamo handles
`ConstantVariable` and `LOAD_CONST`, so this patch both fixes the
aforementioned issue and documents, enforces, and makes explicit the
invariants around `ConstantVariable` and `LOAD_CONST` -- only immutable
objects are supported.

Specifically, this patch:
1. refine the checks for wrapping a `frozenset` object, document why we
   can't just wrap its items directly due to lack of `Sourcec` for set
   items, and use a safe workaround (`SourcelessBuilder`) to ensure
   soundness while keeping the DBR quantization support.
2. Adds more types to `common_constant_types`, thereby making
   `ConstantVariable.is_base_literal` more lenient, and strictly checks
   this property in the constructor of `ConstantVariable`.
3. Change relevant uses of `create_instruction("LOAD_CONST", ...)` to
   `create_load_const` which checks `is_safe_constant`, and makes
   developer overrides explicit by using `create_load_const_unchecked`
   when needed.
4. In a few places, use more specific `VariableTracker`, e.g.,
   `TypingVariable` rather than `ConstantVariable`, and
   `FrozensetVariable` rather than `SetVariable`.

(2) and (3) are mainly to future-proof Dynamo against bugs like (1).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141504
Approved by: https://github.com/jansel
2024-11-27 21:58:35 +00:00
a962ae511d Extend gpt-fast LLM dashboard to support torchao autoquant (#140627)
Summary:
We want to test autoquant on relevant LLM models

right now only llama2 and mixtral, but want to extend to more models like https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models

Test Plan:

```
                                            Llama-2-7b-chat-hf          Mixtral-8x7B-v0.1
gpt-fast int8                           112.98                              147.92
torchao autoquant                  87.41                               85.90
torchao autoquantv2             131.12                                79.59
```

https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch

in pytorch/benchmarks/gpt_fast
```
python benchmark.py
```

output:
```
Loading model Llama-2-7b-chat-hf
Using int8 weight-only quantization!
Time to load model: 2.80 seconds
Compilation time: 170.24 seconds
Average tokens/sec: 112.98 tokens/sec
Average bandwidth achieved: 746.86 GB/s
Memory used: 7.95 GB

Loading model Mixtral-8x7B-v0.1
Using int8 weight-only quantization!
Time to load model: 0.24 seconds
Compilation time: 181.81 seconds
Average tokens/sec: 147.92 tokens/sec
Average bandwidth achieved: 953.06 GB/s
Memory used: 32.45 GB

Loading model Llama-2-7b-chat-hf
Time to load model: 0.11 seconds
Using autoquant
Compilation time: 109.31 seconds
Average tokens/sec: 87.17 tokens/sec
Average bandwidth achieved: 1151.86 GB/s
Memory used: 32.45 GB

Loading model Llama-2-7b-chat-hf
Time to load model: 0.11 seconds
Compilation time: 48.08 seconds
Average tokens/sec: 87.41 tokens/sec
Average bandwidth achieved: 1155.05 GB/s
Memory used: 36.86 GB

Loading model Mixtral-8x7B-v0.1
Time to load model: 0.20 seconds
Using autoquant
Compilation time: 47.32 seconds
Average tokens/sec: 85.90 tokens/sec
Average bandwidth achieved: 1106.37 GB/s
Memory used: 66.81 GB

local test (autoquant v2):
Loading model Mixtral-8x7B-v0.1
Compilation time: 124.40 seconds
Average tokens/sec: 90.41 tokens/sec
Average bandwidth achieved: 1164.47 GB/s
Memory used: 53.91 GB

Loading model Llama-2-7b-chat-hf
TODO

```

gpt_fast_benchmark.csv:
```
name,metric,target,actual,dtype,device,arch,is_model
Llama-2-7b-chat-hf,token_per_sec,144,112.98,int8,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,746.86,int8,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),136,170.24,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,token_per_sec,175,147.92,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,953.06,int8,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,compilation_time(s),133,181.81,int8,cuda,NVIDIA PG509-210,True
gemv,memory_bandwidth(GB/s),870,867.06,int8,cuda,NVIDIA PG509-210,False
gemv,memory_bandwidth(GB/s),990,1092.43,bfloat16,cuda,NVIDIA PG509-210,False
layer_norm,memory_bandwidth(GB/s),950,573.57,bfloat16,cuda,NVIDIA PG509-210,False
Llama-2-7b-chat-hf,token_per_sec,144,87.17,autoquant,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,1151.86,autoquant,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),136,109.31,autoquant,cuda,NVIDIA PG509-210,True
gather_gemv,memory_bandwidth(GB/s),990,945.38,int8,cuda,NVIDIA PG509-210,False
gather_gemv,memory_bandwidth(GB/s),1060,1188.29,bfloat16,cuda,NVIDIA PG509-210,False
mlp_layer_norm_gelu,flops_utilization,0.8,0.82,bfloat16,cuda,NVIDIA PG509-210,False
Llama-2-7b-chat-hf,token_per_sec,94,87.41,bfloat16,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,1155.05,bfloat16,cuda,NVIDIA PG509-210,True
Llama-2-7b-chat-hf,compilation_time(s),133,48.08,bfloat16,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,token_per_sec,175,85.90,autoquant,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,1106.37,autoquant,cuda,NVIDIA PG509-210,True
Mixtral-8x7B-v0.1,compilation_time(s),133,47.32,autoquant,cuda,NVIDIA PG509-210,True
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140627
Approved by: https://github.com/huydhn
2024-11-27 21:57:48 +00:00
30ab10247d Install magma from a tarball (#140417)
Magma is built for specific CUDA versions and stored in the ossci-linux bucket. Install it from there rather than the deprecated conda package.

There are two places where magma is installed today:
- `install_conda.sh`: extract the magma package in the same exact location where conda would install it, using a dedicated `install_magma_conda.sh` script. The new script is included in the relevant Dockerfiles where CUDA+magma is needed
- `install_magma.sh`: this script already uses a tarball. Use the new tarball instead of the tarball from the conda package. The format of the new tarball is compatible with the old one, so changes here are minimal:wq

Fixes #140538
Test PR: https://github.com/pytorch/pytorch/pull/141584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140417
Approved by: https://github.com/atalman
2024-11-27 21:56:20 +00:00
02b52572db Lint: switch oncall owner for test_transformers (#141722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141722
Approved by: https://github.com/malfet
2024-11-27 21:45:43 +00:00
5f004f455a [Dynamo][Distributed] Fix ProcessGroup getattr (#141638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141638
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-11-27 21:42:33 +00:00
dbbebee9d7 Code motion CompiledFxGraph to a dedicated file (#141654)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141654
Approved by: https://github.com/aorenste, https://github.com/jansel
ghstack dependencies: #141491, #141492, #141574
2024-11-27 20:42:21 +00:00
a7ca6a9113 Enable autograd cache on inductor tests (#140890)
This turns on AOTAutogradCache for all inductor tests. It clears AOTAutogradCache on each test as well, by virtue of the local cache using the same directory to store cache entries.

I've also tested with INDUCTOR_TEST_DISABLE_FRESH_CACHE=1, running all the tests. AOTAutogradCache successfully caches 99% of these. There are a few tests that use view_replay and therefore save functional tensors, which cause AOTAutogradCache to fail to pickle its result. Will look into next steps there, but for now, it seems okay if the cache just misses on those cases where it can't serialize the result. It would be better to check before pickling, though.

I've made the following small bugfixes to get this working:
- Inductor is sometimes used in a standalone mode without dynamo, which leads to attribute errors in check_can_cache. In general, we should *never* crash in cache checking, only bypass. So I change a try catch to check Exception instead of just a specific exception.
- Add extra structured logging for metadata on cache hits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140890
Approved by: https://github.com/bdhirsh
2024-11-27 20:41:43 +00:00
ab63b679e9 Save indexing for getitem nodes when do custom replacements (#140193)
Fixes #137280

When we have multiple indexings for the same array as returned items in pattern replacement, we shouldn't ignore its indexing numbers. otherwise, we may create a wrong pattern_to_node mapping.

A unit test is added in this PR. In this unit test, the function `rms_pattern_static` is replaced with `rms_replacement_static` when called. The function `rms_pattern_static` calls two functionalized custom operators, `torch.ops.vllm.rms_norm.default` and `torch.ops.vllm.static_scaled_int8_quant.default`, and it returns at2[1] and at2[2] as outputs. The function `rms_replacement_static` calls one functionalized custom operator `torch.ops.vllm.fused_rms_norm_quant_static.default`, which returns two corresponding items.

Run `python test/inductor/test_pattern_matcher.py -k test_multioutput_register_replacement` to test. After set `TORCH_COMPILE_DEBUG` to 1, the final part of the `fx_graph_readable.py` is like the following.
```python
# File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1673 in rms_pattern_static, code: at1 = auto_functionalized(
auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.vllm.rms_norm.default, result = permute_1, input = convert_element_type, weight = convert_element_type_1, epsilon = 1e-06);  permute_1 = convert_element_type = convert_element_type_1 = None
getitem_1: "bf16[5, 4]" = auto_functionalized[1];  auto_functionalized = None

# File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1680 in rms_pattern_static, code: at2 = auto_functionalized(
auto_functionalized_1 = torch.ops.higher_order.auto_functionalized(torch.ops.vllm.static_scaled_int8_quant.default, result = permute, input = getitem_1, scale = full_default, azp = None);  permute = getitem_1 = full_default = None
getitem_3: "i8[5, 4]" = auto_functionalized_1[1]
getitem_4: "f32[1, 1]" = auto_functionalized_1[2];  auto_functionalized_1 = None
return (getitem_3, getitem_4)
```
This happens before pattern matching, so is it expected to call `static_scaled_int8_quant` and `rms_norm` and return auto_functionalized_1 as outputs.

However, for pytorch before this PR, the `fx_graph_transformed.py`, which is after pattern matching, has the following code.
```python
 # File: /home/yhao/p9/pytorch/test/inductor/test_pattern_matcher.py:1748 in my_func_static, code: scale = torch.ones((1, 1))
full_default: "f32[1, 1]" = torch.ops.aten.full.default([1, 1], 1, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False)

# No stacktrace found for following nodes
as_strided_default: "i8[20]" = torch.ops.aten.as_strided.default(permute, [20], [1], 0)
clone_default: "i8[20]" = torch.ops.aten.clone.default(as_strided_default);  as_strided_default = None
as_strided_default_1: "i8[5, 4]" = torch.ops.aten.as_strided.default(clone_default, [5, 4], [4, 1], 0);  clone_default = None
as_strided_default_2: "f32[1]" = torch.ops.aten.as_strided.default(full_default, [1], [1], 0)
clone_default_1: "f32[1]" = torch.ops.aten.clone.default(as_strided_default_2);  as_strided_default_2 = None
as_strided_default_3: "f32[1, 1]" = torch.ops.aten.as_strided.default(clone_default_1, [1, 1], [1, 1], 0);  clone_default_1 = None
static_scaled_int8_quant_default = torch.ops.vllm.static_scaled_int8_quant.default(as_strided_default_1, permute_1, as_strided_default_3);  as_strided_default_1 = permute_1 = static_scaled_int8_quant_default = None
fused_rms_norm_quant_static_default = torch.ops.vllm.fused_rms_norm_quant_static.default(permute, convert_element_type, convert_element_type_1, full_default, None, 1e-06);  convert_element_type = convert_element_type_1 = full_default = fused_rms_norm_quant_static_default = None
return (permute, as_strided_default_3)
```
Here, it returns `(permute, as_strided_default_3)` while `permute` is written by fused_rms_norm_quant_static and `as_strided_default_3` is written by `static_scaled_int8_quant`. This is wrong because in our expectation, the `static_scaled_int8_quant` should be removed since it is replaced with `fused_rms_norm_quant_static`. It is supposed to return `(permute, full_default)`.

The root cause is the following part. When we [generate patterns](5f4a21dc58/torch/_inductor/pattern_matcher.py (L1580)) with traced fx graph and call the following function, the indexing numbers' type int in traced graph are ignored in `ignore_types`. So, the final arguments of patterns for those two output items are like `(CallFunction(auto_functionalized,XXX)), *)`.

5f4a21dc58/torch/_inductor/pattern_matcher.py (L1839-L1847)

When we do pattern matching after we generated patterns in the following part, the `sorted(itertools.chain.from_iterable(nodes), reverse=True)` is `[getitem_4, getitem_3, getitem_1]`. The getitem_4's iteration is always FailedMatch because we always use the first element to do the pattern match here (it fails on different match functions before and after this PR, but the reason is always the indexing numbers issue)d4cdc09881/torch/_inductor/pattern_matcher.py (L848). However, when we do pattern matching for getitem_3, the child_match returns a match for getitem_3 again which is because the `*` pattern can match anything. Then the getitem_3's pattern matching returns a `[getitem_3, getitem_3]` as outputs which are wrong.
d4cdc09881/torch/_inductor/pattern_matcher.py (L856)

d4cdc09881/torch/_inductor/pattern_matcher.py (L1750-L1774)

This PR doesn't ignore `int` type when we generate patterns for getitem functions because integer indexing numbers are important to them. Thus, the indexing information is kept in patterns, ensuring correct matchings. With this PR, the above `child_match` returns a match for getitem_4, and the final getitem_3's pattern matching returns the correct `[getitem_3, getitem_4]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140193
Approved by: https://github.com/eellison
2024-11-27 20:19:13 +00:00
b37cfddeb3 Refactor ShapeGuardPrinter for future C++ addiiton (#140968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140968
Approved by: https://github.com/anijain2305
ghstack dependencies: #140597
2024-11-27 20:09:58 +00:00
e5d02e0cfb Fix non-determinism in the partitioner (#141682)
When multiple nodes have similar sizes and are part of the `banned_nodes` (which is a `set` and not a `list`), there is non-determinism present in the partitioner due to sorting only by node-size.

This PR fixes this by also sorting by node name.

It would be good to add some tests, but I'm not sure about the best way to do it here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141682
Approved by: https://github.com/Chillee, https://github.com/yf225
2024-11-27 19:33:15 +00:00
8c90a9a030 Revert "fix non termination in unflatten + state (#141494)"
This reverts commit 5d7c3701e40374113921771097ebc65d9c2876bf.

Reverted https://github.com/pytorch/pytorch/pull/141494 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/141494#issuecomment-2504639230))
2024-11-27 19:30:55 +00:00
63e3cfc00b Enable both training and inference perf benchmark on GPU by default when using workflow dispatch (#141708)
A feedback from @bobrenjc93 that while https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml didn't run inference perf by default, the dashboard shows that mode on its landing page https://hud.pytorch.org/benchmark/compilers.  This is a source of confusion because folks won't see their branches unless they choose the correct mode.

IMO, it makes sense to run both training and inference by default when using workflow dispatch. This ensures that the branch will show up in both modes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141708
Approved by: https://github.com/bobrenjc93
2024-11-27 19:00:25 +00:00
53f8a5fde2 [FR] Include mismatch rank into mismatch_collectives and update log message (#141631)
Summary: We want to return the mismatch ranks info in the `mismatch_collectives` field. Also update the logging message when no error is found and it's not partial analysis.

Test Plan: CI

Differential Revision: D66522602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141631
Approved by: https://github.com/c-p-i-o
2024-11-27 18:57:21 +00:00
17fd53d8e5 [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-27 18:51:52 +00:00
75fbcc5743 [ARM] Expand linux aarch64 unit test list (#140799)
Expand the list of unit tests for test_linux_aarch64

These have been verified externally as passing on neoverse n1 and v1 based machines.

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140799
Approved by: https://github.com/snadampal, https://github.com/malfet
2024-11-27 18:43:55 +00:00
ad39a2fc46 [1/N] Decouple Flight Recorder from NCCL utils (#141648)
Part of the effort to make Flight Recorder device agnostic.

Step 1: Move it out of NCCLUtils.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141648
Approved by: https://github.com/fduwjj
2024-11-27 18:29:42 +00:00
fd553b9817 Add remaining method and tests for dtype propagation (#140057)
Adds the remaining unimplemented ops as well as an assertion failure if someone adds a new op without a dtype rule.

We test all unique pointwise operators registered as lowerings which have an opinfo. There will be some follow ups for this to work well with both `codegen_upcast_to_fp32` as True and False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140057
Approved by: https://github.com/arui-meta, https://github.com/blaine-rister, https://github.com/ezyang
ghstack dependencies: #139945
2024-11-27 17:06:44 +00:00
566ceb3e7e Refactor dtype propagation (#139945)
A couple changes.

- Tries to reuse dtype propagation rules that were already registered in inductor. These were present both with `pointwise_overrides_data` and the `boolean_ops` list. Additionally, the registration of pointwise ops already specified dtype propagation rules. Saves those registrations and reuses them later.

- Factors out `get_promoted_dtype` which uses functools.lru_cache to take in non - CSEVariable args because those will not work with the functools cache.

Tests get added later in the stack when everything is implemented.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139945
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
2024-11-27 16:57:02 +00:00
8012ff96ba [MPS] Add MetalShaderLibrary::getFunctionNames() (#141499)
That returns names of all the function in shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141499
Approved by: https://github.com/manuelcandales, https://github.com/Skylion007
ghstack dependencies: #141474, #141475, #141476, #141477
2024-11-27 16:53:38 +00:00
381213ee8a test_torchinductor: Improve cpp_wrapper skip message (#141176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141176
Approved by: https://github.com/desertfire
2024-11-27 16:35:54 +00:00
893ca5f671 Remove check_metrics_vec_kernel_count from test_cpu_repro.py::CPUReproTests::test_transpose_non_contiguous (#141246)
The test was initially added due to accuracy issues which is sufficiently covered by the `self.common(fn, (x,))` assertion.

Unfortunately, the test fails due to tiling logic on `128-bit` vector size, which is outside the scope of this test and therefore it was overly specific.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141246
Approved by: https://github.com/desertfire
2024-11-27 16:04:21 +00:00
b75bb64eb4 [AOTI XPU] Rename test_cuda_cpp_wrapper.py to test_gpu_cpp_wrapper.py, (#135320)
[Inductor] Rename test_cuda_cpp_wrapper.py to test_gpu_cpp_wrapper.py, since the test suite is shared by cuda and xpu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135320
Approved by: https://github.com/jansel, https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #135318
2024-11-27 14:08:06 +00:00
7ea0da2d57 Modest code motion in compile_fx (#141574)
Do code review with whitespace changes off. Check comments for what I changed.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141574
Approved by: https://github.com/bobrenjc93, https://github.com/jansel
ghstack dependencies: #141491, #141492
2024-11-27 13:38:14 +00:00
4ae1c4cbb5 Implement nonzero for large inputs (#141592)
Fixes #51871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141592
Approved by: https://github.com/ezyang
2024-11-27 10:19:53 +00:00
aa827e319e [Inductor][CPP] Extract common functions to be reused in other CPP Template (#141554)
**Summary**
Extract common internal functions from GEMM Template into public function, so these functions can be reused by the  subsequent group GEMM template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141554
Approved by: https://github.com/jgong5
2024-11-27 09:52:18 +00:00
763038db66 Clarify torch.arange floating-point rounding behavior (#141655)
Added documentation note clarifying the rounding behavior of `torch.arange` when using floating-point dtypes, particularly for reduced precision types like `bfloat16`. This helps users understand potential issues like repeated values and provides guidance on using integer dtypes for precise sequences.

## Changes
- Added explanatory note about floating-point rounding behavior and its effects
- Included specific mention of `bfloat16` dtype issues
- Added recommendation to use integer dtypes for precise sequences

Fixes [#137774](https://github.com/pytorch/pytorch/issues/137774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141655
Approved by: https://github.com/cpuhrsch
2024-11-27 09:31:39 +00:00
43a2a231d3 Support linear/BN fusion and follow the API guideline (#141585)
Current `fuse` function supports conv/BN fusions only. This commit is to support linear/BN fusion as well. Changes to follow the API guidelines are also applied.

(This will close the PR #141352 which I created for the same topic and got approval but had lint and API guideline problems.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141585
Approved by: https://github.com/ezyang
2024-11-27 06:52:00 +00:00
9e299b883b [c10d] Test needs abort; otherwise will hang (#141509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141509
Approved by: https://github.com/wz337, https://github.com/fduwjj
2024-11-27 05:47:17 +00:00
5accae4197 [sparse] add extra options to _cslt_spare_mm (#137427)
Summary:

Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.

This PR adds in the additional cuSPARSELt bindings into pytorch.

* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
  so a warning has been added

* Added a header file for cuSPARSELtOps.cpp

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
2024-11-27 05:32:45 +00:00
e3161ba6ec [BE] Fix incompatible-std-redefinition warning (#141630)
Fixes following warning during CUDA bazel builds
```
nvcc-real warning : incompatible redefinition for option 'std', the last value of this option was used
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141630
Approved by: https://github.com/cyyever, https://github.com/kit1980
2024-11-27 05:06:36 +00:00
3d5fe0ce78 torch._scaled_mm: support dims of size 0 for tensorwise scaling (#140967)
Summary:

Ensures we support dims of size 0 properly in `torch._scaled_mm`. Follows the behavior from `torch.mm`.

For now only enable support for tensorwise, we can tackle rowwise in a future PR.

Test Plan:

```
python test/test_matmul_cuda.py -k test_zero_dim
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140967
Approved by: https://github.com/eqy, https://github.com/drisspg
2024-11-27 04:07:52 +00:00
6e61ff4fd3 Revert "Add truediv support in export serializer (#136364)"
This reverts commit 1df440dc4e7ece40db597ce8e477e14b9c44fea7.

Reverted https://github.com/pytorch/pytorch/pull/136364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its doc build failure is legit ([comment](https://github.com/pytorch/pytorch/pull/136364#issuecomment-2502620732))
2024-11-27 03:24:31 +00:00
19d01a1ef0 Apply clang-format for ATen/core/boxing headers (#141105)
Code change via add path config in `.lintrunner.toml` file and running

```bash
 $ lintrunner -a --take CLANGFORMAT --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141105
Approved by: https://github.com/cyyever, https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-11-27 02:49:24 +00:00
c9e2b3fefe NJT: Return correct number of outputs for chunk() on the batch dim (#141604)
Old logic was completely wrong, returning `chunk_size` chunks instead of the intended number. The original test didn't catch this because `chunk_size == num_chunks` :p New OpInfo-based testing covers it though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141604
Approved by: https://github.com/soulitzer
ghstack dependencies: #141500, #140736, #140161, #141392, #141506
2024-11-27 02:31:23 +00:00
43121b6f0d Adjust output NJT ragged_idx for reductions and select() (#141506)
This fixes some bugs when performing reductions / select() on dims before the ragged dim. In this case, the output NJT has a smaller number of dims, and its ragged_idx should reflect that correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141506
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #141500, #140736, #140161, #141392
2024-11-27 02:25:53 +00:00
807a7dbf9f Don't generate modindex (#141601)
Fixes https://github.com/pytorch/pytorch/issues/141591
The generated index looks ugly. Attempting to not generate it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141601
Approved by: https://github.com/malfet, https://github.com/albanD
2024-11-27 02:07:21 +00:00
0c587c324d DOC: Correct torch.trapezoid docstring (#141459)
This is super duper minor, but I believe this corrects a typo in the documentation of `torch.trapezoid`.

The documentation says the input is a 1-dimensional tensor $y_0, \dots, y_n$, but it uses summations going from 1 to n-1. Since it's summing over terms $y_i - y_{i-1}$, stopping at n-1 excludes the last partition $y_n - y_{n-1}$, which doesn't match the implementation...

```python
# (just showing it does include $y_n - y_{n-1}$)
torch.trapezoid([0, 0, 9999]) == 9999 / 2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141459
Approved by: https://github.com/colesbury
2024-11-27 01:54:14 +00:00
fca0f34b83 Switch c10::string_view to std::string_view (#139635)
Shortens `string_view_starts_with` to `starts_with`. Adds some missing headers. Isolates `c10_string_view` to use with `get_fully_qualified_name`.

Test Plan: Sandcastle

Reviewed By: ezyang

Differential Revision: D64833558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139635
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-27 01:41:18 +00:00
d6276c2fbd Remove double space from warning (#141566)
Removes a double space from a warning in a way consistent with prior lines.

(Sorry, I saw this a few times when running vllm and the double space was killing me)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141566
Approved by: https://github.com/colesbury
2024-11-27 01:32:00 +00:00
3e90c00a87 Missing space in torch.autograd.Function deprecation warning (#141562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141562
Approved by: https://github.com/colesbury
2024-11-27 01:31:26 +00:00
136ff97095 [dynamo][log] Remove print torch inner stacktrace to let users focus on their code error (#141553)
Fixes #140394

**Test Result**

```bash
TORCH_LOGS="graph_breaks" python test.py
```

```python
# test.py
from typing import List
import torch

def fn002(x):
    x = x + 1
    torch._dynamo.graph_break()
    x = x + 1
    return x

def fn001(x):
    return fn002(x)

torch.compile(fn001, backend="eager")(torch.randn(1))

```
**Before log**
```
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Graph break in user code at /home/zong/code/pytorch/../scripts/dynamo.py:6
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Reason: Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py'
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] User code traceback:
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/../scripts/dynamo.py", line 11, in fn001
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     return fn002(x)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/../scripts/dynamo.py", line 6, in fn002
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     torch._dynamo.graph_break()
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Traceback (most recent call last):
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 641, in wrapper
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     return inner_fn(self, inst)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]            ^^^^^^^^^^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 2314, in CALL
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     self._call(inst)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 2308, in _call
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     self.call_function(fn, args, kwargs)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 879, in call_function
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/variables/functions.py", line 328, in call_function
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     return super().call_function(tx, args, kwargs)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/variables/functions.py", line 129, in call_function
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     return tx.inline_user_function_return(self, [*self.self_args(), *args], kwargs)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 885, in inline_user_function_return
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 3045, in inline_call
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     return cls.inline_call_(parent, func, args, kwargs)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 3171, in inline_call_
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     tracer.run()
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 1032, in run
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     while self.step():
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]           ^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 944, in step
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     self.dispatch_table[inst.opcode](self, inst)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 641, in wrapper
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     return inner_fn(self, inst)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]            ^^^^^^^^^^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 2314, in CALL
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     self._call(inst)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 2308, in _call
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     self.call_function(fn, args, kwargs)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/symbolic_convert.py", line 879, in call_function
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     self.push(fn.call_function(self, args, kwargs))  # type: ignore[arg-type]
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/variables/functions.py", line 708, in call_function
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     unimplemented(msg)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/torch/_dynamo/exc.py", line 313, in unimplemented
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     raise Unsupported(msg, case_name=case_name)
V1126 16:01:41.701000 1303718 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] torch._dynamo.exc.Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py'
V1126 16:01:41.722000 1303718 torch/_dynamo/symbolic_convert.py:424] [1/0] [__graph_breaks] Graph break (details suppressed) in user code at /home/zong/code/pytorch/../scripts/dynamo.py:6
V1126 16:01:41.722000 1303718 torch/_dynamo/symbolic_convert.py:424] [1/0] [__graph_breaks] Reason: Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py
```

**After log**
```
V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Graph break in user code at /home/zong/code/pytorch/../scripts/dynamo.py:6
V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] Reason: Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py'
V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks] User code traceback:
V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/../scripts/dynamo.py", line 11, in fn001
V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     return fn002(x)
V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]   File "/home/zong/code/pytorch/../scripts/dynamo.py", line 6, in fn002
V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]     torch._dynamo.graph_break()
V1126 16:01:19.900000 1303438 torch/_dynamo/symbolic_convert.py:416] [0/0] [__graph_breaks]
V1126 16:01:19.918000 1303438 torch/_dynamo/symbolic_convert.py:423] [1/0] [__graph_breaks] Graph break (details suppressed) in user code at /home/zong/code/pytorch/../scripts/dynamo.py:6
V1126 16:01:19.918000 1303438 torch/_dynamo/symbolic_convert.py:423] [1/0] [__graph_breaks] Reason: Unsupported: 'skip function graph_break in file /home/zong/code/pytorch/torch/_dynamo/decorators.py'
```

**Using tlparse get stacktrace**

The trace log implement for graph breaks in
5318bf8baf/torch/_dynamo/symbolic_convert.py (L417-L424)

**Get trace log by running**

```bash
TORCH_TRACE=/tmp/my_traced_log python test.py
```

**Using tlparse to get report**

```
tlparse dedicated_log_torch_trace_9unwqrxn.log  -o out1
```

**Result**

![image](https://github.com/user-attachments/assets/01d2ff25-90ec-4b9f-bcb6-5ae59ba65b35)

strack info in `0_0_0/dynamo_graph_break_reason_0.txt `
![image](https://github.com/user-attachments/assets/c4a04bd0-496a-4862-8230-c01f85e6f3c3)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141553
Approved by: https://github.com/shink, https://github.com/ezyang
2024-11-27 01:26:11 +00:00
8c8a484d72 Add some symbolic shapes guard logs to tlparse by default (#140867)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140867
Approved by: https://github.com/bdhirsh
2024-11-27 01:00:14 +00:00
0221e3a960 Fix CTC cuda backend out-of-bound access (#141607)
Fixes #140777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141607
Approved by: https://github.com/eqy
2024-11-27 00:53:02 +00:00
cyy
2f082e1e56 [13/N] Fix extra warnings brought by clang-tidy-17 (#140897)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140897
Approved by: https://github.com/ezyang
2024-11-27 00:35:19 +00:00
1df440dc4e Add truediv support in export serializer (#136364)
Fixes #136113

- [x] Inital `truediv` coverage
- [ ] Expand/reduce coverage?
- [x] Add tests
- [x] Re-check docstrings
- [ ] Linting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136364
Approved by: https://github.com/pianpwk

Co-authored-by: Angela Yi <angelayi@meta.com>
Co-authored-by: Pian Pawakapan <pianpwk@meta.com>
2024-11-27 00:31:47 +00:00
9b89fa44ba [MPS] Modify missing op message (#141314)
To point to https://github.com/pytorch/pytorch/issues/141287 as well as reference commit hash (to clearly distringiush  between OPs that has been implemented in trunk vs ones that are still missing)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141314
Approved by: https://github.com/manuelcandales, https://github.com/albanD
ghstack dependencies: #141313
2024-11-27 00:24:33 +00:00
07850bb2c1 [dynamo][pytree][1/N] make CXX pytree traceable: tree_iter / tree_leaves (#137397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137397
Approved by: https://github.com/jansel
ghstack dependencies: #141360
2024-11-27 00:21:58 +00:00
cdde73033e [dynamo] fix generic namedtuple support when the class is created via class MyTuple(NamedTuple, Generic[T]): ... (#141360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141360
Approved by: https://github.com/jansel
2024-11-27 00:21:58 +00:00
54f4621ca5 Add missing explicit include directive for <cerrno> in c10/util/error… (#141593)
`c10/util/error.cpp` uses the symbol `errno` but is missing an explicit header include directive for `<cerrno>`.

cc) @malfet , @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141593
Approved by: https://github.com/Skylion007
2024-11-27 00:00:23 +00:00
cyy
199d3da632 [9/N] Don't skip ASAN on some tests (#141534)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141534
Approved by: https://github.com/ezyang
2024-11-26 23:52:53 +00:00
605392bd06 add float8 types to LoggingTensor (#141385)
Summary:

float8 dtypes were missing from this map, adding

Test Plan:

CI, and unbreaks debugging in torchao

If there is an existing test I can add this to - lmk

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141385
Approved by: https://github.com/soulitzer
2024-11-26 23:39:57 +00:00
5b0b16ca62 [torch/distributed] Make _SymmetricMemory.has_multicast_support() ret… (#141598)
`SymmetricMemory.has_multicast_support()` throws an exception rather than returning `False` when called with a `DeviceType` that does not support. For example:

```
 from torch._C._distributed_c10d import _SymmetricMemory
 from torch._C._autograd import DeviceType

try:
	supports_multicast = _SymmetricMemory.has_multicast_support(DeviceType.CPU, 0)
except RuntimeError as exc:
	assert str(exc) == "SymmetricMemory does not support device type cpu"
```

This is problematic when building PyTorch from source without `CUDASymmetricMemory.cu` since the [`@requires_multicast_support`](https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_distributed.py#L353) test decorator will throw an exception rather than skipping the test (as intended)

This PR makes `_SymmetricMemory.has_multicast_support()` properly return `False` when multicast is not supported on the passed device.

cc) @malfet , @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141598
Approved by: https://github.com/yifuwang
2024-11-26 23:36:32 +00:00
43afaa4aac Allow users to overwrite ld with environment variable in linker optimization script (#137331)
This should help in the case of cross compilation.

xref: https://github.com/conda-forge/pytorch-cpu-feedstock/pull/261

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137331
Approved by: https://github.com/isuruf, https://github.com/seemethere
2024-11-26 22:54:24 +00:00
23793cf93d NJT unsqueeze() fixes (#141392)
This PR contains three `unsqueeze()`-related fixes for NJT:
1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim
2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly
3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125

Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #141500, #140736, #140161
2024-11-26 22:38:35 +00:00
9ee5d6f83c Initial NJT testing over dim type / views (#140161)
This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info.

Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops").

Testing is added over the following ops:
* `chunk()`
* `narrow()`
* `select()`
* `split()`
* `split_with_sizes()`
* `squeeze()`
* `unflatten()`
* `unsqueeze()`

Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed.

I also slipped in a couple minor fixes (sorry):
1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items)
2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #141500, #140736
2024-11-26 22:08:08 +00:00
7671dd436e [SDPA-CPU] Fix Edge case w/ fused flash cpu kernel (#141519)
Fixes https://github.com/pytorch/pytorch/issues/141128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141519
Approved by: https://github.com/jgong5, https://github.com/jbschlosser
2024-11-26 22:07:56 +00:00
f3d16ec76f Add doc preview command (#141590)
Convenience, when we build pytorch docs
1. Docs for build weren't clear that `make html` is the main command intended to be ran
2. Once you run `make html` you need to visualize the work, opening up a simple http server seems like the simplest solution so adding a `make serve command`

Usage

```shell
numpy ❯ make serve PORT=8080 # Add port optionally
Serving HTTP on :: port 8080 (http://[::]:8080/) ...
::1 - - [26/Nov/2024 10:05:41] "GET / HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/copybutton.css HTTP/1.1" 200 -
::1 - - [26/Nov/2024 10:05:41] "GET /_static/katex-math.css HTTP/1.1" 200 -
```

![Screenshot 2024-11-26 at 10 05 46 AM](https://github.com/user-attachments/assets/3b275c33-1515-4e21-b540-f5a68c8a8e55)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141590
Approved by: https://github.com/svekars, https://github.com/malfet
2024-11-26 21:56:54 +00:00
65dbd5cc2d Revert "[Inductor] Inplacing with Donated Buffer (#140113)"
This reverts commit eecc8e362c2eb192cbe13322af941d09ca647a6b.

Reverted https://github.com/pytorch/pytorch/pull/140113 on behalf of https://github.com/BoyuanFeng due to break test_donated_buffer_inplace internally since donated_buffer = False if is_fbcode() else True ([comment](https://github.com/pytorch/pytorch/pull/140113#issuecomment-2501954300))
2024-11-26 21:20:59 +00:00
869d629c0f Forward / backward NJT support for several activation functions (#140736)
Several activation functions were unimplemented due to missing `pointwise` tags. This PR adds them and corresponding backwards implementations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140736
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #141500
2024-11-26 21:19:58 +00:00
9f4f061f89 PyProcessGroup: support rank, world size, group name/desc overrides (#141529)
This improves `PyProcessGroup` so you can override rank, world size and group name/desc methods from Python. These will be needed to support resizable process groups in torchft.

This also has some small fixes in test_c10d_pypg.py to use threads instead of processes which speeds up the test execution by ~10x.

Test plan:

```
pytest test/distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141529
Approved by: https://github.com/fegin
2024-11-26 20:56:57 +00:00
5696df439b tools: Add script to do split build in one command (#141359)
Usage:
```bash
python3 tools/packaging/split_wheel.py bdist_wheel
python3 tools/packaging/split_wheel.py install
python3 tools/packaging/split_wheel.py develop
```
Ideally this should make it easier to do the split build locally while
we're doing development.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141359
Approved by: https://github.com/kit1980
2024-11-26 20:51:05 +00:00
8ba555ec8a Fix where() for NJT (#141500)
**Background:** It's common to use `scalar_tensor()` in the input to `where()` to convert any scalars present to compatible tensors with matching options, *including layout*. This shows up in various places, notably including derivative formulas ([example](78491d6afc/tools/autograd/derivatives.yaml (L432-L434))). It causes problems for NJTs because they have `layout=torch.jagged` and it never makes sense to create a scalar tensor with this layout. Some of the breakage only seems to happen in CI for reasons I don't fully understand (see the revert of #140736 due to softshrink's derivative formula).

**This PR:**
* Allows non-contiguous NJT inputs to `where()` + adds tests for this
* Handles scalar tensor / dense tensor inputs for `condition` / `other` + adds tests for this
    * Uses limited `broadcast_tensors()` / `broadcast_to()` support
    * Improves `expand()` to work on non-contig NJTs
* Changes `scalar_tensor()` to use `torch.strided` instead of `torch.jagged` in both eager and torch.compile (i.e. meta registration)
* Changes backward formulas for `sinc`, `pow`, `special.i1`, and `special.i1e` to uses `scalar_tensor()` instead of e.g. `zeros({})`

**Alternative approach:** Update all problematic usages of `scalar_tensor()` to avoid ever passing `layout=torch.jagged`. This is an extensive change and includes `torch.where()` logic, a bunch of derivative formulas, and likely other places not yet discovered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141500
Approved by: https://github.com/malfet, https://github.com/cpuhrsch, https://github.com/soulitzer
2024-11-26 20:13:27 +00:00
011650adc5 [sigmoid] Refactor out a helper function to insert const graph into top level graph. (#140854)
Summary: Add the helper function to put a const graph back to the toplevel graph, can be useful when we're taking const graphs from delegates.

Test Plan: CI

Reviewed By: trieuat

Differential Revision: D63031982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140854
Approved by: https://github.com/SherlockNoMad
2024-11-26 20:07:46 +00:00
6fa4356451 handle sympy.oo in bitwise_and/or value_ranges (#141522)
An internal test is failing due to not handling `sympy.oo` properly in bitwise_and/or value_ranges: [T208684142](https://www.internalfb.com/intern/tasks/?t=208684142). I don't know how to repro this - seems like this requires inductor to trigger as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141522
Approved by: https://github.com/ezyang
ghstack dependencies: #138777
2024-11-26 20:01:31 +00:00
84f818f359 [DTensorTestbase] Fix TestFunc typing issue (#141513)
Summary: `TestFunc` is annotated as `Callable[[object], object]` which represents a callable that takes a single argument of any type (`object`) and returns a value of any type (`object`). However, in reality, `TestFunc` could be any number of arguments, as a result, the corret typing should be `Callable[[...], object]` instead which represents a callable that takes any number of arguments (including zero) and returns a value of any type (`object`).

Test Plan: Contbuild & OSS CI

Differential Revision: D66463705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141513
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-11-26 19:48:34 +00:00
893a4390c9 Use cuda 12.6 wheels with Manylinux 2.28. Use Manylinux2014 for CPU, CUDA11.8, CUDA12.4 (#141565)
For release 2.6 we will be using only CUDA 12.6 binaries on Manylinux 2.28.
Issue: https://github.com/pytorch/pytorch/issues/123649
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141565
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/malfet
2024-11-26 19:36:42 +00:00
eqy
816ca98cd2 [cuDNN][SDPA] Update cuDNN grad output layout check (#141147)
Thanks to https://github.com/pytorch/pytorch/pull/137978 from @Skylion007 which bumps to cuDNN 9.5.1 the broken assumption of dO strides == O strides is fixed

Note that there is still the restriction that the innermost stride of the grad output is 1 (this is almost always guaranteed because this condition is required of the input tensors). The main exception would be in test code that does e.g., `.sum().backward()` which yields grad output tensors with strides `[0, 0, 0, 0]`.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141147
Approved by: https://github.com/drisspg
2024-11-26 19:17:01 +00:00
a99332eb25 [ROCM] Support Multi-GPU offline tuning in TunableOp (#139673)
This PR enhances offline tuning to support multi-GPUs.

High-level description of algorithm:
- Duplicate GEMMs are first eliminated
- GEMMs are distributed to multi-GPUs for tuning
- Results are gathered into a file with `_full` in the filename

Also adding support for GemmAndBias and ScaledGemm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139673
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang
2024-11-26 19:07:41 +00:00
5b4c864672 [c10d] Enable CudaEventCache by default and add multi device support (#140975)
We added `CudaEventCache` in https://github.com/pytorch/pytorch/pull/133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating.

Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140975
Approved by: https://github.com/eqy, https://github.com/kwen2501
2024-11-26 18:42:45 +00:00
44186a0a4e Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-26 18:11:00 +00:00
29ca44839e Add skip_first_wait to profiler.schedule (V2) (#141512)
Summary:
Another try for D66198138. Original diff had some weird issue with type checking. Setting everything to int this time to get around it.

Addresses https://github.com/pytorch/pytorch/issues/91888
We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.
Adds new flag but sets to false by default so that existing impl is not affected.

Test Plan:
Got following traces with this schedule:
schedule=torch.profiler.schedule(
          wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
      )

Differential Revision: D66465860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141512
Approved by: https://github.com/aaronenyeshi
2024-11-26 18:10:54 +00:00
809de05693 Update libgfortran version in aarch64 Docker (#141583)
From `libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb` to `libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb` as former is no longer available:
```
% curl --head http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb
HTTP/1.1 404 Not Found
Date: Tue, 26 Nov 2024 16:58:10 GMT
Server: Apache/2.4.29 (Ubuntu)
Content-Type: text/html; charset=iso-8859-1
```
vs
```
% curl --head http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb
HTTP/1.1 200 OK
Date: Tue, 26 Nov 2024 16:58:48 GMT
Server: Apache/2.4.29 (Ubuntu)
Last-Modified: Sun, 31 Mar 2024 10:51:08 GMT
ETag: "713d4-614f2a681d48b"
Accept-Ranges: bytes
Content-Length: 463828
Content-Type: application/x-debian-package
```

Here is the failure: https://github.com/pytorch/pytorch/actions/runs/12032016986/job/33542862322
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141583
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/malfet
2024-11-26 17:49:34 +00:00
000d4e9d43 [hop][inductor] remove codegen_subgraph_suffix and directly assign call function result to outer outputs (#141181)
Before the PR: P1683356646
after the pr: P1683356585

Relevant changes:
```
@@ -231,7 +421,8 @@
             true_graph_0_args = [true_graph_0_arg0_1, true_graph_0_arg1_1]
             del true_graph_0_arg0_1
             del true_graph_0_arg1_1
+            (buf5[0],) = true_graph_0(true_graph_0_args)
-             (true_graph_0_buf0,) = true_graph_0(true_graph_0_args)
-             buf5[0] = true_graph_0_buf0
         else:
             # subgraph: false_graph_0
             false_graph_0_arg0_1 = buf4
@@ -239,7 +430,8 @@
             false_graph_0_args = [false_graph_0_arg0_1, false_graph_0_arg1_1]
             del false_graph_0_arg0_1
             del false_graph_0_arg1_1
+            (buf5[0],) = false_graph_0(false_graph_0_args)
-             (false_graph_0_buf0,) = false_graph_0(false_graph_0_args)
-             buf5[0] = false_graph_0_buf0
         del arg2_1
         del buf4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141181
Approved by: https://github.com/anijain2305
ghstack dependencies: #140334, #141172
2024-11-26 17:32:51 +00:00
aae581d921 [hop free symbols][inductor] remove un-used add_symbol_graph_inputs (#141172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141172
Approved by: https://github.com/Chillee
ghstack dependencies: #140334
2024-11-26 17:32:50 +00:00
45bc9165fe [hop] add discard_graph_changes to remove the empty calls before hop (#140334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140334
Approved by: https://github.com/zou3519
2024-11-26 17:32:43 +00:00
eecc8e362c [Inductor] Inplacing with Donated Buffer (#140113)
Currently, inductor does not inplace update a buffer if it is an input buffer. Because we don't know if an input will be used by other functions.

Donated buffer provides additional information that an input buffer will not be used by other functions. So we can inplace update donated buffer when possible.

[Dashboard](https://hud.pytorch.org/benchmark/torchbench/inductor_dynamic?dashboard=torchinductor&startTime=Mon,%2011%20Nov%202024%2018:14:36%20GMT&stopTime=Mon,%2018%20Nov%202024%2018:14:36%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=bf/donated-buffer-inplace&lCommit=5df0769c00e6f9000caeb10fd5cbf0b165f69c2a&rBranch=main&rCommit=2b39a8db7741b816b03677a9c6fec1af05640dee)

![image](https://github.com/user-attachments/assets/f19d961f-7973-418e-9de8-5c2a97950478)
![image](https://github.com/user-attachments/assets/df3bd6a9-58b8-4e8a-8397-9e3b1de9adfe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140113
Approved by: https://github.com/eellison
2024-11-26 17:19:50 +00:00
3ef031909f [Donated Buffer] support metadata mutation ops (#141308)
### Background:

`set(x,y)` changes the untyped storage of x to be the same as y.

```python
import torch
from torch._subclasses.fake_tensor import FakeTensorMode

x1 = torch.ones(2,3)
y1 = torch.ones(2,3)
z1 = torch.ops.aten.set_.source_Tensor(x1, y1)

fake_tensor_mode = FakeTensorMode()
x2 = fake_tensor_mode.from_tensor(torch.ones(2,3))
y2 = fake_tensor_mode.from_tensor(torch.ones(2,3))
z2 = torch.ops.aten.set_.source_Tensor(x2, y2)

print(f"x1: {x1.untyped_storage()._cdata}, y1: {y1.untyped_storage()._cdata}, z1: {z1.untyped_storage()._cdata}")
print(f"x2: {x2.untyped_storage()._cdata}, y2: {y2.untyped_storage()._cdata}, z2: {z2.untyped_storage()._cdata}")
# x1: 99973024, y1: 99973024, z1: 99973024
# x2: 112107232, y2: 112107232, z2: 112107232
```

### Error before this diff

Consider this example:
```python
import torch

def fn(x):
    p = torch.nn.Parameter(x + 123)
    return p, p.sin()

opt = torch.compile(fn, fullgraph=True)
x = torch.ones(16, device="cuda", requires_grad=True)

p, r = opt(x)
r.sum().backward()
```

When running with `TORCH_LOGS=aot`, we have `set_` in the graph.
```
def forward(self, primals_1: "f32[16][1]cuda:0", primals_2: "f32[16][1]cuda:0"):
   # File: /home/boyuan/playground/inductor/donated_buffer.py:4 in fn, code: p = torch.nn.Parameter(x + 123)
  add: "f32[16][1]cuda:0" = torch.ops.aten.add.Tensor(primals_1, 123);  primals_1 = None

   # File: /home/boyuan/playground/inductor/donated_buffer.py:5 in fn, code: return p, p.sin()
  sin: "f32[16][1]cuda:0" = torch.ops.aten.sin.default(add)

  # No stacktrace found for following nodes
  set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add);  primals_2 = set_ = None
  return (sin, add)
```

`set_: "f32[16][1]cuda:0" = torch.ops.aten.set_.source_Tensor(primals_2, add)` should change the storage of `primals_2` to be the same as `add`. However, this is not true before this diff. We found different untyped_storage() for meta['val'] of `set_`, `add`, and `primals_2`.

This also leads to an error with donated buffer (#130580), which checks alias by untyped_storage. Since `add` and `primals_2` have different untyped_storage (which is wrong), add is wrongly marked as donated buffer.

### Root Cause

During tracing, we have args, kwargs, out, and proxy_args, proxy_kwargs, proxy_out.

We use args and kwargs to compute `out = func(*args, **kwargs)` ([Here](https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/proxy_tensor.py#L912)). Later, we set out to its proxy, essentially calling `proxy_out.node.meta["val"] = out.detach()`.

Due to the detach, the storage change happens on args but not on proxy_args.node.meta["val"] when func is torch.ops.aten.set_. I repro'ed this behavior of detach in eager code.

```python
import torch

x = torch.ones(2,3)
x_detach = x.detach()
y = torch.ones(2,3)
z = torch.ops.aten.set_.source_Tensor(x_detach, y)

print(f"x: {x.untyped_storage()._cdata}, x_detach: {x_detach.untyped_storage()._cdata}, y: {y.untyped_storage()._cdata}, z: {z.untyped_storage()._cdata}")
# x: 97023632, x_detach: 97026480, y: 97026480, z: 97026480
```

To fix the issue, this PR manually resets node.meta["val"] if the storage has changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141308
Approved by: https://github.com/bdhirsh
2024-11-26 17:06:46 +00:00
99a0e2b1a1 [dynamo] Trace through dataclasses by removing it from BUILTIN_SKIPLIST (#141294)
Fixes #141261.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141294
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-11-26 17:05:23 +00:00
2bbd984aa2 Fix typo in Reproducibility docs (#141341)
Fixes trivial issue in the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141341
Approved by: https://github.com/svekars
2024-11-26 16:53:26 +00:00
42ab61241e Add README for torch._inductor.runtime (#141492)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141492
Approved by: https://github.com/jansel
ghstack dependencies: #141491
2024-11-26 14:43:02 +00:00
94ff3985c9 AFAICT, compile workers never actually mocked torch (#141491)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141491
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-11-26 14:43:02 +00:00
9d4c0527b3 [Inductor][CPP] Modularize the CPP GEMM Template (#141006)
**Summary**
Move the common template code, which may be reused in subsequent group GEMM templates, into the standalone sub-templates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141006
Approved by: https://github.com/jgong5
2024-11-26 14:32:40 +00:00
313c1b33c5 Update CUDA installation script to 12.6.3 (#141365)
related to https://github.com/pytorch/pytorch/issues/138440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141365
Approved by: https://github.com/atalman
2024-11-26 13:49:51 +00:00
9dd3b85d05 [Inductor XPU] Fix wrong device check before skip concat linear. (#140916)
Fix #140917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140916
Approved by: https://github.com/EikanWang, https://github.com/eellison
2024-11-26 13:30:26 +00:00
4742080ed9 [AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-11-26 11:51:32 +00:00
c418a9ac75 [Intel GPU] XPUInductorQuantizer for XPU int8 recipe customization (#139578)
# Motivation
This PR add `XPUInductorQuantizer`, which would defined the recipe of int8 quantization at XPU backend.

# Detailed
The `XPUInductorQuantizer` is class derived from `X86InductorQuantizer` as both quantizer would take the advantage of highly optimized operators in oneDNN library(qconv, qlinear, qconv/qlinear fusion).

We share the same recipe as `X86InductorQuantizer`, so we would have same `annotate_xxxx` methods.  So, in ideal situation, the `XPUInductorQuantizer` would have no class body as all implementation can inherit from base class.

In this PR, we override the `annotate_xxx` method for operators that has NOT be implemented. All operators XPU backend does  not implement would be fallbacked to fp32 implementation as the node in graph is a `dq-op-q` pairs. This would help provide good OOB usability for XPU backend.   On the other hand, the implemented operators would uses `annotate_op` implemented in base class and could be lowered successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139578
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/CuiYifeng, https://github.com/jerryzh168
ghstack dependencies: #133080
2024-11-26 09:44:14 +00:00
5318bf8baf Revert "[sparse] add extra options to _cslt_spare_mm (#137427)"
This reverts commit f1451163ecd2bd014cb80a40c41c9999fbc94af8.

Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to This looks like the test is still failing, plz do a rebase ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2499918590))
2024-11-26 08:01:24 +00:00
cyy
6d4cd3e5f2 Remove linking of private cuda targets (#141463)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141463
Approved by: https://github.com/malfet
2024-11-26 03:51:53 +00:00
648f5d9dd9 [Intel GPU] qconv at XPU backend (#133080)
# Motivation
This PR enables the XPU quantized convolution. The operators it registers are `onednn::qconv_prepack`, `onednn::qconv1d_pointwise`, `onednn::qconv2d_pointwise`, `onednn::qconv3d_pointwise`. We share same operator schemas as Intel CPU backend as both would call kernels implemented in oneDNN library.

# Details

The implemented operators would be further integrated into pt2e quant flow. In this PR, we validated the kernel functionality via the UT in `test/inductor/test_mkldnn_pattern_matcher.py` where CPU backend defines a series of UT for quantized convolution. Also, we extend the device support for inductor lowering pass and inductor IR defined in `torch/_inductor/fx_passes/quantization.py` and  `torch/_inductor/mkldnn_ir.py`. The overall picture would be that, CPU and GPU backend could share the general optimization pass(op fusion) and quantization inductor IR. After lowering, the final kernel would be dispatched to different implementation in oneDNN library.

In this PR, we share the same int8 quantizer in CPU, namely, `X68InductorQuantizer`. In next PR #139578, we will append a `XPUIndcutorQuantizer` which will customized the pt2e behaviors at XPU backend. The capability of `XPUInductorQuantizer` would gradually grow along with the development of quantized operators in XPU.

# Validation
*  UT testing
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_xpu \
   -k test_qconv2d_silu_xpu \
   -k test_qconv2d_relu6_xpu \
   -k test_qconv2d_hardtanh_xpu \
   -k test_qconv2d_hardswish_xpu
```
* Runtime exemplification
```bash
#qconv2d
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:binary_add:f32:2+eltwise_linear:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0668945

#qconv2d_silu
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_u8::blocked:acdb::f0 wei_s8::blocked:acdb::f0 bia_undef::undef::: dst_u8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1+binary_add:f32:2+eltwise_linear:0.0124779:22,alg:convolution_direct,mb1_ic3oc128_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.0881348
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133080
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman
2024-11-26 02:24:30 +00:00
f2d388eddd [BE] Use torch.special.expm1 (#141518)
Instead of `torch.exp(x)-1`, as suggested by TorchFix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141518
Approved by: https://github.com/kit1980
2024-11-26 01:47:11 +00:00
dcd16bdc21 [Dynamo][autograd.Function] Use fake tensor prop to infer fwd output (#136184)
Fixes #129963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136184
Approved by: https://github.com/zou3519
2024-11-26 01:10:08 +00:00
cyy
6b60f4bc91 Fix some typos in cuda.cmake (#141462)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141462
Approved by: https://github.com/peterbell10
2024-11-26 01:08:25 +00:00
6a22cae436 [IntraNodeComm] fix a recent breakage (#141200)
- Pass `group_name` to `CUDASymmetricMemory::alloc()` instead of `CUDASymmetricMemory::rendezvous()`. We can only move the argument to rendezvous() once all the underlying operators do the same.
- Added `float` to the allowlist for intra-node all-reduces.
- Added a warning when `IntraNodeComm::rendezvous()` is performed with overlapping devices among participants.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141200
Approved by: https://github.com/weifengpy, https://github.com/kwen2501
2024-11-26 00:46:38 +00:00
583484b726 [dynamo] Fix and simplify hanlding of Set.update method (#141286)
The old implementation of `SetVariable.call_method("update", ...)` was
incorrectly becacuse it wouldn't handle iterable inputs. This patches
removes the input type restriction altogether, and implements the method
as a polyfill (like how most of the other set methods are handled).

Fixes #141283.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141286
Approved by: https://github.com/anijain2305
2024-11-26 00:41:50 +00:00
5d7c3701e4 fix non termination in unflatten + state (#141494)
With largish systems of nn modules with buffers, sinking params suffered from some kind of exponential blowup that is easily fixed by using a set instead of a list to keep track of unlifted buffer placeholders.

Test Plan: added random dag test that failed previously

Differential Revision: D66457661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141494
Approved by: https://github.com/angelayi
2024-11-26 00:17:56 +00:00
9ccbd84316 Upgrade ROCm wheels to manylinux2_28 - 1 of 2 (docker images) (#140681)
Fixes #140631

Highlights:
* Use `cpu_final` base for ROCm in `.ci/docker/manywheel/Dockerfile_2_28`
* Cleans up install_miopen.sh to remove old ROCm references
* Install `gcc-gfortran` package to build magma for ROCm on almalinux

Needs builder PR https://github.com/pytorch/builder/pull/2043 (merged) so that GCC_ABI expected value is updated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140681
Approved by: https://github.com/jeffdaily
2024-11-26 00:10:40 +00:00
8f5ce865a4 [Build] Add COMMIT_SHA to caffe2::GetBuildOptions (#141313)
Using the same `tools/generate_torch_version.py` script

It's already available on Python level, but not on C++ one

Please note, that updating commit hash will force recompilation of less than 10 files according to
```
% touch caffe2/core/macros.h; ninja -d explain -j1 -v -n torch_python
ninja explain: output caffe2/torch/CMakeFiles/gen_torch_version doesn't exist
ninja explain: caffe2/torch/CMakeFiles/gen_torch_version is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/version.py is dirty
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546390618881 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546233600752 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546651089243 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546224176845 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546464535054 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301550062608920 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o is dirty
ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301547538843492 vs 1732301802196214000)
ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o is dirty
```

Differential Revision: [D66468257](https://our.internmc.facebook.com/intern/diff/D66468257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141313
Approved by: https://github.com/ezyang
2024-11-26 00:09:36 +00:00
ad37afd590 Revert "Always unspecialize float in OSS (#138922)"
This reverts commit ba5253da9b30ed4d998cee1d865f92b2c27d3086.

Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/yf225 due to perf regression on torchbench ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2499277511))
2024-11-26 00:03:03 +00:00
964655bf0c Revert "Remove THC from OSS build (#134969)"
This reverts commit 9c7660be0ee155baf0cb7e1e67708dd784ac5796.

Reverted https://github.com/pytorch/pytorch/pull/134969 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking the installation of https://github.com/facebookresearch/detectron2/blob/main/detectron2/layers/csrc/deformable/deform_conv_cuda_kernel.cu#L76 ([comment](https://github.com/pytorch/pytorch/pull/134969#issuecomment-2499275378))
2024-11-26 00:00:12 +00:00
f1451163ec [sparse] add extra options to _cslt_spare_mm (#137427)
Summary:

Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.

This PR adds in the additional cuSPARSELt bindings into pytorch.

* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
  so a warning has been added

* Added a header file for cuSPARSELtOps.cpp

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
2024-11-25 23:45:41 +00:00
02990fe36b Populate nn.module.stack in _fuse_conv_bn_qat (#141400)
Summary:
Populate nn.module.stack in _fuse_conv_bn_qat for replacement nodes that correspond to a `get_attr` node in the original graph.

In new training ir , `get_attr` nodes don't have `nn_module_stack` in node meta anymore (because the get_attr nodes are de-duplicated, so one get_attr node can potential have users in different module stacks).

We populate it by checking if "conv_input" or "conv_weight" replacement node has nn_module_stack. If not, we copy it from the conv node.

Test Plan:
CI

```
buck run fbcode//caffe2/test:quantization_pt2e -- -r test_preserve_nn_module_stack
```

Differential Revision: D66393517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141400
Approved by: https://github.com/angelayi
2024-11-25 23:41:28 +00:00
851edf208b [ROCm] Remove gfx906 from CI docker build (#141523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141523
Approved by: https://github.com/jeffdaily
2024-11-25 22:23:28 +00:00
915625307e [PGNCCL] Record device index for GPU guarding during NCCLComm method calls (#141270)
### Motivation
`ncclCommInitRank` needs GPU guard (documented in NCCL).

`ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests.

### Solution
This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs.

### Note
This is not a bug fix. Just a safety improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141270
Approved by: https://github.com/eqy
ghstack dependencies: #141374
2024-11-25 21:31:21 +00:00
af4522b81c [c10d][CI] Use new store for PG restart tests (#141374)
A new Store is used to recreate PGs upon restart. Achieve the new Store by adding prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141374
Approved by: https://github.com/fduwjj
2024-11-25 21:31:21 +00:00
b18bbc965c [dynamo] support list.sort sort non-constant iterable with constant keys (#141485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141485
Approved by: https://github.com/jansel
2024-11-25 21:06:11 +00:00
efec302dd0 cpp_wrapper tests: Fix tests assuming non-cpp_wrapper code (#141175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141175
Approved by: https://github.com/desertfire
2024-11-25 19:33:55 +00:00
78491d6afc Update triton wheel install script with new versioning (#141497)
This PR is a follow-on to #141410.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141497
Approved by: https://github.com/huydhn
2024-11-25 19:09:55 +00:00
91f7c547ec [FlexAttention] add support for learnable biases in Inductor (#137452)
# Summary

The follow up PR to: https://github.com/pytorch/pytorch/pull/137526.  In this pr, we actually update the lowerings for the flex_attention backwards kernel to generate fused backward gradient calculations for any captured buffers that require grads.

We are doing this using tl.atomic_add to scatter the correct gradients into zeroed out buffer for any captured buffers that required grads. Added many test cases and found.  Along the way found some masking bugs.

There are likely some performance cliffs here, specifically with D-types and on different GPUs. Planned to do this in a follow-up and profile the current strategy. We are explicitly choosing reduced memory over increased performance right now.

By using atomics, we do not need to realize a full attention scores matrix. However, this comes with two downsides. One, this is potentially slower in some cases, and two, the gradient calculation for any captured buffers is non-deterministic.

## Worked Example

Lets do the case where you are reading from one bias that doesn't require grad and using this to index into another that does.

ScoreMod:
```Python
bias = torch.randn(
    params.seq_length,
    device=self.device,
    dtype=params.dtype,
    requires_grad=True,
)

offset = torch.randint(
    0,
    params.seq_length,
    (params.seq_length,),
    device=self.device,
)

def score_mod(score, b, h, q_idx, kv_idx):
    return score + bias[offset[q_idx]]

```

I am removing all but the new subgraph injected into the backwards:

``` Python
    dsT = pT * (dpT - Di[None, :])
    # ~~~~~~~~~~~~~~~~~~~ Apply joint modification  ~~~~~~~~~~~~~~~~~~~
    grad_scores = (dsT)

    # ~~~~~~~~~~~~~~~~~~~ Apply other buffer grad writes ~~~~~~~~~~~~~
    idx_b = off_z
    idx_h = off_hq
    idx_m = m
    idx_n = n
    scatter_mask = offs_m1[None, :] < Q_LEN and offs_n1[:, None] < KV_LEN
    tmp4 = (dsT).to(tl.float32)
    tl.atomic_add(out_ptr1 + (tl.broadcast_to(tl.load(in_ptr16 + idx_m), tmp4.shape)), tmp4, scatter_mask, sem='relaxed')

    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
## Key points
* We always accumulate to float 32 grad buffers regardless of the type in the forward. This is because we normally do all computation intra kernel w/ fp32 accumulation and we want the same behavior for atomic additions
* We are currently restricted to 1 scatter in the kenrel. I have some ideas on fx rewrites that would remove this restrictions but for now have nice error message w/ work around and will leave as a follow up.
* Will do more extensive performance/ memory profiling in a follow up.

### Toy E2E example
I have a toy E2E training example PR in the gym for now: https://github.com/pytorch-labs/attention-gym/pull/84/
I plan to update to a realistic learnable bias before landing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137452
Approved by: https://github.com/Chillee
2024-11-25 19:08:34 +00:00
de6d69ec78 [MPS] Make MetalShaderLibrary usable from C++ (#141477)
By guarding Metal framework include and defining all ObjC protocols to dummy `void*`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141477
Approved by: https://github.com/Skylion007
ghstack dependencies: #141474, #141475, #141476
2024-11-25 18:40:55 +00:00
953e5f9201 [MPS][BE] Add virtual destructor (#141476)
As classes with virtual methods must have virtual destructors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141476
Approved by: https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #141474, #141475
2024-11-25 18:40:55 +00:00
b532a84be5 [MPS] Move MetalShaderLibrary to its own header (#141475)
In preparation to be used from libtorch_python
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141475
Approved by: https://github.com/Skylion007
ghstack dependencies: #141474
2024-11-25 18:40:47 +00:00
1bca1220de [MPS][BE] Remove unused definitions (#141474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141474
Approved by: https://github.com/Skylion007
2024-11-25 18:40:40 +00:00
9a09011cd1 [inductor] Refactor dependencies.extract_loop_body_with_args (#141404)
I plan to reuse this in a later PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141404
Approved by: https://github.com/yanboliang
2024-11-25 18:34:09 +00:00
8f5edcb75c [CUTLASS] Lift shape & stride information as kernel args (#138611)
Differential Revision: [D64773324](https://our.internmc.facebook.com/intern/diff/D64773324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138611
Approved by: https://github.com/chenyang78
2024-11-25 17:52:33 +00:00
2325749a89 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 7a9d0e3c06781dda04a9cc3dcf56ff09cf472235.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2498670406))
2024-11-25 17:51:53 +00:00
4a378d77d4 [AMD] Add ncclRemoteError back (#141461)
Summary:
It looks RCCL does have the support for those two error types:: ncclRemoteError and ncclnProgress: https://github.com/ROCm/rccl/blob/develop/src/nccl.h.in#L57. And I do see my job throwing out those errors. But pytorch just said:
```
RuntimeError: Unconvertible NCCL type
```

Even though nccl says:
```
develop/src/init.cc.hip:502 NCCL WARN Attempt to use communicator before the previous operation returned ncclSuccess
```

Therefore just enabling those.

Test Plan: CI

Differential Revision: D66434341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141461
Approved by: https://github.com/eqy
2024-11-25 17:49:34 +00:00
5ececd4caa [ROCm] Select gpu targets according to PYTORCH_ROCM_ARCH when building AOTriton from source (#139432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139432
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Vicky Tsang <vtsang@amd.com>
2024-11-25 17:33:57 +00:00
419b566e54 [ONNX] Use the torchlib opset number and fix opset import logic (#141413)
- Update the ONNX IR `add_opset_imports` pass to remove the heuristics of taking the `max` of the seen opsets. Instead, it uses the torchlib default opset version for the model's opset_import. The version converter is able to take the true opset versions in the nodes and convert the model to the correct version.
- Update all hard coding of opset 18 to instead query the default torchlib opset from onnxscript, introduced in https://github.com/microsoft/onnxscript/pull/1963

Fixes https://github.com/pytorch/pytorch/issues/141260
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141413
Approved by: https://github.com/titaiwangms
2024-11-25 17:33:25 +00:00
4fa72168ea FlopCounterMode: Decompose ops for inference mode (#138508)
Fixes #126268

I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state.

Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-11-25 16:53:10 +00:00
cffeb83f15 Revert "Forward / backward NJT support for several activation functions (#140736)"
This reverts commit daaecb96d6b8049f8ca95974cd8a45b2fb9d4e28.

Reverted https://github.com/pytorch/pytorch/pull/140736 on behalf of https://github.com/malfet due to Take 2, of stack revert your change but its tests are failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/140736#issuecomment-2498479702))
2024-11-25 16:27:00 +00:00
e0f9ec4a25 Revert "Initial NJT testing over dim type / views (#140161)"
This reverts commit 730caf0aed187ce5c1c36fae7e9ae1f700585280.

Reverted https://github.com/pytorch/pytorch/pull/140161 on behalf of https://github.com/malfet due to Sorry for reverting your change but its tests are failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/140736#issuecomment-2498358652))
2024-11-25 15:40:54 +00:00
58727b6f5f Revert "NJT unsqueeze() fixes (#141392)"
This reverts commit 48409a5cc6b14b6a5237beb6263a436d309afcd2.

Reverted https://github.com/pytorch/pytorch/pull/141392 on behalf of https://github.com/malfet due to Sorry for reverting your change but its tests are failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/140736#issuecomment-2498358652))
2024-11-25 15:40:54 +00:00
07906f2f2b [logging] Move population of common MetricsContext fields to record_compilation_metrics (#141291)
Summary: Fix outstanding TODOs related to logging of CompilationMetrics by moving the population of common fields to record_compilation_metrics() instead of populating those independently wherever we use a the metrics_context contextmanager:
* Keep track of start and end time in MetricsContext and pass those to record_compilation_metrics() and populate those fields in that function.
* Pass exception info to record_compilation_metrics() and populate those field in that function.
* Add a new contextmanager, chromium_event_timed, to create the start/end "dynamo" event. This is important because I want this contextmanager to complete _after_ building the CompilationMetrics.
* Populate the compile_id field centrally in record_compilation_metrics().
* Populate the structured_logging_overhead centrally in record_compilation_metrics().
* Add the CompilationMetrics to the current chromium event in record_compilation_metrics(), after all common fields have been added. In a future diff, I can also add _all_ compilation metrics to the chromium event.

Test plan: Unit tests. Also see internal testing:
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/jrascnf9
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/l3jnla06
* tlparse: https://fburl.com/bq5a9nqs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141291
Approved by: https://github.com/jamesjwu
2024-11-25 13:18:40 +00:00
a964f31d7b [inductor] modify the heuristic for loop split optimization (#137550)
### Summary

1. Improve the heuristic for loop split optimization: The divisor needs to be an integer and cannot be too small (needs to be greater than 8, this threshold has been tuned).
2. Improve the heuristic for disabling vectorization: add quantity_threshold and relax ratio_threshold for the number of non-contiguous load/store/index_expr in the loop body.

This PR will bring performance improvements for two torchbench models(functorch_dp_cifar10, opacus_cifar10) and one timm model(sebotnet33ts_256).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137550
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-11-25 09:16:30 +00:00
48409a5cc6 NJT unsqueeze() fixes (#141392)
This PR contains three `unsqueeze()`-related fixes for NJT:
1. Adjusts the output's `_ragged_idx` when `unsqueeze()` inserts a dim before the ragged dim
2. Corrects the unbind reference for `unsqueeze()` after the last input dim. For this case, the dim kwarg canonicalization logic needs to be applied wrt `inp.dim() + 1` to account for `dim=-1` properly
3. Adds ragged dim support to `unsqueeze()`, allowing for e.g. `(B, j1, D) -> (B, 1, j1, D)`. This is okay now after #137125

Note that `unsqueeze()` still doesn't support batch dim operation, and arguably should never support this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141392
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #140736, #140161
2024-11-25 08:08:38 +00:00
730caf0aed Initial NJT testing over dim type / views (#140161)
This PR introduces `ExtraOpData`, a structure that contains op metadata regarding whether the op is a view and the dim-related args it accepts. It also populates a huge database for dim-wise / view ops with this info.

Test logic (sample input generation, references) have been updated to utilize this data. It allows for a fairly generic set of sample inputs & a reference for the class of ops that accept a single NJT and operate dim-wise (AKA "unary dimwise ops").

Testing is added over the following ops:
* `chunk()`
* `narrow()`
* `select()`
* `split()`
* `split_with_sizes()`
* `squeeze()`
* `unflatten()`
* `unsqueeze()`

Most of the above do not operate on the ragged / batch dims or on non-contiguous NJTs, so the proper xfails are added as needed.

I also slipped in a couple minor fixes (sorry):
1. The `_wrap_jagged_dim()` helper now avoids assuming the `nt._ragged_idx == 1` and allows for a batch dim to be a valid input, disambiguating the converted inner dim as necessary through an additional `operating_on_batch` return value (i.e. both dim=0 and dim=1 map to dim=0 on the inner values tensor, since that dim represents a packed ragged dim for all batch items)
2. Padded dense -> NJT conversion requires shape gymnastics to operate with the restrictive FBGEMM kernel. The gymnastics were slightly wrong for the transposed NJT case, and this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140161
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
ghstack dependencies: #140736
2024-11-25 08:08:38 +00:00
daaecb96d6 Forward / backward NJT support for several activation functions (#140736)
Several activation functions were unimplemented due to missing `pointwise` tags. This PR adds them and corresponding backwards implementations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140736
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-11-25 08:08:31 +00:00
d0fd42eb3a [inductor] refine loop split logic (#128812)
This PR aims to improves parallelization by collapsing vectorized loop. https://github.com/pytorch/pytorch/issues/122281

For such case, the parallel level is only `2`.
And the vectorized loop cannot be collapsed.
```
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
{
    for(long x1=static_cast<long>(0L); x1<static_cast<long>(199984L); x1+=static_cast<long>(16L))
    {
        auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16);
        tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16);
    }
    #pragma omp simd simdlen(8)
    for(long x1=static_cast<long>(199984L); x1<static_cast<long>(199985L); x1+=static_cast<long>(1L))
    {
        auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))];
        out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0;
    }
}
```
After this PR, we will gen code
```
#pragma omp for collapse(2)
for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
{
    for(long x1=static_cast<long>(0L); x1<static_cast<long>(199985L); x1+=static_cast<long>(16L))
    {
        if (x1 >= 0 && x1 <199984) {
            auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr0 + static_cast<long>(x1 + (199985L*x0)), 16);
            tmp0.store(out_ptr0 + static_cast<long>(x1 + (209985L*x0)), 16);
        }
        if (x1 >= 199984 && x1 <199985) {
            auto tmp0 = in_ptr0[static_cast<long>(x1 + (199985L*x0))];
            out_ptr0[static_cast<long>(x1 + (209985L*x0))] = tmp0;
        }
    }
}
```

### Highlight
For reduction case, we have some side-effect here.
For below case, we vectorized `x1` dim and reduction at `x2` dim.
```
#pragma omp for
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L))
{
    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(16L); x1+=static_cast<int64_t>(8L))
    {
        {
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8);
                tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
            }
            [&]
            {
                __at_align__ std::array<float, 8> tmpbuf;
                tmp_acc0_vec.store(tmpbuf.data(), 8);
                #pragma GCC unroll 8
                for (long x1_inner = 0; x1_inner < 8; x1_inner++)
                {
                    out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner];
                }
            }
            ()
            ;
        }
    }
    #pragma omp simd simdlen(4)
    for(int64_t x1=static_cast<int64_t>(16L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(1L))
    {
        {
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                auto tmp0 = in_ptr1[static_cast<int64_t>(x1 + (17L*x2) + (306L*x0))];
                tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
            }
            out_ptr1[static_cast<int64_t>(x0 + (39L*x1))] = tmp_acc0;
        }
    }
}

```
After collapse, the loop order will be `x1 -> x2 -> x1_tail_part`, thus we will need a `tmp_acc_arr` to store the reduction result for `x1_tail_part`. And for `reduction_stores`, we also need to check `x1`'s value like what we do in the `loopbody` since the `reduction_stores` happened between `x1` and `x2` loops.
```
#pragma omp for collapse(2)
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(39L); x0+=static_cast<int64_t>(1L))
{
    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(17L); x1+=static_cast<int64_t>(8L))
    {
        {
            float tmp_acc0_arr[8];           ######### need an array to hold acc result for tail part
            for (int i = 0; i < 8; i++)
            {
                tmp_acc0_arr[i] = -std::numeric_limits<float>::infinity();
            }
            float tmp_acc0 = -std::numeric_limits<float>::infinity();
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity());
            for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(18L); x2+=static_cast<int64_t>(1L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x1 + (17L*x2) + (306L*x0)), 8);
                        tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp0);
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L)))
                    {
                        for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++)
                        {
                            auto tmp0 = in_ptr1[static_cast<int64_t>(x1_tail + (17L*x2) + (306L*x0))];
                            tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)] = max_propagate_nan(tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)], tmp0);
                        }
                    }
                }
            }

            ############### reduction stores
            if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L)))
            {
                [&]
                {
                    __at_align__ std::array<float, 8> tmpbuf;
                    tmp_acc0_vec.store(tmpbuf.data(), 8);
                    #pragma GCC unroll 8
                    for (long x1_inner = 0; x1_inner < 8; x1_inner++)
                    {
                        out_ptr1[static_cast<int64_t>(x0 + (39L*x1) + (39L*x1_inner))] = tmpbuf[x1_inner];
                    }
                }
                ()
                ;
            }
            if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L) && x1 < static_cast<int64_t>(17L)))
            {
                for (long x1_tail = static_cast<int64_t>(16L); x1_tail < static_cast<int64_t>(17L); x1_tail++)
                {
                    out_ptr1[static_cast<int64_t>(x0 + (39L*x1_tail))] = tmp_acc0_arr[x1_tail - static_cast<int64_t>(16L)];
                }
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128812
Approved by: https://github.com/jgong5
2024-11-25 04:46:07 +00:00
2398e758d2 Fix access to _msvccompiler from newer distutils (#141363)
Newer versions of distutils no longer import `_msvccompiler` upon init(on Windows platform, that was not the case on other platforms even before 74), but it's still accessible if one chooses to import it directly.
Test plan:
```
% python -c 'from setuptools import distutils; print(distutils.__version__, hasattr(distutils, "_msvccompiler")); from distutils import _msvccompiler; import setuptools; print(setuptools.__version__, _msvccompiler.__file__)'
3.10.9 False
65.5.0 /usr/local/fbcode/platform010/Python3.10.framework/Versions/3.10/lib/python3.10/site-packages/setuptools/_distutils/_msvccompiler.py
```
and
```
% python -c 'from setuptools import distutils; print(distutils.__version__, hasattr(distutils, "_msvccompiler")); from distutils import _msvccompiler; import setuptools; print(setuptools.__version__, _msvccompiler.__file__)'
3.13.0 False
75.6.0 /Users/malfet/py312-venv/lib/python3.13/site-packages/setuptools/_distutils/_msvccompiler.py
```

Gave up trying to appease the linker, so rewrote it as following function:
```python
def _get_vc_env(vc_arch: str) -> dict[str, str]:
    try:
        from setuptools import distutils  # type: ignore[import]

        return distutils._msvccompiler._get_vc_env(vc_arch)  # type: ignore[no-any-return]
    except AttributeError:
        from setuptools._distutils import _msvccompiler  #type: ignore[import]

        return _msvccompiler._get_vc_env(vc_arch)  # type: ignore[no-any-return]
```

This PR also undoes setuptools version restriction introduced by  https://github.com/pytorch/pytorch/pull/136489 as premise for restriction is incorrect

Fixes https://github.com/pytorch/pytorch/issues/141319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141363
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-11-25 01:50:47 +00:00
6ad0423758 [CI]Move inductor UT from avx512 runner to amx runner (#141206)
According to https://github.com/pytorch/pytorch/issues/140208#issuecomment-2477813174, we need to run inductor UT on Sapphire Rapids runner to cover AMX Micro GEMM tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141206
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-11-25 01:26:58 +00:00
cyy
9c7660be0e Remove THC from OSS build (#134969)
THC is not used in OSS version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134969
Approved by: https://github.com/albanD
2024-11-25 00:39:42 +00:00
6a096a0b96 [PT2] Fix callbacks to account for entire execution in compilation (#141323)
Summary:
In SJD, we register the callbacks to get notified of an active compilation. Using this information, we can basically allow for an increase time for the training loop

The callbacks currently do not account for entire time and in several cases, the end callback is not called at all.

This leads to a bunch of APS jobs getting terminated incorrectly: https://fburl.com/scuba/mast_hpc_job_run_status/ondwzt2w

In this diff, we basically install a context manager which will call the start and end callbacks, similar to how we log counters and other information.

Test Plan:
```
buck2 run mode/opt //aps_models/examples/dlrm:dlrm_train_app -- --config-name train_mast_fsdp_torchdynamo launcher.data_project=apf_ai_infra launcher.fbl_entitlement=ai_infra_training_rnd_tc  launcher.hardware=TC_ANY_80G
```
Led to https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-atuljangra-ef2285ba9a?job_attempt=0&version=0&env=prod

https://fburl.com/ai_infra/sv0a213y confirms that callback was correctly called and a lease was properly installed, which takes over the training loop lease.

{F1965137027}

Differential Revision: D66347023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141323
Approved by: https://github.com/ezyang
2024-11-24 22:31:04 +00:00
cb8c956b5f Fix PyBind 2.10.4 compatibility issue in caffe2/torch/csrc/dynamo/guards.cpp +2 (#141456)
Summary: See D65023502 and [here](https://fb.workplace.com/groups/mldp.users/permalink/8706556336131960/) for details.

Test Plan: Sandcastle

Reviewed By: itamaro

Differential Revision: D66395491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141456
Approved by: https://github.com/Skylion007
2024-11-24 21:05:48 +00:00
675735cfc9 [dynamo] match implementation for sorted(...) with CPython (#141227)
```python
def sorted(iterable, /, *, key=None, reverse=False):
    seq = list(iterable)
    seq.sort(key=key, reverse=reverse)
    return seq
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141227
Approved by: https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #141224
2024-11-24 20:01:50 +00:00
cyy
259a00b727 [3/N] Replace at::detail::Array with std::array (#141324)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141324
Approved by: https://github.com/ezyang
2024-11-24 18:17:34 +00:00
e3cb167560 Revert "Add skip_first_wait to profiler.schedule (#141070)"
This reverts commit 9d83cab8a4f3a21a012303361bbee39318d241e0.

Reverted https://github.com/pytorch/pytorch/pull/141070 on behalf of https://github.com/izaitsevfb due to oops, it's actually reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141070#issuecomment-2496141168))
2024-11-24 18:03:50 +00:00
9d83cab8a4 Add skip_first_wait to profiler.schedule (#141070)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/91888

We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.

Adds new flag but sets to false by default so that existing impl is not affected.

Test Plan:
Got reasonable traces with this schedule:

schedule=torch.profiler.schedule(
            wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
        )

D66198138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141070
Approved by: https://github.com/aaronenyeshi, https://github.com/briancoutinho
2024-11-24 17:54:49 +00:00
e34ff2cb4b remove allow-untyped-defs from _inductor/bounds.py (#141440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141440
Approved by: https://github.com/Skylion007
2024-11-24 16:23:31 +00:00
3614d130dd [XPU] Update XPU C Shim Header (#141086)
Fixes https://github.com/pytorch/pytorch/issues/141268

Caused by these commits: 34b2165bdb and 34e420519d

The windows XPU builds are failing: https://github.com/pytorch/pytorch/actions/runs/11922274722/job/33228175750
due to recent PR merge with changes in fallback ops: 34e420519d

This PR updates the XPU C Shim header file to overcome these build failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141086
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel, https://github.com/malfet, https://github.com/dvrogozh, https://github.com/desertfire
2024-11-24 12:24:35 +00:00
a87925cc7e Fix AttributeError: 'int' object has no attribute 'node' due to constant prop (#141250)
Fixes https://github.com/pytorch/pytorch/issues/140625

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141250
Approved by: https://github.com/bobrenjc93
2024-11-24 08:20:04 +00:00
51b6126f54 Bump onnxscript version in CI (#141412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141412
Approved by: https://github.com/titaiwangms
2024-11-24 06:51:48 +00:00
af47e05a96 [fx] make split_module work with keep_original_order=True and no-op graph (#141340)
Fixes https://github.com/pytorch/pytorch/issues/140014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141340
Approved by: https://github.com/ezyang
2024-11-24 06:41:30 +00:00
cyy
4c1f50af5f Modernize C++ code in aten/src/ATen/ (#141424)
Clang-tidy modernize checkers were applied, and most changes were concatenation of namespaces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141424
Approved by: https://github.com/eqy
2024-11-24 02:15:19 +00:00
ba5253da9b Always unspecialize float in OSS (#138922)
Fixes https://github.com/pytorch/pytorch/issues/107277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-11-24 01:58:13 +00:00
11c786dcb5 [BE] Make maybe_aliasing_or_mutating proper tag (#131990)
For better tracking, we need to make maybe aliasing/mutating ops with proper tag. We need to special case native_batch_norm because it is not a CIA but has a wrong schema. I guess native_batch_norm will be removed at some point, so until then we just keep it around.

D60347117
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131990
Approved by: https://github.com/bdhirsh
2024-11-24 00:12:49 +00:00
c513f01516 Revert "Add skip_first_wait to profiler.schedule (#141070)"
This reverts commit 8b13ed594a2b9b0a994e8efd42b8f1e59372e499.

Reverted https://github.com/pytorch/pytorch/pull/141070 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141070#issuecomment-2495671689))
2024-11-23 22:22:24 +00:00
995e3079c9 [inductor] Fix for "Failed to find static RBLOCK" (#141434)
Summary: I expect this to fix https://fb.workplace.com/groups/1075192433118967/permalink/1547962839175255/

Test Plan: Ask poster to confirm fix

Differential Revision: D66413828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141434
Approved by: https://github.com/ezyang
2024-11-23 22:08:56 +00:00
f6eeab7ea8 [export] Make unflattened module compileable (#141249)
Test Plan: Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1091988579177957/

Differential Revision: D66302806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141249
Approved by: https://github.com/avikchaudhuri
2024-11-23 18:46:01 +00:00
83116ec90c [dynamo] Fix fbcode flakey test from asyncio warning (#141399)
Summary: This was failing with a `/usr/local/fbcode/platform010/lib/python3.10/asyncio/events.py:666: DeprecationWarning` that seems unrelated.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_numpy_readonly_inline_inbuilt_nn_modules' --run-disabled
```

Differential Revision: D66394773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141399
Approved by: https://github.com/yanboliang
2024-11-23 18:16:50 +00:00
3473dfa698 Add triton_op test for user defined triton caching (#141407)
Fix failing internal codecache test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141407
Approved by: https://github.com/aorenste
2024-11-23 07:54:39 +00:00
8b4ae29b1b misc. fixes to unflatten (#141066)
Handling of nested modules in unflatten had several bugs, which were caught by trying to preserve module call signatures for nested modules.
* A module `k` encountered when calling `k.n()` before `k()` used to become an empty nn module. This caused some information to be dropped when `k()` was eventually called. Relatedly, we would also lose call counts for `k.n()` through different paths (say, when `k()` calls `n()`).
* Deleting call-indexed modules and patching up their call sites was broken for nested modules when creating dispatcher modules, because of silliness when handling their fqns.

An interesting aside is that we used random graph generation for testing some of these changes. A future PR will add the infra to create tests using these random graphs.

Differential Revision: D66192799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141066
Approved by: https://github.com/angelayi
2024-11-23 07:31:51 +00:00
5268754ebd [inductor] Default impl refactors to IRNode (#141321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141321
Approved by: https://github.com/yanboliang
2024-11-23 06:25:59 +00:00
bae9510307 Fix pytorch-triton nightly checksum shorthash (#141410)
Binary build is failing in trunk after https://github.com/pytorch/pytorch/pull/139206 lands, for example, https://github.com/pytorch/pytorch/actions/runs/11981181986/job/33410250461#step:17:539.  It's a bit tricky to spot the issue but the difference is between `3.2.0+35c6c7c628` set by PyTorch and `3.2.0+git35c6c7c6` from triton (look closely one has the length of 10, the other of 8 characters)

Triton now has its own nightly build logic in https://github.com/triton-lang/triton/pull/4812 that takes only 8 characters by default while the original logic from PT took 10. So, PT nightly couldn't find the dependency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141410
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-11-23 04:56:40 +00:00
1f734bc90c Add bfloat16 support to torch.bmm(NST, NST) (#141380)
Adds bfloat16 support to torch.bmm(NST, NST) where NST is NestedTensor with the torch.strided (default) layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141380
Approved by: https://github.com/jbschlosser
2024-11-23 04:18:48 +00:00
66f2550328 Revert "Fix pytorch-triton nightly checksum shorthand (#141410)"
This reverts commit 9f8a19172d3ec417f8a6dce57d62d2aacc36c07c.

Reverted https://github.com/pytorch/pytorch/pull/141410 on behalf of https://github.com/huydhn due to There is still a small tweak that I need to do 35c6c7c628 is now git35c6c7c6 so a prefix is needed, going to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/141410#issuecomment-2495291851))
2024-11-23 04:16:39 +00:00
6cc22976a0 [ROCm][CI] upgrade CI and manywheel docker images to ROCm 6.2.4 (#140851)
Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because:
1. They are run on less capable machines (`c5.2xlarge`) instead of the beefier ones on which [docker-build workflows](924c1fe3f3/.github/workflows/docker-builds.yml (L50)) run (`c5.12xlarge`)
2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into [timeout of 90mins](9abd4d95bb/.github/actions/calculate-docker-image/action.yml (L171)): https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140851
Approved by: https://github.com/jeffdaily
2024-11-23 03:36:27 +00:00
9f8a19172d Fix pytorch-triton nightly checksum shorthand (#141410)
Binary build is failing in trunk after https://github.com/pytorch/pytorch/pull/139206 lands, for example, https://github.com/pytorch/pytorch/actions/runs/11981181986/job/33410250461#step:17:539.  It's a bit tricky to spot the issue but the difference is between `3.2.0+35c6c7c628` set by PyTorch and `3.2.0+git35c6c7c6` from triton (look closely one has the length of 10, the other of 8 characters)

Triton now has its own nightly build logic in https://github.com/triton-lang/triton/pull/4812 that takes only 8 characters by default while the original logic from PT took 10. So, PT nightly couldn't find the dependency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141410
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-11-23 03:25:52 +00:00
a8ab6b0938 Fix failing internal codecache test (#141405)
When internal remote cache version was bumped to 11, this test started failing, I guess no one noticed it, and it got disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141405
Approved by: https://github.com/aorenste
2024-11-23 02:01:02 +00:00
1aea642393 pytorch/feature: Record if inductor fx cache is enabled (#141059)
This uses the underlying infrastructure and records if the fx cache is
enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141059
Approved by: https://github.com/masnesral
2024-11-23 01:55:27 +00:00
68be990519 Flag TORCH_SDT_SEMAPHORE as being name resovable (#141191)
Summary:
Mirroring changes in D64604573, it appears this code in libcaffe2 is
mostly a copy of folly's one. Copy of the original diff summary:

This particular inline assembly use cannot be converted to a constraint template parameter (until llvm 18 / gcc 14), as there is no way (until those versions) to specify that a non-pic relocation is needed when compiling under pic.  This inline assembly requires a non-pic relocation because it is being written to the .notes section, which is non .text.

Differential Revision: D66038989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141191
Approved by: https://github.com/dcci
2024-11-23 01:39:44 +00:00
eb954ef3f2 [pipelining] allow multiple backward grads (#140981)
fixes https://github.com/pytorch/pytorch/issues/139404. The input grads get saved in a new `self.bwd_cache` container and get popped off after they are used in `backward_one_chunk`

`python test/distributed/pipelining/test_schedule_multiproc.py -k test_pipeline_schedule_runtime_custom_sched`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140981
Approved by: https://github.com/wconstab
2024-11-23 00:35:08 +00:00
2e7ba0b194 Revert "Switch to using Python nested int (#141166)"
This reverts commit e2e8a7fa2e519433a4ec1071f80d2f6f843c6300.

Reverted https://github.com/pytorch/pytorch/pull/141166 on behalf of https://github.com/clee2000 due to broke docs [GH job link](https://github.com/pytorch/pytorch/actions/runs/11980936976/job/33406870951) [HUD commit link](e2e8a7fa2e) ([comment](https://github.com/pytorch/pytorch/pull/141166#issuecomment-2495112297))
2024-11-22 23:54:36 +00:00
ee7eaad5c3 [dynamo] add SymNode bitwise and/or (#138777)
Fixes [T203472723](https://www.internalfb.com/intern/tasks/?t=203472723)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138777
Approved by: https://github.com/ezyang
2024-11-22 23:36:16 +00:00
a8c90e5140 Revert "Always unspecialize float in OSS (#138922)"
This reverts commit 6d779d05492813da1c19ac0c562d0d5f8473f27e.

Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is some slow tests failing after this land ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2495076878))
2024-11-22 23:18:36 +00:00
c328d200ff [SDPA][CUDA] resync sm90+ priority order for SDPA with test_export.py (#141274)
Since we deprioritized cuDNN SDPA, this test fails on `sm90+`. This PR just changes the expected backend for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141274
Approved by: https://github.com/drisspg
2024-11-22 23:16:41 +00:00
0be0c944b1 Revert "Forward / backward NJT support for several activation functions (#140736)"
This reverts commit af70f5e04c69839a1a0e08942254c170dc4c3d61.

Reverted https://github.com/pytorch/pytorch/pull/140736 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its tests are failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/140736#issuecomment-2495075871))
2024-11-22 23:15:55 +00:00
4acc988630 Add ciflow/inductor-cu126 label (#141377)
No op to unblock the testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141377
Approved by: https://github.com/atalman, https://github.com/huydhn
2024-11-22 23:14:24 +00:00
2aac2ec664 [dynamo] fix sorted(...) when key function is explicitly passed with key=None (#141224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141224
Approved by: https://github.com/jansel
2024-11-22 22:42:46 +00:00
57eea3f8e2 Fix a -Wshadow warning in ATen/native/Math.h (#141361)
Move declaration down to point where it's needed, don't redeclare.

Differential Revision: [D66376820](https://our.internmc.facebook.com/intern/diff/D66376820/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141361
Approved by: https://github.com/Skylion007
2024-11-22 22:33:04 +00:00
0ce0e44237 Add workaround for potential runners issue on s390x (#141239)
More information is at
https://gitlab.com/qemu-project/qemu/-/issues/2600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141239
Approved by: https://github.com/huydhn
2024-11-22 22:17:55 +00:00
e2e8a7fa2e Switch to using Python nested int (#141166)
Doesn't seem to noticeably slow down eager - TestNestedTensorSubclass tests with and without the PR finished in similar amounts of time (around 57s, 58s)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141166
Approved by: https://github.com/ezyang
2024-11-22 22:12:25 +00:00
af70f5e04c Forward / backward NJT support for several activation functions (#140736)
Several activation functions were unimplemented due to missing `pointwise` tags. This PR adds them and corresponding backwards implementations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140736
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2024-11-22 22:05:53 +00:00
5062bbcd86 [inductor] Add missing get_reads() method (#141310)
Summary: This is a possible fix for https://fb.workplace.com/groups/1075192433118967/permalink/794017756161443/

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_test_framework/pt2_staging_tests/sw_v2:smallworld_cmf_test -- --exact 'ai_infra/distributed_ai/pyper_test_framework/pt2_staging_tests/sw_v2:smallworld_cmf_test - test_train (ai_infra.distributed_ai.pyper_test_framework.pt2_staging_tests.sw_v2.smallworld_cmf_test.CmfTest)'
```

Differential Revision: D66340927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141310
Approved by: https://github.com/ezyang
2024-11-22 22:00:18 +00:00
d16aa566ea [FlexAttention] Speed up gradcheck tests (#141356)
# Summary
### Before
```Shell
48.71s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_captured_score_mod_aot_eager_gradcheck_score_mod_name__head_offset_mode_aot_eager
```
### After
Speeds up grad check tests by 10x
```Shell
4.74s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_captured_score_mod_aot_eager_gradcheck_score_mod_name__head_offset_mode_aot_eager
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141356
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #141164, #141185
2024-11-22 21:18:21 +00:00
32583d915e [export] Improve stacktrace filtering (#141285)
Differential Revision: [D66321127](https://our.internmc.facebook.com/intern/diff/D66321127)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141285
Approved by: https://github.com/yushangdi
ghstack dependencies: #141071, #141072
2024-11-22 20:55:04 +00:00
53df1c11cd [export] Add custom op guards (#141072)
For custom ops that do not have a meta kernel, draft export automatically creates a meta kernel based on the tracing example inputs. To ensure that these assumptions made during tracing is clear to the user, we add assertions into the traced exported program:

An example graph:
```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, a: "f32[s0, s1]", b: "f32[s2, s3]"):
             # File: /data/users/angelayi/pytorch/test/export/test_draft_export.py:172 in forward, code: res1 = torch.ops.mylib.foo4(a, b)
            _assert_tensor_metadata = torch.ops.aten._assert_tensor_metadata(a, dtype = torch.float32, device = device(type='cpu'));  _assert_tensor_metadata = None
            _assert_tensor_metadata_1 = torch.ops.aten._assert_tensor_metadata(b, dtype = torch.float32, device = device(type='cpu'));  _assert_tensor_metadata_1 = None
            foo4: "f32[u2, u3]" = torch.ops.mylib.foo4.default(a, b);  a = b = None
            return (foo4,)
```

Differential Revision: [D66321129](https://our.internmc.facebook.com/intern/diff/D66321129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141072
Approved by: https://github.com/pianpwk
ghstack dependencies: #141071
2024-11-22 20:55:04 +00:00
0fbc0830ba [export] Add device and dtype fields to assert_tensor_metadata (#141071)
Differential Revision: [D66321128](https://our.internmc.facebook.com/intern/diff/D66321128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141071
Approved by: https://github.com/yushangdi, https://github.com/zou3519
2024-11-22 20:54:55 +00:00
45d62d6fc5 [dynamo] Added cuda and triton versions to dynamo_compile (#141290)
Opening another PR since #141140 was reverted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141290
Approved by: https://github.com/masnesral
2024-11-22 20:04:42 +00:00
2a6eaa2e6f Refactor nightly pull tool to use venv and pip (#141281)
Resolves #141238

- #141238

Example output:

```console
$ python3.12 tools/nightly.py checkout -b my-nightly-branch -p my-env --python python3.10
log file: /Users/PanXuehai/Projects/pytorch/nightly/log/2024-11-22_04h15m45s_63f8b29e-a845-11ef-bbf9-32c784498a7b/nightly.log
Creating virtual environment
Creating venv (Python 3.10.15): /Users/PanXuehai/Projects/pytorch/my-env
Installing packages
Upgrading package(s) (https://download.pytorch.org/whl/nightly/cpu): pip, setuptools, wheel
Installing packages took 5.576 [s]
Creating virtual environment took 9.505 [s]
Downloading packages
Downloading package(s) (https://download.pytorch.org/whl/nightly/cpu): torch
Downloaded 9 file(s) to /var/folders/sq/7sf73d5s2qnb3w6jjsmhsw3h0000gn/T/pip-download-lty5dvz4:
  - mpmath-1.3.0-py3-none-any.whl
  - torch-2.6.0.dev20241121-cp310-none-macosx_11_0_arm64.whl
  - jinja2-3.1.4-py3-none-any.whl
  - sympy-1.13.1-py3-none-any.whl
  - MarkupSafe-3.0.2-cp310-cp310-macosx_11_0_arm64.whl
  - networkx-3.4.2-py3-none-any.whl
  - fsspec-2024.10.0-py3-none-any.whl
  - filelock-3.16.1-py3-none-any.whl
  - typing_extensions-4.12.2-py3-none-any.whl
Downloading packages took 7.628 [s]
Installing dependencies
Installing packages
Installing package(s) (https://download.pytorch.org/whl/nightly/cpu): numpy, cmake, ninja, packaging, ruff, mypy, pytest, hypothesis, ipython, rich, clang-format, clang-tidy, sphinx, mpmath-1.3.0-py3-none-any.whl, jinja2-3.1.4-py3-none-any.whl, sympy-1.13.1-py3-none-any.whl, MarkupSafe-3.0.2-cp310-cp310-macosx_11_0_arm64.whl, networkx-3.4.2-py3-none-any.whl, fsspec-2024.10.0-py3-none-any.whl, filelock-3.16.1-py3-none-any.whl, typing_extensions-4.12.2-py3-none-any.whl
Installing packages took 42.514 [s]
Installing dependencies took 42.515 [s]
Unpacking wheel file
Unpacking wheel file took 3.223 [s]
Checking out nightly PyTorch
Found released git version ac47a2d9714278889923ddd40e4210d242d8d4ee
Found nightly release version e0482fdf95eb3ce679fa442b50871d113ceb673b
Switched to a new branch 'my-nightly-branch'
Checking out nightly PyTorch took 0.198 [s]
Moving nightly files into repo
Linking /var/folders/sq/7sf73d5s2qnb3w6jjsmhsw3h0000gn/T/wheel-dljxil5i/torch-2.6.0.dev20241121/torch/_C.cpython-310-darwin.so -> /Users/PanXuehai/Projects/pytorch/torch/_C.cpython-310-darwin.so
Linking /var/folders/sq/7sf73d5s2qnb3w6jjsmhsw3h0000gn/T/wheel-dljxil5i/torch-2.6.0.dev20241121/torch/lib/libtorch_python.dylib -> /Users/PanXuehai/Projects/pytorch/torch/lib/libtorch_python.dylib
...
Linking /var/folders/sq/7sf73d5s2qnb3w6jjsmhsw3h0000gn/T/wheel-dljxil5i/torch-2.6.0.dev20241121/torch/include/c10/macros/Macros.h -> /Users/PanXuehai/Projects/pytorch/torch/include/c10/macros/Macros.h
Moving nightly files into repo took 11.426 [s]
Writing pytorch-nightly.pth
Writing pytorch-nightly.pth took 0.036 [s]
-------
PyTorch Development Environment set up!
Please activate to enable this environment:

  $ source /Users/PanXuehai/Projects/pytorch/my-env/bin/activate
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141281
Approved by: https://github.com/seemethere
2024-11-22 20:03:55 +00:00
75cecba164 [inductor] Move fusion heuristics to V.choices (#141108)
This is a refactor to enable out of tree autotuners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141108
Approved by: https://github.com/yanboliang
2024-11-22 19:53:07 +00:00
d8c14838f1 [ca] dead code elimination for compile time (#141289)
Although these nodes are eventually inlined away, they increase compile time, especially when initial CA graph capture treats all shapes as dynamic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141289
Approved by: https://github.com/jansel
ghstack dependencies: #141152, #141153
2024-11-22 19:26:27 +00:00
db4e8a1d8a [ca] expose option to collect sizes as dynamic (#141153)
This is to address recompiles from eager nodes that saved dynamic activations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141153
Approved by: https://github.com/jansel
ghstack dependencies: #141152
2024-11-22 19:26:27 +00:00
1024a1c3d1 [ca] fix dynamic shape logging (#141152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141152
Approved by: https://github.com/jansel
2024-11-22 19:26:27 +00:00
7c5c38da23 Fix constant lifting pass when there is no user input (#141157)
Differential Revision: [D66253854](https://our.internmc.facebook.com/intern/diff/D66253854/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141157
Approved by: https://github.com/zhxchen17
2024-11-22 19:08:25 +00:00
40d0740e73 [PT2][Optimus] Fix a corner case in merge splits (#141194)
Summary:
We find another corner case in the merge splits, where the first split node does not have consecutive getitem indices, we need to skip such cases.

{F1964255863}

Test Plan:
# local reproduce
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split  --flow_id 666002198 2>&1 | tee ~/cmf.txt
```

P1683429791

Differential Revision: D66275387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141194
Approved by: https://github.com/jackiexu1992
2024-11-22 19:04:40 +00:00
e54538afc8 [export] fix sympy.expr roundtrippability for serialization (#141284)
Summary:
Latest attempt after [136802](https://github.com/pytorch/pytorch/pull/136802) and [140084](https://github.com/pytorch/pytorch/pull/140084) got shelved.

This keeps the string format for `expr_str`, but calls `sympy.printing.repr.srepr(s)` instead of `str(s)`, which prints expressions more explicitly, e.g.
```
((2*x)//(3*y + 4)) -> "FloorDiv(Mul(Integer(2), Symbol('x')), Add(Mul(Integer(3), Symbol('y')), Integer(4)))"
```

This is nice because:
- we have better roundtrippability for deserialization, robust to pretty printing changes like [this](6c9bfd52b6/torch/utils/_sympy/functions.py (L208)) that caused the issue in the first place.
- this preserves the BC surface for both 1) sigmoid thrift serialization, by keeping the string format, and 2) deserialization for old IRs, since `sympy.sympify(...)` still handles the old `str(s)` format.
- more memory efficient than storing ASTs; the [AST attempt](https://github.com/pytorch/pytorch/pull/140084) increased artifact size by 20% on some toy programs.
- doesn't even require a schema version bump.

Additionally to push some test cases over the line, this redoes expression processing (handling ranges, symbol caching) by doing bottom-up processing instead of the current hacky-ish workflow.

Test Plan: test_serdes, test_serialize, internal tests broken by AST PR

Differential Revision: D66283208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141284
Approved by: https://github.com/zhxchen17
2024-11-22 18:47:04 +00:00
e6962f8f19 [c10d] Relax CUDA context test criteria (#141298)
After `destroy_process_group`, it may be possible that the CUDA context finishes its job and exits, thus NVML detects 0 processes on the device. This PR relaxes the current check condition (there must be exactly 1 active process on that device) to cover this possibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141298
Approved by: https://github.com/eqy
2024-11-22 18:38:25 +00:00
57fc070e08 [triton] Update pin for PyTorch 2.6/Triton 3.2 (#139206)
Bump the Triton pin to the release candidate commit for Triton 3.2.

A few changes beyond the pin bump itself are needed:
* Remove the script that adds a git version hash suffix to the Triton wheel, since as of https://github.com/triton-lang/triton/pull/4812 Triton adds that itself
* Add `pybind11` to the Triton build setup, since Triton now depends on it
* Use manylinux-2.28 for the Triton wheel builder, and use clang+lld for building to pick up the right glibc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139206
Approved by: https://github.com/malfet, https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-11-22 18:34:32 +00:00
313dac6c1c [export] Fix name inconsistentcy between thrift and schema.py (#141151)
Summary: The struct type is named "InputToConsantInputSpec" in thrift which causes some inconsistency between the schema. Changing the type name from 1 to another is okayish because that doesn't change the on wire format.

Test Plan: CI

Differential Revision: D66240951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141151
Approved by: https://github.com/yiming0416
2024-11-22 18:04:23 +00:00
44d5012a80 Revert "[triton] Update pin for PyTorch 2.6/Triton 3.2 (#139206)"
This reverts commit c93e57efac091f246b599b4fcdc189ed94753b43.

Reverted https://github.com/pytorch/pytorch/pull/139206 on behalf of https://github.com/atalman due to Will revert and reland skipping xpu builds ([comment](https://github.com/pytorch/pytorch/pull/139206#issuecomment-2494437857))
2024-11-22 18:01:18 +00:00
6d779d0549 Always unspecialize float in OSS (#138922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-11-22 17:54:42 +00:00
2239d1a7a3 Revert "[CI, 3.13] enable 3.13 CI (#139533)"
This reverts commit b7a25c1ee7cdb559516db2b10279c996742a1708.

Reverted https://github.com/pytorch/pytorch/pull/139533 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing test_cpp_extensions_open_device_registration. The test was wrongly excluded by TD ([comment](https://github.com/pytorch/pytorch/pull/139533#issuecomment-2494328806))
2024-11-22 17:18:49 +00:00
cf1d95a965 Revert "Add option to split Linear gates for Quantizable LSTM into separate ops (#140868)"
This reverts commit 3fcf66f61fbc8f760fc0d34356a60b76c3f2e27c.

Reverted https://github.com/pytorch/pytorch/pull/140868 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think lint is failing on this in trunk ([comment](https://github.com/pytorch/pytorch/pull/140868#issuecomment-2494076202))
2024-11-22 15:54:05 +00:00
080f992d68 Revert "[CI] Reduce distributed test timeout to 60s (#141168)"
This reverts commit e8de8f3969bf935442378efd125442de90e78431.

Reverted https://github.com/pytorch/pytorch/pull/141168 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think we missed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/141168#issuecomment-2494060624))
2024-11-22 15:46:37 +00:00
f23621ec56 Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597)"
This reverts commit c25b201583fc28243b87c460a2f18e2531a676e7.

Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978))
2024-11-22 15:43:39 +00:00
cc90ba8924 Revert "[sparse] add extra options to _cslt_spare_mm (#137427)"
This reverts commit 45b30a5aecf31ec26d9b2dc86d5170f9618a7766.

Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_sparse_semi_structured is failing in trunk after it lands ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2494047577))
2024-11-22 15:40:21 +00:00
7a9d0e3c06 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-11-22 15:07:46 +00:00
c93e57efac [triton] Update pin for PyTorch 2.6/Triton 3.2 (#139206)
Bump the Triton pin to the release candidate commit for Triton 3.2.

A few changes beyond the pin bump itself are needed:
* Remove the script that adds a git version hash suffix to the Triton wheel, since as of https://github.com/triton-lang/triton/pull/4812 Triton adds that itself
* Add `pybind11` to the Triton build setup, since Triton now depends on it
* Use manylinux-2.28 for the Triton wheel builder, and use clang+lld for building to pick up the right glibc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139206
Approved by: https://github.com/malfet, https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-11-22 14:50:22 +00:00
b7a25c1ee7 [CI, 3.13] enable 3.13 CI (#139533)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139533
Approved by: https://github.com/atalman, https://github.com/malfet
2024-11-22 14:43:02 +00:00
e0d97e936a OpenReg: Fix releasing tensor issue when exiting process (#140936)
When executing the following code:

```
import pytorch_openreg

import torch

if __name__ == "__main__":
    a = torch.tensor(1, device="openreg")

```
Sometimes releases tensor a failed after the process finishes executing `main` function. The trace of releasing `a` is `~Tensor()` -> ... -> `OpenRegMem.cpp` -> `OpenRegHooks.cpp` -> `_aten_impl.py`.

There are two failed scenarios I've found:

1. Segmentation fault: Before executing `~Tensor()`, the process has released global variables in `_aten_impl.py`, which causes the issue.
2. Waiting indefinitely: The main process passes the `free ptr` command  to deamon process, however daemon processes have shutdown.

The way to fix this issue is when the process is shutting down, we ignore the del ptr operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140936
Approved by: https://github.com/ezyang
2024-11-22 13:50:35 +00:00
4009d15412 Optimize hook description of register_module_forward_hook (#140379)
Fixes #74024

Optimize description as the issue suggested

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140379
Approved by: https://github.com/mikaylagawarecki
2024-11-22 13:40:45 +00:00
1af69eee4a Solid XPU UT test_memory_allocation (#141325)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/141326

# Additional Context
We use the previous value queried by these APIs as the reference value rather than 0. With this PR, we don't depend on the Python garbage collection mechanism anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141325
Approved by: https://github.com/EikanWang
2024-11-22 13:14:49 +00:00
f497a0039c API to retrieve default distributed backend from device (#140536)
# Motivation
The distributed APIs rely on backend names for creation of process group.
To abstract out references of these names from PG creation, an API is added to get default distributed backend for  device.
The device code would need to register its device and backend  via  ```torch.distributed.Backend.register_backend```  or  update the map ``` torch.distributed.Backend.default_device_backend_map["device"] = "distributed_backend" ```  prior to using the API.

An example of use is added in the test file ( which can be used to check abstracted APIs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140536
Approved by: https://github.com/kwen2501
2024-11-22 11:01:53 +00:00
7d89a8d385 Add ExportedProgram type annotation (#141247)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141247
Approved by: https://github.com/Skylion007
2024-11-22 10:40:42 +00:00
a6344c8bcd Throw an error if args contain reserved python keywords (#135357)
This PR adds a check for reserved python keywords in the `torchgen/gen.py/error_check_native_functions`  function.

Fixes #135127
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135357
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-22 07:44:50 +00:00
bd971cc395 safer check for isatty in fx/_utils.py (#140876)
if no isatty method is defined, it's probably not a tty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140876
Approved by: https://github.com/ezyang
2024-11-22 07:27:28 +00:00
cyy
1bdb92cbff [2/N] Use thread-safe strerror (#141011)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141011
Approved by: https://github.com/ezyang
2024-11-22 07:02:30 +00:00
8b13ed594a Add skip_first_wait to profiler.schedule (#141070)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/91888

We use wait as the amount you wait in between cycles when profiling and skip_first to delay the start of said profiling. However, once skip_first steps are completed, we immediately go to the wait phase. This is not problematic if wait is smaller than skip_first because we can just lower the values of skip_first, but if it is larger then we end up starting the first profile much later than desired. For example imagine a skip first of 1 and a wait of 100 with repeat of 2. We do want to wait 100 steps in between cycle 1 and 2 but we may not want to start warmup of cycle 1 at step 101 (forced because wait occurs directly after first steps skipped). This diff addresses this by adding a flag to skip the first wait.

Adds new flag but sets to false by default so that existing impl is not affected.

Test Plan:
Got reasonable traces with this schedule:

schedule=torch.profiler.schedule(
            wait=10, warmup=3, active=1, repeat=1, skip_first=1, skip_first_wait=1
        )

Differential Revision: D66198138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141070
Approved by: https://github.com/aaronenyeshi, https://github.com/briancoutinho
2024-11-22 06:40:58 +00:00
a3e516d165 [aoti] Split custom ops tests (#140977)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140977
Approved by: https://github.com/desertfire
2024-11-22 06:18:25 +00:00
3acc6eac49 [inductor] Add typing to ir.py 2 (#140915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140915
Approved by: https://github.com/aorenste
2024-11-22 04:56:54 +00:00
cyy
35ecca735e [2/N] Replace at::detail::Array with std::array (#141205)
Follows #122064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141205
Approved by: https://github.com/ezyang
2024-11-22 04:44:40 +00:00
3fcf66f61f Add option to split Linear gates for Quantizable LSTM into separate ops (#140868)
Summary:
For LSTM, the input and hidden state are projected with Linear layers to construct the 4 gates. This is typically performed together as a single Linear (for each state) with output channel count `4 * hidden_dim` for efficiency.
https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=52-58
The output is then ultimately split into 4:
https://www.internalfb.com/code/fbsource/[ebef7c4238aa55948b2b444044f2c8ed2040de55]/fbcode/caffe2/torch/ao/nn/quantizable/modules/rnn.py?lines=83-87

For on-device latency (and possibly memory) considerations, we want to avoid constructing the intermediate `gates` tensor (which can be relatively large), by splitting `igates` and `hgates` first (as 4x `Linear(hidden_dim, hidden_dim)` each), applying add separately, then proceeding as usual.

This functionality can be enabled by specifying `split_gates=True` (default False is original behavior) at any entry point (directly with `torch.ao.nn.quantizable.LSTM`  or via `_get_lstm_with_individually_observed_parts`).

Test Plan:
piggy back on existing test to check for correct swap handling, numerics, and jit.script during prepare/convert
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_custom_module_lstm (caffe2.test.quantization.core.test_quantized_op.TestQuantizedOps)'
```
https://www.internalfb.com/intern/testinfra/testrun/11540474102848372

This test is quite long running now (more than double original).

Reviewed By: Ninja91

Differential Revision: D65283170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140868
Approved by: https://github.com/jerryzh168
2024-11-22 04:10:26 +00:00
150ffb6e07 [flight recorder] Updated MatchState to have a member variable (#141297)
Summary: Without this change calling `str(MatchState.SOMETHING)` will cause exception.

Test Plan:
Can we add unittest somewhere?
Ensure `str(MatchState.FULLY_MATCHED)` and `str(MatchState.FULLY_MATCHED())` won't raise exception.

Differential Revision: D66321609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141297
Approved by: https://github.com/fduwjj
2024-11-22 03:14:34 +00:00
3bec67b8e5 Fix tests in test/test_serialization that were failing if run individually (#141300)
#140739 and #140740 made it such that `get_safe_globals` no longer return an empty List by default

This caused some tests that check the content of `get_safe_globals` to fail, in particular when run individually (they didn't fail in test suite as other tests ran before them called `clear_safe_globals`) but will fail when tests are run individually [T208186010](https://www.internalfb.com/intern/tasks/?t=208186010)

test_safe_globals_for_weights_only
test_safe_globals_context_manager_weights_only

This PR fixes that and also makes most tests calling `clear_safe_globals` use the `safe_globals` context manager rather than try: finally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141300
Approved by: https://github.com/awgu
2024-11-22 02:40:37 +00:00
dbe6fce185 [CUDA][Nightly Binary] Remove PTX from cuda 12.4 Nightly (#141142)
Separate cuda 12.4 | 12.6 logic
Remove PTX from cuda 12.4
Remove deprecated cuda 11.[6/7]

Discussed in https://github.com/pytorch/pytorch/issues/137374#issuecomment-2489200733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141142
Approved by: https://github.com/atalman
2024-11-22 02:34:59 +00:00
c25b201583 Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-22 02:04:36 +00:00
c83b739f14 Migrate pull jobs cuda12.1->cuda12.4 (#141271)
Cuda 12.1 nightly builds where deprecated. Hence no reason on keep testing cuda 12.1 in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141271
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn
2024-11-22 01:52:38 +00:00
f28bac76f5 [AOTI Minifier] Save EP instead of graphs (#141159)
Summary:
`repro.py` can have nested graph modules, e.g.

```
class Repro(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.true_graph_0 = GraphModule()

    def forward(self):
        true_graph_0 = self.true_graph_0
        return (true_graph_0,)
```

So dumping the string doesn’t always work.

So,
1) we use exported program in repro.py instead
2) we still dump the graph module string, but only put it in comments

We also added two flags to `minifier_launcher.py`
- `minifier-export-mode`: whether strict or non-strict export is used in the minifier
- `skip-export-error`: intermediate graphs that cannot be exported will be skipped.

Test Plan:
```
buck2 run  fbcode//caffe2/test/inductor:minifier_utils_cpu  -- -r string
python test/inductor/test_minifier.py
```

Differential Revision: D66175257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141159
Approved by: https://github.com/henrylhtsang
2024-11-22 01:51:10 +00:00
ca9813ea14 Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case (#140258)
As suggested by @leslie-fang-intel in 4c83e4e751 (diff-139642bd981df977f70f4c18c1c34bd1a85c1d6b9ffa06aaa98426ed83942a31R537) - all elements of `B` tiles (not referring to AMX tiles, but the tiles at the granularity of the micro-kernel) have contiguous elements since `B` matrix is pre-packed, so dequantized buffer loading logic can be simplified. While the previous approach kept elements to be loaded into a B AMX tile contiguous, the new approach doesn't entail any performance penalty either because that data is already in L1D, so loading AMX tiles from non-contiguous dequantized B elements doesn't adversely affect performance.

Also rectified the size of the dequantized B buffer.

Fixes #140208.

A subsequent PR will factor out caching of dequantized int8 weights into a separate codegen function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140258
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-11-22 01:34:06 +00:00
f5d00f1456 pytorch/features: Make a feature logger and record triton bundling (#141056)
This modifies metrics_context to allow us to store whether a feature was
used or not.

This also starts recording this for triton bundling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141056
Approved by: https://github.com/masnesral
2024-11-22 01:31:08 +00:00
0155a112fd [export] avoid name collision when inlining node (#141169)
Summary:
When we have both `set_grad` and `autocast` HOP, name collision might happen when we try to inline a node.

For exmaple, for a GraphModule like this:

```
GraphModule(
  (submod_0): GraphModule(
    (submod_1): GraphModule()
  )
  (submod_1): GraphModule()
  (submod_2): GraphModule()
)

```

when we inline `submod_0`, we might accidentally overwrite `submod_1`.

In this PR, we fix this by check if the graph module already has an attribute with the same name, if so, we use the next "submod_{i}", until no name collision.

Partially fixes https://github.com/pytorch/pytorch/issues/140589.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r  test_predispatch_autocast_and_set_grad
```

Differential Revision: D66200994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141169
Approved by: https://github.com/angelayi
2024-11-22 01:08:22 +00:00
d8b4406e12 [MPS] Expand fused forloop to bfloat16 (#141104)
For MacOS14+

Running following script (adapted from one mentioned in https://github.com/pytorch/pytorch/pull/127242 )
```python
import torch
from torch.optim import adam, adamw
import torch.utils.benchmark as benchmark
import itertools

def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused):
    fn(
        params,
        grads,
        exp_avgs,
        exp_avg_sqs,
        max_exp_avg_sqs,
        state_steps,
        foreach=False,
        capturable=False,
        fused=fused,
        amsgrad=amsgrad,
        beta1=0.9,
        beta2=0.99,
        lr=1e-3,
        weight_decay=.0,
        eps=1e-5,
        maximize=False,
        grad_scale=None,
        found_inf=None,
    )
    torch.mps.synchronize()

device, dtype = "mps", torch.bfloat16

results = []

for num_tensors, numel, adamWflag, amsgrad in itertools.product([10, 50, 100], [1024, 65536, 1048576], [True, False], [True, False]):
    print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}")
    params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=dtype, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)]
    max_exp_avg_sqs = [torch.arange(numel, dtype=dtype, device=device) for _ in range(num_tensors)] if amsgrad else []
    state_steps = [torch.tensor([5], dtype=dtype, device=device) for _ in range(num_tensors)]
    fn = adamw.adamw if adamWflag else adam.adam

    for fused in [True, False]:

        t = benchmark.Timer(
                stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)',
                label=f'Fused Adam on {device} using {dtype}',
                sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}",
                globals=locals(),
                description= f"Fused: {fused}",
            ).blocked_autorange(min_run_time=5)
        results.append(t)

compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.colorize(rowwise=True)
compare.print()
```

Produces following results on M4Pro running MacOS 15
```
[-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------]
                                                                          |  Fused: True  |  Fused: False
1 threads: ----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10        |       283     |      2810
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10       |       277     |      2430
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10       |       285     |      2400
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10      |       278     |      2250
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10       |       504     |      2700
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10      |       478     |      2600
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10      |       506     |      2500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10     |       482     |      2300
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10     |      2089     |      4190
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10    |      1940     |      3800
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10    |      2100     |      3770
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10   |      1950     |      3600
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50        |       842     |     14000
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50       |       835     |     11800
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50       |       845     |     11700
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50      |       855     |     11000
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50       |      1410     |     14000
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50      |      1350     |     12000
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50      |      1400     |     12000
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50     |      1340     |     11000
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50     |      9767     |     20400
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50    |      8991     |     18600
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50    |      9803     |     18300
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50   |      9070     |     17600
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100       |      1600     |     27000
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100      |      1600     |     24100
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100      |      1600     |     23500
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100     |      1600     |     21800
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100      |      2740     |     26000
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100     |      2580     |     24000
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100     |      2730     |     25000
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100    |      2600     |     23000
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100    |     19350     |     39000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100   |     17780     |     37300
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100   |     19400     |     37000
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100  |     17900     |     35500
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141104
Approved by: https://github.com/qqaatw, https://github.com/kulinseth, https://github.com/Skylion007
ghstack dependencies: #141089, #141090, #141092, #141103
2024-11-22 01:07:15 +00:00
989888236e Revert "[MPS] Expand fused forloop to bfloat16 (#141104)"
This reverts commit 9a729390420570cd2528ce2e9947e3eab209660b.

Reverted https://github.com/pytorch/pytorch/pull/141104 on behalf of https://github.com/malfet due to Want to add test script to the commit message ([comment](https://github.com/pytorch/pytorch/pull/141104#issuecomment-2492659931))
2024-11-22 01:03:43 +00:00
e8de8f3969 [CI] Reduce distributed test timeout to 60s (#141168)
Pulling a PR to test viability.
Today's timeout is 300s, which could waste quite some machine time if a hang happens in CI.

Differential Revision: [D66275756](https://our.internmc.facebook.com/intern/diff/D66275756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141168
Approved by: https://github.com/clee2000
2024-11-22 00:59:55 +00:00
65166d86a3 [MPS] Add regression test for sync deadlock (#141296)
See https://github.com/pytorch/pytorch/pull/140725#issuecomment-2492434870
Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]`
```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12
    frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84
    frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40
    frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100
    frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92
    frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040
    frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200
    frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104
    frame #8: 0x0000000100fccbe4 Python`run_mod + 168
    frame #9: 0x0000000100fcb518 Python`pyrun_file + 164
    frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256
    frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80
    frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164
    frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72
    frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988
    frame #15: 0x0000000100ff1564 Python`pymain_main + 304
    frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40
    frame #17: 0x000000019f630274 dyld`start + 2840
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141296
Approved by: https://github.com/huydhn
2024-11-22 00:56:33 +00:00
25c0b91dbb [Docs] Make links to source link to source (#141186)
Rewrite [SOURCE] links in the API docs to point to the source file in github repo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141186
Approved by: https://github.com/malfet, https://github.com/msaroufim

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-22 00:50:19 +00:00
f708e92ba1 [Inductor] support Conv/Linear + broadcast add fusion (#138201)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138201
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-22 00:47:25 +00:00
5ab5a61671 Revert "[ROCm][CI] upgrade CI to ROCm 6.2.4 (#140851)"
This reverts commit 6c9bfd52b6a76ddff053bcff4d23ea7f4c280e9a.

Reverted https://github.com/pytorch/pytorch/pull/140851 on behalf of https://github.com/jithunnair-amd due to Need to upgrade libtorch images to ROCm 6.2.4 as well ([comment](https://github.com/pytorch/pytorch/pull/140851#issuecomment-2492641342))
2024-11-22 00:44:34 +00:00
612122af8f Fix type-safety of torch.nn.Module instances (#141240)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141240
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-22 00:05:05 +00:00
f869a0ffe1 Fix the use of fsspec transactions (#135541)
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.

Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at 76ca4a6888/fsspec/transaction.py (L39) -- because each filesystem can have a single transaction, this tear-down logic will error).

Instead, let's manually handle committing / discarding changes to the file. This does this "the old-fashioned way" instead of using `fsspec`'s commit/rollback behavior because the internal PathManagerFileSystem used for `iopath` does not properly support that behavior.

I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:

Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541
Approved by: https://github.com/Skylion007
2024-11-22 00:03:19 +00:00
e894219504 [export] fix loss_output in joint graph signature (#140974)
Summary: joint-graph export is marking all outputs as LOSS_OUTPUT, fix so it marks only the correct one

Test Plan: test_experimental

Differential Revision: D66117412

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140974
Approved by: https://github.com/JacobSzwejbka
2024-11-21 23:57:07 +00:00
f044c1a7c8 Fixes #140986, improves wording and grammar of nn/module.py (#140987)
Fixes #140986

This includes several improvements on the grammar and wording of nn/module.py, mostly simple one word fixes, but also other slightly more elaborate ones.

It addresses about half of the docs for module.py but I would be glad to cover the rest of it if required to do so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140987
Approved by: https://github.com/mikaylagawarecki
2024-11-21 23:40:43 +00:00
45b30a5aec [sparse] add extra options to _cslt_spare_mm (#137427)
Summary:

Splitting this PR into two, one for the cuSPARSELt improvements, and one
for the inductor lowering.

This PR adds in the additional cuSPARSELt bindings into pytorch.

* `torch._cslt_sparse_mm_search` will be deprecated in a future PR,
  so a warning has been added

* Added a header file for cuSPARSELtOps.cpp

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch, https://github.com/eqy
2024-11-21 23:37:36 +00:00
9a72939042 [MPS] Expand fused forloop to bfloat16 (#141104)
For MacOS14+

Running following script
```python
```

Produces following results on M4Pro running MacOS 15
```
[-------------------------------- Fused Adam on mps using torch.bfloat16 -------------------------------]
                                                                          |  Fused: True  |  Fused: False
1 threads: ----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 10        |       283     |      2810
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 10       |       277     |      2430
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 10       |       285     |      2400
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 10      |       278     |      2250
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 10       |       504     |      2700
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 10      |       478     |      2600
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 10      |       506     |      2500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 10     |       482     |      2300
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 10     |      2089     |      4190
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 10    |      1940     |      3800
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 10    |      2100     |      3770
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 10   |      1950     |      3600
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 50        |       842     |     14000
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 50       |       835     |     11800
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 50       |       845     |     11700
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 50      |       855     |     11000
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 50       |      1410     |     14000
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 50      |      1350     |     12000
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 50      |      1400     |     12000
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 50     |      1340     |     11000
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 50     |      9767     |     20400
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 50    |      8991     |     18600
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 50    |      9803     |     18300
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 50   |      9070     |     17600
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100       |      1600     |     27000
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100      |      1600     |     24100
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100      |      1600     |     23500
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100     |      1600     |     21800
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100      |      2740     |     26000
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100     |      2580     |     24000
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100     |      2730     |     25000
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100    |      2600     |     23000
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100    |     19350     |     39000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100   |     17780     |     37300
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100   |     19400     |     37000
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100  |     17900     |     35500
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141104
Approved by: https://github.com/qqaatw, https://github.com/kulinseth, https://github.com/Skylion007
ghstack dependencies: #141089, #141090, #141092, #141103
2024-11-21 23:30:37 +00:00
740d1eb030 Fix test_out when run on CPU with CUDA available (#137140)
Ever since #135140, this test will fail if run with CPU parameterization (e.g. test_out__refs_logical_or_cpu_float32) and CUDA available - as far as I can tell, the PyTorch CI isn't currently checking for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137140
Approved by: https://github.com/ezyang
2024-11-21 23:10:07 +00:00
37fe8015ac softshrink nan fixes (#138421)
Fixes #138385 .

Currently contains fixes for cpu and cuda. Will add fixes to mps as well soon if my mac can build it from source.(Had some issues with building it on my linux pc due to limited memory)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138421
Approved by: https://github.com/mikaylagawarecki
2024-11-21 23:06:08 +00:00
3b84fb26d0 Enable inductor-rocm workflow for all trunk commits AND inductor-related PRs (#138623)
It should help with triaging ROCm-inductor-related breakages and surfacing them in the PRs itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138623
Approved by: https://github.com/huydhn
2024-11-21 22:51:49 +00:00
ba5c4a727f Upload sccache stats into benchmark database with build step time (#140839)
Guinea pig benchmark database
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140839
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-11-21 22:38:45 +00:00
7b2138b864 [inductor] fix uncaught exception when checking for openmp on macos (#141208)
Based on #133776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141208
Approved by: https://github.com/Skylion007
2024-11-21 22:17:52 +00:00
e908f9278f [ONNX] Remove test_save_with_without_initializer test (#141263)
The test is flaky and obsolete. So remove.

Fixes https://github.com/pytorch/pytorch/issues/125020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141263
Approved by: https://github.com/titaiwangms
2024-11-21 22:06:15 +00:00
e28b09517f [miniz] Make sure miniz extra_size_remaining doesn't go off bound (#141266)
#140041 added some logic to fix a zip64 header error. This PR makes sure `extra_size_remaining` doesn't overflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141266
Approved by: https://github.com/angelayi
2024-11-21 22:02:28 +00:00
5e54cf3687 Revert "Fix MPS synchronize by waiting for root buffer to complete (#140725)"
This reverts commit 9bc9d4cdb4355a385a7d7959f07d04d1648d6904.

Reverted https://github.com/pytorch/pytorch/pull/140725 on behalf of https://github.com/malfet due to It causes deadlocks when I try to run something benchmark from  https://github.com/pytorch/pytorch/pull/127242 ([comment](https://github.com/pytorch/pytorch/pull/140725#issuecomment-2492416501))
2024-11-21 21:56:22 +00:00
cc36d039d4 [FlexAttention] Rename zeros_and_scatter library (#141185)
# Summary
Previous custom op library name was a little verbose and didn't really align with how we typically name our libraries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141185
Approved by: https://github.com/Chillee
ghstack dependencies: #141164
2024-11-21 21:35:48 +00:00
073cbf2c9d [FlexAttention] Fix another IMA with captured buffers (#141164)
# Summary
We have another IMA for captured buffers when we are the sequences are not divisible.

Running test before this commit:
```Shell
========= Error: process didn't terminate successfully
========= Target application returned an error
========= ERROR SUMMARY: 447 errors
========= ERROR SUMMARY: 347 errors were not printed. Use --print-limit option to adjust the number of printed errors
```

And After
```Shell
❯ CUDA_LAUNCH_BLOCKING=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck pytest test/inductor/test_flex_attention.py -k "test_non_divisible_with_captured_buffer"
========= COMPUTE-SANITIZER
====================================================== test session starts =======================================================
platform linux -- Python 3.12.7, pytest-7.4.0, pluggy-1.5.0
rootdir: /home/drisspg/meta/pytorch
configfile: pytest.ini
plugins: hypothesis-6.115.5, typeguard-4.3.0
collected 518 items / 517 deselected / 1 selected
Running 1 items in this shard

test/inductor/test_flex_attention.py .                                                                                     [100%]

=============================================== 1 passed, 517 deselected in 13.31s ===============================================
========= ERROR SUMMARY: 0 errors
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141164
Approved by: https://github.com/Chillee
2024-11-21 21:35:48 +00:00
a0e84ff5c6 [inductor] Check Triton Autotuner.__init__ for pre_hook/post_hook (#141040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141040
Approved by: https://github.com/aakhundov
ghstack dependencies: #140982
2024-11-21 21:30:01 +00:00
fa63276691 [user empathy day][dynamo] Support get on subclassed dicts (#141214)
Fixes https://github.com/pytorch/pytorch/issues/141138 but we need to do
a more exhaustive job of going through dict methods and check each one
of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141214
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #141209
2024-11-21 21:18:42 +00:00
d7402cd196 [user-empathy-day][dynamo] Remove speical casing for torch.nn.Parameter tracing (#141209)
This was done to reduce compile time ealier, but I have seen two cases in past
month where this code falters, one from the user empathy day -
https://docs.google.com/document/d/1nEX1GtKhNzid6NvNg5CaVamO6JrJoKPuJ2iueWUYFWc/edit?tab=t.0

So removing this code. It can affect compile time for a few models by a
few seconds, but its way less code to maintain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141209
Approved by: https://github.com/jansel
2024-11-21 21:18:42 +00:00
6c9bfd52b6 [ROCm][CI] upgrade CI to ROCm 6.2.4 (#140851)
Fixes issue of long docker build times in PRs which trigger the docker build in regular PyTorch build jobs eg. https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198. These docker builds take a long time for ROCm6.2 because:
1. They are run on less capable machines (`c5.2xlarge`) instead of the beefier ones on which [docker-build workflows](924c1fe3f3/.github/workflows/docker-builds.yml (L50)) run (`c5.12xlarge`)
2. ROCm6.2 docker builds enabled building of MIOpen from source, which runs into [timeout of 90mins](9abd4d95bb/.github/actions/calculate-docker-image/action.yml (L171)): https://github.com/pytorch/pytorch/actions/runs/11751388838/job/32828886198#step:7:160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140851
Approved by: https://github.com/jeffdaily
2024-11-21 21:12:48 +00:00
04f569a524 [ROCm] AMDSMI memory usage unification (#139900)
Fixes https://github.com/pytorch/pytorch/issues/140638

Old implementation used vram_used, which is not the correct equivalent API for pynvml memory utilization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139900
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-11-21 21:11:39 +00:00
614e727191 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit cd942d00dde73dbf9d7c5f89fdd7152f3440c4ca.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/izaitsevfb due to causes crash internally during test listing ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2492328790))
2024-11-21 21:05:22 +00:00
6ba5fa47ea Add reference to pad_packed_sequence in pack_padded_sequence doc (#137294)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137294
Approved by: https://github.com/mikaylagawarecki
2024-11-21 21:01:17 +00:00
4e34fbdcbc Add inductor_fx_graph_cache stats to dynamo_utils (#141190)
Summary:
Add the following inductor fx graph cache stats to dynamo compile

- inductor_fx_cache_hit_count
- inductor_fx_cache_miss_count
- inductor_fx_cache_backend_type
- inductor_fx_cache_hit_keys
- inductor_fx_cache_miss_keys
- remote_cache_version

Test Plan: Run local tests and staging logger: P1683061460

Differential Revision: D66232206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141190
Approved by: https://github.com/masnesral
2024-11-21 20:59:10 +00:00
149677e30c Revert "[dynamo] Added cuda and triton versions to dynamo_compile" (#141280)
Reverts pytorch/pytorch#141140

reason: conflicts with https://github.com/pytorch/pytorch/pull/141190 and wasn't merged using mergebot

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141280
Approved by: https://github.com/clee2000, https://github.com/kit1980
2024-11-21 20:50:06 +00:00
11d0ba068f [dynamo] Added cuda and triton versions to dynamo_compile (#141140)
[dynamo] Added cuda and triton versions to dynamo_compile (#141140)

Summary:

Add cuda and triton versions to dynamo_compile logging site.

Test Plan:
$ buck2 run mode/opt //scripts/oulgen:runner
File changed: fbcode//caffe2/torch/_dynamo/convert_frame.py
Buck UI: https://www.internalfb.com/buck2/1a8ada1f-d54e-44b2-a368-b2ff2030e113
Network: Up: 65KiB  Down: 0B  (reSessionID-8f4d1d6d-a680-4ecc-8e73-c29c932d824b)
Jobs completed: 2166. Time elapsed: 7.0s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
BUILD SUCCEEDED
...
Cuda: 12.4.0
Triton: 3.0.0

Reviewed By: masnesral

Differential Revision: D66181508
2024-11-21 12:20:02 -08:00
4fb4aa3e70 Updated docstrings referring to torch.expand to point to torch.Tensor.expand (#140045)
`torch.expand` was moved to `torch.Tensor.expand` but some docstrings still refer to `torch.expand`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140045
Approved by: https://github.com/mikaylagawarecki
2024-11-21 20:13:41 +00:00
d3c8f1af8d Revert "[export] serialize sympy.Exprs as ASTs instead of strings (#140084)"
This reverts commit d869344bc00bf7de815a2b69fb0909e7229bc5bf.

Reverted https://github.com/pytorch/pytorch/pull/140084 on behalf of https://github.com/izaitsevfb due to reverted internally in D66253238 ([comment](https://github.com/pytorch/pytorch/pull/140084#issuecomment-2492165667))
2024-11-21 20:09:54 +00:00
da94ab0b66 [inductor] Add typing to ir.py 1 (#140912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140912
Approved by: https://github.com/aorenste
ghstack dependencies: #140895, #140910
2024-11-21 20:01:57 +00:00
6eca0aee76 [inductor] Refactor ir.Layout into ir.OutputSpec (#140910)
This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs).  Which should make typing easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910
Approved by: https://github.com/ezyang
ghstack dependencies: #140895
2024-11-21 20:01:57 +00:00
827f2f749e [CUTLASS] Raise NotImplementedError if X & W aren't FixedLayout (#140985)
Summary: title

Differential Revision: D66131402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140985
Approved by: https://github.com/Skylion007
2024-11-21 19:59:19 +00:00
a847790400 [inductor] reset to zero support for user defined Triton kernels (#140982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140982
Approved by: https://github.com/aakhundov
2024-11-21 18:53:23 +00:00
723498aab8 Gaussian nll loss scalar variance support (#138931)
Fixes #138747

Adds support for `variance` being a Tensor or a float in `gaussian_nll_loss` to avoid a cpu-gpu sync point in the loss function, when the variance is a static tensor like `<scalar>*torch.ones_like(input)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138931
Approved by: https://github.com/mikaylagawarecki
2024-11-21 18:20:09 +00:00
e39955e82f Avoid some max constructor optimizations when known not needed. (#139741)
Summary:
around 10% with 1K nodes
more than that with 2K features. 414.5735 -> 333 (20%)

This target optimizing patterns like this
```
 sym_max: "Sym(Max(u31 + u32, u33 + u34))" = torch.sym_max(sym_sum_6, sym_sum_7);  sym_sum_6 = sym_sum_7 = None
        sym_max_1: "Sym(Max(u31 + u32, u33 + u34, u35 + u36))" = torch.sym_max(sym_max, sym_sum_8);  sym_max = sym_sum_8 = None
        sym_max_2: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38))" = torch.sym_max(sym_max_1, sym_sum_9);  sym_max_1 = sym_sum_9 = None
        sym_max_3: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40))" = torch.sym_max(sym_max_2, sym_sum_10);  sym_max_2 = sym_sum_10 = None
        sym_max_4: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42))" = torch.sym_max(sym_max_3, sym_sum_11);  sym_max_3 = sym_sum_11 = None
        sym_max_5: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42, u43 + u44))" = torch.sym_max(sym_max_4, sym_sum_12);  sym_max_4 = sym_sum_12 = None
        sym_max_6: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42, u43 + u44, u45 + u46))" = torch.sym_max(sym_max_5, sym_sum_13);  sym_max_5 = sym_sum_13 = None
        sym_max_7: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42, u43 + u44, u45 + u46, u47 + u48))" = torch.sym_max(sym_max_6, sym_sum_14);  sym_max_6 = sym_sum_14 = None
        sym_max_8: "Sym(Max(u31 + u32, u33 + u34, u35 + u36, u37 + u38, u39 + u40, u41 + u42, u43 + u44, u45 + u46, u47 + u48, u49 + u50))" = torch.sym_max(sym_max_7, sym_sum_15);  sym_max_7 = sym_sum_15 = sym_max_8 = None
```

<img width="496" alt="Screenshot 2024-11-05 at 11 00 35 AM" src="https://github.com/user-attachments/assets/455c06a3-e1bf-43cb-b880-9470ae6fb07f">
<img width="511" alt="Screenshot 2024-11-05 at 11 00 57 AM" src="https://github.com/user-attachments/assets/ff0d4236-9b5c-4a9a-8520-47b005bb3cb0">

Differential Revision: D65354971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139741
Approved by: https://github.com/ezyang
2024-11-21 16:50:52 +00:00
75bbad4768 Unbreak CUDA 11.4 build of Half.h (#141173)
`__CUDACC__` is needed to detect CUDA builds on that platform.

Differential Revision: [D66262774](https://our.internmc.facebook.com/intern/diff/D66262774/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D66262774/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141173
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-21 16:36:38 +00:00
8e359a65f3 [ONNX] Use IRv10 (#141207)
Update to use IRv10 to support INT4 types and ValueInfo in functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141207
Approved by: https://github.com/titaiwangms
2024-11-21 16:34:35 +00:00
41f315417c Fix NJT linear_backward() memory usage (#141163)
Fixes #141112

The formula we're using for `linear_backward()` is inefficient for higher dim input sizes, even if the input is trivially higher dim (e.g. via use of `unsqueeze()`). This PR updates the formula to match the more efficient version employed by NST. Specifically, note the leading dim collapse for `grad_output`'s values before we compute the various matmuls.
d5ee1d1b58/aten/src/ATen/native/nested/NestedTensorBackward.cpp (L37-L70)

Testing for correctness is done via existing gradcheck tests (e.g. `test_backward_nn_functional_linear`). I added a memory usage test but I think it's likely there's a better way to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141163
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch, https://github.com/soulitzer
2024-11-21 15:22:45 +00:00
f2f7ef9d59 Fix stride in TensorMetadata to always be a Tuple[int, ...] (#141106)
Test Plan: CI

Differential Revision: D66204410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141106
Approved by: https://github.com/Skylion007, https://github.com/evanleed
2024-11-21 14:52:36 +00:00
b25c291563 [C10D] Support group ranks in P2POp and batch_isend_irecv (#141054)
Changes semantic of __repr__ of P2POp: s, d are now group ranks instead
of global ranks. I think this is OK since I also updated the field names
to make this obvious.

Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141054
Approved by: https://github.com/kwen2501
2024-11-21 14:51:56 +00:00
3b67d4d687 [inductor] Don't clamp on split operation. (#141078)
This PR turns clamping off for the `split` operation. By doing so, we generate less bound
guards and reduce the number of recompilation when varying the input size.

```python
@torch.compile(dynamic=True)
def f(x):
    return x.chunk(4)

>>> f(torch.arange(12))
(tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([ 9, 10, 11]))

>>> f(torch.arange(11))

(tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([ 9, 10]))

>>> f(torch.arange(10))
(tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([9]))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141078
Approved by: https://github.com/ezyang
ghstack dependencies: #141077
2024-11-21 13:53:38 +00:00
154f90f026 [inductor] Don't specialize split on sizes parameter. (#141077)
Fix: #139936

This PR modifies the lowering of `split` operation, so that it won't generate guards,
specializing on the sizes parameter. Instead, it specializes on the number of output
tensors being generated (i.e. function of the size of the base tensor, and the sizes
parameter).

As a result, operations such as `chunk` (whose number of output tensors usually is
constant given a static chunk number) won't trigger recompiles when varying the size of
the base tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141077
Approved by: https://github.com/ezyang
2024-11-21 13:53:38 +00:00
dcf7728fd6 Update submodule ideep for ideep conv changes (#141101)
Summary:
Update submodule ideep to include ideep conv changes: modify convolution_forward to support broadcast add fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141101
Approved by: https://github.com/Skylion007, https://github.com/jgong5
2024-11-21 12:26:24 +00:00
ecf3bae40a [dynamo] support operator.methodcaller (#141137)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141137
Approved by: https://github.com/jansel
ghstack dependencies: #141122
2024-11-21 09:13:23 +00:00
1132b6764a [draft export] generate fake outputs when real tensor prop finds mismatches (#139766)
Currently real tensor tracing raises MetadataMismatchErrors if registered fake kernels don't match the real kernels (e.g. shape, aliasing, dtype, etc.). This adds an option to use fake kernel inference to bypass mismatches - this option defaults to False for real tensor tracing, but is on for draft export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139766
Approved by: https://github.com/angelayi, https://github.com/zou3519
2024-11-21 08:01:09 +00:00
66476617bf [Dist][CI] Easier override of destroy-upon-exit setting (#141192)
Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired.
(Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192
Approved by: https://github.com/wconstab
2024-11-21 07:32:56 +00:00
d65f194ab9 [dynamo] support operator.attrgetter and operator.itemgetter (#141122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141122
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-11-21 06:48:33 +00:00
fb529c2c84 [dynamo] skip_guard_eval_unsafe stance for power users (#140251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140251
Approved by: https://github.com/jansel
ghstack dependencies: #140223, #140250
2024-11-21 06:28:58 +00:00
7392e88219 Instead of using node.meta read from side table directly (#141146)
When a transformation phase copies/modifies a node, it might drop node.meta, same as graph.meta, so they are not a good storage locations. instead directly read from the side table.

Differential Revision: [D66249968](https://our.internmc.facebook.com/intern/diff/D66249968/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141146
Approved by: https://github.com/ezyang
2024-11-21 06:19:12 +00:00
0a4bcbf39c [ONNX] Add support for torch.cond/HOP in onnx exporter (#137428)
This PR implements the framework for supporting HOP in the ONNX exporter. Refer to https://github.com/pytorch/pytorch/issues/140995 for the design.

- Implement support for torch.cond
- Refactor `_add_nodes` into `_translate_fx_graph` to handle nested subgraphs. To support building subgraphs as functions using the same logic, new handlers for `placeholder` and `output` nodes are added to register inputs and outputs on the onnx function.
- Fuctions are created under the domain of `pkg.torch.__subgraph__`
- Updated the type promotion pass to run on nested subgraphs.
- Implement torch.cond in `_torchlib/ops/hop.py`. Updated the registry to discover these ops.
- Improve opset_import handling robustness with `add_opset_imports` IR pass. To achieve this, we added opset version to all Nodes. Fixes https://github.com/pytorch/pytorch/issues/139503

Fixes #117655 Fixes #123972 Fixes #93743 Closes https://github.com/pytorch/pytorch/issues/140995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137428
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-11-21 03:02:43 +00:00
e0482fdf95 Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-21 01:40:11 +00:00
b44ecd91ba [c10d] Switch all timer logging in c10d to wait_counter (#141154)
Summary: The original decorator based time logger is bad in performance and capacity. So we want to replace it with pytorch `_WaitCounter` now.

Test Plan: Tested on workload and no regression has been seen: https://fburl.com/scuba/aps_instrumentation_components/mskj73ea

Differential Revision: D66218675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141154
Approved by: https://github.com/wz337
2024-11-21 01:10:11 +00:00
225d3f4495 [subclasses] Subclass parameterization to not wrap-unwrap on forward (#140632)
One of the common use cases of tensor Subclasses is to replace all model Parameters with Subclass that provides alternative implementation of common ops. E.g. quantization replaces weights to QuantizedSubclass.

AotAutograd lifts up Parameters to graph arguments and wraps graph execution at runtime with wrapping/unwrapping of those subclasses.

Even if one unwrapping is not critically big ~14us, when we have to unwrap/wrap all linear weights, that could  result in substantial addition to runtime, which can be more than compiled region execution time. E.g. 20 weights * 14us = 0.3ms.

This is parametrization to unwrap tensor subclasses, that is used in torch.ao: https://github.com/pytorch/ao/blob/main/torchao/utils.py#L294

It adds parametrization to unwrap tensor subclasses to plain tensors.
As a result the registered parameters are changed (all registered parameters will become plain tensors) and  state_dict is not compatible before/after transformation.

This transformation is used before dynamo and does breaking changes, so we keep it for user to be used explicitly.

Testing:
```
TORCH_LOGS="graph_code,aot" python test/functorch/test_aotdispatch.py -k test_subclass_parameters
```
```
TORCH_LOGS="graph_code,aot,export" python test/dynamo/test_export.py -k test_subclass_parameters
```

```
TRACED GRAPH
  ===== pre insert_deferred_runtime_asserts __compiled_fn_1 =====
  <eval_with_key>.0 class GraphModule(torch.nn.Module):
     def forward(self, L_self_modules_parametrizations_modules_p1_parameters_original0_: "f32[3, 4]", L_x_: "f32[3, 4]", L_self_modules_parametrizations_modules_p2_parameters_original0_: "f32[3, 4]", L_self_modules_parametrizations_modules_p2_parameters_original1_: "f32[3, 4]"):
         l_self_modules_parametrizations_modules_p1_parameters_original0_ = L_self_modules_parametrizations_modules_p1_parameters_original0_
         l_x_ = L_x_
         l_self_modules_parametrizations_modules_p2_parameters_original0_ = L_self_modules_parametrizations_modules_p2_parameters_original0_
         l_self_modules_parametrizations_modules_p2_parameters_original1_ = L_self_modules_parametrizations_modules_p2_parameters_original1_

          # File: /data/users/ivankobzarev/a/pytorch/torch/testing/_internal/subclasses.py:42 in __tensor_unflatten__, code: return WrapperSubclass(a, outer_size, outer_stride)
         rebuilt: "f32[3, 4]" = torch.testing._internal.subclasses.WrapperSubclass(l_self_modules_parametrizations_modules_p1_parameters_original0_, None, None);  l_self_modules_parametrizations_modules_p1_parameters_original0_ = None

          # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2
         mul: "f32[3, 4]" = 2 * rebuilt;  rebuilt = None
         add: "f32[3, 4]" = l_x_ + mul;  l_x_ = mul = None

          # File: /data/users/ivankobzarev/a/pytorch/torch/testing/_internal/two_tensor.py:58 in __tensor_unflatten__, code: return TwoTensor(a, b, outer_size, outer_stride)
         rebuilt_1: "f32[3, 4]" = torch.testing._internal.two_tensor.TwoTensor(l_self_modules_parametrizations_modules_p2_parameters_original0_, l_self_modules_parametrizations_modules_p2_parameters_original1_, None, None);  l_self_modules_parametrizations_modules_p2_parameters_original0_ = l_self_modules_parametrizations_modules_p2_parameters_original1_ = None

          # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2
         add_1: "f32[3, 4]" = add + rebuilt_1;  add = rebuilt_1 = None
         return (add_1,)

ACED GRAPH
==== __compiled_fn_1 =====
data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
  def forward(self, L_self_modules_parametrizations_modules_p1_parameters_original0_: "f32[3, 4][4, 1]cpu", L_x_: "f32[3, 4][4, 1]cpu", L_self_modules_parametrizations_modules_p2_parameters_original0_: "f32[3, 4][4, 1]cpu", L_self_modules_parametrizations_modules_p2_parameters_original1_: "f32[3, 4][4, 1]cpu"):
      l_self_modules_parametrizations_modules_p1_parameters_original0_ = L_self_modules_parametrizations_modules_p1_parameters_original0_
      l_x_ = L_x_
      l_self_modules_parametrizations_modules_p2_parameters_original0_ = L_self_modules_parametrizations_modules_p2_parameters_original0_
      l_self_modules_parametrizations_modules_p2_parameters_original1_ = L_self_modules_parametrizations_modules_p2_parameters_original1_

       # File: /data/users/ivankobzarev/a/pytorch/torch/testing/_internal/subclasses.py:42 in __tensor_unflatten__, code: return WrapperSubclass(a, outer_size, outer_stride)
      rebuilt: "f32[3, 4][4, 1]cpu" = torch.testing._internal.subclasses.WrapperSubclass(l_self_modules_parametrizations_modules_p1_parameters_original0_, None, None);  l_self_modules_parametrizations_modules_p1_parameters_original0_ = None

       # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2
      mul: "f32[3, 4][4, 1]cpu" = 2 * rebuilt;  rebuilt = None
      add: "f32[3, 4][4, 1]cpu" = l_x_ + mul;  l_x_ = mul = None

       # File: /data/users/ivankobzarev/a/pytorch/torch/testing/_internal/two_tensor.py:58 in __tensor_unflatten__, code: return TwoTensor(a, b, outer_size, outer_stride)
      rebuilt_1: "f32[3, 4][4, 1]cpu" = torch.testing._internal.two_tensor.TwoTensor(l_self_modules_parametrizations_modules_p2_parameters_original0_, l_self_modules_parametrizations_modules_p2_parameters_original1_, None, None);  l_self_modules_parametrizations_modules_p2_parameters_original0_ = l_self_modules_parametrizations_modules_p2_parameters_original1_ = None

       # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2
      add_1: "f32[3, 4][4, 1]cpu" = add + rebuilt_1;  add = rebuilt_1 = None
      return (add_1,)

.py:381] [0/0] [__aot_joint_graph] TRACED GRAPH
.py:381] [0/0] [__aot_joint_graph]  ===== Joint graph 0 =====
.py:381] [0/0] [__aot_joint_graph]  /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class joint_fn(torch.nn.Module):
.py:381] [0/0] [__aot_joint_graph]     def forward(self, primals, tangents):
.py:381] [0/0] [__aot_joint_graph]         primals_1: "f32[3, 4][4, 1]cpu"; primals_2: "f32[3, 4][4, 1]cpu"; primals_3: "f32[3, 4][4, 1]cpu"; primals_4: "f32[3, 4][4, 1]cpu"; tangents_1: "f32[3, 4][4, 1]cpu"; tangents_2: "f32[3, 4][4, 1]cpu";
.py:381] [0/0] [__aot_joint_graph]
.py:381] [0/0] [__aot_joint_graph]         primals_1, primals_2, primals_3, primals_4, tangents_1, tangents_2, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)
.py:381] [0/0] [__aot_joint_graph]          # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2
.py:381] [0/0] [__aot_joint_graph]         mul: "f32[3, 4][4, 1]cpu" = torch.ops.aten.mul.Tensor(primals_1, 2);  primals_1 = None
.py:381] [0/0] [__aot_joint_graph]         add: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(primals_2, mul);  primals_2 = mul = None
.py:381] [0/0] [__aot_joint_graph]         add_1: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(add, primals_3);  primals_3 = None
.py:381] [0/0] [__aot_joint_graph]         add_2: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(add, primals_4);  add = primals_4 = None
.py:381] [0/0] [__aot_joint_graph]         return pytree.tree_unflatten([add_1, add_2, None, None, None, None], self._out_spec)
.py:381] [0/0] [__aot_joint_graph]
.py:381] [0/0] [__aot_joint_graph]
graph_code] TRACED GRAPH
graph_code]  ===== tensorify_python_scalars =====
graph_code]  /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class joint_fn(torch.nn.Module):
graph_code]     def forward(self, primals, tangents):
graph_code]         primals_1: "f32[3, 4]"; primals_2: "f32[3, 4]"; primals_3: "f32[3, 4]"; primals_4: "f32[3, 4]"; tangents_1: "f32[3, 4]"; tangents_2: "f32[3, 4]";
graph_code]
graph_code]         primals_1, primals_2, primals_3, primals_4, tangents_1, tangents_2, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)
graph_code]          # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2
graph_code]         mul: "f32[3, 4]" = torch.ops.aten.mul.Tensor(primals_1, 2);  primals_1 = None
graph_code]         add: "f32[3, 4]" = torch.ops.aten.add.Tensor(primals_2, mul);  primals_2 = mul = None
graph_code]         add_1: "f32[3, 4]" = torch.ops.aten.add.Tensor(add, primals_3);  primals_3 = None
graph_code]         add_2: "f32[3, 4]" = torch.ops.aten.add.Tensor(add, primals_4);  add = primals_4 = None
graph_code]         return pytree.tree_unflatten([add_1, add_2, None, None, None, None], self._out_spec)
graph_code]
graph_code]
.py:463] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch.testing._internal.subclasses.WrapperSubclass'>, base_idx=None, dynamic_dims=set(), requires_grad=True, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[WrapperSubclass(TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4))))], subclass_inp_meta=[PlainTensorMeta(unwrapped_idx=0, memory_format=None), PlainTensorMeta(unwrapped_idx=1, memory_format=None), PlainTensorMeta(unwrapped_idx=2, memory_format=None), PlainTensorMeta(unwrapped_idx=3, memory_format=None)], subclass_fw_graph_out_meta=[SubclassCreationMeta(flat_tensor_start_idx=0, arg_count=2, included_subclass_symints=True, attrs={'a': SubclassCreationMeta(flat_tensor_start_idx=0, arg_count=2, included_subclass_symints=True, attrs={'a': PlainTensorMeta(unwrapped_idx=1, memory_format=None), 'b': PlainTensorMeta(unwrapped_idx=2, memory_format=None)}, outer_size=torch.Size([3, 4]), outer_stride=(4, 1), meta=None, original_subclass=TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4))), original_subclass_type=None, memory_format=None)}, outer_size=torch.Size([3, 4]), outer_stride=(4, 1), meta=None, original_subclass=WrapperSubclass(TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4)))), original_subclass_type=None, memory_format=None)], subclass_tangent_meta=[SubclassCreationMeta(flat_tensor_start_idx=0, arg_count=2, included_subclass_symints=False, attrs={'a': SubclassCreationMeta(flat_tensor_start_idx=0, arg_count=2, included_subclass_symints=False, attrs={'a': PlainTensorMeta(unwrapped_idx=1, memory_format=torch.contiguous_format), 'b': PlainTensorMeta(unwrapped_idx=2, memory_format=torch.contiguous_format)}, outer_size=torch.Size([3, 4]), outer_stride=(4, 1), meta=None, original_subclass=TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4))), original_subclass_type=None, memory_format=torch.contiguous_format)}, outer_size=torch.Size([3, 4]), outer_stride=(4, 1), meta=None, original_subclass=WrapperSubclass(TwoTensor(FakeTensor(..., size=(3, 4)), FakeTensor(..., size=(3, 4)))), original_subclass_type=None, memory_format=torch.contiguous_format)], is_train=True, traced_tangent_metas=None, num_symints_saved_for_bw=0, grad_enabled_mutation=None, deterministic=False, static_input_indices=[0, 2, 3], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=[], num_backward_tokens=0), inner_meta=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True), InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=True, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None), OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[PlainTensorMeta(unwrapped_idx=0, memory_format=None), PlainTensorMeta(unwrapped_idx=1, memory_format=None), PlainTensorMeta(unwrapped_idx=2, memory_format=None), PlainTensorMeta(unwrapped_idx=3, memory_format=None)], subclass_fw_graph_out_meta=[PlainTensorMeta(unwrapped_idx=0, memory_format=None), PlainTensorMeta(unwrapped_idx=1, memory_format=None)], subclass_tangent_meta=[], is_train=True, traced_tangent_metas=None, num_symints_saved_for_bw=0, grad_enabled_mutation=None, deterministic=None, static_input_indices=[0], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=[], num_backward_tokens=0)
.py:569] [0/0] [__aot_graphs] TRACED GRAPH
.py:569] [0/0] [__aot_graphs]  ===== Forward graph 0 =====
.py:569] [0/0] [__aot_graphs]  /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
.py:569] [0/0] [__aot_graphs]     def forward(self, primals_1: "f32[3, 4][4, 1]cpu", primals_2: "f32[3, 4][4, 1]cpu", primals_3: "f32[3, 4][4, 1]cpu", primals_4: "f32[3, 4][4, 1]cpu"):
.py:569] [0/0] [__aot_graphs]          # File: /data/users/ivankobzarev/a/pytorch/test/functorch/test_aotdispatch.py:6301 in forward, code: return x + 2 * self.p1 + self.p2
.py:569] [0/0] [__aot_graphs]         mul: "f32[3, 4][4, 1]cpu" = torch.ops.aten.mul.Tensor(primals_1, 2);  primals_1 = None
.py:569] [0/0] [__aot_graphs]         add: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(primals_2, mul);  primals_2 = mul = None
.py:569] [0/0] [__aot_graphs]         add_1: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(add, primals_3);  primals_3 = None
.py:569] [0/0] [__aot_graphs]         add_2: "f32[3, 4][4, 1]cpu" = torch.ops.aten.add.Tensor(add, primals_4);  add = primals_4 = None
.py:569] [0/0] [__aot_graphs]         return (add_1, add_2)
.py:569] [0/0] [__aot_graphs]
.py:569] [0/0] [__aot_graphs]
.py:580] [0/0] [__aot_graphs] TRACED GRAPH
.py:580] [0/0] [__aot_graphs]  ===== Backward graph 0 =====
.py:580] [0/0] [__aot_graphs]  /data/users/ivankobzarev/a/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
.py:580] [0/0] [__aot_graphs]     def forward(self, tangents_1: "f32[3, 4][4, 1]cpu", tangents_2: "f32[3, 4][4, 1]cpu"):
.py:580] [0/0] [__aot_graphs]         return (None, None, None, None)
.py:580] [0/0] [__aot_graphs]
.py:580] [0/0] [__aot_graphs]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140632
Approved by: https://github.com/bdhirsh
2024-11-21 01:09:33 +00:00
5c45984cce skip complex logaddexp tests in 3.12+ (#140731)
This test is failing locally in 3.12 and 3.13 and is blocking 3.13 CI enablement.

It may have to do with scipy version, see .ci/docker/requirements-ci.txt (3.12+ has scipy 1.12.0/1.14.1, where as < 3.12 requires scipy 1.10.1).

Wanted to xfail these tests, but they somehow pass sometimes on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140731
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-11-21 01:08:07 +00:00
6882b398a4 [Doc] Remove mention of Intel Macs (#141182)
As we are no longer supporting those.
At mention that MPS support needs Ventura+.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141182
Approved by: https://github.com/clee2000, https://github.com/atalman
2024-11-21 01:05:12 +00:00
2d52f7946b [BE] Use torch.log1p(x) instead of torch.log(1+x) (#141167)
To fix TOR107 linter violations
Found while trying to migrate PyTorch to latest torchfix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141167
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-11-21 00:36:20 +00:00
cd942d00dd [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-11-21 00:25:20 +00:00
c7d072db99 [AOTAutogradCache] Allowlist various ops found from models to safe list (#140825)
From running internal models, I found a bunch of AOTAutogradCache ops that seem safe to cache.

Would appreciate any suggestions for how to allowlist these in a more general way, but starting with these for now.

Differential Revision: [D66010326](https://our.internmc.facebook.com/intern/diff/D66010326/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D66010326/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140825
Approved by: https://github.com/bdhirsh
2024-11-21 00:04:17 +00:00
1d6ca50c5b config: Throw if justknobs value is not a boolean (#139488)
This helps avoid an issue, where someone uses a mutable type that
justknobs does not support within the code. And then it gets overriden
to a different type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139488
Approved by: https://github.com/ezyang
2024-11-20 23:52:17 +00:00
040af3053a [AOTI] Fix a two-pass kernel missmatch (#141041)
Summary: Fixes https://github.com/pytorch/pytorch/issues/140766. In AOTI's two-pass codegen, the first pass generates triton_per_fused_add_native_layer_norm_4, and the second pass generates triton_red_fused_add_native_layer_norm_4. While this problem will go away with the incoming one-pass implementation, further debugging reveals there is a mismatch in has_non_contiguous_pw_in_reduction_kernel between the two passes, due to a symbol comparsion problem in stride1_for_last_dim.

Differential Revision: [D66203298](https://our.internmc.facebook.com/intern/diff/D66203298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141041
Approved by: https://github.com/shunting314
2024-11-20 23:34:24 +00:00
ed9135a732 add jk for unspecialize float killswitch (#141143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141143
Approved by: https://github.com/c00w
2024-11-20 23:20:52 +00:00
765a347d21 [BE]: Update CUDNN for Linux to 9.5.1.17 for 12.6 only (#137978)
* Significantly faster, better CUDNN Attention especially on Hopper (FA3 implementation?)
* Lots of bugfixes
* Better performance
* More numerically stable / fixed heuristics
* More functionality for SDPA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137978
Approved by: https://github.com/eqy, https://github.com/drisspg, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2024-11-20 23:11:39 +00:00
93efddc67a Use pip corresponding to python executable (#141165)
Sometimes `python3` and `pip` are aliased to different runtimes, so it's better to always use `pip3`, but as linter should install packages into the same python environment, it's even better to just call sys.executable with `-mpip install XYZ` arguments

Fixes regression introduced by https://github.com/pytorch/pytorch/pull/124033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141165
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2024-11-20 22:58:33 +00:00
a82bab6419 Run only listed tests on s390x (#140265)
Skip tests that are failing

This was previously part of https://github.com/pytorch/pytorch/pull/125401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140265
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-20 22:53:09 +00:00
701e06b643 Revert "Move Sympy printers to torch/utils/_sympy/printers.py (#140597)"
This reverts commit aefcdb3c9fa787f9d43864f6f99a3590c914324a.

Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it fails inductor/test_padding in trunk. This is a target determination miss and that failed test was not run in your PR ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2489641453))
2024-11-20 22:13:57 +00:00
abaab5da05 Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835)"
This reverts commit 4c9e77d71e3f4ff9bec6fb5de98789f041f70a61.

Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/izaitsevfb due to breaking typechecks in meta code ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2489638528))
2024-11-20 22:11:19 +00:00
5c37b20d13 Fix autocast HOP pass for nested autocast (#141065)
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast"
```

Differential Revision: D65970066

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141065
Approved by: https://github.com/angelayi
2024-11-20 21:57:11 +00:00
87f9c1abe5 Change export IR to non-functional pre-dispatch IR (#139511)
Differential Revision: [D65362160](https://our.internmc.facebook.com/intern/diff/D65362160)

State after this IR:
1. For the tests that require inference IR, they are replaced with ep.run_decomp({}) so export_for_training_run_decomp is sort of redundant but i guess it is still nice that multiple round of retracing still working. In general, we need some auditing to reduce our redundant testing coverages.
2. After this PR landed and not get reverted for a week or so, i will replace the export_for_training calls with export as they are the same thing now.
3. Added more tests to also cover now "deprecated" old IR by patching export to use old export. For reviewers, please look at the internal version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139511
Approved by: https://github.com/ydwu4, https://github.com/angelayi, https://github.com/avikchaudhuri
2024-11-20 21:47:55 +00:00
f3f7ba5a69 Restart dynamo analysis when we fail to tensorify away all symfloat inputs (#140346)
Fixes a bunch of benchmarks that failed with cudagraph errors including `tlp python benchmarks/dynamo/timm_models.py --device cuda --inductor --accuracy --amp --training --only resmlp_12_224` when `specialize_float=False`

Also brings down number of overall failures (with keep-going) from 108 => 62. I'd estimate >80% of those 62 are wobbly expect tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140346
Approved by: https://github.com/ezyang
ghstack dependencies: #140983, #141003
2024-11-20 21:20:41 +00:00
4b3ce62946 [while_loop] support pytree inputs (#140059)
Previously, we only support carries to be tuple of tensors. This pr enables us to support pytree of tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140059
Approved by: https://github.com/zou3519
2024-11-20 21:12:29 +00:00
2ee2dcb736 [Device] Add mps as device type in torch._utils._get_available_device_type() (#141098)
As the title states

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141098
Approved by: https://github.com/malfet
2024-11-20 20:45:59 +00:00
2e3c0c489d Continuous job for pulling artifacts and doing upload (#140453)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140453
Approved by: https://github.com/huydhn
2024-11-20 20:41:52 +00:00
d5ee1d1b58 Remove capture_pre_autograd_graph in test_aot_inductor (#141064)
Summary: as title

Test Plan: CI

Differential Revision: D66191296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141064
Approved by: https://github.com/zhxchen17
2024-11-20 20:34:46 +00:00
aefcdb3c9f Move Sympy printers to torch/utils/_sympy/printers.py (#140597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-11-20 20:26:49 +00:00
161425ff9f Added aten.bernoulli.p and aten.bernoulli.default decompositions (#139141)
Fixes #105519

Added aten.bernoulli.p decomposition and moved/rewrote aten.bernoulli.deafult to make them included in core aten decomposition.

Tested the sample code in [105519](https://github.com/pytorch/pytorch/issues/105519), torch.bernoulli could be decomposed by the code snippet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139141
Approved by: https://github.com/eellison
2024-11-20 19:52:57 +00:00
bc69a19139 [MPS] Add support for bf16 autocast (#139390)
This PR adds support for bf16 autocast. Most of the code and ideas are copied from #99272.

Most of the heavy lifting was done by AI.

Fixes #139386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139390
Approved by: https://github.com/malfet

Co-authored-by: Kulin Seth <kulin_seth@apple.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-20 19:52:28 +00:00
808f0f656d [inductor] Refactor MutableBox to make IRNode typing easier (#140895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-11-20 19:50:46 +00:00
a8794fd7df [MPS] Fix conv backward pass for channels last (#141009)
Looks like a regression caused by use of strided API, but adding the test revealed (at least in CI), that on Ventura it worked but returned garbage results, so fixed by removing all the logic about channels last (as it's irrelevant for strided API case and placeholder already turns tensor into a correct one)

This also allows one to remove `mem_format_key` and `ns_shape_key` (it was redundant even back then, as `mem_format_key` + `getTensorsStringKey(grad_output_t)` already uniquely identified the operation)

Fixes https://github.com/pytorch/pytorch/issues/140902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141009
Approved by: https://github.com/manuelcandales
2024-11-20 19:50:31 +00:00
c9db2c6328 [ROCm] cudagraph explicit sync only after capture_begin() (#138722)
hipGraphExecDestroy doesn't immediately free memory since rocm6.2.
They wait for next sync point in order to free the memory, this is to ensure that all hipGraphLaunch are finished before we release any memory.
We need to ensure all async opreations finish before deleting the object.

capture_dev_ variable is used to save the device number when capture_begin() method is called
But CUDAGraph can be created and destroyed without calling capture_begin() method. `capture_dev_ = UNDEFINED_DEVICE;` allows to detect such a case and skip sync

Tests impacted:
test_cuda.py::TestCuda::test_graph_make_graphed_callables_*
distributed/test_c10d_nccl.py::ProcessGroupNCCLTest::test_allreduce_in_cudagraph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138722
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/jeffdaily
2024-11-20 19:37:22 +00:00
caa3a3e12c Only compute new_untracked_symbols and new_unbacked_bindings if needed. (#140083)
Summary:
237s -> 198..
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000

Test Plan: NA

Differential Revision: D65638637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140083
Approved by: https://github.com/ezyang, https://github.com/isuruf, https://github.com/anijain2305
2024-11-20 19:28:18 +00:00
4ffce45100 AOTInductor: properly generate cpp_wrapper runtime assertions (#141050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141050
Approved by: https://github.com/desertfire
ghstack dependencies: #141058
2024-11-20 19:17:47 +00:00
5c684503a6 cpp_wrapper: Fix dtype_view wrapping reinterpret_view (#141058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141058
Approved by: https://github.com/desertfire
2024-11-20 19:17:47 +00:00
d3902b5e20 [dynamo][CI] Add numpy-2.X shard (follow up) (#140586)
Fixes #107302

This is a clone and fix for #139199.

This PR is a small step for the overall NumPy 2 support.
It adds a new CI job for testing with NumPy 2 with one test file only.
More tests to be fixed and added later in follow-up pull requests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140586
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2024-11-20 19:11:28 +00:00
b5db3cb61c Skip uploading benchmark records when there is no model name (#141145)
A small fix I just realize after https://github.com/pytorch/pytorch/pull/141087.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141145
Approved by: https://github.com/malfet
2024-11-20 19:05:47 +00:00
1a7055cb73 Record PR time benchmark results in JSON format (#140493)
I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside.  The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839.

Existing CSV files remain unchanged.

### Testing

The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493
Approved by: https://github.com/laithsakka
2024-11-20 18:54:01 +00:00
4acd56eb53 Upload MPS benchmark results (#141087)
This uploads the MPS benchmark results to benchmark database.  The data can then be queried, for example:

```
select benchmark, model, metric from oss_ci_benchmark_v3 where head_sha = '99a133116fee15aa1467165f2b209b37da53f189' and metric.name in ['eager_peak_mem', 'dynamo_peak_mem', 'speedup'] and model.name = 'BERT_pytorch'
```

I'm documenting the JSON format at https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database

### Testing

Locally,

```
PYTHONPATH=/Users/huydo/Storage/mine/benchmark python benchmarks/dynamo/torchbench.py --performance --only resnet152 --backend eager --training --devices mps --output test/test-reports/torchbench_training.csv
```

Workflow dispatch https://github.com/pytorch/pytorch/actions/runs/11927990520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141087
Approved by: https://github.com/malfet
2024-11-20 18:18:21 +00:00
1d8318df98 [BE][Ez]: Reserve vector for NT GEMM Matmul (#141130)
Easy fix to missing reserve calls in NT Matmul CUDA kernel to improve perf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141130
Approved by: https://github.com/malfet
2024-11-20 18:12:51 +00:00
9d229f08f4 [dynamo][guards] Introduce a diff_guard_manager (#140250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140250
Approved by: https://github.com/jansel
ghstack dependencies: #140223
2024-11-20 17:59:30 +00:00
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
8d708090c0 Optimize increment summations [Latest Nov 15] (#140822)
Summary:
**wins**
on torchrec benchmark, for 2K nodes it save 40seconds
with the recent sympy changes (https://www.internalfb.com/diff/D65883538) we save around 13 second ( with the max opt on).
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=200
```
This diff optimizes construction expressions of the form
a+b+c...  (all unique symbols).
which are very common in torchrec models.

**How**
Expressions of the form a+b+c are not optimized by add, the only needed optimization is sorting them.
If we have  a+b+c and we are adding (d) to it, we can do a binary search to know
the position of (d) and avoid optimizing the new expression by passing the new order.

**Extensions**:
1. support constant terms.
2. support 10a+10b+.. (this will give even more wins will extend the support in second PR)

Differential Revision: D66008482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140822
Approved by: https://github.com/ezyang
2024-11-20 16:48:20 +00:00
a440a01832 [MPS][BE] Let preprocessor do preprocessing (#141103)
Instead of calling `REGISTER_FUSED_ADAM_OP` macro with 7 parameters 16 times, 4 type parameter macros for each op and then one op to define the quartet of ops: Adam, AdamW and their grad functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141103
Approved by: https://github.com/kulinseth
ghstack dependencies: #141089, #141090, #141092
2024-11-20 14:03:17 +00:00
b0deddde46 [MPS][BE] Move FusedOptimizerOps to its own shader (#141092)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141092
Approved by: https://github.com/Skylion007, https://github.com/kulinseth
ghstack dependencies: #141089, #141090
2024-11-20 14:03:17 +00:00
446ea2aea5 pow: fix meta function output argument dtype check. (#140287)
Tracking issue: #138399

This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with
its Python ref implementation. The following example shows the inconsistency between the
two:

```python
def run(device):
    S = (5,)
    a = torch.rand(S, device=device, dtype=torch.float32)
    b = 2
    out = torch.empty(S, device=device, dtype=torch.float64)
    return torch.pow(a, b, out=out)

>>> run("cpu")
Traceback (most recent call last):
  File "test.py", line 34, in run
    return torch.pow(a, b, out=out)
RuntimeError: Found dtype Double but expected Float

>>> run("meta")
tensor(..., device='meta', size=(5,), dtype=torch.float64)
```

**~Update:~**

~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed
further 2 modifications:~

- ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching
output dtypes~
- ~Create a dispatch for `pow` when `_refs.pow()` is called~

**Update:**

Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and,
after the discussion below, more correct. The solution was to change the
`TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function,
setting:

- `cast_common_dtype_to_outputs`; and
- `enforce_safe_casting_to_output`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287
Approved by: https://github.com/ezyang
2024-11-20 13:28:47 +00:00
a9e54f64ee Remove unused Python API named _set_torch_function_mode (#141023)
Detailed description:

As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141023
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-20 09:48:03 +00:00
ffb305d3a6 Fix bugs about torch.fx.experimental.proxy_tensor.make_fx (#141022)
Detailed description:

The codes below will raise an error
```Python
import torch
from torch.fx.experimental.proxy_tensor import make_fx

def func(a):
    b = a + 1
    c = b.view(-1)
    c.add_(1)
    return b

input = torch.randn(2)
out = make_fx(func)(input)
```

The error info are like below:
```Python
...
  File "/root/Git.d/pytorch/pytorch/torch/_dynamo/codegen.py", line 34, in <module>
    from .variables.torch_function import TensorWithTFOverrideVariable
  File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 185, in <module>
    populate_builtin_to_tensor_fn_map()
  File "/root/Git.d/pytorch/pytorch/torch/_dynamo/variables/torch_function.py", line 146, in populate_builtin_to_tensor_fn_map
    inp0 = torch.ones(1)
  File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1240, in __torch_function__
    return func(*args, **kwargs)
  File "/root/Git.d/pytorch/pytorch/torch/utils/_stats.py", line 21, in wrapper
    return fn(*args, **kwargs)
  File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 1342, in __torch_dispatch__
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "/root/Git.d/pytorch/pytorch/torch/fx/experimental/proxy_tensor.py", line 907, in proxy_call
    name=proxy_mode.tracer.graph._target_to_str(func.overloadpacket.__name__),
AttributeError: 'PythonKeyTracer' object has no attribute 'graph'
...
```

Solutions:
Import torch._dynamo before dispatch_trace is called to avoid the context set before dispatch_trace from affecting the torch._dynamo import.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141022
Approved by: https://github.com/ezyang
2024-11-20 09:42:32 +00:00
c9c8370feb Openreg: Add RNG Generator (#138449)
Implement RNG Generator by falling back to CPUGeneratorImpl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138449
Approved by: https://github.com/ezyang
2024-11-20 09:27:55 +00:00
54f380f64a Also check for attention mask shape for _sfdp_params_check (#141003)
Fixes `python test/inductor/test_fused_attention.py SDPAPatternRewriterCpuTests.test_pattern_fails_with_unsupported_mask_cpu` when `specialize_float=False`. You might wonder how it's related, it's because there is a "negative" test that expects us not to match. Previously it would fail on isinstance(param, Tensor), but now that we tensorify the float, it did match and caused a failure. This check ensures the mask has the same shape to ensure this negative test case actually fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141003
Approved by: https://github.com/ezyang
ghstack dependencies: #140983
2024-11-20 08:37:28 +00:00
d869344bc0 [export] serialize sympy.Exprs as ASTs instead of strings (#140084)
Summary: The way we've been de/serializing sympy.Exprs is not roundtrippable in all cases (serialize by calling `str(expr)`, and deserialize by calling `sympy.sympify(expr_str)`). This has led to expressions being mathematically equivalent but structurally different, causing issues in ValueRanges. Example issue: https://github.com/pytorch/pytorch/issues/136797

This starts to deprecate the use of `expr_str` and stores expressions in AST format instead. For BC purposes, `expr_str` deserialization is still supported, but we will always serialize to `expr_ast`. We'll kill this once the serialization upgrader design is finalized and implemented.

Test Plan: test_export

Differential Revision: D65638757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140084
Approved by: https://github.com/angelayi
2024-11-20 07:44:25 +00:00
7e9e83a8c6 [inductor] force contiguous layout for implicit fallback (#140996)
Fix https://github.com/pytorch/pytorch/issues/140462 .

Horace found that when we implicitly fallback to eager, some eager kernels may not work correctly if Inductor provide non-contiguous inputs (due to padding etc.). The original issue is found for the backward op of weight_norm. The fix in this PR is a general one: we force inputs to all implicit fallback kernels to be contiguous.

I have to refactor the code a bit to make it work. Previously we apply layout constraint in `GraphLowering.run_node`. We looks for implicit fallback in `call_function`. The problem here is, when we setup the implicit fallback in `call_function` with a layout constraint, we don't have a chance to apply the constraints.. The refactor moves the code that applies layout constraints to `call_function`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140996
Approved by: https://github.com/jansel
2024-11-20 06:41:17 +00:00
8f3c71ad27 Add torch.sum dtype promotion description (#140939)
Fixes #82159

Add note description about type promotion of `torch.sum`.

**Test Result**

**Before**
![image](https://github.com/user-attachments/assets/fb952676-f190-4680-9e15-ea8c99d33c67)

**After**
![image](https://github.com/user-attachments/assets/ee0d46a6-5053-46d5-b412-5c919a40965a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140939
Approved by: https://github.com/zou3519
2024-11-20 06:20:01 +00:00
93e3c91679 [inductor] support linear+binary foldinig for freezing path (#138807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138807
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
2024-11-20 05:34:09 +00:00
a864c42781 [dynamo][guards] Support cloning of Guard Manager (#140223)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140223
Approved by: https://github.com/jansel
2024-11-20 05:28:45 +00:00
4c9e77d71e Add back DistributedDataParallel types that were lost when pyi was removed (#136835)
When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back.

Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835
Approved by: https://github.com/kwen2501
2024-11-20 04:57:19 +00:00
5ab1c51f0f [Easy] Use nested namespaces in aten (#141012)
Change files with nested namespaces in aten
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141012
Approved by: https://github.com/Skylion007
2024-11-20 04:05:23 +00:00
cyy
d91484509a [1/N] Apply bugprone-unchecked-optional-access (#140679)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140679
Approved by: https://github.com/ezyang
2024-11-20 04:04:41 +00:00
a4e8ca789a Revert "Record PR time benchmark results in JSON format (#140493)"
This reverts commit 783cd9c8dd8a57d58ac0260ce18253e0cc6a69b7.

Reverted https://github.com/pytorch/pytorch/pull/140493 on behalf of https://github.com/huydhn due to I think I missed something in the workflow setup as the test is failing in non-test CI jobs ([comment](https://github.com/pytorch/pytorch/pull/140493#issuecomment-2487360455))
2024-11-20 04:04:07 +00:00
84d86e3767 [numeric_debugger] guard the input generate_numeric_debug_handle as GraphModule type (#140742)
Summary: Support ExportProgram type in generate_numeric_debug_handle, to better meet the requirement

Test Plan: ci

Differential Revision: D65920529

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140742
Approved by: https://github.com/tarun292, https://github.com/jerryzh168
2024-11-20 03:40:04 +00:00
c05813d2a9 [AOTI Minifier] Exclude illegal graphs from minifier search (#140999)
Summary:
Some graphs produced by the minifier graph cutter cannot be used for AOTI/export (illegal graphs), these should be considered as graphs that don't fail in the minifier, so the minifier keeps searching.

One example is the following graph, where `true_graph_0` is an fx.GraphModule. Here, export.export() would give a `UserError` with `ErrorType = UserErrorType.INVALID_OUTPUT`.

```
      # graph():
        #     %true_graph_0 : [num_users=1] = get_attr[target=true_graph_0]
        #     return (true_graph_0,)
```

This graph could be obtained from the module below:

```python
    class M(torch.nn.Module):
        def forward(self, x, flag):
            flag = flag.item()

            def true_fn(x):
                return x.clone()

            return torch.cond(flag > 0, true_fn, true_fn, [x])
 ```

So we detect such errors, and exclude them from minifier's search (consider these graphs as didn't fail).

This is ok and won't miss any actual errors, since the AOTI minifier is only designed to catch errors in the AOTI phase anyway, it is not responsible to catching export bugs.

Test Plan:
```
buck2 run  fbcode//caffe2/test/inductor:test_minifier_utils  -- -r invalid_output
```

Differential Revision: D66143487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140999
Approved by: https://github.com/henrylhtsang
2024-11-20 03:20:06 +00:00
f0f9393779 add serialized_type_name to torch.size register_pytree_node (#141047)
Summary: We are working on onboarding legokit modules to ModuleStability and this is needed to fix the serialization issue found in P1680200613

Test Plan:
`buck2 test //torchrec/fb/legokit/module_stability_tests/layer_norm_stability_test:layer_norm_stability_test -- --env ADD_NEW_STABILITY_CONFIGS=True`

serialization succeeds when the above command is run on top of this diff.

Differential Revision: D66034492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141047
Approved by: https://github.com/angelayi
2024-11-20 03:14:10 +00:00
fc905d92c5 [MPS][BE] Do not create 4 instances of FUSED_ADAM_OPS (#141090)
Defining `static char shaderSource[]` in the header will instantiate it as often as it is included.
Solved the problem by renaming `static auto getCPLState(const std::string&)` into `auto getFusedAdamCPLState(const std::string&)` and instantiating it only once resulted in 500K reduction in binary size (and perhaps even more in runtime footprint)

I.e. before
```
% ls -lak lib/libtorch_cpu.dylib
-rwxr-xr-x  1 malfet  staff  183357744 Nov 19 17:58 lib/libtorch_cpu.dylib
```
and afer
```
% ls -lak lib/libtorch_cpu.dylib
-rwxr-xr-x  1 malfet  staff  183357120 Nov 19 17:57 lib/libtorch_cpu.dylib
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141090
Approved by: https://github.com/Skylion007
ghstack dependencies: #141089
2024-11-20 03:04:33 +00:00
a8a428df3b [MPS][BE] Use nested namespace (#141089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141089
Approved by: https://github.com/Skylion007
2024-11-20 03:04:33 +00:00
da115eff86 [dynamic] Reduce stack trace logs in symbolic_shape (#141068)
Motivation: https://github.com/pytorch/pytorch/issues/139408

To reduce excessive warning logs. You can get back previous behavior by prepending `TORCH_LOGS="dynamic" `

repro: https://github.com/pytorch/pytorch/issues/139408

after:
```
/torch/fx/experimental/symbolic_shapes.py:6452] runtime_asserts_frozen but then got 3*TruncToInt(IntTrueDiv(s0, 1))*TruncToInt(IntTrueDiv(s1, 1)) < 2147483648
/torch/fx/experimental/symbolic_shapes.py:6032] Ignored guard 3*TruncToInt(IntTrueDiv(s0, 1))*TruncToInt(IntTrueDiv(s1, 1)) < 2147483648 == True, this could result in accuracy problems
/torch/fx/experimental/symbolic_shapes.py:6452] runtime_asserts_frozen but then got 2*s0*s1 + s1*(s0 - 1) + s1 < 2147483648
/torch/fx/experimental/symbolic_shapes.py:6032] Ignored guard 2*s0*s1 + s1*(s0 - 1) + s1 < 2147483648 == True, this could result in accuracy problems
```

before: 174 lines

Differential Revision: D66196982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141068
Approved by: https://github.com/ezyang
2024-11-20 03:00:53 +00:00
32094626f2 [fr] fix OSS broken flight recorder (#140973)
Summary:
OSS flight recorder does not work because we renamed `trace_dir` to `folder` in the internal version to reuse the loader.

Fixes item #2 in reported issue:
https://github.com/pytorch/pytorch/issues/140879

Test Plan:
BEFORE:
```
❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node1_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 44, in main
    details, version = read_dir(args)
  File "/home/cpio/local/pytorch/tools/flight_recorder/components/loader.py", line 89, in read_dir
    assert len(details) > 0, f"no files loaded from {args.folder} with prefix {prefix}"
AttributeError: 'Namespace' object has no attribute 'folder'
```

AFTER:
```
python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}
❯ git status
fatal: not a git repository (or any parent up to mount point /data/users/cpio)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
❯ python ./tools/flight_recorder/fr_trace.py ~/fr/140563/nccl_trace_logs --prefix nccl_trace_rank_container-node17_
tabulate is not installed. Proceeding without it.
Traceback (most recent call last):
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 52, in <module>
    main()
  File "/data/users/cpio/fbsource/fbcode/caffe2/./tools/flight_recorder/fr_trace.py", line 45, in main
    db = build_db(details, args, version)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/builder.py", line 446, in build_db
    check_no_missing_dump_files(entries, memberships)
  File "/home/cpio/local/fbsource/fbcode/caffe2/tools/flight_recorder/components/utils.py", line 267, in check_no_missing_dump_files
    dumps_ranks == all_ranks
AssertionError: Missing dump files from ranks {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119}
```

Differential Revision: D66117013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140973
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2024-11-20 02:58:11 +00:00
241d2259d3 torch/config: fix mock behaviour (#140779)
Mock only replaces the value that was removed, if after deletion, it
does not see the attribute.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140779
Approved by: https://github.com/ezyang
2024-11-20 02:57:16 +00:00
878a849c92 [aoti] Remove example inputs from aoti_compile_and_package (#140991)
Differential Revision: [D66136724](https://our.internmc.facebook.com/intern/diff/D66136724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140991
Approved by: https://github.com/yushangdi, https://github.com/desertfire
ghstack dependencies: #140990
2024-11-20 02:49:47 +00:00
cb6a21b033 [export] Add setattr for ep.example_inputs (#140990)
Differential Revision: [D66136725](https://our.internmc.facebook.com/intern/diff/D66136725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140990
Approved by: https://github.com/yushangdi, https://github.com/ydwu4
2024-11-20 02:49:20 +00:00
ff17d2b83e [easy][logging] Remove dynamo_timed fwd_only param (#140993)
Summary: It's ignored; remove it

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140993
Approved by: https://github.com/ezyang
2024-11-20 02:31:51 +00:00
5e0c009a5a Forward fix lint after #140443 (#141088)
TSIA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141088
Approved by: https://github.com/atalman
2024-11-20 02:21:24 +00:00
f23d034826 [PyTorch Decomp] Allow native_layernorm decomp to align [mean, rstd] output dtypes with input dtype for MTIA backend (#141025)
Summary: As title

Test Plan: CI

Differential Revision: D66169328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141025
Approved by: https://github.com/bdhirsh
2024-11-20 01:58:08 +00:00
783cd9c8dd Record PR time benchmark results in JSON format (#140493)
I'm trying to make this benchmark results available on OSS benchmark database, so that people can query it from outside.  The first step is to also record the results in the JSON format compatible with the database schema defined in https://github.com/pytorch/test-infra/pull/5839.

Existing CSV files remain unchanged.

### Testing

The JSON results are uploaded as artifacts to S3 https://github.com/pytorch/pytorch/actions/runs/11809725848/job/32901411180#step:26:13, for example https://gha-artifacts.s3.amazonaws.com/pytorch/pytorch/11809725848/1/artifact/test-jsons-test-pr_time_benchmarks-1-1-linux.g4dn.metal.nvidia.gpu_32901411180.zip

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140493
Approved by: https://github.com/laithsakka
2024-11-20 01:48:00 +00:00
eff22171d2 Add Current Mask Var To CSE Cache Key (#140838)
This torch.cat kernel has multiple subblocks which load from the same input. We were incorrectly reusing the mask vars from the first load for the second load.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140838
Approved by: https://github.com/jansel
ghstack dependencies: #140841
2024-11-20 00:55:56 +00:00
b740a1b96c [user triton] Ignore backend-specific args in the TTIR analysis (#141062)
Fixes #140800.

On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](a6bb57d628/python/triton/runtime/jit.py (L594-L596)). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141062
Approved by: https://github.com/oulgen
2024-11-20 00:37:34 +00:00
7c7c34693d disable tensorify floats when cuda graphs is on (#140983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140983
Approved by: https://github.com/ezyang
2024-11-20 00:33:09 +00:00
cyy
0fca51bcc4 [11/N] Fix Wextra-semi warning (#140926)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140926
Approved by: https://github.com/ezyang
2024-11-20 00:32:45 +00:00
0443398f5b Implement deterministic scan (#140887)
Fixes #89492
Uses block-wise cub primitives
On large inputs, this implementation is approximately 25% slower than device cub implementation, so it's turned on only in cases where cub would have been (floating point inputs, cumsum that is effectively 1d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140887
Approved by: https://github.com/ezyang, https://github.com/kurtamohler
2024-11-19 23:43:26 +00:00
6ccd35ccb8 cpp_wrapper: Fix searchsorted fallback ops (#140817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140817
Approved by: https://github.com/desertfire
ghstack dependencies: #140624, #140634
2024-11-19 23:34:20 +00:00
ce15d1ebc2 Narrow the scope of several cpp_wrapper test skips (#140634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140634
Approved by: https://github.com/desertfire
ghstack dependencies: #140624
2024-11-19 23:34:20 +00:00
34b2165bdb Insert aten.add into fallback_ops, and fix Tensor -> Scalar conversion in ir.FallbackKernel (#140624)
The code in ir.FallbackKernel will long-term be obviated by the solution for #90923.

Closes #131334.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140624
Approved by: https://github.com/desertfire
2024-11-19 23:34:20 +00:00
9bc9d4cdb4 Fix MPS synchronize by waiting for root buffer to complete (#140725)
Makes https://github.com/pytorch/pytorch/issues/139550#issuecomment-2468860559 work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140725
Approved by: https://github.com/malfet, https://github.com/kulinseth
2024-11-19 23:10:24 +00:00
780c580d68 General per-SampleInput xfail / skip system (#140443)
### Background
This PR adds the functionality to xfail / skip on a per-`SampleInput` basis for `OpInfo` tests. See #89354 and #82669 for some requests asking for this type of functionality.

This was originally landed for NJT in #138370 and is generalized and slightly tweaked here.

### Design
#### Principles
* Clean separation among `SampleInput` generation logic, test logic that uses the `SampleInput`s, and xfail / skip logic (which will change as bugs are addressed).
* Flexibility in xfail / skip predicate specification - ideally each bug can be handled by a single skip / xfail, even if it surfaces across a specific class of ops.
    * This is important in practice for NJT, where it's common to have a bug that affects all binary ops, for example.
* Opt-in with minimal test logic changes + no substantial impact on other tests.

#### Details
The core new concept is a `SampleRule`, which can be either an `XFailRule` or `SkipRule`.

```python
@dataclass
class SampleRule(ABC):
    # function to indicate whether the rule applies to this op; return True if so
    # NB: str arg of callable is device_type
    op_match_fn: Callable[[str, OpInfo], bool] = None
    # function to indicate whether the rule applies to this sample; return True if so
    sample_match_fn: Callable[[torch.device, SampleInput], bool] = None
    # optional name for identifying the rule
    name: str = ""

@dataclass
class XFailRule(SampleRule):
    # expected error type
    error_type: TypeVar = Exception
    # expected error message
    error_msg: str = ".*"

@dataclass
class SkipRule(SampleRule):
    ...
```

* See below for example usage details, but at a high level: each test should have a corresponding list of `sample_skips_and_xfails`.
    * The list of `sample_skips_and_xfails` is traversed in order, and the first rule that matches (if any) is applied, so order can matter.
    * The PR includes a logging mechanism for matched rules accessible by setting the loglevel to `DEBUG`.
* The split between `op_match_fn` and `sample_match_fn` is made to allow pre-filtering of the list of rules to get only those that apply to the op under test.
* Each `SampleInput` is run within a subtest context so they can be individually skipped / xfailed as needed. This also means that a test will no longer stop after the first erroring `SampleInput`; all samples will be run through test logic.

### Example Usage
Consider the following OpInfo test:
```python
class MyTestCase(TestCase):
    @ops(op_db)
    def test_foo(self, device, dtype, op):
        for sample in op.sample_inputs(device, dtype, requires_grad=False):
            # do some SampleInput-based test logic
            output = op.op(sample.input, *sample.args, **sample.kwargs)
            ...
```

This is a common pattern for such tests; simply generate a list of `SampleInputs` and run them through the op. Now say you want to xfail one of these `SampleInput`s for a given op. Today, you have to xfail the entire test or hack around this in the test logic.

This PR lets you do this to get very flexible xfail / skips based on op / sample input properties:
```python
# NB: Define rules for per-SampleInput xfails / skips. These can also be defined in-line in the @ops decorator, but
# it can be more readable to maintain these somewhere else. These are attempted to be matched in order and
# the first one that matches applies, so order can matter.
FOO_SKIPS_AND_XFAILS = [
    XFailRule(
        error_type=ValueError,
        error_mg="2D inputs not supported",
        op_match_fn=lambda device, op: (
            # NB: logic for which ops this rule applies to goes here
            op.full_name == "add"
        ),
        sample_match_fn=lambda device, sample: (
            # NB: logic which samples this rule applies to goes here
            sample.input.dim() == 2
        ),
        # NB: optional rule identifier can help with debugging matched rules
        name="add_with_2D_inputs_not_supported",
    ),
    # NB: This follows a similar structure as XFailRule but without error_type / error_msg. Obviously
    # this skips a particular SampleInput instead of xfailing :)
    SkipRule(...),
    ...
]

class MyTestCase(TestCase):
    @ops(op_db)
    @sample_skips_and_xfails(FOO_SKIPS_AND_XFAILS)
    # NB: the @ops decorator automatically filters out any rules that don't apply to this op
    def test_foo(self, device, dtype, op):
        for sample, subtest_ctx in op.sample_inputs(
            # NB: use_subtests=True is required for skips / xfails to work. If skips / xfails are defined and use_subtests != True,
            # an informative error will be thrown.
            device, dtype, requires_grad=False, use_subtests=True
        ):
            # NB: this subtest context manager runs each sample input as a "subtest" and handles skips / xfails appropriately
            with subtest_ctx(self):
                # do some SampleInput-based test logic
                output = op.op(sample.input, *sample.args, **sample.kwargs)
                ...
```

More examples can be seen in `test/test_nestedtensor.py`, where this system is used in practice.

I also demonstrate usage of syntactic sugar over this system in `test/functorch/test_vmap.py`. Here, a skip for the `to()` operator is replaced with a granular xfail for `test_vmap_exhaustive()`:
```python
...
# pre-existing xfail
xfail("item"),
# new granular xfail using syntactic sugar over the general system
xfailIf(
    "to",
    lambda sample: (
        sample.kwargs["memory_format"] == torch.channels_last
    ),
),
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140443
Approved by: https://github.com/janeyx99, https://github.com/zou3519
ghstack dependencies: #140160, #138370
2024-11-19 23:09:38 +00:00
cee3f8541e [MPS][BE] Use mtl_setBytes to upload bools as is (#141037)
But add static assert that size of bool is a single byte, to guard against hard to debug corruptions if someone decides to typedef it to int

Fixes https://github.com/pytorch/pytorch/issues/140971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141037
Approved by: https://github.com/qqaatw, https://github.com/Skylion007
2024-11-19 23:08:43 +00:00
9fac5a16fd Revert "[PGNCCL] Add an API to get the status/error code of each PG (#140087)"
This reverts commit 80aa19a622bc6b159f7cf07b3501269f3356d752.

Reverted https://github.com/pytorch/pytorch/pull/140087 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/140087#issuecomment-2486912231))
2024-11-19 22:53:46 +00:00
da069af0d4 [Easy] Refactor rsqrt lowering (#139944)
The bool/int casting is equivalent to register_pointwise_numeric

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139944
Approved by: https://github.com/shunting314, https://github.com/blaine-rister
2024-11-19 22:51:42 +00:00
496c1e78c5 Revert "Implements user buffer registration using MemPool (#133603)"
This reverts commit 25d9be37bef949c675e42b4929ddcb6997af2a7b.

Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))
2024-11-19 22:42:26 +00:00
32e93dfa92 [pytorch/profiler] Profiler NCCL metadata can now contain collective Input and Ouput Tensor addrs (#140637)
Summary: Studying memory access patterns is the primary use cases.

Differential Revision: D65918359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140637
Approved by: https://github.com/briancoutinho
2024-11-19 22:22:16 +00:00
08cb5160b2 Extract reusable portions of GeluKernel into header (#140425)
Makes the implementation reusable via header-only code sharing. (no diff for that yet, but we can commit the refactor regardless.)

Testing: existing correctness tests should cover.

Differential Revision: [D65608800](https://our.internmc.facebook.com/intern/diff/D65608800/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140425
Approved by: https://github.com/ezyang
2024-11-19 22:00:01 +00:00
34e420519d [Reland] dont decompose baddbmm (#141045)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Reland of https://github.com/pytorch/pytorch/pull/137904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141045
Approved by: https://github.com/BoyuanFeng
2024-11-19 21:07:58 +00:00
f30f43f594 Use std::bit_cast as c10::bit_cast if available (#141035)
Make what we're doing as obvious as possible to the compiler.

Differential Revision: [D66108811](https://our.internmc.facebook.com/intern/diff/D66108811/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141035
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/malfet
ghstack dependencies: #140564, #140565, #140566, #140567, #140720, #140994
2024-11-19 20:43:45 +00:00
f4ce9ac29d [dynamo] Dont erase the cache line on invalidation (#140821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140821
Approved by: https://github.com/jansel
2024-11-19 19:11:10 +00:00
efed02b990 Fix Half X86_F16 CUDA build failure (#140994)
It passed PyTorch CI, but internally we saw failures from this.

Differential Revision: [D66137897](https://our.internmc.facebook.com/intern/diff/D66137897/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140994
Approved by: https://github.com/malfet
ghstack dependencies: #140564, #140565, #140566, #140567, #140720
2024-11-19 19:02:21 +00:00
4f2543c31d [logs] Add dynamo_timed to get better compilation time breakdown for AOTI (#140198)
Adding some dynamo timed for the purpose of better understanding AOTI compilation time.

Probably would require a few more passes. A lot of time is spent in Scheduler.__init__, and not enough annotations are there.

run_command_and_check takes a lot time as well. But there is probably not much we can do. Maybe we can add a config to tune C++ optimization level?

traces:
<img width="1205" alt="Screenshot 2024-11-08 at 4 41 10 PM" src="https://github.com/user-attachments/assets/61645264-b3af-4d4a-804d-700b0f831c7c">

Differential Revision: D65554141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140198
Approved by: https://github.com/desertfire
2024-11-19 18:54:17 +00:00
7f10351ba0 Revert "Implement deterministic scan (#140887)"
This reverts commit 4eed438a42a054a63b5e0a7225dd0e84cf488a96.

Reverted https://github.com/pytorch/pytorch/pull/140887 on behalf of https://github.com/ngimel due to breaks with 11.4 ([comment](https://github.com/pytorch/pytorch/pull/140887#issuecomment-2486409438))
2024-11-19 18:08:48 +00:00
d276688da6 Revert "[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)"
This reverts commit b09eb6ed6a22476746d8b7d5f6e464e34f89747a.

Reverted https://github.com/pytorch/pytorch/pull/139560 on behalf of https://github.com/anijain2305 due to internal test failures ([comment](https://github.com/pytorch/pytorch/pull/139560#issuecomment-2486344859))
2024-11-19 17:37:44 +00:00
7ced49d2cc Raise exception if vmap (eager) calls compiled function (#140439)
Fixes #138422

This is not a proper fix for #140439, but more of a way to prevent a user from seeing a nasty error inside the C++ code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140439
Approved by: https://github.com/zou3519
2024-11-19 16:27:48 +00:00
99a03211cb Deprecate conda nightly builds (#141024)
Removing CD as per https://github.com/pytorch/pytorch/issues/138506

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141024
Approved by: https://github.com/malfet
2024-11-19 16:09:54 +00:00
2b21a653d8 Register CIA ops to FakeTensorMode directly in export (#140465)
During export, we nub out most CIA ops to return NotImplemented to avoid decomposing them during tracing. To recover the existing shape propagation behavior, we register these CIA decomps directly as FakeTensorMode rules as well. The reason we have to do is because when we return NotImplemented, FakeTensor would fallback to running these CIAs with Meta backend causing device branching CIA ops to fail. (because now the device is Meta. One example is sdpa). If we register a kernel directly to FakeTensorMode, we won't fallback to Meta backend.

Differential Revision: [D65716260](https://our.internmc.facebook.com/intern/diff/D65716260/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140465
Approved by: https://github.com/bdhirsh
2024-11-19 15:00:35 +00:00
93aef684d9 fix typo in torch.compiler_dynamo_deepdive.rst (#140871)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140871
Approved by: https://github.com/zou3519
2024-11-19 14:42:36 +00:00
260d1dcef4 Check torch.linalg.qr differentiability as documented (#135097)
Expands the `test_linalg_qr_autograd_errors` unit test to check all cases of differentiablity/non-differentiability as given in the docs https://pytorch.org/docs/stable/generated/torch.linalg.qr.html:

- mode= ‘reduced’ (default): Returns (Q, R) of shapes (*, m, k), (*, k, n) respectively. It is always differentiable.
- mode= ‘complete’: Returns (Q, R) of shapes (*, m, m), (*, m, n) respectively. It is differentiable for m <= n.
- mode= ‘r’: Computes only the reduced R. Returns (Q, R) with Q empty and R of shape (*, k, n). It is never differentiable.

(in particular, the happy paths are added)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135097
Approved by: https://github.com/IvanYashchuk, https://github.com/nikitaved
2024-11-19 12:25:39 +00:00
0c7c5d78fa [inductor] add support for TRITON_INTERPRET (#140841)
Was debugging the issue lower in the stack and found this to be helpful / quick enough to add.

Fix for https://github.com/pytorch/pytorch/issues/123956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140841
Approved by: https://github.com/exclamaforte
2024-11-19 11:24:13 +00:00
f0f6144381 [EZ][BE] Update googletest submodule (#140988)
From v1.11.0 (released in Jun 2021) to v1.15.2 (release in Jul 2024)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140988
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2024-11-19 07:49:16 +00:00
808da50c2d create a new torch.cuda.device_memory_used api (#140870)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.
see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65960134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870
Approved by: https://github.com/ngimel, https://github.com/eqy
2024-11-19 06:36:30 +00:00
7156d0824d [ROCm] Fix largeIndexBlockSize (#139087)
On ROCm, hipification converts std::min to ::min, but ::min is not returning the right result. This impacts index_add_ operation on a large tensor, we end up picking the large values instead of max supported block size (128). This leads to GPU accessing memory out of bounds.

While we wait for ::min to be fixed, we can use < operator to compare instead of relying on ::min.

Example Code w/ failure:
```
D=6144
hidden_states = torch.zeros([16384, 6144],           device="cuda:0", dtype=torch.bfloat16)
index         = torch.randint(0, 16384, (1, 32, 16384), device="cuda:0", dtype=torch.int64)
output        = torch.empty([1, 32, 16384, 6144],    device="cuda:0", dtype=torch.bfloat16)
hidden_states.index_add_(0, index.view(-1), output.view(-1, D))
```

```
Traceback (most recent call last):
RuntimeError: HIP error: invalid configuration argument
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139087
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2024-11-19 06:29:48 +00:00
115f15a255 [PGNCCL][EZ] Do not use same name as NCCL API (#140997)
`ncclCommAbort` is an API name of NCCL. Do not use the same name for `NCCLComm`'s method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140997
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-11-19 05:40:39 +00:00
1bdb9ddc70 [CD] Upgrade XPU support packages version to 2025.0 (#140373)
Depends on https://github.com/pytorch/pytorch/pull/139775
Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140373
Approved by: https://github.com/atalman, https://github.com/malfet
2024-11-19 05:16:46 +00:00
8bc4033814 [fr][ez] better log messages + minor fixups (#140969)
Summary:
1. Clearly specify error messages that we are refering to a collective_sequence_id and an internal_record id for entry.
The entry id is semi-useless for the end consumer so at least let them know that this is an internal record id.
2. Add some missing fields in types.py.
  self.missing_ranks = set()
  self.input_numel = tuple()
  self.output_numel = tuple()
  self.errors = set()

These were showing up as linter errors when I opened the file in vs-code

Test Plan:
```
buck2 run //caffe2/fb/flight_recorder:fr_trace -- -m f665492593-nerf_training-96ab95e0 -w 8 --mast_job_version 0 -a 0
Buck UI: https://www.internalfb.com/buck2/2cac9273-1b7b-47bf-867f-82f9a4c1d581
Network: Up: 0B  Down: 0B
Not all ranks joining collective: sequence number: 31117
internal record id: 31116
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {3, 4, 5, 6, 7}
input sizes: [[1571911]]
output sizes: [[1571911]]
world size: 8
expected ranks: {0, 1, 2, 3, 4, 5, 6, 7}
collective state: scheduled
collective stack trace:
 all_reduce at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/distributed_c10d.py:2707
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/c10d_logger.py:81
sync_buffers at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/models/gaussian_splatting.py:650
decorate_context at /packages/fblearner.flow.canary/workflow#link-tree/torch/utils/_contextlib.py:116
step at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/training/training_manager/splatting.py:356
main at /packages/fblearner.flow.canary/workflow#link-tree/xri_mapsr/neural_fields/nerf_training.py:260
main_impl at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:57
main at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:34
wrapper at /packages/fblearner.flow.canary/workflow#link-tree/torch/distributed/elastic/multiprocessing/errors/__init__.py:355
<module> at /packages/fblearner.flow.canary/workflow#link-tree/rl_aiep/mast/endpoint.py:118
_run_code at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:86
_run_module_as_main at /packages/fblearner.flow.canary/workflow#link-tree/runtime/lib/python3.10/runpy.py:196
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/bootstrap.py:69
run_as_main at /packages/fblearner.flow.canary/workflow#link-tree/__par__/meta_only/bootstrap.py:98
__invoke_main at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:28
<module> at /packages/fblearner.flow.canary/workflow#link-tree/__run_lpar_main__.py:31

...

Differential Revision: D66018461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140969
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2024-11-19 04:39:16 +00:00
51d4338716 fix test_save_load_transform. (#140494)
test test_save_load_transform in [test_transforms.py](https://github.com/pytorch/pytorch/blob/main/test/distributions/test_transforms.py)

_pytest test_transforms.py -k test_save_load_transform_

error message:

```
.
.
.
  File "/workspace/pytorch/test/distributions/test_transforms.py", line 555, in test_save_load_transform
    other = torch.load(stream)
            ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/torch/serialization.py", line 1444, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
	(1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
	(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
	WeightsUnpickler error: Unsupported global: GLOBAL torch.distributions.transformed_distribution.TransformedDistribution was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TransformedDistribution])` or the `torch.serialization.safe_globals([TransformedDistribution])` context manager to allowlist this global if you trust this class/function.

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140494
Approved by: https://github.com/mikaylagawarecki
2024-11-19 04:36:06 +00:00
d472a5f680 Revert "[inductor] Refactor MutableBox to make IRNode typing easier (#140895)"
This reverts commit c79e78b5034198f9d6801b4fef710b9b9b0e9193.

Reverted https://github.com/pytorch/pytorch/pull/140895 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think test_torchbind_inductor is failing in trunk after this lands ([comment](https://github.com/pytorch/pytorch/pull/140895#issuecomment-2484679319))
2024-11-19 04:25:41 +00:00
cyy
00b3b61076 Add and use thread-safe strerror (#140472)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140472
Approved by: https://github.com/ezyang
2024-11-19 04:24:17 +00:00
a10ce22577 [BE] Update bazelisk and bazel versions (#140992)
bazelisk from 1.16 to 1.23
bazel from 6.1.1 to 6.5.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140992
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn
2024-11-19 03:40:53 +00:00
0fcd024f59 [hop] refactor only_consist_of with find_mismatched_vars (#140105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140105
Approved by: https://github.com/zou3519
2024-11-19 03:21:16 +00:00
70a0906f24 [c10d] Support optional backend if device_id provided (#140963)
Citing @malfet's [comment](https://github.com/pytorch/pytorch/pull/136343#pullrequestreview-2318792396) in https://github.com/pytorch/pytorch/pull/136343
> It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged.

This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init).

New user experience:
```
device = torch.device(device_type, rank % device_count)
dist.init_process_group(device_id=device)
```

The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140963
Approved by: https://github.com/wconstab
2024-11-19 03:17:29 +00:00
37959c554d Add small test case for #140230 (#140850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140850
Approved by: https://github.com/malfet
ghstack dependencies: #140739, #140740
2024-11-19 02:44:54 +00:00
f3f305ef3e Fix condition for weights_only unpickler for DTensor (#140740)
Same as #140739 but for DTensor (move safe globals for DTensor to `torch.distributed.tensor.__init__` and update error message to let user know `torch.distributed.tensor` must be imported to load DTensor)

Differential Revision: [D65961690](https://our.internmc.facebook.com/intern/diff/D65961690)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140740
Approved by: https://github.com/malfet
ghstack dependencies: #140739
2024-11-19 02:44:53 +00:00
b63a84804c Allow NJT by default for weights_only torch.load (take 2) (#140739)
Per discussion with @malfet, only allow weights_only unpickler to load NJT if `torch.nested` and `torch._dynamo`  are imported

(this is slightly weird as technically `torch.nested` is actually imported by default and `torch._dynamo.decorators._DimRange` is actually what needs to be imported)

we can't import this from `torch.nested` as this would
- undo dynamo lazy import
- cause circular import

===========================
Redo of https://github.com/pytorch/pytorch/pull/140304 caused issues as `torch.nested._internal.foo` needs to be imported, which causes issues like

```python
torch/_weights_only_unpickler.py", line 339, in load
    if full_path in _get_allowed_globals():
torch/_weights_only_unpickler.py", line 188, in _get_allowed_globals
    torch.nested._internal.nested_tensor.NestedTensor
AttributeError: module 'torch.nested' has no attribute '_internal'
```

**This likely wasn't caught in our CI because imports are global during unit tests(?), so we use subprocess to properly test this time**

Differential Revision: [D65961691](https://our.internmc.facebook.com/intern/diff/D65961691)

@jbschlosser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140739
Approved by: https://github.com/malfet
2024-11-19 02:44:53 +00:00
1e234e63b3 [pytorch][dynamo_compile] Log inductor config to dynamo_compile (#140790)
Summary:
Scrubbed inductor config logging to dynamo_compile as json:str.

Scrub RE: `r'((^TYPE_CHECKING$)|(.*_progress$)|(.*TESTING.*)|(.*(rocm|halide).*)|(^trace\..*)|(^_))'`to save some space.

Test Plan:
Staging logger: https://fburl.com/data/ltkt08zm

P1679697917

{F1958428018}

Differential Revision: D65806399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140790
Approved by: https://github.com/masnesral
2024-11-19 02:39:33 +00:00
9ae19ffbed fix layer_norm decomp precision for cpu (#140557)
xref: https://fb.workplace.com/groups/1075192433118967/posts/1540519826586223/?comment_id=1543752356262970&reply_comment_id=1544425069529032

the issue is that our decomp needs to branch on device (it only upcasts for cpu), but the device shows up as "meta" because it is registered as a meta tensor rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140557
Approved by: https://github.com/ezyang
2024-11-19 02:31:31 +00:00
240aa77ad0 [Quantizer][XNNPACK] Fix ReLU fusion when conv/linear has > 1 user (#140846)
Summary:
Bug in quantizer when Conv + ReLU is fused even when the preceeding conv has more than one user. Conv and ReLU can not be fused in this case because the result of Conv must be used elsewhere.

XNNPACK Delegate naturally handles this by inserting a clamp node for ReLU.

Test Plan: CI

Reviewed By: digantdesai

Differential Revision: D65989599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140846
Approved by: https://github.com/digantdesai
2024-11-19 02:29:45 +00:00
2673a440d0 [distributed] add PG APIs and general doc cleanups (#140853)
Doc updates:

* This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 .
* It also does some general cleanups to simplify the distributed.rst by using `:methods`.
* It adds `__init__` definitions for the Stores
* I've reordered things so the collective APIs are before the Store/PG apis

Test plan:

```
lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853
Approved by: https://github.com/kwen2501
2024-11-19 02:06:32 +00:00
5b326d6b61 Add gdb print methods support same as pytorch-lldb (#140935)
`pytorch-lldb` support pretty printing size and key_set of tensor via #97101

Add same pretty printing for gdb debugging.

**Test Result**

```bash
$ gdb python
(gdb) break at::native::negative
(gdb) r
>>> import torch
>>> t = torch.tensor([1, 2, 3, 4], dtype=torch.float64)
>>> t.negative()
Thread 1 "python" hit Breakpoint 1, at::native::negative (self=...) at /home/zong/code/pytorch/aten/src/ATen/native/UnaryOps.cpp:854
854	Tensor negative(const Tensor& self) { return self.neg(); }
```

**Before**
```bash
(gdb) p self.key_set()
$2 = {repr_ = 1271310352385}

(gdb) p self.sizes()
$3 = {Data = 0x9cb488, Length = 1}

```

**After**
```bash
(gdb) torch-int-array-ref-repr self.sizes()
[4]
(gdb) torch-dispatch-keyset-repr self.key_set()
DispatchKeySet(CPU, ADInplaceOrView, AutogradCPU, AutocastCPU)
```

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/b720e284-13b1-4581-ae3a-963f6482fdb2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140935
Approved by: https://github.com/drisspg
2024-11-19 01:28:30 +00:00
98e6e69b1b [C10D] Support group_dst/group_src in c10d send/recv object_list (#140847)
Also add mypy annotations

Partially addresses RFC 0042 (https://github.com/pytorch/rfcs/pull/71)
See more details/motivation in https://github.com/pytorch/pytorch/pull/140460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140847
Approved by: https://github.com/H-Huang
ghstack dependencies: #140843
2024-11-19 01:23:08 +00:00
c82c46ccc7 [C10D] support group_src/dst in broadcast/reduce ops (#140843)
Also add mypy annotations

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140843
Approved by: https://github.com/kwen2501
2024-11-19 01:23:08 +00:00
efe8482c0d Add prepare_obs_or_fq_callback to quantizer (#140863)
Test Plan: CI.

Differential Revision: D65982003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140863
Approved by: https://github.com/jerryzh168
2024-11-19 01:13:38 +00:00
c79e78b503 [inductor] Refactor MutableBox to make IRNode typing easier (#140895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140895
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-11-19 00:24:35 +00:00
98e441f00b [dynamo] Simplify ConstantVariable.create and ConstantVariable.__init__ (#140745)
This patch removes some redundant code paths in
`ConstantVariable.create` and` ConstantVariable.__init__`.

Closes #110871.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140745
Approved by: https://github.com/jansel
2024-11-19 00:22:50 +00:00
2da98d9757 [dynamo] Support is comparison for symnodes (#140754)
Fixes #109504.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140754
Approved by: https://github.com/williamwen42
2024-11-19 00:19:33 +00:00
175ba9fed6 [Utilization Monitor] input to disable utilization monitor (#140857)
# Overview
Currently monitor.py produces error only result, this pr introduct disable-monitor option to all *-test.yml. We also like to explore how the monitor code affect benchmark results.

# next steps
- fix the monitor.py
- enable non-benchmark tests with monitor
- investigate benchmark test behavior with monitor background job

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140857
Approved by: https://github.com/huydhn
2024-11-18 23:26:03 +00:00
48a276c5a0 log_softmax: fix meta function output argument dtype check. (#140289)
Tracking issue: #138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140289
Approved by: https://github.com/ezyang
ghstack dependencies: #140186, #140286, #140288
2024-11-18 23:05:29 +00:00
435286e985 Fix unary references' out dtype check. (#140288)
Tracking issue: #138399

This PR fixes a number of reference implementations (which are also used as meta
functions), making them more consistent with CPU device. More specifically, it fixes those
operations that use `_make_elementwise_unary_reference` decorator, and don't error on
mismatching out argument dtype while they error when using concrete devices (e.g. CPU).

The fixed operations are:

- `abs`
- `ceil`
- `floor`
- `frac`
- `isneginf`
- `isposinf`
- `sgn`
- `sign`
- `signbit`
- `trunc`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140288
Approved by: https://github.com/ezyang
ghstack dependencies: #140186, #140286
2024-11-18 23:05:29 +00:00
727f1a6da9 Revert "FlopCounterMode: Decompose ops for inference mode (#138508)"
This reverts commit f915409c26c0ba38b286c7b617880af61a6b08ba.

Reverted https://github.com/pytorch/pytorch/pull/138508 on behalf of https://github.com/jamesjwu due to Failing internal jobs ([comment](https://github.com/pytorch/pytorch/pull/138508#issuecomment-2484310587))
2024-11-18 22:59:36 +00:00
8d5b3eeaa6 Remove __start__ stack, log backward compile to empty stack (#140431)
Summary:
This diff removes "__start__" from all stacks in Pt2 Compile Events, as it's unnecessary.

It also starts logging events for backward compile, because otherwise we have no toplevel event representing full backward compilation. This gives us a toplevel event outside of the inductor compile.

Test Plan:
New chromium events:

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fstuff4%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fstuff4%2Fchromium_events.json&local_cache_key

New tlparse:
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/jjwu/custom/stuff4/index.html

New scuba icicle view, still good: https://fburl.com/scuba/pt2_compile_events/z6gr3z53

Differential Revision: D65832045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140431
Approved by: https://github.com/masnesral
2024-11-18 22:48:31 +00:00
8e439021c1 [ONNX] Support from dynamic_shapes to dynamic_axes when torch.onnx.export(fallback=True) is triggered (#139532)
Fixes #139320

### Summary:
#### (1) Add  `_rename_dynamic_shapes_with_model_inputs` for dynamic_shapes to play along with input_names

* Use model forward signature to rename dynamic_shapes when dynamic_shapes is not nested and dynamic_shapes is directly using the customized name. This solves the issue that torch.export.export expects dynamic_shapes only uses the model input names.
* If the dynamic_shapes is nested, we do nothing.

#### (2) Add `_from_dynamic_shapes_to_dynamic_axes` for fallback

* We flatten dynamic_shapes with leaf defined _pytree.tree_leaves()
~~* If a dynamic_shapes is not nested, and defined in dict. We can use the key as the input_names, since it should be renamed by `_rename_dynamic_shapes_with_model_inputs` already.~~
* If a dynamic_shapes is provided, input_names is required to assign the names, because dynamic_axes needs it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139532
Approved by: https://github.com/justinchuby
2024-11-18 22:35:21 +00:00
72943ba823 [3.13] deal with exec() semantic change in test_cond_no_dynamo_cache_limit (#140401)
https://peps.python.org/pep-0667/ changed the semantics of `eval/exec` in 3.13 so that changes to locals no longer propagate (but globals do). This is to make the behavior predictable since in the past, the locals may or may not update based on various mysterious conditions. Other test sites may need updating too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140401
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-11-18 22:06:47 +00:00
e445239bb4 [ONNX] Fix 2GB exporting crash during onnx shape type inference (#140962)
Fixes https://github.com/pytorch/pytorch/issues/132205

Regression happened after https://github.com/pytorch/pytorch/pull/128675 that ONNX shape type inference error stops the exporting process during shape type inference. ONNX shape type inference during the export only does it's best to fulfill the information, and should not crash the export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140962
Approved by: https://github.com/justinchuby
2024-11-18 21:50:23 +00:00
cyy
8cd7ad8b48 [Reland][Environment Variable][5/N] Use thread-safe getenv functions (#140594)
Reland of #139762 with no bug found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140594
Approved by: https://github.com/ezyang
2024-11-18 21:45:35 +00:00
c62da98c1a Upload all run attempts when in upload_test_stats_intermediate (#140459)
Upload all run attempts since it can be hard to determine which run attempt to do from HUD, since HUD shows everything together
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140459
Approved by: https://github.com/huydhn
2024-11-18 21:40:10 +00:00
17bb78a3d3 Port X86_F16 from executorch half to PyTorch half (#140720)
This was added in https://github.com/pytorch/executorch/pull/1789 . I'm working on sharing Half.h with ExecuTorch, and this is a missing feature.

Differential Revision: [D65949409](https://our.internmc.facebook.com/intern/diff/D65949409/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140720
Approved by: https://github.com/malfet
ghstack dependencies: #140564, #140565, #140566, #140567
2024-11-18 21:32:44 +00:00
43de32d948 Revert "create a new torch.cuda.device_memory_used api (#140870)"
This reverts commit 478204cad68651960a979ca109e2bd4a219b0f1a.

Reverted https://github.com/pytorch/pytorch/pull/140870 on behalf of https://github.com/yuguo68 due to the test is still flaky on ROCm, test_cuda.py::TestCudaMallocAsync is not skipped with the unittest.skipIf(TEST_CUDAMALLOCASYNC ([comment](https://github.com/pytorch/pytorch/pull/140870#issuecomment-2484161914))
2024-11-18 21:26:25 +00:00
4bb1bf0573 [Docs] Remove duplicate declaration of double_tensor (#140927)
Fixes #140920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140927
Approved by: https://github.com/malfet
2024-11-18 21:22:30 +00:00
e46af7de0c [MPS] [BE] Use direct call vs virtual (#140950)
I.e. replace `at::detail::getMPSHooks().isOnMacOSorNewer` with `is_macos_13_or_newer`, which is a direct function call instead of going thru a virtual method call
Hooks are only needed to provide a feature-agnostic inteface to query something even on the platforms that might not have support for the featuee, while functions implemented in `ATen/native/xxx` should be able to call those platform specific methods directly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140950
Approved by: https://github.com/Skylion007
ghstack dependencies: #140896
2024-11-18 21:01:52 +00:00
4eed438a42 Implement deterministic scan (#140887)
Fixes #89492
Uses block-wise cub primitives
On large inputs, this implementation is approximately 25% slower than device cub implementation, so it's turned on only in cases where cub would have been (floating point inputs, cumsum that is effectively 1d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140887
Approved by: https://github.com/ezyang, https://github.com/kurtamohler
2024-11-18 20:56:14 +00:00
00c829876c Log Full Knapsack Problem Information (#140757)
Summary: When AOT_PARTITIONER_DEBUG is set to 1 and debug logging is turned on we can now log the full input and output for each knapsack problem.

Differential Revision: D65633086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140757
Approved by: https://github.com/jansel
2024-11-18 20:36:32 +00:00
408ad45014 [MPS][BE] Introduce mtl_setArgs (#140896)
Which is a variadic template that automates tedious (and error prone) process of pasing the arguments via series of
```cpp
  mtl_setBuffer(encoder, b1, 0);
  mtl_setBuffer(encoder, b2, 1);
  mtl_setBytes(encoder, param, 2);
```
into a compact
```
  mtl_setArgs(encoder, b1, b2, param);
```

Introduce few more specialization of `mps_setArg`, such as:
 - Call `setBuffer` for `id<MTLBuffer>`
 - Copy double as float (as MPS does not support double precision types)
 - Accept `std::optional<at::Tensor>` that will not call setBuffet, if optional is empty

Also, re-metaprogramm `mtl_setBytes` to make it usable with any trivially copiable structs, but keep separate implementation for containers, as uploading `c10:SmallVector`, which is trivially copiable would overwrite next arguments, which luckily resulted in test failures of `test_cross_entropy_label_smoothing_weight_ignore_indices_mps`

Introduce `has_size_type_v` which could be used to diferrentiate between trivially copiable `std::array` and `c10::ArrayRef` vs other trivially copiable structs.
```cpp
template <typename T>
class has_size_type {
  template <typename U>
  static constexpr std::true_type check(typename U::size_type*);
  template <typename>
  static constexpr std::false_type check(...);

 public:
  static constexpr bool value = decltype(check<T>(nullptr))::value;
};

template <typename T>
constexpr bool has_size_type_v = has_size_type<T>::value;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140896
Approved by: https://github.com/Skylion007
2024-11-18 20:35:01 +00:00
e80b1b2870 Flex + NJT: cross attention support (#140723)
Fixes #140598

Allows ragged structures for query and key+value sequence lengths to differ (i.e. supports cross attention for Flex + NJT).

Technically, this is BC-breaking thanks to arg renaming and positional arg reordering in `create_nested_block_mask()`, but Flex + NJT support isn't in a major release yet so I'm hoping we can just do it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140723
Approved by: https://github.com/drisspg
2024-11-18 19:49:45 +00:00
478204cad6 create a new torch.cuda.device_memory_used api (#140870)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.
see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65960134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140870
Approved by: https://github.com/ngimel
2024-11-18 19:13:43 +00:00
081c1687c8 Remove UB type punning from c10/util/floating_point_utils.h (#140567)
Accessing the inactive member of a union is undefined behavior. Fortunately, we have c10::bit_cast.

Differential Revision: [D65888680](https://our.internmc.facebook.com/intern/diff/D65888680/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140567
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #140564, #140565, #140566
2024-11-18 18:41:34 +00:00
f59ec98ceb Add C10_EMBEDDED to gate ostream usage in Half/BFloat16 (#140566)
We want to use Half/BFloat16 in ExecuTorch to support shared kernel code. They will need to be used in ExecuTorch core, so they can't have streams. This diff introduces a macro to gate the stream code off.

Differential Revision: [D65888035](https://our.internmc.facebook.com/intern/diff/D65888035/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140566
Approved by: https://github.com/ezyang, https://github.com/malfet
ghstack dependencies: #140564, #140565
2024-11-18 18:41:34 +00:00
0f1a88cfba Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2024-11-18 18:21:17 +00:00
cca34be584 Update XNNPACK Version (#139913)
Updating XNNPACK Version to 4ea82e595b36106653175dcb04b2aa532660d0d8

submodule update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139913
Approved by: https://github.com/digantdesai, https://github.com/huydhn
2024-11-18 18:16:31 +00:00
e429a3b72e Move complex<Half> from Half.h to complex.h (#140565)
Executing on old TODO on the way to sharing Half.h with ExecuTorch.

Differential Revision: [D65888037](https://our.internmc.facebook.com/intern/diff/D65888037/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140565
Approved by: https://github.com/ezyang, https://github.com/malfet
ghstack dependencies: #140564
2024-11-18 15:56:21 +00:00
f630799587 move c10::overflows to its own header (#140564)
Working on moving `complex<Half>` to complex.h instead of Half.h; this depends on complex and isn't used particularly widely.

Differential Revision: [D65888038](https://our.internmc.facebook.com/intern/diff/D65888038/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140564
Approved by: https://github.com/ezyang, https://github.com/Skylion007, https://github.com/malfet
2024-11-18 15:56:21 +00:00
b379a28a95 Generalization of distributed test cases for non-CUDA devices (#138216)
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.

It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784)

This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.

The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs

NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.

I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.

This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
2024-11-18 09:38:00 +00:00
cyy
06dde8c157 [1/N] Remove inclusion of ATen/core/Array.h (#122064)
The functionality of Array.h is largely overlapped with std::array and it should be safe to use std::array instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122064
Approved by: https://github.com/ezyang
2024-11-18 08:50:28 +00:00
6c6f745fa7 Revert "[1/N] Remove inclusion of ATen/core/Array.h (#122064)"
This reverts commit 486b9aaa67a02807aea06f33c009b5311caab337.

Reverted https://github.com/pytorch/pytorch/pull/122064 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but lots of compilation errors show up after this lands ([comment](https://github.com/pytorch/pytorch/pull/122064#issuecomment-2482263396))
2024-11-18 08:31:38 +00:00
43edb94f8a [Quantization][PrivateUse1] Adding more support QuantizedPrivateuse1 backends (#139860)
Here's are some explanations of this PR.

1. Changes in `aten/src/ATen/core/Tensor.cpp` and `c10/core/DispatchKey.cpp`: Support toString method for `QuantizedPrivateUse1` backend, make pytorch print out correct backend string for it.
2. Add  header `DispatchStub.h` in `aten/src/ATen/native/quantized/IndexKernel.h`: If this header is not included, we can't utilize `masked_fill_kernel_quantized_stub` even we include this `IndexKernel.h` header, it would throw an error during compilation.
3. Add multiple `TORCH_API`s in `aten/src/ATen/native/quantized/AffineQuantizer.h`: these functions is useful for other privateuse1 backends supporting quantization functions, if these `TORCH_API` are missed, it would throw an error during runtime (undefined symbol)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139860
Approved by: https://github.com/bdhirsh
2024-11-18 05:09:59 +00:00
1d5a8ee8fb [C10D] call destroy_process_group after MultiProcess tests (#140820)
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">

My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.

Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).

But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.

Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
2024-11-18 04:26:21 +00:00
a1327fac45 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [5/N] (#140663)
related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140663
Approved by: https://github.com/williamwen42
2024-11-18 04:11:56 +00:00
16bc82a015 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [6/N] (#140688)
related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140688
Approved by: https://github.com/williamwen42
2024-11-18 04:09:09 +00:00
62d2c5b667 Revert "Enable XPUEvent elapsed_time function (#134666)" (#140872)
# Motivation
This PR raises an internal UT failure on XPU.
This reverts commit 4bbd6da33101a8d709f1d2921ad8ae6f9b0dc166.
# Additional Context
refer to https://github.com/pytorch/pytorch/issues/140814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140872
Approved by: https://github.com/EikanWang
2024-11-18 02:58:05 +00:00
3d26c08dda Fix unintended deprecation warning in torch.distributed.optim (#140889)
We have a deprecation warning for scripted functional optimizer at module level in `torch/distributed/optim/__init__.py`. However, not all optimizers exposed by the module are scripted functional optimizers, causing some false deprecation warning (e.g. https://github.com/pytorch/pytorch/issues/139661).

This PR moves the deprecation warning to the `__init__` functions of the deprecated scripted functional optimizers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140889
Approved by: https://github.com/d4l3k, https://github.com/kwen2501, https://github.com/XilunWu
2024-11-18 02:34:51 +00:00
137554c943 [CI] Upgrade XPU support packages version to 2025.0 (#139775)
Works for #139722 and #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139775
Approved by: https://github.com/atalman
2024-11-18 02:26:13 +00:00
cyy
486b9aaa67 [1/N] Remove inclusion of ATen/core/Array.h (#122064)
The functionality of Array.h is largely overlapped with std::array and it should be safe to use std::array instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122064
Approved by: https://github.com/ezyang
2024-11-18 01:31:39 +00:00
c3fbec74bd [PT2][Optimus] Fix a corner case in merge splits (#140788)
Summary:
We observed another corner case where not all split items are used, see the screenshot

{F1960315622}

We thus skip such cases by checking the getitem indices.

Test Plan:
# local reproduce
```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split  --flow_id 663157369 2>&1 | tee ~/cmf.txt
```
P1679677122

# E2E

before fix
f663157369

after fix

Differential Revision: D65990213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140788
Approved by: https://github.com/jackiexu1992
2024-11-18 01:27:43 +00:00
625c24a7f9 [C10D] Support group_dst in scatter/gather (+object) ops (#140827)
Also add missing mypy typing and a few asserts to make mypy happy

Partially addresses RFC 0042 (pytorch/rfcs#71)
See more details/motivation in #140460

Note: object collective version canonicalizes to global instead of group
rank, simply becuase this left more of the original code intact and
required less conversions overall.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140827
Approved by: https://github.com/kwen2501
2024-11-17 22:19:58 +00:00
99014a297c [BE][MPS] Apply clang-format to mps headers (#140906)
It was a mistake to amiss them in the past

All changes in this PR except ones to .lintrunner.toml are generated by running
`lintrunner -a --take CLANGFORMAT --all-files`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140906
Approved by: https://github.com/Skylion007
2024-11-17 21:06:27 +00:00
5a7e147ef3 [SymmetricMemory] introduce user-facing APIs empty() and rendezvous() (#139677)
Previously `SymmetricMemory` only had private pybind APIs:
```python
from torch.distributed._symmetric_memory import _SymmetricMemory
t = _SymmetricMemory.empty_strided_p2p(
    size=(64,),
    stride=(1,),
    dtype=torch.float32,
    device=device,
)
symm_mem_hdl = _SymmetricMemory.rendezvous(t, group_name=group.group_name)
```

This PR introduces user-facing APIs empty() and rendezvous():
```python
import torch.distributed._symmetric_memory as symm_mem
t = symm_mem.empty(64, device="cuda")
symm_mem_hdl = symm_mem.rendezvous(t, group_name=group.group_name)
```

Notable differences compared to the pybind APIs:
- `empty()` now resembles `torch.empty()`:
  - shape can either be an integer sequence or pack
  - no need to/can't specify stride anymore
  - device can either be `torch.device` or string
- `group_name` needs to be specified at rendezvous time as opposed to allocation time. See https://github.com/pytorch/pytorch/pull/139529 for the rationales. I feel the new semantic is superior, hence enforcing it in the public API.
  - Currently, the pybind API still support specifying `group_name` at rendezvous time.

This PR does not change the behavior of the pybind APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139677
Approved by: https://github.com/lw
ghstack dependencies: #139529
2024-11-17 20:51:50 +00:00
9f4af6b4e6 Add trunc to z3 validator (#140886)
Fixes vision_maskrcnn benchmark when validation is turned on

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140886
Approved by: https://github.com/ezyang
ghstack dependencies: #140830, #140832, #140828
2024-11-17 18:38:30 +00:00
9005156004 don't specialize when grad tracking tensors are activated (#140828)
Fixes `python test/dynamo/test_inline_inbuilt_nn_modules.py
InlineInbuiltNNModulesFuncTorchHigherOrderOpTests.test_grad_non_tensor_input_inline_inbuilt_nn_modules`
when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140828
Approved by: https://github.com/ezyang
ghstack dependencies: #140830, #140832
2024-11-17 18:28:47 +00:00
e1d6c08f3d Specialize symfloats when getting fake value involves complex args (#140832)
Fixed `PYTORCH_TEST_WITH_DYNAMO=1 tlp python test/test_sparse_csr.py TestSparseCSRCPU.test_sampled_addmm_cpu_complex64` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140832
Approved by: https://github.com/ezyang
ghstack dependencies: #140830
2024-11-17 18:17:54 +00:00
24be47f0c7 [MPS] Allow >2**32 metal dispatches (#140862)
By passing length as `NSUInteger` which should be a 64-bit value on all 64-bit systems according to https://developer.apple.com/documentation/objectivec/nsuinteger?language=objc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140862
Approved by: https://github.com/Skylion007
2024-11-17 18:05:44 +00:00
4269250a30 [BE][EZ] Use nested namespaces (#140905)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140905
Approved by: https://github.com/Skylion007
2024-11-17 17:53:00 +00:00
cyy
73602873c9 [10/N] Fix Wextra-semi warning (#140880)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140880
Approved by: https://github.com/ezyang
2024-11-17 16:12:28 +00:00
2c6bd9f6f6 [inductor] Support fixed triton configs defined at compile time (#140217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140217
Approved by: https://github.com/shunting314
ghstack dependencies: #139585
2024-11-17 16:10:37 +00:00
318eaa2be7 [inductor] Refactor reduction type choices into V.choices (#139585)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139585
Approved by: https://github.com/shunting314
2024-11-17 16:10:37 +00:00
44afaac9fd [MPS][BE] Fix non-portable path warning (#140891)
I.e. fixes
```
1082/1084] Building OBJCXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mps/operations/UpSample.mm.o
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/UpSample.mm:224:10: warning: non-portable path to file '<ATen/native/mps/UpSample_metallib.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path]
  224 | #include <ATen/native/mps/Upsample_metallib.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |          <ATen/native/mps/UpSample_metallib.h>
```
as generated header name should have the same capitalization as respective shader file, i.e. `kernels/UpSample.metal`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140891
Approved by: https://github.com/Skylion007
2024-11-17 15:14:05 +00:00
90d3584147 [dyanmo] support subclasses of namedtuple type (#140534)
Allow subclassing namedtuple type. Allow assign attributes to instances of these subtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140534
Approved by: https://github.com/jansel
2024-11-17 14:13:40 +00:00
ab5c8857ef [SymmetricMemory] support specifying group_name at rendezvous time (#139529)
Before this PR, users need to call `empty_strided_p2p()` with a `group_name`:

```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device, group_name="0")
symm_mem = _SymmetricMemory.rendezvous(tensor)
```

Users can now omit `group_name` at allocation time and specify it later at rendezvous time:

```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device)
symm_mem = _SymmetricMemory.rendezvous(tensor, group_name="0")
```

Rationales for this change:
- This allows the same allocation to establish symmetric memory under different groups
- Specifying `group_name` at rendezvous time instead of allocation time is a more natural UX

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139529
Approved by: https://github.com/lw
2024-11-17 09:31:17 +00:00
602ae9cbcf Specialize symfloats during equality checks (#140830)
Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python
    test/inductor/test_torchinductor_opinfo.py
    TestInductorOpInfoCPU.test_comprehensive_nn_functional_local_response_norm_cpu_float32`
    when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140830
Approved by: https://github.com/ezyang
2024-11-17 06:35:22 +00:00
6094f17ada Revert "revert test repro logging" (#140749)
This reverts commit 6323fa673279eac9f2292b9b7790f621a4649af8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140749
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #138634
2024-11-17 06:25:54 +00:00
62fb6fd8bd Fix broken AOTInductor node and kernel counts (#139435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435
Approved by: https://github.com/desertfire
ghstack dependencies: #139411, #139412

Co-authored-by: Bin Bao <binbao@meta.com>
2024-11-17 04:17:07 +00:00
83e62cbc18 Enable all fixed cpp_wrapper tests (#139412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139412
Approved by: https://github.com/desertfire
ghstack dependencies: #139411

Co-authored-by: Bin Bao <binbao@meta.com>
2024-11-17 04:17:07 +00:00
819b0ebd94 cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411)
Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use.

Closes #135559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411
Approved by: https://github.com/desertfire

Co-authored-by: Bin Bao <binbao@meta.com>
2024-11-17 04:16:59 +00:00
2fc692b3dd [audio hash update] update the pinned audio hash (#140860)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140860
Approved by: https://github.com/pytorchbot
2024-11-17 03:34:54 +00:00
c1f21bf2b6 Made FlexAttention error on subgraph lowering failure (#140331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140331
Approved by: https://github.com/drisspg
2024-11-17 02:43:58 +00:00
b86b5349cb Ignore eager profiling code in training IR (#140826)
Differential Revision: [D66010452](https://our.internmc.facebook.com/intern/diff/D66010452/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140826
Approved by: https://github.com/zhxchen17
2024-11-16 20:31:17 +00:00
bf8709b08a Revert "[C10D] call destroy_process_group after MultiProcess tests (#140820)"
This reverts commit 77d1f076dadec7a77c4bcf807c4efbef6ca5a8f1.

Reverted https://github.com/pytorch/pytorch/pull/140820 on behalf of https://github.com/wconstab due to failures on trunk not on PR CI ([comment](https://github.com/pytorch/pytorch/pull/140820#issuecomment-2480644227))
2024-11-16 16:32:14 +00:00
ce77409647 Upgrade to fbscribelogger 0.1.7 (#138634)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138634
Approved by: https://github.com/huydhn
2024-11-16 14:33:34 +00:00
77d1f076da [C10D] call destroy_process_group after MultiProcess tests (#140820)
Faced with an annoying string of warnings like this when running tests,
<img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212">

My choices seem to be (1) call destroy_process_group() at the end of
each test fn, (2) do this in some wrapper, (3) do it in the base test
class.

Since tests in MultiProcessTestCase are responsible for calling
init_process_group themselves, they should also be responsible for
calling destroy (or at least method (3) would be asymmetric and may
result in double-destroy).

But it doesn't feel worth it to go add a destroy call manually to each
test, and try/except for a possible second destroy call seems like a
happy middle ground.

Note: tests that want to ensure that destroy runs cleanly can and should
still call destroy _inside_ the test, and this change does not affect
that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820
Approved by: https://github.com/fegin
ghstack dependencies: #140460, #140815
2024-11-16 14:24:52 +00:00
f8891a764d [C10D] dedup send/recv impls (#140815)
Avoid copypaste of send/isend and recv/irecv impl.

This does change the warning issued from send to include the identifier
"isend" instead of "send", but I think thats not a big deal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140815
Approved by: https://github.com/fegin
ghstack dependencies: #140460
2024-11-16 14:24:52 +00:00
3d4e68fad3 [C10D] Support group_dst/group_src in c10d send/recv (#140460)
Partly addressing RFC 0042 (https://github.com/pytorch/rfcs/pull/71)

It's annoying that 'dst' (for send) ust be a global rank even when a
group is passed in.  But we can't easily change 'dst' without breaking
existing cases.

Furthermore, requiring use of 'global' dst breaks the less common usage
pattern of creating a new ProcessGroup object that is not connected to
the 'default group' and thus has no logical 'global' ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140460
Approved by: https://github.com/d4l3k, https://github.com/kwen2501, https://github.com/fduwjj
2024-11-16 14:24:45 +00:00
2b39a8db77 Refactor UnflattenedModule's adapt flat args (#140840)
Test Plan: unblocks model launch

Differential Revision: D66014709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140840
Approved by: https://github.com/pianpwk
2024-11-16 05:09:37 +00:00
0f9eea1329 [FlexAttention] Fix multiple calls to flex bug (#140761)
# Summary
Fixes long-standing bug we've had in the backward pass for flex attention. See https://github.com/pytorch/pytorch/issues/135161 for details

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140761
Approved by: https://github.com/Chillee, https://github.com/zou3519
2024-11-16 04:57:04 +00:00
a173186566 [RFC] Implement caching for user defined triton kernels (#140326)
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.

One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.

Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
2024-11-16 02:37:16 +00:00
48a55b8623 [c10d][fr] wait counter for dump function (#140823)
Summary:
Add a wait counter for the dump function.
This is useful to see if we get stuck in the dump function and never return for a particular job.

Test Plan: Tested locally I and see `pytorch.wait_counter.NCCLTraceBuffer__dump.busy_time_us.sum.60` in ODS.

Differential Revision: D65823433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140823
Approved by: https://github.com/fduwjj
2024-11-16 02:22:08 +00:00
be90d3ce86 [IG] Avoid generation of empty merge cpu submodule by splitter v2 (#140794)
Summary:
Customize splitter behavior to mark `get_attr` nodes as acc supported.
Currently these nodes are excluded by `FxNetAccNodesFinder` which marks all nodes with op not in `CALLABLE_NODE_OPS` ("call_module", "call_function", "call_method") as unsupported.

Before this change, merge-net is split into an almost empty cpu submodule with a single empty output node:
```
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:###### debug_print nodes for _run_on_cpu_0
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:Found output node: n.name='output', n.target='output', n.args=((),), n.kwargs={}, n.meta={}
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:return ()
INFO:caffe2.torch.fb.model_transform.experimental.prepare_fx_model:
_run_on_cpu_0 stats for merge:
[output] output: 1
```
full log: P1678727348 (generated using same command as below)

Test Plan:
Tested by lowering `ig_organic_feed_cn_v2_mtml` using cmd:
```
buck run mode/opt-split-dwarf //tgif/cli:cli -- --model-name=ig_organic_feed_cn_v2_mtml --model-type ig_organic_feed_cn_v2_mtml --world-size=1 --storage-mode 1 --inference-dtype=FP16 --meta-transform=False --use-random-weights=True --accelerator-arch=3 --enable-input-dist=True --embedding-tables-dtype=FP16 --mtia-use-torch-export=True embedding-quantization-pass torchrec-sharding-pass tgif-split-pass gen-app-graph-pass tgif-mtia-lowering-pass dense-quantization-pass save-torch-package-pass generate-model-package-pass pack-weights-and-save-pass 2>&1 | tee /tmp/publish_ig_organic_feed_cn_v2_mtml_mtia_export_20241114_splitter_2.log
```
Output shows only 1 acc submodule is generated for merge:
```
INFO 18:33:15.951 1735650 utils.py:235: [TGIF] num of acc submodules: 1
INFO 18:33:15.952 1735650 utils.py:236: [TGIF] num of cpu submodules: 0
INFO 18:33:16.534 1735650 logging_utils.py:53: [TGIF] _run_on_acc_0 graph module debug info: https://www.internalfb.com/intern/everpaste/?color=0&handle=GK4VKhWsDKF9VdsDAKxhR6KAlhJ0br0LAAAz
INFO 18:33:16.534 1735650 utils.py:257: [TGIF] Start MTIA lowering _run_on_acc_0 in merge, device ordinal: -1
```
full log: P1679596796

Differential Revision: D65983916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140794
Approved by: https://github.com/ezyang
2024-11-16 01:49:03 +00:00
bf78a0fa96 Add dim to logging to help debug (#140445)
Differential Revision: D65839759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140445
Approved by: https://github.com/ljyuva83, https://github.com/ColinPeppler
2024-11-16 01:33:29 +00:00
5df9207ba9 Don't go through dispatch for *_dot_with_fp32_arith (#140834)
We don't need to dispatch for these because they're only used from within ATen/native/cpu, which is rebuilt per-CPU_CAPABILITY anyway.

Differential Revision: [D66012283](https://our.internmc.facebook.com/intern/diff/D66012283/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140834
Approved by: https://github.com/malfet
2024-11-16 00:30:25 +00:00
baf756a785 [reland] [aoti] Selectively package AOTI generated files (#140675)
Summary: Reland  https://github.com/pytorch/pytorch/pull/140022

Test Plan: CI

Differential Revision: D65929964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140675
Approved by: https://github.com/desertfire
2024-11-15 23:48:34 +00:00
109f8274a8 Revert "Add NHWC support for group normalization (#126635)"
This reverts commit ed0e63e938317fd254a705f00580caeb68768f9c.

Reverted https://github.com/pytorch/pytorch/pull/126635 on behalf of https://github.com/kit1980 due to Reverted internally at Meta, see D65979564 ([comment](https://github.com/pytorch/pytorch/pull/126635#issuecomment-2480130943))
2024-11-15 23:38:15 +00:00
0aed13437e remove typo in UninitializedParameter docstring (#140197)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140197
Approved by: https://github.com/Skylion007
2024-11-15 23:26:23 +00:00
41bb1539d3 Fix get_unsafe_globals_in_checkpoint to account for user allowed globals per docstring (#140738)
bugfix: this function did not account for the user allowed globals :(

Differential Revision: [D65960696](https://our.internmc.facebook.com/intern/diff/D65960696)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140738
Approved by: https://github.com/malfet
2024-11-15 22:47:35 +00:00
fc813df120 Benchmarks dynamo update script to use ClickHouse instead of Rockset (#140574)
Query works but the part where it parses the job name is broken

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140574
Approved by: https://github.com/huydhn
2024-11-15 22:17:35 +00:00
d64827dc35 [ROCm][Inductor][CK] Enable scaled mm with bias in gemm max autotune with CK backend (#140674)
## Testing
```
pytest test/inductor/test_ck_backend.py -k scaled_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140674
Approved by: https://github.com/chenyang78
2024-11-15 22:08:38 +00:00
ffd5197138 Ensure index for state guard construction is a source (#140515)
Fixes https://github.com/pytorch/pytorch/issues/140393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140515
Approved by: https://github.com/anijain2305, https://github.com/vmoens
2024-11-15 22:02:50 +00:00
1fd4757fdc Support tensor betas in Adam and AdamW (#134171)
Adds support for beta1 and beta2 to be wrapped in tensor for Adam and AdamW.

Fixes https://github.com/pytorch/pytorch/issues/133898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134171
Approved by: https://github.com/janeyx99
2024-11-15 21:55:55 +00:00
924c1fe3f3 [CP] Enable CP + compiler tests when there are more than 2 GPUs (#133736)
https://github.com/pytorch/pytorch/pull/132755 makes c10d_functional.wait_tensor effectful ORDERED op, which should resolve any issues due to dangling wait for CP ring attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133736
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2024-11-15 20:42:51 +00:00
476e0697f5 Fix for split gates enabled quantizable LSTM subclass (#140818)
Summary:
### Motivation
In D65283170, we need subclass of quantizable LSTM to enable split_gates. Also, required for tests.

### What's the change?
As subclass is not part of no_observer() set, an improper observer is added after the quantizable LSTM module. Here, we switch class check change to issubclass check on no_observer set.

Test Plan:
- N6206576
- CI.

Reviewed By: andrewor14

Differential Revision: D65989314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140818
Approved by: https://github.com/andrewor14
2024-11-15 20:15:52 +00:00
03b7ec9237 Revert "create a new torch.cuda.memory_usage_in_bytes api (#140719)"
This reverts commit 9febc476372e25f65cfcd642bf49625db10f0f0b.

Reverted https://github.com/pytorch/pytorch/pull/140719 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is flaky on ROCm ([comment](https://github.com/pytorch/pytorch/pull/140719#issuecomment-2479832082))
2024-11-15 20:05:32 +00:00
210de39872 Revert "[FlexAttention] Fix multiple calls to flex bug (#140761)"
This reverts commit b506d1cc8aee0d17cb72c2be0bc03361d4023698.

Reverted https://github.com/pytorch/pytorch/pull/140761 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/140761#issuecomment-2479819212))
2024-11-15 19:58:37 +00:00
47f44303ff Add ciflow/inductor automatically in more cases (#140824)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140824
Approved by: https://github.com/malfet
2024-11-15 19:54:20 +00:00
80d63e7dd9 Fix softmax_backward_data cpu implementation error when argument output is noncontinguous (#139740)
Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous.

Here is a test case that demonstrates this issue:

```python
torch.manual_seed(0)
op = torch.ops.aten._softmax_backward_data
grad_output = torch.ones(3, 3, 3)
temp = torch.randn(3, 10, 3)
out = temp[:, :3, :]
out = out.contiguous()
print(out.is_contiguous())
grad_input = op(grad_output, out, 1, torch.float32)
print(grad_input)
```

In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740
Approved by: https://github.com/zou3519
2024-11-15 19:53:20 +00:00
9602f56979 Fix misuse of offset param in seek (#140633)
Fixes #115630.

The size of BufferAdapter has been calculated wrongly due to misuse of python method seek. Causes miniz reader initialized with wrong size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140633
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
2024-11-15 19:07:52 +00:00
500ce29e4c Use has_free_unbacked_symbols instead of bool(free_unbacked_symbols) (#140027)
with 20K features saves 20 seconds.
257.021589517593-> 237.8304626941681
buck2 run @fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=2000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140027
Approved by: https://github.com/ezyang
2024-11-15 19:01:06 +00:00
4caf6a1fc8 [ROCm] Bug fix for flex attention configs avoiding ROCm path (#140270)
Fixes https://github.com/pytorch/pytorch/issues/139755 https://github.com/pytorch/pytorch/issues/139621

Follow up fix to https://github.com/pytorch/pytorch/pull/139883 which made the bulk of the changes required but a logic error resulted in ROCm still using h100 configurations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140270
Approved by: https://github.com/bertmaher
2024-11-15 17:52:56 +00:00
8e1f96469b [dynamo] Remove the name_stack code paths in symbolic_convert.py (#140155)
This is no longer needed now that we've replaced `ClosureVariable` with
`NewCellVariable`, i.e., Dynamo now treats `LOAD_CLOSURE` the same as
`LOAD_FAST`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140155
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #140330, #140152, #140436, #140435, #140153, #140154
2024-11-15 17:17:30 +00:00
54dde12c37 [dynamo] Remove closure_cells and merge/remove code paths (#140154)
Now that all cells are modeled as `NewCellVariable` in Dynamo, we no
longer need to put cell variables into this special `closure_cells`,
rather we just merge `closure_cells` with `symbolic_locals`.

This allows us to merge and remove some code paths, notably make
`LOAD_CLOSURE` the same as `LOAD_FAST`, and `LOAD_DEREF` & `STORE_DEREF`
the same for inlining or regular `InstructionTranslator`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140154
Approved by: https://github.com/jansel
ghstack dependencies: #140330, #140152, #140436, #140435, #140153
2024-11-15 17:17:30 +00:00
ea1d11cf74 [dynamo] Represent all cells as NewCellVariable (#140153)
In addition to `NewCellVariable`, Dynamo has 3 ways of modeling cell objects:
1. For cells captured and created by the root frame, represent them as
   their contents in `root_tx.symbolic_locals`, which `LOAD_DEREF` and
   `STORE_DEREF` update directly, without going through `SideEffects`.
2. `ClosureVariable`: this is created when cells from (1) are captured
   by a newly created function Dynamo is about to inline. It's a handle
   with a name that redirects `LOAD_DEREF` and `STORE_DEREF` back (1),
   to make `root_tx.symbolic_locals` up-to-date.
3. For cells that are captured by both the root frame and some
   pre-existing function Dynamo is about to inline, represent those
   cells as contents, and do not allow writes to them.

Note that (2) and (3) are mainly to conform with (1) -- to make sure
Dynamo has a consistent modeling of cells for the same cell objects.

In this patch, we represent all of these cells as `NewCellVariable`. The
main new code paths introduced are:
- using `NewCellVariable` to model cell objects created by the root
  frame (the cells are passed in as input to `InstructionTranslator`),
  this is what allows us to get rid of all 3 legacy paths above.
- adding a new `AutoDerefLocalSource` to deal with the python-code
  level (guards) and bytecode level (codegen) auto-dereferencing
  behavior, when accessing pre-existing python cells. This also
  involves a tiny update to guard manager generation.
- plumbing some extra info into `LocalSource` and `CellVariable` so that
  we can still emit `LOAD_DEREF`, `STORE_DEREF`, `LOAD_CLOSURE` (instead
  of `make_cell`, `cell_contents` attribute access, and `LOAD_FAST`),
  which is important for readability, performance, and some
  assumptions `bytecode_transformation.py` makes.

As a result, this patch removes a lot of the now-dead code paths and
TODOs. Notably, it significantly simplified the `prune_dead_locals`
function, which was duplicating a lot of the logic from
`prune_dead_object_new`; this conveniently closes #137123.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140153
Approved by: https://github.com/jansel
ghstack dependencies: #140330, #140152, #140436, #140435
2024-11-15 17:17:30 +00:00
7faee6bf15 [dynamo] Track from registered tensor hooks in prune_dead_object_new (#140435)
Registed tensor hooks contain `NestedUserFunctionVariable` which might
capture a `NewCellVariable` for cell objects created during Dynamo
tracing, so we must make sure it doesn't get pruned away.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140435
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #140330, #140152, #140436
2024-11-15 17:17:30 +00:00
ac6684ebbc [dynamo] Identify pre-existing captured cells by cell id rather than content id (#140436)
In `match_nested_cell`, Dynamo tried to identify pre-existing captured
cells by `(cell_name, id(cell_contents))`. This works in most cases, but
as the test added in this patch shows, it's not a complete solution.

This patch
1. changes `match_nested_cell` to `lookup_variable_for_captured_cell`,
   and does the lookup based on id of cell objects, not their contents.
   This requires plumbing a tuple of captured cell objects from
   different CPython versions all the way to
   `InstructionTranslator.__init__`, where we store a mapping from the
   ids of these cell objects, and use it later in
   `UserFunctionVariable.bind_args` to look for these unboxed cells.
2. builds off (1) -- rather than using a `VariableTracker` that
   represents the content of the unboxed cells, use `ClosureVariable`,
   which enables codegen in case these cells escape as closure of a
   `NestedUserFunctionVariable`.

The patch adds a regression test for each of the scenarios above:
1. `test_write_to_cells_with_name_shadowing` where Dynamo mistakenly
   thought the program is writing to a cell captured by root frame (which
   it doesn't support atm), which resulted in
```
  File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/symbolic_convert.py", line 3340, in STORE_DEREF
    unimplemented("write to __closure__ while inlining")
  File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/exc.py", line 313, in unimplemented
    raise Unsupported(msg, case_name=case_name)
torch._dynamo.exc.Unsupported: write to __closure__ while inlining
```
2. `test_existing_func_that_creates_capturing_nested_func` where Dynamo
   ended up trying to codegen a `NestedUserFunctionVariable` that
   captures a cell which was also captured by the root frame, so it was
   unboxed and ends up emitting `LOAD_DEREF` rather than
   `LOAD_FAST/LOAD_CLOSURE` during codegen, resulting in
```
  File "/Users/ryanguo99/Documents/work/pytorch/torch/_dynamo/variables/functions.py", line 105, in _create_nested_fn
    func = FunctionType(code, f_globals, name, defaults, closure)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: arg 5 (closure) expected cell, found int
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140436
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #140330, #140152
2024-11-15 17:17:30 +00:00
a4032d8396 [dynamo] Use ExecutionRecorder only in root frame InstructionTranslator (#140152)
As title. This is effectively what ended up happening anyways since we
always overwrite the record with the current frame's while propagating
the exception upward in `InstructionTranslatorBase.run`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140152
Approved by: https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #140330
2024-11-15 17:17:30 +00:00
85dd7b84cf [dynamo] Add a DynamoFrameType type above Python frame object (#140330)
This patch introduces a `DynamoFrameType` to serve as a layer between
Dynamo and different versions of Python frame object. In
`DynamoFrameType`, we only register attributes Dynamo cares about (e.g.,
`f_code`, `f_locals`, etc.

This will be helpful when it comes to adding new attributes to this
`DynamoFrameType`, or dealing with Python version changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140330
Approved by: https://github.com/jansel, https://github.com/williamwen42
2024-11-15 17:17:30 +00:00
c05eff278a [BE][Ez]: Update ruff to 0.7.4 (#140806)
Updates ruff to 0.7.4, mainly updates false pos/negatives for rules and fixes some bad autofixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140806
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-11-15 17:04:32 +00:00
de34f581f1 Revert "Made FlexAttention error on subgraph lowering failure (#140331)"
This reverts commit e68bc76c28934561e336f0fba8ef71bcea401701.

Reverted https://github.com/pytorch/pytorch/pull/140331 on behalf of https://github.com/malfet due to Looks like it regressed trunk, see 55f1959fc1/1 ([comment](https://github.com/pytorch/pytorch/pull/140331#issuecomment-2479435705))
2024-11-15 17:00:21 +00:00
cyy
55f1959fc1 [12/N] Fix extra warnings brought by clang-tidy-17 (#140801)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140801
Approved by: https://github.com/Skylion007
2024-11-15 16:54:30 +00:00
e2e67a010a [logging] Add dynamo_compile fields for pre-dispatch/joint/post-dispatch times (#140306)
Tested internally: P1679622670

Differential Revision: [D65986059](https://our.internmc.facebook.com/intern/diff/D65986059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140306
Approved by: https://github.com/ezyang
2024-11-15 15:02:08 +00:00
cyy
1b95ca904f [9/N] Fix Wextra-semi warning (#140803)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140803
Approved by: https://github.com/lw
2024-11-15 14:01:43 +00:00
25d9be37be Implements user buffer registration using MemPool (#133603)
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-11-15 12:47:49 +00:00
ae7f809bfc Update torch-xpu-ops commit pin (#140782)
Update the torch-xpu-ops commit to [bf4bab1](bf4bab1fff), includes:

- Fix Werror=terminate relevant building issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140782
Approved by: https://github.com/EikanWang
2024-11-15 10:10:52 +00:00
ee3a4f068c [FSDP2] privateuse1 support fsdp2. (#139539)
We are looking forward to supporting FSDP2 with devices other than CUDA. Please give me some coding suggestions. Thank you very much.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139539
Approved by: https://github.com/kwen2501
2024-11-15 06:34:35 +00:00
b506d1cc8a [FlexAttention] Fix multiple calls to flex bug (#140761)
# Summary
Fixes long-standing bug we've had in the backward pass for flex attention. See https://github.com/pytorch/pytorch/issues/135161 for details

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140761
Approved by: https://github.com/Chillee, https://github.com/zou3519
2024-11-15 06:28:20 +00:00
9febc47637 create a new torch.cuda.memory_usage_in_bytes api (#140719)
Summary:
the current torch.cuda.memory_usage returns the memory utilization, more specifically, percent of time over the past sample period global memory being read/written for Nvidia.

see more details in https://github.com/pytorch/pytorch/issues/140638

Test Plan: added a new unittest

Differential Revision: D65928031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140719
Approved by: https://github.com/xw285cornell, https://github.com/hongxiayang
2024-11-15 05:59:40 +00:00
6c0a2d8bbf Fix the check for can_use_expanded_index_path (#140351)
Fixes #129093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140351
Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch
2024-11-15 05:52:23 +00:00
8043e67026 catch tensor.numel() == 0 in nan detector (#140741)
Context: we are trying to pass an empty tensor through the system now (sometimes;... its an edge case); and it seems to cause all_reduce to seg fault, which is unexpected to me

Deep Shah and Pavan identified the issue, I'm just pushing for a fix :)

Test Plan: idk what i'm doing here, someone help

Reviewed By: shuqiangzhang

Differential Revision: D65956095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140741
Approved by: https://github.com/shuqiangzhang
2024-11-15 05:03:20 +00:00
865a7c5238 [ONNX] Improve the conversion of from dynamic axes to shapes (#140488)
Features:
(1) Add support for tree structure.
(2) Add user warning before axes to shapes conversion
(3) Add suggestion of providing `dynamic_shapes` when conversion fails

Notes:
(1) `input_names` is crucial to the conversion, as we don't know the ONNX graph inputs.
(2) min and max are set as default, so LLM has higher chance to fail if users use `dynamic_axes` in terms of the min/max constraints dependency between `attention_mask` and `sequence_length`, etc. (Found in llama-3.2-1B_Instruct)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140488
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-11-15 04:26:45 +00:00
94824766e6 [ONNX] Separate decomp into single step and add to the report (#140767)
1. Fix the ordering of the error report entries so non-strict show on top
2. Isolate run_decomposition into a separate step because it sometimes fails. This makes it easier for users to understand what failed

Fix https://github.com/pytorch/pytorch/issues/140762 Fix https://github.com/pytorch/pytorch/issues/137638
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140767
Approved by: https://github.com/titaiwangms
2024-11-15 04:26:16 +00:00
e68bc76c28 Made FlexAttention error on subgraph lowering failure (#140331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140331
Approved by: https://github.com/drisspg
2024-11-15 04:26:01 +00:00
80aa19a622 [PGNCCL] Add an API to get the status/error code of each PG (#140087)
Summary:
If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective

Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087
Approved by: https://github.com/kwen2501
2024-11-15 04:11:00 +00:00
9c88b08ac9 [BE] Replace skipIfMPS with expectedFailureMPS (#139940)
Functionally two decorators are very similar, but one should rely on expectedFailure as much as possible to get signal when something is fixed.
- Move `product_version` variable from `test_mps` to common_utils, but call it `MACOS_VERSION`
- Introduce `skipIfMPSOnMacOS13`  to decorate the hard crashes that happens only on MacOS13 (which at this point will not get any fixes and will be deprecated soon)
- Add `device_type='mps'` to all `skipIfMPS` per https://github.com/pytorch/pytorch/issues/140560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139940
Approved by: https://github.com/janeyx99, https://github.com/huydhn
2024-11-15 03:48:37 +00:00
1c1d06a22c [ROCm] remove size restrictions in gemm_and_bias (#140724)
This aligns hipblaslt behavior with CUDA_VERSION >= 12010.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140724
Approved by: https://github.com/pruthvistony, https://github.com/eqy
2024-11-15 02:23:27 +00:00
baf8686aec [BE][MPS] Remove extra semicolons (#140776)
Fixes following warnings:
```
In file included from /Users/malfet/git/pytorch/pytorch/torch/csrc/Generator.cpp:25:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:40:63: warning: extra ';' after member function definition [-Wextra-semi]
   40 |   void set_engine(at::Philox4_32 engine) { engine_ = engine; };
      |                                                               ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:41:46: warning: extra ';' after member function definition [-Wextra-semi]
   41 |   at::Philox4_32 engine() { return engine_; };
      |                                              ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/mps/MPSGeneratorImpl.h:43:62: warning: extra ';' after member function definition [-Wextra-semi]
   43 |   static DeviceType device_type() { return DeviceType::MPS; };
      |                                                              ^
3 warnings generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140776
Approved by: https://github.com/Skylion007
2024-11-15 01:47:55 +00:00
cec82c3aed Use Manylinux 2.28 for aarch64 CPU workflows (#140743)
Use https://hub.docker.com/r/pytorch/manylinux2_28_aarch64-builder/tags

Similar to https://github.com/pytorch/pytorch/pull/138732
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140743
Approved by: https://github.com/malfet
2024-11-15 01:46:29 +00:00
33191bb664 [Partitioner] Enumerate partitions by iterating partition ids (#136598)
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/mcr229

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-15 00:25:14 +00:00
14ecbfe184 Add kwen2501 to CODEOWNERS of c10d backend APIs (#140231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140231
Approved by: https://github.com/shuqiangzhang
2024-11-14 23:58:51 +00:00
217d328764 OpenReg: Support autograd (#140662)
Add some unfinished implements to support autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140662
Approved by: https://github.com/ezyang
2024-11-14 23:47:56 +00:00
02d0c43c32 [SymmetricMemory] fix a bug in symm_mem::memset32_ where the ops fails when offset=0 (#140129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140129
Approved by: https://github.com/lw
ghstack dependencies: #140127, #140128
2024-11-14 23:29:16 +00:00
684db9beb2 [SymmetricMemory] fix a bug where get_signal_pad() returns a tensor backed by a buffer ptr instead of a signal_pad ptr (#140128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140128
Approved by: https://github.com/lw
ghstack dependencies: #140127
2024-11-14 23:29:16 +00:00
c3d61bd367 [SymmetricMemory] allow overlapping devices for testing (#140127)
When `TORCH_SYMM_MEM_ALLOW_OVERLAPPING_DEVICES` is set, the check for overlapping devices and multicast support will be disabled. This is useful for testing with a single device.

Making this is an env var instead of an API argument since this is likely only useful for testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140127
Approved by: https://github.com/lw
2024-11-14 23:29:16 +00:00
Aki
9c818c880f [torchgen] Improve schema parsing with regex for numeric ranges (#140210)
Replaces the hardcoded string replacement for numeric ranges with a more robust regex pattern that handles any combination of positive and negative numbers in default value ranges.
Fixes #135470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140210
Approved by: https://github.com/ezyang
2024-11-14 23:28:27 +00:00
cyy
e90888a93d [8/N] Fix Wextra-semi warning (#140697)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140697
Approved by: https://github.com/ezyang
2024-11-14 23:08:04 +00:00
05c3330893 use more elements per thread for narrow dtypes (#139449)
Fix perf issue for narrow type by accessing more elements per thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449
Approved by: https://github.com/Chillee, https://github.com/eqy
2024-11-14 22:50:16 +00:00
7621fc5dad Add missing boundary checks to cunn_SoftMaxForward (#140682)
This fixes OOB memory access for following code
```python
import torch
qk = torch.randn((1024,587), dtype=torch.float64, device='cuda')
smqk = torch.softmax(qk, dim=-1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140682
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-11-14 22:49:06 +00:00
c1fe6be202 Revert "[dynamo] add SymNode bitwise and/or (#138777)"
This reverts commit c98ef0279e6eb968f5f9d22e1f193e7064594152.

Reverted https://github.com/pytorch/pytorch/pull/138777 on behalf of https://github.com/ezyang due to triggering AssertionError: Guard check failed: 14/2: name 'BitwiseFn_bitwise_or' is not defined ([comment](https://github.com/pytorch/pytorch/pull/138777#issuecomment-2477477776))
2024-11-14 21:52:40 +00:00
d751b271b5 Torchbench nightly MPS runs (#135386)
Add a workflow to run TorchBench with nightly mps builds & upload performance data to the HUD

Solves: https://github.com/pytorch/pytorch/issues/115201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135386
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-11-14 21:50:23 +00:00
f57ef5ddf2 Update Kineto Submodule (#140629)
Summary: Update Submodule from Oct 10, 2024 to Nov 13, 2024

Test Plan: CI Passes

Differential Revision: D65915865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140629
Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/briancoutinho
2024-11-14 21:23:59 +00:00
2ea2c89675 Fixes the manylinux_2_28 docker image to build PyTorch on Aarch64 (#137696)
This change provides the openblas support to the Docker image manylinux_2_28.

- It allows us to build pytorch using manylinux_2_28.
- Using this image in PyTorch builds  provides the major perf improvements when tested torch bench models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137696
Approved by: https://github.com/snadampal, https://github.com/atalman
2024-11-14 21:09:53 +00:00
3424ca378f [Inductor efficiency] Move less critical Inductor jobs to periodic (#140466)
Moves jobs that don't have to be run as frequently to the inductor-periodic workflow, based on the priorities given by @desertfire
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140466
Approved by: https://github.com/huydhn, https://github.com/zxiiro, https://github.com/desertfire
2024-11-14 21:09:06 +00:00
27c7caf745 [ROCm] TunableOp fix for batched MM with views. (#140673)
Fixes #140278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140673
Approved by: https://github.com/jeffdaily
2024-11-14 20:22:12 +00:00
8094b19620 Fix _out_spec (#140608)
Summary: The gm_torch_level can be a _LazyGraphModule(GraphModule) instead of a GraphModule. When we call .recompile(), GraphModule populates the self._out_spec, but _LazyGraphModule(GraphModule).recompile() doesn't populate it.

Test Plan: CI

Differential Revision: D65902135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140608
Approved by: https://github.com/tugsbayasgalan
2024-11-14 20:09:30 +00:00
b0d681417c [MPS] Reintroduce support for convolutions with output_channels > 65536 (#140726)
This reintroduces support for high channel sizes for convs. The guard for macOS versions < 15.1 is still present to prevent reintroducing #129207.

I'm unsure about the specific macOS version support, but I'm assuming this was fixed in 15.1, and I'm relying on signals from ci for verification. I'm expecting the new test will fail for macOS versions < 15.1, and the old test will start failing for > 15.0. I've added xfails for this and extended the version helpers to support 15.1+.

Fixes #140722
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140726
Approved by: https://github.com/malfet
2024-11-14 20:09:01 +00:00
cd6ace1d15 [EZ] Delete unused xfailIfMacOS14_4Plus (#140735)
Issue was fixed by https://github.com/pytorch/pytorch/pull/130038 but decorator remained in place

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140735
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-11-14 20:08:48 +00:00
65518fd9ef Turn on triton bundler in OSS (#140600)
Its been enabled internally, lets also push it out to OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140600
Approved by: https://github.com/masnesral
2024-11-14 20:02:15 +00:00
c536903c3f revert test repro logging (#140717)
@ezyang noticed this exercises a multithreading bug that is causing tests to become disabled:

```
2024-11-13T21:05:55.8363582Z inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCPU::test_comprehensive_fft_ihfftn_cpu_int32 /opt/conda/envs/py_3.9/lib/python3.9/site-packages/_pytest/threadexception.py:73: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-3
2024-11-13T21:05:55.8364857Z
2024-11-13T21:05:55.8364974Z Traceback (most recent call last):
2024-11-13T21:05:55.8365491Z   File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 980, in _bootstrap_inner
2024-11-13T21:05:55.8366003Z     self.run()
2024-11-13T21:05:55.8366371Z   File "/opt/conda/envs/py_3.9/lib/python3.9/threading.py", line 917, in run
2024-11-13T21:05:55.8366858Z     self._target(*self._args, **self._kwargs)
2024-11-13T21:05:55.8367518Z   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 176, in _run_event_loop
2024-11-13T21:05:55.8368189Z     self.loop.run_until_complete(self.task)
2024-11-13T21:05:55.8368774Z   File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
2024-11-13T21:05:55.8369348Z     return future.result()
2024-11-13T21:05:55.8369980Z   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py", line 214, in _worker
2024-11-13T21:05:55.8370603Z     message = await asyncio.wait_for(
2024-11-13T21:05:55.8371090Z   File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
2024-11-13T21:05:55.8371573Z     return await fut
2024-11-13T21:05:55.8372156Z   File "/opt/conda/envs/py_3.9/lib/python3.9/asyncio/queues.py", line 166, in get
2024-11-13T21:05:55.8372613Z     await getter
2024-11-13T21:05:55.8374010Z RuntimeError: Task <Task pending name='Task-1' coro=<FbScribeLogger._worker() running at /opt/conda/envs/py_3.9/lib/python3.9/site-packages/fbscribelogger/__init__.py:214> cb=[_run_until_complete_cb() at /opt/conda/envs/py_3.9/lib/python3.9/asyncio/base_events.py:184]> got Future <Future pending> attached to a different loop
2024-11-13T21:05:55.8375366Z
2024-11-13T21:05:55.8375603Z   warnings.warn(pytest.PytestUnhandledThreadExceptionWarning(msg))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140717
Approved by: https://github.com/ezyang, https://github.com/zxiiro
2024-11-14 19:51:52 +00:00
f6ba95a76f [inductor] PyCodeCache: only delete on-disk artifacts if purge=True (#140216)
Summary: https://github.com/pytorch/pytorch/pull/136505 changed the cache_clear operation to remove loaded modules from disk. That change caused some problems with TORCHINDUCTOR_FORCE_DISABLE_CACHES=1, where there are some code paths (coordinate descent tuning at least), where we call `PyCodeCache.load_by_key_path` and expect that the files are still on disk. (But when caches are disabled, we call cache_clear before every inductor compile). It seems we probably have a shortcoming in the disable-cache logic, but since we also have flakey test failures with the same `'could not get source code'` error, let's restore the previous functionality until I can investigate further.

Since some tests actually _DO_ want to delete on-disk artifacts (e.g., to test remote caching), then I added a `purge` param to optionally delete files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140216
Approved by: https://github.com/eellison
2024-11-14 19:34:57 +00:00
7702da9ce6 ci: Remove --progress-bar fallback for pip (#140189)
All versions of pip that we currently support should have this flag so removing this should essentially be a no-op.

Also put the actual command into a variable so we only have to change it once next time instead of changing it in 3 places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140189
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-11-14 19:26:41 +00:00
222d4b48b1 Revert "cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411)"
This reverts commit 761b42bc085190e272a930847694e872d92a1255.

Reverted https://github.com/pytorch/pytorch/pull/139411 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))
2024-11-14 19:25:46 +00:00
25048e5381 Revert "Enable all fixed cpp_wrapper tests (#139412)"
This reverts commit fef16fe254da2f9598c6f8bb19fdd883e5a54971.

Reverted https://github.com/pytorch/pytorch/pull/139412 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))
2024-11-14 19:25:46 +00:00
14641c0393 Revert "Fix broken AOTInductor node and kernel counts (#139435)"
This reverts commit 8cb0b932a16ee69137287b4e3872ffd39a79a8d4.

Reverted https://github.com/pytorch/pytorch/pull/139435 on behalf of https://github.com/kit1980 due to breaking internal inductor test ([comment](https://github.com/pytorch/pytorch/pull/139411#issuecomment-2477235367))
2024-11-14 19:25:46 +00:00
b69282c98c Enable opting out of experiments even when they're being rolled out (#140433)
Enables opting out of specific experiments in the runner determinator

To opt out:
1. Go to the tracking issue: https://github.com/pytorch/test-infra/issues/5132
2. In the entry by your name, enter the experiment name, prefixed with a `-`.  For example, to opt out of the LF fleet you could enter `@ZainRIzvi,-lf`

This lets you simultaneously be opted into some experiments and opted out of others.

While the `disable-runner-experiments` label offers an option to disable all experiments on a given PR, this one lets you disable a selected set of experiments across all your PRs.

Fixes https://github.com/pytorch/pytorch/issues/138099

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140433
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-11-14 19:18:24 +00:00
b11ff3cf60 [logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)
Here's the overview:

There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.

Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.

And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
2024-11-14 19:11:20 +00:00
ea7d1826a2 [ez] Make merge blocking sevs be based on label instead of string (#140636)
sev issues are now merge blocking if they are labeled merge blocking, instead of simply having the merge blocking string in the body.  This makes it easier to default to non merge blocking when creating a sev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140636
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-11-14 19:02:27 +00:00
sdp
83b6d91d08 [Intel GPU] Add NestedTensorXPU to parseDispatchKey and codegen (#140461)
Add `NestedTensorXPU` dispatch key.
```
>>> nt = torch.nested.nested_tensor([]).to("xpu")
>>> nt
nested_tensor([

], device='xpu:0')
>>> nt.is_xpu
True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140461
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang
2024-11-14 18:54:41 +00:00
9ff368c270 [pytorch] Add logger for pt2 compile chromium events to hive (#139941)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2535

Logging raw chromium events to hive per job run enables us to build combined rank perfetto traces without having to depend on Logarithm and deal with things like rate limits etc.

We can easily build a utility to query hive and upload traces to manifold and view them on perfetto

Test Plan:
Launch a job

```
buck2 run mode/opt //aps_models/examples/dlrm:dlrm_train_app -- --config-name train_mast_fsdp_torchdynamo launcher.data_project=apf_ai_infra launcher.fbl_entitlement=ai_infra_training_rnd_tc  launcher.hardware=TC_ANY_80G
```

Local run
```
Perfetto: ['https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https://interncache-all.fbcdn.net/manifold/pt2_compile_traces_test/tree/pt2_trace_files/aps-ppanchalia-426838c277/0/0/2bc9975d-921c-4766-9cb2-e7ce9833ae96.json']
```

{F1954710538}

Differential Revision: D65525513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139941
Approved by: https://github.com/jamesjwu
2024-11-14 18:27:38 +00:00
50ab68fa22 [EZ] Make lintrunner usable with Python-3.12 and 3.13 (#140721)
By installing numpy-2.1 as 1.26 is available up to Python-3.11 And restricting torch fix to python older than 3.13, as TorchFix depends on libcstd-1.2 and therefore can not be installed to 3.13, see https://github.com/pytorch-labs/torchfix/issues/84

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140721
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/ZainRizvi
2024-11-14 17:52:05 +00:00
879e273601 fix: Add type annotation to _record_memory_history (#140545)
Pylance infers the type of the first argument (`enabled`) to `_record_memory_history` as `str` even though the function accepts `Literal[None, "state", "all"]`.

This raises an issue when passing `None`, even though it is a legitimate argument.

This PR addresses the issue by adding the type annotation in the doc string.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140545
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-14 17:44:46 +00:00
adcff4bff0 Revert "use more elements per thread for narrow dtypes (#139449)"
This reverts commit d3fc13a9dd186ceb8d1b56b0968a41686ea645cd.

Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/ngimel due to breaks tests ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2477012582))
2024-11-14 17:28:32 +00:00
f4008a5ce4 [AOTI XPU] Remove workarounds after update torch-xpu-ops that extend c_shim_xpu layer with out-of-tree ATen OPs. (#139026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139026
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-11-14 17:14:58 +00:00
add6bb2e96 [aps] skip version check for export IR. (#140573)
Summary: mitigating potential export compatibility issue for production (temporarily).

Test Plan: CI

Differential Revision: D65890958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140573
Approved by: https://github.com/desertfire
2024-11-14 17:13:42 +00:00
dcf22fa58c [AOTI][refactor] Add sizes and strides util functions (#140449)
Summary: Similar to https://github.com/pytorch/pytorch/pull/139895, add sizes and strides methods to RAIIAtenTensorHandle and ConstantHandle, to increase the code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140449
Approved by: https://github.com/chenyang78
ghstack dependencies: #140447, #140448
2024-11-14 16:48:43 +00:00
3ef2dfc1ba [export] Implement cpp deserializer. (#136398)
Differential Revision: D63206258

This diff introduces a mechanism to generate a json-compatible deserializer in cpp using nlohmann json (already being used by AOTI).

Why we need this? Because there will be a lot of cases where people don't want to use Python to load the graph (e.g. cpp runtime), and instead they can use this header to deserialize the JSON graph.

Every time we call update_schema.py to update the schema, the header will be auto generated and included into the source files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136398
Approved by: https://github.com/angelayi
2024-11-14 16:34:59 +00:00
f98c601efe Avoid logging zeros (#139968)
Summary: title

Test Plan: NA

Differential Revision: D65582953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139968
Approved by: https://github.com/zou3519
2024-11-14 15:46:49 +00:00
216b6a952c triangular_solve: fix meta function output argument dtype check. (#140286)
Tracking issue: #138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140286
Approved by: https://github.com/ezyang
ghstack dependencies: #140186
2024-11-14 15:25:14 +00:00
72c6d13cea [BE]: Use proper logger in torch.distributed.run (#140547)
`torch.distributed.run` was improperly using the root logger and ignoring all logging settings and useful debugging info. Now properly uses the correct logger. Will be added to ruff as part of LOG015 soon.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140547
Approved by: https://github.com/XuehaiPan, https://github.com/fegin
2024-11-14 14:49:17 +00:00
1c669e7c4e Document the parameter (hx) that RNN actually uses (#140575)
Fixes https://github.com/pytorch/pytorch/issues/136925

This PR updates the docs to use `hx`, which is the parameter actually used by `RNN`:

629c243c82/torch/nn/modules/rnn.py (L650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140575
Approved by: https://github.com/ezyang
2024-11-14 14:45:17 +00:00
ebeab262d9 Refine XPU device prop and fix typo (#140661)
# Motivation
`architecture` is an experimental attribute that might been used by triton AOT codegen. It should not be in `__repr__`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140661
Approved by: https://github.com/EikanWang
2024-11-14 11:18:01 +00:00
9a051f6ee0 OpenReg: Fix issue when creating empty tensor (#140496)
On the exeuctor side, when it is found that meta.data_ptr is not in the allocated memory, tensor creation will fail, but there is no need to allocate memory when creating an empty tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140496
Approved by: https://github.com/ezyang
2024-11-14 11:10:37 +00:00
aaefa48441 reduce the threshold to change exisiting data suggestion to noise/3 (#140623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140623
Approved by: https://github.com/bobrenjc93
2024-11-14 06:29:25 +00:00
62eea62493 [Quant][Onednn] add linear_dynamic_fp16 ops (#140376)
**About this PR**
This PR adds the following ops for `linear_dynamic_fp16` in onednn namespace. These ops are intended for PT2E quantization eager mode.
- `onednn::linear_prepack_fp16`: packs fp32 weight to an fp16 MkldnnCPU tensor.
- `onednn::linear_dynamic_fp16`: takes an fp32 CPU tensor and an fp16 MkldnnCPU tensor and compute linear in fp32
- `onednn::linear_relu_dynamic_fp16`: similar as the former and apply relu on output.

**Test plan**
`python test/test_quantization.py -k test_linear_dynamic_fp16_onednn`

**Implementation**
These ops call oneDNN lib under the hood. It's worth noting that oneDNN does not support f32 * f16 -> f32 computation, so we have to convert fp16 weight to fp32 before computation. And weight is still in plain format after packing.

**Correctness and performance**
Correctness is guaranteed by UT.
Performance of the new ops may be better than the FBGEMM implementation when weight shape is small but worse when weight shape is large. It's because weight dtype conversion and computation are not fused.
For example, I ran benchmarks on an Intel(R) Xeon(R) Platinum 8490H machine with different cores and shapes. When using 1 core per instance, the new implementation generally is faster for weight shape < 1024 * 1024. When using more cores, the threshold will increase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140376
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2024-11-14 05:19:18 +00:00
99c8d5af27 Don't pass credentials explicitly to sccache (#140611)
sccache-0.2.14 can query it thru IMDSv1 and sccache-0.8.2 can do it thru v2 (or may be just use trust relationships between host and bucket
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140611
Approved by: https://github.com/wdvr
2024-11-14 04:44:55 +00:00
e6083016b3 fix test_float_to_int_conversion_nonfinite for NumPy 2 (#138131)
Related to #107302

We saw `test_float_to_int_conversion_nonfinite` failed as we upgrade to NumPy 2.

It is caused by the undefined behavior of `numpy` casting `inf`, `-inf` and `nan` from `np.float32` to other dtypes.
The test is using NumPy as reference for the ground truth. (see line 1013-1015)
However, these behaviors are undefined in NumPy.
If you do `np.array([float("inf")]).astype(np.uint8, casting="safe")`, it results in an error `TypeError: Cannot cast array data from dtype('float64') to dtype('uint8') according to the rule 'safe'`.
The undefined behaviors are always subject to change.

This PR address this issue by passing concrete values as the ground truth references.
In the future, even NumPy changes its behavior the test would still remain stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138131
Approved by: https://github.com/drisspg
2024-11-14 04:19:19 +00:00
d32eac86f3 Put a compile lock around backward compile (#140626)
Summary: https://fb.workplace.com/groups/1286739428954016/posts/1370274947267130

Test Plan:
```
hg up b5b5adce34
vizard_projects/ml_depth/scripts/run_mld.sh
```

used to crash, no longer crashes

Differential Revision: D65913100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140626
Approved by: https://github.com/ezyang
2024-11-14 04:07:46 +00:00
3ce75e7ea6 [Inductor UT] Fix duplicate registration of custom ops amount test cases (#140540)
Fix #140537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140540
Approved by: https://github.com/EikanWang, https://github.com/jansel
ghstack dependencies: #140517
2024-11-14 03:36:20 +00:00
8d3a07e321 [Inductor UT] Skip test_decompose_mem_bound_mm.py for XPU since we have not enabled decompose_mem_bound_mm for XPU. (#140517)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140517
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-11-14 03:36:20 +00:00
b1d6250028 [ONNX] Use TracedONNXFunction op signature to promote inputs to tensors (#138770)
Previous to this PR, in torchlib TracedONNXFunction, the inputs could be python constants even if the annotation sets to TensorTypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138770
Approved by: https://github.com/justinchuby
2024-11-14 03:15:07 +00:00
77da0509c4 [executorch hash update] update the pinned executorch hash (#139588)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139588
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-11-14 02:10:37 +00:00
c6c0554394 [EZ] Delete linux-focal-cuda12_1-py3_10-gcc9-bazel-test (#140659)
Because there is `linux-focal-cuda12_1-py3_10-gcc9-bazel-test` Not sure what the purpose of testing it against 2 CUDA versions as very basic things are tested right now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140659
Approved by: https://github.com/atalman, https://github.com/huydhn
2024-11-14 02:00:45 +00:00
80870f62f0 [AOTI][refactor] Switch remaining aoti_torch_get_data_ptr (#140448)
Summary: https://github.com/pytorch/pytorch/pull/139895 added data_ptr(), but there is a remaining place in cpp_wrapper_gpu.py didn't switch over. Also moved a few AtenTensorHandle related utility functions from arrayref_tensor.h to utils.h.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140448
Approved by: https://github.com/chenyang78
ghstack dependencies: #140447
2024-11-14 01:40:59 +00:00
85deef9ede [AOTI][refactor] Rename generate_extern_kernel_alloc_and_find_schema_if_needed (#140447)
Summary: Rename generate_extern_kernel_alloc_and_find_schema_if_needed to better reflect its meaning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140447
Approved by: https://github.com/chenyang78
2024-11-14 01:40:58 +00:00
e2b7f0bfd2 clarifies the wording in the main README to make it clearer that visu… (#140442)
…al studio build tool is only needed for Windows

I created no issue since the suggested change is actually very small.  This is my very first PR so partly I am creating it just to dip my toes in the water.  In fact I would understand if the change does not get accepted since it's a simple modification to part of the wording in the README.  The wording as it currently stands is probably clear enough for most people, but I still missed the fact that visual studio build tool must only be installed for Windows (even though that is stated there), and I thought by adding some parentheses this might become even more clear, specially since elsewhere in the README the formatting makes it more explicit that some steps must only be run for Windows/Linux/MacOS

As I said, it's a trivial change so I'd understand if it's not accepted, and I am looking forward to making more meaningful contributions as time goes on.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140442
Approved by: https://github.com/soulitzer
2024-11-14 00:35:55 +00:00
70acf02116 Use Manylinux2_28 for wheel builds (#138732)
Fixes https://github.com/pytorch/pytorch/issues/123649
Use Manylinux 2_28 Docker builds for PyTorch Nightly builds

This moves the wheels to a Docker image that uses : ``quay.io/pypa/manylinux_2_28_x86_64`` as a base rather then ``centos:7`` which is EOL on June 30, 2024.

Information:
https://github.com/pypa/manylinux#manylinux_2_28-almalinux-8-based

manylinux_2_28 (AlmaLinux 8 based)
Toolchain: GCC 13
Built wheels are also expected to be compatible with other distros using glibc 2.28 or later, including:
Debian 10+
Ubuntu 18.10+
Fedora 29+
CentOS/RHEL 8+

This migration should enable us to migrate to latest CUDNN version, and land this PR: https://github.com/pytorch/pytorch/pull/137978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138732
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn
2024-11-14 00:25:47 +00:00
f85e4338d4 [ONNX] Remove the contiguous patch (#140428)
Remove the contiguous patch because it is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140428
Approved by: https://github.com/titaiwangms
2024-11-14 00:03:17 +00:00
9c75475c77 Add missing pytorch-linux-jammy-py3.12-triton-cpu Docker image (#140571)
When investigating the burst of 429 rate limit failures from docker.io yesterday, I found out that ` pytorch-linux-jammy-py3.12-triton-cpu` hasn't been added to docker build workflow at all.  The bad effect is that the image is rebuilt on every job https://github.com/pytorch/pytorch/actions/runs/11808772774/job/32900628381

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140571
Approved by: https://github.com/seemethere, https://github.com/wdvr
2024-11-13 23:49:31 +00:00
f1e045eb75 Update torch-xpu-ops commit pin (#140277)
Update the torch-xpu-ops commit to [01f4e29](01f4e293fa), includes:
- Improve XPU operator coverage
- Fix `Werror=comments` relevant building issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140277
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-11-13 23:38:51 +00:00
2f1dbfea02 Logging Refactor - Remove Print Statements (#139782)
Summary:
Removes print statements and implements logging via the logging library.

Hopefully this will allow more control on the level of logging when running models.

Test Plan:
```
AOT_PARTITIONER_DEBUG=1 buck2 run @mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2
```

Resulting output paste: P1674535630
* Full logs paste: P1674535621

```
pastry P1674535621 | grep "functorch/partitioners.py" | pastry
```

Logging results: P1674549514

Differential Revision: D61678215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139782
Approved by: https://github.com/paryxyt, https://github.com/jansel
2024-11-13 23:09:18 +00:00
b34bb1f562 Add support for parsing torch.Generator in JIT (#140489)
Fixes #140420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140489
Approved by: https://github.com/davidberard98
2024-11-13 23:06:57 +00:00
70060b0927 Add proper parse_tensor_constants support (#140558)
Fixes #140422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140558
Approved by: https://github.com/davidberard98
2024-11-13 23:06:26 +00:00
9d93c27025 Implement unfold_backward on MPS (#135411)
This PR adds native implementation of unfold_backward as metal shader, mostly copy-n-paste of algorithms used in CUDA and CPU implementations, i.e. considering `out = in.unfold(dim, size, step)`, then following holds true:
* `out.shape[dim] == (in.shape[dim] - size) / step + 1`
* `out.shape[-1] == size`
* `out.ndim == in.ndim + 1`
`unfold_backward` Metal kernel  receives `grad_in` and returns `grad_out` such that:
* `grad_in.shape == out.shape`
* `grad_out.shape == in.shape`

For each index in `grad_out` find the elements contributing to it and sum them up. Such algorithm requires no synchronization between threads.
That is `grad_out[...,out_dim_idx,...]` accumulates all values `grad_in[...,in_dim_idx,...,in_last_idx]`, where `in_dim_idx` is range [`(out_dim_idx - size) / step`, `out_dim_idx / step`] clamped to (0, `in_dim_size`) and `in_last_idx` are equal `out_dim_idx - in_dim_idx * step` . Accumulation step is skipped if `in_last_idx` is outside of [0, size] range.

This operator has been requested 16 times on https://github.com/pytorch/pytorch/issues/77764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135411
Approved by: https://github.com/manuelcandales

Co-authored-by: Manuel Candales <42380156+manuelcandales@users.noreply.github.com>
2024-11-13 23:04:15 +00:00
08acfcddc4 [ez] Fix check labels error when deleting comment (#140578)
Re make of https://github.com/pytorch/pytorch/pull/140587
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140578
Approved by: https://github.com/huydhn
2024-11-13 23:00:58 +00:00
274f4cfacb [3/x][fx minimizer] Support all_outputs in minimizer (#139774)
Summary: output nodes may be eliminated to the input nodes if only partial output nodes are specified. add option to check results for all output nodes in the partitioned graph

Test Plan: see D65367305

Reviewed By: qcyuan

Differential Revision: D65367305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139774
Approved by: https://github.com/jfix71
2024-11-13 22:56:42 +00:00
26fde110db Refactor user-defined triton kernel source code collection (#140577)
Differential Revision: [D65895743](https://our.internmc.facebook.com/intern/diff/D65895743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140577
Approved by: https://github.com/zou3519
2024-11-13 22:12:17 +00:00
a8de84998d OpenReg: Export the number of devices (#140492)
Export the number of devices so that it can be used in ut.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140492
Approved by: https://github.com/ezyang
2024-11-13 22:08:37 +00:00
c1bf714d76 [Profiler] Fix ASAN Overflow Issues (#140441)
Summary:
It seems like this issues is due to leftover cupti events during warmup staying persistent in the queue during profiling. These events start before our actual time window and therefore have a timestamp lower than our basetime. This makes the delta become negative which results in unsigned overflow. This then creates a large number which later gets sign added which creates the signed overflow.

Solution: If a raw timestamp is less than the base timestamp, just mark the process timestamp as -1 so we can mark these events as "to ignore". In Kineto, add a special case to ignore timestamps that are negative.

Test Plan: Test with ASAN

Differential Revision: D65835650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140441
Approved by: https://github.com/davidberard98
2024-11-13 21:30:32 +00:00
ba8568f7fb [c10d][logging] Add wait counter for time spent in object to tensor and tensor to object (#140414)
Originally we want to leverage the timer logger to measure the time spent in object to tensor and tensor to object (https://github.com/pytorch/pytorch/pull/139757) But it gets reverted (internally) because of a performance regression. We now use wait counter instead which is more lightweight.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140414
Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu, https://github.com/wz337
2024-11-13 21:10:43 +00:00
49c124fe1b dynamo: guard on FSDP module parameters (#138819)
Fixes https://github.com/pytorch/pytorch/issues/138715

It looks like we were previously ignoring guards on FSDP module parameters. In the issue linked above, this was causing inductor size/stride asserts to fire. The root cause is that for some code like this:
```
m = FSDP(
    torch.nn.Sequential(
        torch.compile(torch.nn.Linear(1024, 1024)),
        torch.compile(torch.nn.Linear(1024, 4096))
    )
)
```

We need to generate two different graphs for the two linear layers, and it looks like without a `TENSOR_MATCH` guard on the linear parameters, dynamo would think that it could re-use the same graph across both layers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138819
Approved by: https://github.com/anijain2305
2024-11-13 20:46:46 +00:00
c8be6f1196 [codemod] Remove unused-variable in pytorch (#140569)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

#buildsonlynotests - Builds are sufficient

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D65833225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140569
Approved by: https://github.com/Skylion007
2024-11-13 20:38:03 +00:00
82597d07aa type annotations for meta_utils (#140203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140203
Approved by: https://github.com/ezyang
2024-11-13 20:07:47 +00:00
c25999bdc0 Revert "Add missing pytorch-linux-jammy-py3.12-triton-cpu Docker image (#140571)"
This reverts commit 51e0996d58e6fa40a8d255a26b767c3f3e035943.

Reverted https://github.com/pytorch/pytorch/pull/140571 on behalf of https://github.com/huydhn due to Not sure why lint fails, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/140571#issuecomment-2474627883))
2024-11-13 19:54:11 +00:00
0f739b8f66 [Codemod] skipIfMps->skipIfMPS (#140562)
As `MPS` is an acronym that stands for Metal Performance Shaders
Also to closer align with `skipCUDAIf` not `skipCudaIf`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140562
Approved by: https://github.com/ZainRizvi, https://github.com/r-barnes
2024-11-13 19:45:08 +00:00
f3a6832b09 [inductor] Skip autotuning config on ptxas error (#140495)
Currently, when ptxas errors occur in one of the autotuning configs, we error out. This doesn't match the newly introduced behavior of the native Triton ([here](915c149978/python/triton/runtime/autotuner.py (L164))). In this PR, we match the Inductor's autotuning behavior to native Triton's by ignoring the ptxas errors and the configs triggering thereof.

This unblocks PT2 compilation of an internal model.

Differential Revision: D65861236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140495
Approved by: https://github.com/chenyang78
2024-11-13 19:45:00 +00:00
51e0996d58 Add missing pytorch-linux-jammy-py3.12-triton-cpu Docker image (#140571)
When investigating the burst of 429 rate limit failures from docker.io yesterday, I found out that ` pytorch-linux-jammy-py3.12-triton-cpu` hasn't been added to docker build workflow at all.  The bad effect is that the image is rebuilt on every job https://github.com/pytorch/pytorch/actions/runs/11808772774/job/32900628381

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140571
Approved by: https://github.com/seemethere, https://github.com/wdvr
2024-11-13 19:08:14 +00:00
d63eb3c46c Revert "[logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)"
This reverts commit cb15c1515778499ae801dcf67d55c8bdab4724ef.

Reverted https://github.com/pytorch/pytorch/pull/139849 on behalf of https://github.com/kit1980 due to Breaking an internal tests + there is a bug according to the author ([comment](https://github.com/pytorch/pytorch/pull/139849#issuecomment-2474459094))
2024-11-13 18:47:51 +00:00
42622cf7d5 enable concat linear with mkldnn linear by flag (#139048)
Enable concat linear for CPU mkldnn path.
Previously, we have a concat linear in freezing passes but it not worked on CPU.
This is because `concat_linear` pattern happened after `mkldnn_weight_prepack`. And `concat_linear` only handle `addmm/mm` etc.

```
addmm -> mkldnn linear
addmm -> mkldnn linear -> cannot concat

# only worked when disable mkldnn
addmm ->
addmm -> concat linear
```
Now we changed `mkldnn linear` related pass numbers larger than `concat_linear` pass numbers.

```
addmm -> concat linear -> mkldnn linear
addmm ->

```
So it can work fine with mkldnn linear now.

Also, since concat linear not always have benefits. We add 1 flag `config.cpp.enable_concat_linear` and set default value to False. User can enable this by their need.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139048
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-13 18:43:37 +00:00
c98ef0279e [dynamo] add SymNode bitwise and/or (#138777)
Fixes [T203472723](https://www.internalfb.com/intern/tasks/?t=203472723)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138777
Approved by: https://github.com/ezyang
2024-11-13 18:31:06 +00:00
22dfb5b6cf [dynamo, 3.13] replace deprecated PyWeakref_GetObject (#140187)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140187
Approved by: https://github.com/jansel
2024-11-13 17:57:28 +00:00
03cccaa76a Doc: Rewrite the storage.rst file to emphasize untyped storages (#140145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140145
Approved by: https://github.com/janeyx99
2024-11-13 17:40:16 +00:00
1a8752bc7d [TorchScript] bindings for torch._C.ClassType.method_names() (#140444)
I used this for debugging, figured I'd upstream it.

This gives you a list of the method names provided by the given ClassType.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140444
Approved by: https://github.com/eellison
2024-11-13 17:23:23 +00:00
2675ef8758 Revert " [Environment Variable][5/N] Use thread-safe getenv functions (#139762)"
This reverts commit 43f0fe60a36dc7e3bd8f77a2451bde81496679b0.

Reverted https://github.com/pytorch/pytorch/pull/139762 on behalf of https://github.com/malfet due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/139762#issuecomment-2474174813))
2024-11-13 16:50:00 +00:00
3d618019fb Fix RMSNorm Notation: Parentheses, Indices, Comma (#140215)
Fixes #140165

* fixed mathematical notation for RMSNorm:
  * changed RMS function from brackets `[x]` to parenthesis `(x)` for consistency and align with mathematical notation standards for functions
  * added indices (e.g. `y_i`)  for element-wise operations for the correctness in the context of tensor operations
  * added comma `,` before $$\text{where}$$

![grafik](https://github.com/user-attachments/assets/47368625-d97a-43de-8b90-17b2c01cbe2f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140215
Approved by: https://github.com/mikaylagawarecki
2024-11-13 15:33:50 +00:00
a58a565819 Revert "[Environment Variable][6/N] Use thread-safe getenv functions (#140200)"
This reverts commit 7d4f5f7508d3166af58fdcca8ff01a5b426af067.

Reverted https://github.com/pytorch/pytorch/pull/140200 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140200#issuecomment-2473956859))
2024-11-13 15:33:23 +00:00
5dc6b8c19e Revert "Allow NJT by default for weights_only torch.load (#140304)"
This reverts commit 1f28235ee2984dbad45b55aa65358b59a7aeea33.

Reverted https://github.com/pytorch/pytorch/pull/140304 on behalf of https://github.com/mikaylagawarecki due to Breaking internal tests due to missing torch.nested._internal ([comment](https://github.com/pytorch/pytorch/pull/140304#issuecomment-2473928461))
2024-11-13 15:24:00 +00:00
b4cc5d38b4 Revert "[aoti] Remove dir after packaging (#140022)"
This reverts commit ba136a78ba613d3c7f5d2de53b9fff556e04cfba.

Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/angelayi due to sorry I realized I need to land from internal ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2473814720))
2024-11-13 14:43:15 +00:00
a8a1e58e24 [inductor] Log how compile_threads is set (#139771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139771
Approved by: https://github.com/eellison
2024-11-13 14:17:10 +00:00
c6a29fc3d8 Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843)"
This reverts commit 82eb09aafd7e4ee6e4fb0580f2221ea6253d218b.

Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2473709760))
2024-11-13 14:06:52 +00:00
4a18e26ff5 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit a3cff4bbd4130d36b188dbe101a790e6d7da644f.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2473709246))
2024-11-13 14:05:01 +00:00
34743d8a16 Support dlpack for privateuse1 (#135331)
Fixes #129652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135331
Approved by: https://github.com/shink, https://github.com/FFFrog, https://github.com/ezyang

Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2024-11-13 13:13:14 +00:00
97d995a0d3 Revert "[pytorch/profiler] Profiler NCCL metadata can now contain collective Input and Ouput Tensor addrs (#139837)"
This reverts commit 3e277eb9febbbdd435e6a07a3f0750d4e362625a.

Reverted https://github.com/pytorch/pytorch/pull/139837 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139837#issuecomment-2473466607))
2024-11-13 12:26:43 +00:00
ba136a78ba [aoti] Remove dir after packaging (#140022)
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
2024-11-13 12:17:19 +00:00
e754611d19 [aoti] Add error msg if we can't find a proxy executor (#140308)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140308
Approved by: https://github.com/desertfire
2024-11-13 09:10:54 +00:00
c61ccaf10e [FR] Polish the log message for dtype mismatch and don't exit when too many mismatch (#140451)
Summary:
1. We don't want to exit with exceptions when there are so many mismatches. We should just break and return.
2. Polish the message of dtype mismatch. This is because dtype of input/output is actually a list not a string. So we don't want to show a list of ['double'] in the output message.

Test Plan:
Testing on the case when we see too many collective dtype mismatch

 {F1958467224}

Differential Revision: D65841830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140451
Approved by: https://github.com/c-p-i-o
2024-11-13 07:24:53 +00:00
cb71bcc542 Replace clone.detach with detach.clone (#140264)
Fixes #64532

As state in issue, replace `clone.detach` by `detach.clone`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264
Approved by: https://github.com/soulitzer
2024-11-13 07:01:02 +00:00
f06ee3e546 [pt2] Add meta for _add_relu (#140009)
aten._add_relu doesn't have meta function registered, so in dynamic shape case it is throwing an error in dynamo logs:
Error:
`V1107 11:25:32.344000 140481543555072 torch/_dynamo/symbolic_convert.py:534] [0/1] [__graph_breaks] NotImplementedError: aten::_add_relu.Tensor: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered. You may have run into this message while using an operator with PT2 compilation APIs (torch.compile/torch.export); in order to use this operator with those APIs you'll need to add a fake impl.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140009
Approved by: https://github.com/ezyang
2024-11-13 06:30:58 +00:00
8a80cee2f3 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [3/N] (#140247)
related commits:

- #139706
- #140238
- #140247
- #140253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140247
Approved by: https://github.com/soulitzer
2024-11-13 05:51:42 +00:00
5b1c67cc60 [Intel GPU] Avoid atomic add for XPU device in satter_add by deterministic mode (#137966)
The "scatter_add" op with the deterministic mode in XPU device is not implemented, it will report that "scatter_add_kernel" does not have a deterministic implementation in UT.

Just like the implementation of CUDA,  we need to check  _deterministic_algorithms in scatter_add op for the XPU device.

The UT is in: https://github.com/intel/torch-xpu-ops/blob/main/test/xpu/test_scatter_gather_ops_xpu.py. We reused [PyTorch UT code]( 96b30dcb25/test/test_scatter_gather_ops.py (L233)).
Now the UT case is [skipped in torch-xpu-ops test](4fa7921f1e/test/xpu/skip_list_common.py (L731)). Will open it when this PR is merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137966
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang
2024-11-13 05:46:54 +00:00
79fb7416e7 [Intel GPU] Add device guard for XPU structured operator in torchgen (#138802)
This PR is a supplement to https://github.com/pytorch/pytorch/pull/133980. The previous PR fulfill the basic functionality of XPU device guard, while we found it fails to address structured operators.

With current PR, the code snippet in RegisterXPU.cpp is as follows, where we can see the device guard is successfully generated.

```c++
struct structured_exp_out_functional final : public at::native::structured_exp_out {
    void set_output_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        auto current_device = guard_.current_device();
        if (C10_UNLIKELY(current_device.has_value())) {
          TORCH_INTERNAL_ASSERT(*current_device == options.device(),
            "structured kernels don't support multi-device outputs");
        } else {
          guard_.reset_device(options.device());
        }
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
        // super must happen after, so that downstream can use maybe_get_output
        // to retrieve the output
        at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
    }
    void set_output_raw_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        auto current_device = guard_.current_device();
        if (C10_UNLIKELY(current_device.has_value())) {
          TORCH_INTERNAL_ASSERT(*current_device == options.device(),
            "structured kernels don't support multi-device outputs");
        } else {
          guard_.reset_device(options.device());
        }
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
        // super must happen after, so that downstream can use maybe_get_output
        // to retrieve the output
        at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
    }
    const Tensor& maybe_get_output(int64_t output_idx) override {
      return outputs_[output_idx];
    }
    std::array<Tensor, 1> outputs_;
    c10::OptionalDeviceGuard guard_;
};

```

However, without current change, the generated code is

```c++
struct structured_exp_out_functional final : public at::native::structured_exp_out {
    void set_output_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
        // super must happen after, so that downstream can use maybe_get_output
        // to retrieve the output
        at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
    }
    void set_output_raw_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
        // super must happen after, so that downstream can use maybe_get_output
        // to retrieve the output
        at::native::structured_exp_out::set_output_raw_strided(output_idx, sizes, strides, options, names);
    }
    const Tensor& maybe_get_output(int64_t output_idx) override {
      return outputs_[output_idx];
    }
    std::array<Tensor, 1> outputs_;
};
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138802
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/ezyang
2024-11-13 05:40:38 +00:00
7b0d199471 [doc] fix grammar in "Extending Torch" (#140209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140209
Approved by: https://github.com/soulitzer
2024-11-13 05:34:43 +00:00
1886e33f60 Use device-agnostic runtime API in distributed DDP/FSDP instead of cuda device specific. (#137678)
# Motivation
This PR targets to use device-agnostic runtime API in distributed DDP/FSDP instead of `cuda` device specific.

cc cc [@jgong5](https://github.com/jgong5) [@gujinghui](https://github.com/gujinghui) [@EikanWang](https://github.com/EikanWang) [@fengyuan14](https://github.com/fengyuan14) [@guangyey](https://github.com/guangyey)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137678
Approved by: https://github.com/kwen2501, https://github.com/guangyey, https://github.com/jgong5
2024-11-13 05:32:19 +00:00
4c6eebf4e2 [doc] improve code in fake tensor doc (#140329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140329
Approved by: https://github.com/soulitzer
2024-11-13 05:14:56 +00:00
d6b3ad4de2 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [2/N] (#140238)
related commits:

- #139706
- #140238
- #140247
- #140253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140238
Approved by: https://github.com/soulitzer
2024-11-13 05:13:39 +00:00
42ad54c71b [Intel GPU] Allow XPU device in LSTMCell operators (#140246)
Refine device check logic for LSTMCell.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140246
Approved by: https://github.com/soulitzer
2024-11-13 05:13:07 +00:00
3e277eb9fe [pytorch/profiler] Profiler NCCL metadata can now contain collective Input and Ouput Tensor addrs (#139837)
Studying memory access patterns is the primary use cases.

Internal: The data may be used to find the % of operators that may cause alignment related overhead.

Differential Revision: [D64413699](https://our.internmc.facebook.com/intern/diff/D64413699/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139837
Approved by: https://github.com/sraikund16
2024-11-13 04:57:16 +00:00
4bbd6da331 Enable XPUEvent elapsed_time function (#134666)
# Motivation
This PR aims to enable `elapsed_time` function for `XPUEvent`.

# Additional Context
This PR depends on toolchain oneAPI 2025.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134666
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-11-13 04:32:50 +00:00
e9fb2c6abe Add some error messages for flexattention (#138891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138891
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-11-13 04:05:29 +00:00
659d2132be Add architecture to XPU device property (#138186)
# Motivation
Add `architecture` to XPU device property.
In some cases, low-level application code can use special features or do specific optimizations depending on the device architecture, and this PR enables such applications.
Modified from https://github.com/pytorch/pytorch/pull/129675/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138186
Approved by: https://github.com/ezyang
2024-11-13 03:35:13 +00:00
39d1c91c33 [dynamo] Restrict support for out= variants of torch operators (#140202)
There has been a series of attempts to provide support for resizing in
torch operators like `torch.sigmoid(x, out=y)`, i.e., `y` would have a
different shape before and after this expression. Prior to this patch,
we have some checks to graph break if the shape changed.

This patch extends
1. extends the existing check and graph break for any shape change, not
   just for `TensorVariable` with source field.
2. removes an old code path which was introduced to address the shape
   change, but became obselete in that regard because we added extra
   checks to graph break upon shape change. Moreover, this old code path
   is unsound, it tries to replace references to the old
   `TensorVariable` the new one returned by `wrap_fx_proxy`, but it only
   does the replacement in `symbolic_locals`, which breaks when cells
   are involved. In general the old `TensorVariable` could be _anywhere_,
   think the `replace_all` we had for immutable VTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140202
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150, #140151, #140201
2024-11-13 03:14:23 +00:00
65615915ed [dynamo] Fix bugs in side-effect pruning and codegen (#140201)
This patch fixes 2 things which are exposed if we have `NewCellVariable`
rather than `ClosureVariable` to model python cells:
1. `codegen_save_tempvars` must run first, to establish `source` for
   objects, otherwise they can't reconstruct.
2. `prune_dead_object_new` must account for `OutputGraph.backward_state`
   as well, since it also contains variables that must live.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140201
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150, #140151
2024-11-13 03:14:23 +00:00
3a622c5685 [dynamo] Refine LocalSource.cell_or_freevar to LocalSource.is_input (#140151)
The `cell_or_freevar` was added in #106403 to help us ensure
Dynamo-export only allows graph input that depends on the frame input
(rather than a captured cell, for instance).

However, when taken literally, the `cell_or_freevar` condition is
actually not accurate, because for frame inputs that are also cells
(i.e., captured by some inner function), we actually set the
`cell_or_freevar` flag to false. This makes sense, because otherwise the
existing implementation would prevent Dynamo-export to add any of these
inputs to the graph.

To help with reasoning, this patch refines the `cell_or_freevar` flag to
what we really want to check -- `is_input`, and updates the relevant use
sites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140151
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149, #140150
2024-11-13 03:14:23 +00:00
d34d5ccec5 [dynamo] Fix some corner cases for modeling pre-existing cells (#140150)
In `UserFunctionVariable.bind_args`, there's a rare case when the
underlying function satisfies all conditions below
1. The function captures a pre-existing cell
2. The cell isn't captured by root frame
3. `UserFunctionVariable.source` is `None`

In such cases, Dynamo would model the cell as its content (just like
what we do for cells in the root frame). However, this could break in
two cases:
- We could have multiple instances of `UserFunctionVariable`, where some
  have source and others don't. This means sometimes we'll model the
  cell as a `NewCellVariable`, and sometimes as its content. This
  causes issues because writes to the `NewCellVariable` would be
  buffered in `SideEffects` and never get picked up by the other
  modeling.
- Only when `UserFunctionVariable` has a source, do we check whether we
  already had a `NewCellVariable` for the captured cell. This again causes
  Dynamo to potentially have multiple representations for the same cell
  object, resulting in a similar "buffered writes not reflected" issue
  as above.

This patch fixes the above 2 issues by
1. modeling captured cells of sourceless `UserFunctionVariable` as
   immutable `NewCellVariable`, and adds a few lines in `SideEffects` to
   account for its immutability.
2. always checking whether we already had a `NewCellVariable` for the
   captured cell, before constructing a new one.

Tests are added for each aforementioned case.

I also left a TODO to investigate why exactly we would lose source
information for `UserFunctionVariable`. Some cases are easily fixable,
but others not so much.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140150
Approved by: https://github.com/jansel
ghstack dependencies: #140035, #140036, #140149
2024-11-13 03:14:23 +00:00
6a821c9e6a [dynamo] Remove cell unboxing/restart optimization (#140149)
We added an unboxing optimization to avoid writes to cells that existed
before Dynamo tracing (such writes interfere with HOPs). However, the
avoided write shouldn't be there in the first place, since we were
basically creating an empty `NewCellVariable`, and then write the
pre-existing content into the variable.

This patch
1. adds logic to bypass the initial write for pre-existing cells
   without undermining correctness.
2. removes the unboxing optimization and the restart code path.

Fixes #137456, #138491; also see those issues for more historical
context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140149
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #140035, #140036
2024-11-13 03:14:23 +00:00
698ff07323 [dynamo] Fix name collision bug for captured cells and locals (#140036)
The `export_freevars` method was introduced very early on, for
propagating writes to unboxed cells from child to parent frame, see
https://github.com/pytorch/torchdynamo/commit/d0c10341.

However, it's no longer needed after we started to modify root tracer's
`symbolic_locals` directly for the unboxed cells, see
https://github.com/pytorch/torchdynamo/commit/663e4d92.

As a result, we no longer need `export_freevars`. In fact, it can cause
a very subtle bug when name collision happens across the parent and
child frames during inlining, because the parent frame isn't necessarily
the frame that defined the cell captured by child frame.

In summary, this patch removes the `export_freevars` bits, and adds a
regression test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140036
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #140035
2024-11-13 03:14:23 +00:00
8dc3cb043c [dynamo] Put cells into closure_cells and document relevant parts (#140035)
This patch establishes the invariant that `ClosureVariable` and
`NewCellVariable` are always in `closure_cells`, never in
`symbolic_locals`, and therefore removes some duplicated code paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140035
Approved by: https://github.com/jansel
2024-11-13 03:14:23 +00:00
d3da6d49df Add cmake to requirements.txt (#140491)
As one can not build PyTorch in clean venv if cmake is not installed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140491
Approved by: https://github.com/yangw-dev, https://github.com/huydhn
2024-11-13 02:53:25 +00:00
953286b850 [DTensorTestbase] Fix @with_comms inactive problem (#139637)
Summary:
`with_comms()` is mostly used as a decorator with an optional input argument `eager_init`. The problem of a decorator with input argument is that it has to be used with invocation always, i.e., you have to use as `with_comms()` rather than `with_comms` which majority of the existing usages.

This diff tries to provide a solution such that we could use `with_comms`, `with_comms()`, `with_comms(eager_init=False)`, and `with_comms(eager_init=True)`.

Test Plan: Contbuild & OSS CI

Differential Revision: D65385700

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139637
Approved by: https://github.com/wz337
2024-11-13 02:45:02 +00:00
cyy
40fb738197 Use Wextra-semi (#140236)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140236
Approved by: https://github.com/ezyang
2024-11-13 02:15:16 +00:00
fb7148d05d Fix split decomp returning self (#140065)
Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with :

> RuntimeError: View operation returned a tensor that is the same as the input base tensor.  This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using.

Fix for https://github.com/pytorch/pytorch/issues/133394

Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140065
Approved by: https://github.com/bdhirsh
2024-11-13 01:58:02 +00:00
4906413b70 [Intel GPU] Support RegisterSparseXPU.cpp codegen. (#139267)
This PR is to support code generation for sparse operations on Intel GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139267
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-11-13 01:41:43 +00:00
891ba2ec8a Fix xpu cmake typo (#140374)
# Motivation
This PR aims to fix a typo in the CMake build. The typo impacts the XPU Windows build and results in PyTorch being built without XPU, which is unexpected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140374
Approved by: https://github.com/EikanWang, https://github.com/ezyang, https://github.com/atalman
2024-11-13 00:26:35 +00:00
3d2dd14217 [BE][Bugfix]: Add rad2deg to pointwise ops (#140290)
Adds missing pontwise tags. Apparently this allows NestedTensor to properly generate a function for opinfo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140290
Approved by: https://github.com/jbschlosser
2024-11-13 00:02:00 +00:00
3e82b1f6c0 Build magma tarball for cuda 126 (#140143)
Now that manylinux 2.28 is available with cuda 1.26 https://github.com/pytorch/pytorch/pull/139909

we can build the magma tarball for cuda 1.26.

Fixes #139397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140143
Approved by: https://github.com/atalman
2024-11-12 23:42:26 +00:00
d48ea29b9a Revert "[aoti] Remove dir after packaging (#140022)"
This reverts commit 8c6abe5a8c42be3909496d2cd3d1f194a8493460.

Reverted https://github.com/pytorch/pytorch/pull/140022 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit ([comment](https://github.com/pytorch/pytorch/pull/140022#issuecomment-2471847439))
2024-11-12 23:35:27 +00:00
1f28235ee2 Allow NJT by default for weights_only torch.load (#140304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140304
Approved by: https://github.com/jbschlosser
2024-11-12 23:34:27 +00:00
096929c1e8 Add safe.directory to Almalinux docker image (#140454)
Something that was accidentally dropped by: https://github.com/pytorch/pytorch/pull/140157
Needs to be re-added. I believe its part of our Docker images. Please see: https://github.com/pytorch/pytorch/blob/main/.ci/docker/manywheel/Dockerfile#L21

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140454
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-11-12 23:28:12 +00:00
70a223cce6 [aotinductor] fix a few issues in bandwidth profiler (#139607)
Summary:
The recent tries on bandwidth profiler is not as expected. I have observed a few issues and tried to fix them in this diff:
1. The return of the DebugAutotuner class
2. Profiling results shows really large overhead.
DebugAutotuner.run()  returns the benchmark time around 45ms while CachingAutotuner.run() returns the benchmark time around 0.45ms.
The `_find_names` and `re.match` takes 45ms: P1669186358
After we commenting out the above _find_names and re.match, the benchmark time become consistent with non-profiling mode: P1669185589
3. introduce a variable `bandwidth_info` to control the path in DebugAutotuner.run(). During benchmarking of configuration selection, we should turn off the `bandwidth_info`

After applying this diff, the profiling issues mentioned above are fixed: P1669273172

Test Plan:
```
TORCHINDUCTOR_FORCE_DISABLE_CACHES=1   TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/tmp/profile.txt TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_MAX_AUTOTUNE=1 CUDA_VISIBLE_DEVICES=5  buck run mode/{opt,inplace} scripts/wwei6/triton_examples:test_mat 2>&1 | tee profiling-5.log
```
If we want to disable the Aten backend, just add TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON"

Differential Revision: D64883079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139607
Approved by: https://github.com/chenyang78
2024-11-12 23:26:47 +00:00
267641f6f1 [Profiler] Add More Logging for Dynamic Collection API (#140285)
Summary: Add a log warning users about how disabling only CUDA events can cause incorrect correlation IDs

Test Plan: Log was printed in the correct scenario

Differential Revision: D65762576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140285
Approved by: https://github.com/sanrise
2024-11-12 22:59:04 +00:00
7578a0b268 [pipelining] clean up stage functions (#140418)
Clean up methods related to stage input/output shape verification which are no longer needed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140418
Approved by: https://github.com/wconstab
ghstack dependencies: #140019
2024-11-12 21:42:08 +00:00
2ac71a5771 [pipelining] add type checking to _backward functions (#140019)
fix https://github.com/pytorch/pytorch/issues/139405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140019
Approved by: https://github.com/wconstab
2024-11-12 21:42:08 +00:00
1f590feaf7 [AOTI][refactor] Update codegen_int_array_var API (#140299)
Summary: codegen_int_array_var and codegen_reinterpret_view need to call different writeline functions depending on which part of code it's writing. Previously their APIs take a writer and implicitly assign a default writer if needed, which is not intuitive. Update their APIs to explicitly take a writeline function.

Differential Revision: [D65774584](https://our.internmc.facebook.com/intern/diff/D65774584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140299
Approved by: https://github.com/frank-wei, https://github.com/chenyang78
2024-11-12 21:39:41 +00:00
8c6abe5a8c [aoti] Remove dir after packaging (#140022)
Update AOTI to return a list of files that it generates when `aot_inductor.package=True`. Then we will only package the files that are in that list.

This should fix the [caching issue](https://fb.workplace.com/groups/1028545332188949/permalink/1081702043539944/) and hopefully https://github.com/pytorch/pytorch/issues/140053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140022
Approved by: https://github.com/larryliu0820, https://github.com/desertfire, https://github.com/malfet
2024-11-12 21:36:24 +00:00
0db21a6b23 Remove most rockset references (#139922)
Remove most references to rockset:
* replace comments and docs with a generic "backend database"
* Delete `upload_to_rockset`, so we no longer need to install the package.
* Do not upload perf stats to rockset as well (we should be completely on DynamoDB now right @huydhn?)

According to VSCode, it went from 41 -> 7 instances of "rockset" in the repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139922
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-11-12 21:17:43 +00:00
4675875d16 Fix lint after #138899 (#140446)
Fixes Lint after: https://github.com/pytorch/pytorch/pull/138899
Due to landrace.
Run ``./regenerate.sh``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140446
Approved by: https://github.com/wdvr, https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2024-11-12 20:53:58 +00:00
1172a10574 [Build] Do not regenerate code endlessly without XPU (#140438)
Before this change, if one builds PyTorch without XPU build process will
be perpetually regenerating code because of the reference to non-existing
file, that will make autograd codegened files always out of date, see part of the `ninja -d explain torch_cpu` output:
```
ninja explain: output ../torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp doesn't exist
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja explain: /Users/malfet/git/pytorch/pytorch/torch/csrc/autograd/generated/Functions.cpp is dirty
```

This is a regression introduced by https://github.com/pytorch/pytorch/pull/139025.

After this change, incremental rebuilds with no changes cause no build actions:
```
% ninja -j1 -v -d explain -n torch_cpu
ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist
ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty
ninja: no work to do.
```

Test plan: Wait for at least on XPU build to finish...

Fixes https://github.com/pytorch/pytorch/issues/140432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140438
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-11-12 20:19:28 +00:00
14bb49fe98 Add CUDA 12.6 Linux Builds to Binaries Matrix (#138899)
Related to #138440

Issue tracker: https://github.com/pytorch/pytorch/issues/138609

Version based on https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138899
Approved by: https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-12 19:52:31 +00:00
034b105d53 [BE][Ez]: Add NT unary op macro (#140213)
* Adds a macro to simplify adding more unary ops to NT.
* Adds sqrt support to NT
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140213
Approved by: https://github.com/jbschlosser
2024-11-12 19:50:06 +00:00
069a71023b Revert "[inductor] Refactor reduction type choices into V.choices (#139585)"
This reverts commit 6438c8637a7e28b676a1ccfe942dc37375d0cb14.

Reverted https://github.com/pytorch/pytorch/pull/139585 on behalf of https://github.com/kit1980 due to breaking internal builds, see D65800124 ([comment](https://github.com/pytorch/pytorch/pull/139585#issuecomment-2471392822))
2024-11-12 19:32:14 +00:00
c0ddd10f6d Revert "[inductor] Support fixed triton configs defined at compile time (#140217)"
This reverts commit 29114e44fa7a17a3a2112d76937ae3b4cf9d33ce.

Reverted https://github.com/pytorch/pytorch/pull/140217 on behalf of https://github.com/kit1980 due to breaking internal builds, see D65800124 ([comment](https://github.com/pytorch/pytorch/pull/139585#issuecomment-2471392822))
2024-11-12 19:32:14 +00:00
8304a1faad OpenReg: Fix issue when casting tensor on the executor size (#140255)
Previously we assumed that the number of tensor elements multiplied by the type size is not greater than the allocated memory size. However in some scenarios such as `tensor.expand`, the stride can be zero, which makes the assumption not true.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140255
Approved by: https://github.com/ezyang
2024-11-12 19:29:21 +00:00
cc8e832066 [AMD] use DC method for linalg.eigh (#140327)
Summary: Jacobi method has larger numerical errors, see D64997718, use divide-and-conquer method instead.

Test Plan: CI

Differential Revision: D65786796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140327
Approved by: https://github.com/jianyuh
2024-11-12 19:17:25 +00:00
726424f4de Use base32 triton cache function if base64 is not found (#140297)
In #140190 the base64 function is imported from triton

But, since triton-lang/triton#5088 ,
the base64 function was replaced to base32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140297
Approved by: https://github.com/davidberard98
2024-11-12 19:05:21 +00:00
c182c7ccfc Fix triangular_solve meta function out parameter names. (#140186)
This PR replaces the parameter names specified in the `triangular_solve_meta`
function (specifically in its `@out_wrapper(...)` decorator) by those written in the
_native_functions.yaml_ file.

This name mismatch caused the operation to fail when using the meta device (see error
below):

```python
Traceback (most recent call last):
  File "examples/test.py", line 23, in <module>
    torch.triangular_solve(b.to("meta"), A.to("meta"), out=meta_out)
  File "torch/_decomp/__init__.py", line 100, in _fn
    return f(*args, **kwargs, out=None if is_none else out_kwargs)
  File "torch/_prims_common/wrappers.py", line 289, in _fn
    result = fn(*args, **kwargs)
TypeError: triangular_solve_meta() got an unexpected keyword argument 'X'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140186
Approved by: https://github.com/ezyang
2024-11-12 19:04:34 +00:00
6a368b3fc5 Add ScalarList overload to _foreach_lerp (#134482)
Related:
- https://github.com/pytorch/pytorch/issues/133367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482
Approved by: https://github.com/janeyx99
2024-11-12 19:03:41 +00:00
cyy
7624d625c0 [Reland][7/N] Fix Wextra-semi warning (#140342)
Reland of #140225 to fix a change in FBCODE_CAFFE2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140342
Approved by: https://github.com/kit1980
2024-11-12 18:55:31 +00:00
e4195f8060 Revert "[logging][ez] Add timer logging for pickling and unpickle for object based collective (#139757)"
This reverts commit 41e4d88584c4ed0708cd1d93c71cd4ee2e1bbbb5.

Reverted https://github.com/pytorch/pytorch/pull/139757 on behalf of https://github.com/izaitsevfb due to reverted internally, see D65682470 ([comment](https://github.com/pytorch/pytorch/pull/139757#issuecomment-2471316405))
2024-11-12 18:53:37 +00:00
cyy
a3cff4bbd4 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-11-12 18:49:51 +00:00
928b8ec633 [BE]: Add pointwise tag to isfinite (#140291)
Adds pointwise tag to isfinite
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140291
Approved by: https://github.com/jbschlosser
2024-11-12 18:02:07 +00:00
5aadaaf2b5 [Dynamo] Allow filter() to handle infinite iterator (#138305)
Fixes #137380

```python
import torch

def filt(x):
    return x < 10

@torch.compile(backend="eager", fullgraph=True)
def f(x):
    x = x + 1
    return zip(range(3), filter(filt, itertools.count()))

print(list(f(torch.ones(3)))) # [(0, 0), (1, 1), (2, 2)]

@torch.compile(backend="eager")
def g(x):
    x = x + 1
    return filter(filt, [1, 2, 3])

res = g(torch.ones(3))
assert isinstance(res, filter)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138305
Approved by: https://github.com/williamwen42
2024-11-12 17:32:56 +00:00
7a02457053 [BE] Fix error message in torch._scaled_mm (#140343)
Followup after https://github.com/pytorch/pytorch/pull/140307 that fixes error message for mat1, but not for mat2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140343
Approved by: https://github.com/kit1980
2024-11-12 17:13:41 +00:00
60db702a42 Noop m.set_python_module on C10_MOBILE builds (#140273)
Summary:
This was causing issues. Since Python isn't available on C10_MOBILE anyways,
it's OK to noop the call to m.set_python_module. We no-op it by just never
calling registerPythonModule.

This is a fix only for C10_MOBILE, there's likely a corresponding issue for
regular PyTorch that we need to work through
(https://github.com/pytorch/pytorch/issues/140272)

Test Plan: - tests

Differential Revision: D65758016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140273
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-11-12 16:35:01 +00:00
d723abf686 [CI]Move CPU inductor test runners and cases to save cost (#136313)
For CPU, only SPR has native support for AMP BF16.

Ref: pytorch/pytorch#138476
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136313
Approved by: https://github.com/jgong5, https://github.com/zxiiro, https://github.com/chuanqi129, https://github.com/desertfire
2024-11-12 16:15:20 +00:00
faef1510f8 Add batch rule for native_dropout_backward (#140140)
Fixes: #122432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140140
Approved by: https://github.com/zou3519
2024-11-12 16:14:49 +00:00
213b8ef163 [BE] add empty tensor testing for _foreach_addcmul/div (#140276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140276
Approved by: https://github.com/jbschlosser
ghstack dependencies: #140191
2024-11-12 15:35:06 +00:00
92fb1f79b8 [BE] Test interspersed empty tensors for _foreach_norm test parity (#140191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140191
Approved by: https://github.com/jbschlosser
2024-11-12 15:35:06 +00:00
71d8bb7ede implement torch._foreach_rsqrt (#134574)
Related:
- #133367 c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134574
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-11-12 15:34:35 +00:00
8cb0b932a1 Fix broken AOTInductor node and kernel counts (#139435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139435
Approved by: https://github.com/desertfire
ghstack dependencies: #139411, #139412
2024-11-12 15:22:46 +00:00
fef16fe254 Enable all fixed cpp_wrapper tests (#139412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139412
Approved by: https://github.com/desertfire
ghstack dependencies: #139411
2024-11-12 15:22:46 +00:00
761b42bc08 cpp_wrapper_cpu: Ensure reinterpret_view results in RAIIAtenTensorHandle (#139411)
Fixes segfaults caused by views being implicitly converted to AtenTensorHandle, then being destroyed before use.

Closes #135559.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139411
Approved by: https://github.com/desertfire
2024-11-12 15:22:38 +00:00
057f0dca78 Don't use sudo to checkout sources (#140263)
Move this part out of https://github.com/pytorch/pytorch/pull/125401 and try using it for all architectures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140263
Approved by: https://github.com/zxiiro, https://github.com/huydhn
2024-11-12 14:29:17 +00:00
78a8f7f5c3 [FSDP2] Fix CUDA sync for bf16 HSDP AR, fp32 params (#140044)
Differential Revision: [D65621037](https://our.internmc.facebook.com/intern/diff/D65621037)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140044
Approved by: https://github.com/weifengpy
2024-11-12 13:31:40 +00:00
51e8a13d00 CD Enable Python 3.13 on windows (#138095)
Adding CD windows. Part of: https://github.com/pytorch/pytorch/issues/130249
Builder PR landed with smoke test: https://github.com/pytorch/builder/pull/2035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138095
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-12 12:28:10 +00:00
ff91fcc991 Refactor device index bound check for xpu code (#120768)
# Movitation
refer to [Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex.](https://github.com/pytorch/pytorch/pull/119639), we use `c10::Device::MAX_NUM_DEVICES` to make sure the number of XPU devices is valid in PyTorch.

# Solution
Use `TORCH_CHECK` to check if the number of XPU devices exceeds `c10::Device::MAX_NUM_DEVICES` when enum XPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120768
Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/tringwald
2024-11-12 12:09:11 +00:00
f77eb07662 Split int4wo weight packing (#139611)
Fixes https://github.com/pytorch/ao/issues/1117.

This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on https://github.com/pytorch/ao/issues/1117#issuecomment-2451252756.

Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139611
Approved by: https://github.com/jerryzh168
2024-11-12 10:12:50 +00:00
7691064768 dispatcher module for multiple graphs (#139439)
Differential Revision: [D65307961](https://our.internmc.facebook.com/intern/diff/D65307961/)

This PR introduces the concept of a "dispatcher" module `n` that carries multiple interpreter modules `n`, `n@1`, `n@2`, etc., each corresponding to a particular call of `n` and thus might carry a different specialized graph. We only do this when we're preserving module call signatures for `n`. The carried modules have the same number and order of calls to `n` appearing in the original module / exported program. In the unflattened module, all those calls go to the "dispatcher" module which internally tracks how many calls have been made so far and invokes the corresponding interpreter module. We reset this tracking after a successful or unsuccessful run of the unflattened module.

Overall this makes swapping easier when module call signatures are preserved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139439
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #139438
2024-11-12 09:53:40 +00:00
9a5175e836 fix shared submodule module call signature (#139438)
Differential Revision: [D65308061](https://our.internmc.facebook.com/intern/diff/D65308061/)

When a shared submodule is called multiple times with different aliases, e.g., `self.a` and `self.b` are both `C()` under the hood and we have calls to both `self.a(...)` and `self.b(...)`, we wrap `C()` to emit as many export tracepoints as there are aliases. This caused us to compute module call signatures that conflated information: we'd add inputs and outputs of one call to inputs and outputs of a different call. Overall preserving module call signatures in the presence of shared submodules was borked because of this bug.

The fix is to pay attention to the nn module stack, which accurately tracks individual calls, thus allowing us to ignore some export tracepoints that get the module correct but not the alias through which the call was made.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139438
Approved by: https://github.com/zhxchen17
2024-11-12 09:53:40 +00:00
a104b560d8 fix trace nn.parameters() (#138149)
Fixes #137764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138149
Approved by: https://github.com/anijain2305
2024-11-12 09:43:45 +00:00
330c9577a3 [Inductor] make decompose_mm_pass support cpu case (#139696)
Summary: Previously, decompose_mm_pass only works for gpu case. This diff make it support some cpu case as well for the performance optimization

Differential Revision: D65226131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139696
Approved by: https://github.com/eellison
2024-11-12 06:22:23 +00:00
965555d1fd [dynamo] Remove dead code path for capturing __class__ in UserFunctionVariable (#140034)
This was introduced in https://github.com/pytorch/torchdynamo/commit/d0c10341
as limited support for pre-existing cells, since we know `__class__` wouldn't be modified
in most cases. It's no longer needed now that we have much more support for these cells.

Example:
```python
class Foo():
    def __init__(self):
        super().__init__()

print(Foo.__init__.__code__.co_freevars) # ('__class__',)
print(Foo.__init__.__closure__)          # (<cell at 0x1011fb310: type object at 0x10fe185b0>,)
```

This patch also exposed and fixes a bug in
`NNModuleVariable.var_getattr`, where Dynamo wasn't propagating source
correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140034
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
2024-11-12 05:54:35 +00:00
09bab7566a Revert "Allow NJT by default for weights_only torch.load (#140304)"
This reverts commit 455dc4c14264a0cd7d70ba5328382a9fb7769094.

Reverted https://github.com/pytorch/pytorch/pull/140304 on behalf of https://github.com/huydhn due to A bunch of failure shows up in trunk after this lands, so probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/140304#issuecomment-2469602096))
2024-11-12 04:53:10 +00:00
469eae2ba2 [inductor][invoke_subgraph] Fix SDPA seed/offset issue (#140070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140070
Approved by: https://github.com/eellison
2024-11-12 04:40:03 +00:00
23db92bad2 [FR] refactor build collective and return more info to db (#140082) (#140303)
Summary:

This change is trying to return the result of analysis with more details. Internally the contract is listed in https://docs.google.com/document/d/19ON5jKlYirT76D4Q-OoGMgD-U2L_sCDnUd_RE1gfiLE/edit?tab=t.0. For OSS, this change is BC to the current behavior.

Also create a new state object which handle logging and convert to object to Collective and NCCLCall.

Test Plan: CI and more thorough testing is on the way.

Reviewed By: VieEeEw

Differential Revision: D65612448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140303
Approved by: https://github.com/c-p-i-o
2024-11-12 03:43:02 +00:00
455dc4c142 Allow NJT by default for weights_only torch.load (#140304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140304
Approved by: https://github.com/jbschlosser
2024-11-12 02:04:18 +00:00
19eff28ff3 [Intel GPU] Extract common utils for conv&qconv (#139580)
# Motivation
This PR is a precursor to #133080. The PR extracts common logics in convolution and quantized convolution into `Utils.cpp`. With such modification, these two operators could share codes like input format querying, op layout querying.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139580
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/malfet
ghstack dependencies: #139721
2024-11-12 02:00:33 +00:00
e21ee6327d [Intel GPU] format XPU oneDNN integration codes (#139721)
# Motivation
This PR add XPU oneDNN integration codes into lintrunner config `.lintrunner.toml`, which would format cpp source and cpp headers codes at `aten/src/ATen/native/mkldnn/xpu/` and `aten/src/ATen/native/mkldnn/xpu/detail/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139721
Approved by: https://github.com/guangyey, https://github.com/cyyever, https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/malfet
2024-11-12 01:52:06 +00:00
4e487eda7a Add linters for C10_UNUSED and C10_NODISCARD (#140302)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140302
Approved by: https://github.com/Skylion007
2024-11-12 01:50:11 +00:00
263a5bf95e [cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827)
Reopen https://github.com/pytorch/pytorch/pull/121782, as more optimizations have landed.

Fixes https://github.com/pytorch/pytorch/issues/115261, https://github.com/pytorch/pytorch/issues/113017.
For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues.

### Validation on 3 benchmark suites

#### FP32
![image](https://github.com/user-attachments/assets/ec920928-fa36-467f-ba07-d2c05c51b92e)

Outlier models (speedup<0.8, single socket): None.

#### BF16
![image](https://github.com/user-attachments/assets/4a301e5e-147d-4b74-beb1-40290969ed80)

Outlier models (speedup<0.8, single socket multi threads):

- functorch_dp_cifar10 0.58
- opacus_cifar10 0.57

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136827
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-11-12 01:26:18 +00:00
29114e44fa [inductor] Support fixed triton configs defined at compile time (#140217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140217
Approved by: https://github.com/shunting314
ghstack dependencies: #139585
2024-11-12 00:56:02 +00:00
6438c8637a [inductor] Refactor reduction type choices into V.choices (#139585)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139585
Approved by: https://github.com/shunting314
2024-11-12 00:56:02 +00:00
e76f57d54e add missing bracket in error message (#140307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140307
Approved by: https://github.com/kit1980
2024-11-12 00:45:14 +00:00
dbb55b448b Revert "[7/N] Fix Wextra-semi warning (#140225)"
This reverts commit ffb979032dc149b4c895526fe5b92d713ed7b1e1.

Reverted https://github.com/pytorch/pytorch/pull/140225 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/140225#issuecomment-2469312229))
2024-11-12 00:02:06 +00:00
0af38b1034 Remove temp table to post autograd IR (#140085)
This table is not needed

Differential Revision: [D64553397](https://our.internmc.facebook.com/intern/diff/D64553397/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140085
Approved by: https://github.com/justinchuby, https://github.com/bdhirsh
2024-11-11 23:59:09 +00:00
c223e0642c Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-11 23:55:27 +00:00
a96aadf0a0 fix specialization logic in Scalar.h (#140280)
Fixes `test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_linalg_norm_subgradients_at_zero_cuda_float64` when `specialize_float=False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140280
Approved by: https://github.com/ezyang
2024-11-11 23:51:15 +00:00
222175b3d5 Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598)"
This reverts commit 2ede4c9a3858d6b97e2ba5156add0134b6765474.

Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/kit1980 due to breaking internal ExecuTorch tests ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2469294995))
2024-11-11 23:42:51 +00:00
412df50454 Revert "[dynamo] Remove dead code path for capturing __class__ in UserFunctionVariable (#140034)"
This reverts commit de40a23f6c02fd8d2b5046b5cab04582dc4ebc4e.

Reverted https://github.com/pytorch/pytorch/pull/140034 on behalf of https://github.com/kit1980 due to breaking internal tests, see D65755044 ([comment](https://github.com/pytorch/pytorch/pull/140034#issuecomment-2469290205))
2024-11-11 23:38:00 +00:00
2817fe8bef Add unaligned attributes to q8gemm/4x4c2-sse2.c (#140188)
Summary:
UBSan hits undefined behavior in this file. This fixes it by marking these pointers as unaligned.

```
caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5: runtime error: store to misaligned address 0x62900313891f for type 'uint32_t' (aka 'unsigned int'), which requires 4 byte alignment
0x62900313891f: note: pointer points here
 be be be be be  be be be be be be be be  be be be be be be be be  be be be be be be be be  be be be
             ^
UndefinedBehaviorSanitizer: undefined-behavior buck-caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/__ukernels_sse2__/buck-private-headers/q8gemm/4x4c2-sse2.c:325:5 in
```

The fix is to mark these variables as unaligned following D42179009's example

q8gemm.cc + internal integration test

Differential Revision: D65637959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140188
Approved by: https://github.com/digantdesai
2024-11-11 23:28:07 +00:00
5eb1ccadc2 [dynamo][user-defined] Walk __mro__ to get the member descriptor source (#140300)
Fixes https://github.com/pytorch/pytorch/issues/140266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140300
Approved by: https://github.com/williamwen42
2024-11-11 23:16:48 +00:00
a290c1d748 Fix building with system GLOO (#140275)
Leverage existing FindGloo CMake module to locate system's library and headers. Add system's gloo headers to include path rather than the gloo from third party when USE_SYSTEM_GLOO is specified.

Fixes #140274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140275
Approved by: https://github.com/malfet
2024-11-11 22:58:39 +00:00
b742d11b1c [TD] Filepath heuristic also looks at file name (#140170)
Filepath heuristic also now takes into account the file name, not just directories

A bit of refactoring
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140170
Approved by: https://github.com/huydhn
2024-11-11 22:55:54 +00:00
5f7ea7ca6a [invoke_subgraph] Support symint/int as inputs (#140058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140058
Approved by: https://github.com/ydwu4, https://github.com/eellison
ghstack dependencies: #139162
2024-11-11 22:26:43 +00:00
d4cdc09881 ILP for auto FSDP wrapping (#140298)
This PR presents a mixed integer linear programming (MILP) formulation that can be utilized to determine, under a memory budget, which modules to wrap as FSDP units. Similar to the auto SAC MILP introduced in https://github.com/pytorch/pytorch/pull/137908, the MILP uses information collected from MemTracker, Runtime Estimator, and SAC Estimator, introduced in these PRs:
* https://github.com/pytorch/pytorch/pull/124688
* https://github.com/pytorch/pytorch/pull/134243
* https://github.com/pytorch/pytorch/pull/135208

End-to-end example and its sample output:

```
import copy
from typing import Tuple

import torch
from torch._subclasses.fake_tensor import FakeTensorMode

from torch.distributed._tools.ilp_utils import (
    aggregate_stats,
    get_peak_memory_runtime_baseline,
    parse_module_info,
)
from torch.distributed._tools.mem_tracker import _ModState, MemTracker
from torch.distributed._tools.runtime_estimator import RuntimeEstimator
from torch.distributed._tools.sac_estimator import SACEstimator
from torch.distributed._tools.fsdp_ilp import fsdp_milp, CommType, CommParams
from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
)

def _init_model_input_optimizer() -> (
    Tuple[torch.nn.Module, torch.optim.Optimizer, torch.Tensor]
):
    bsz = 2
    model_args = ModelArgs(
        n_layers=6,
        n_heads=12,
        vocab_size=8192,
        max_seq_len=1024,
        dim=6144,
        dropout_p=0.1,
    )
    with torch.device(torch.cuda.current_device()):
        model = Transformer(model_args)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=True)
    inp = torch.randint(
        0,
        model_args.vocab_size,
        (bsz, model_args.max_seq_len),
        device=torch.cuda.current_device(),
    )
    return (model, optimizer, inp)

def _run_and_get_mem_tracker(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    inp: torch.Tensor,
) -> MemTracker:
    mem_tracker = MemTracker()
    mem_tracker.track_external(model, optimizer)
    with mem_tracker as mt:
        for iter_idx in range(2):  # running twice to initialize optimizer
            output = model(inp)
            output.sum().backward()
            if iter_idx == 1:
                last_snapshot = mt.get_tracker_snapshot("current")
            optimizer.step()
            optimizer.zero_grad()
            if iter_idx == 0:
                mt.reset_mod_stats()
    assert last_snapshot is not None
    for mod_stats in mem_tracker.memory_tracking.values():
        if _ModState.POST_BW not in mod_stats.snapshots.keys():
            mod_stats.snapshots.setdefault(_ModState.POST_BW, []).append(
                copy.deepcopy(last_snapshot)
            )
    return mem_tracker

def _run_and_get_runtime_estimator(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    inp: torch.Tensor,
) -> RuntimeEstimator:
    def _run_one_step() -> None:
        output = model(inp)
        output.sum().backward()
        optimizer.step()
        optimizer.zero_grad()

    # Initializing optimizer states and warm-up
    _run_one_step()

    runtime_estimator = RuntimeEstimator()
    with runtime_estimator(estimate_mode_type="operator-level-cost-model"):
        _run_one_step()  # We use only one iteration for estimation
    return runtime_estimator

def _run_and_get_sac_estimator(
    model: torch.nn.Module,
    inp: torch.Tensor,
) -> SACEstimator:
    sac_estimator = SACEstimator()
    with sac_estimator(estimate_mode_type="operator-level-cost-model"):
        loss = model(inp).sum()
    loss.backward()
    return sac_estimator

def main():
    with FakeTensorMode():
        model, optimizer, inp = _init_model_input_optimizer()
        mem_tracker = _run_and_get_mem_tracker(model, optimizer, inp)
        runtime_estimator = _run_and_get_runtime_estimator(model, optimizer, inp)
        sac_estimator = _run_and_get_sac_estimator(model, inp)
        mod_info = aggregate_stats(
            model,
            mem_tracker,
            runtime_estimator,
            sac_estimator,
            torch.device(torch.cuda.current_device()),
        )
        g = parse_module_info(mod_info)

        peak_mem, compute_time = get_peak_memory_runtime_baseline(g)
        print("=== WITHOUT FSDP ===")
        print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB")
        print(f"compute_time: {round(compute_time, 2)} ms")

        fsdp_decisions, exposed_comm_time, peak_mem = fsdp_milp(
            g,
            world_size=8,
            memory_budget=15,
            comm_params={
                CommType.ALL_GATHER: CommParams(latency=0.01, bandwidth=2 * 1e8),
                CommType.REDUCE_SCATTER: CommParams(latency=0.01, bandwidth=2 * 1e8),
            },
        )
        print("=== WITH FSDP on 8 ranks ===")
        print(f"fsdp units: {sorted(fsdp_decisions)}")
        print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB")
        print(f"exposed communication time: {round(exposed_comm_time, 2)} ms")

if __name__ == "__main__":
    main()
```

```
=== WITHOUT FSDP ===
peak_mem: 20.92 GiB
compute_time: 1375.49 ms
=== WITH FSDP on 8 ranks ===
fsdp units: ['Transformer', 'Transformer.layers.0.attention.wk', 'Transformer.layers.0.attention.wo', 'Transformer.layers.0.attention.wq', 'Transformer.layers.0.attention.wv', 'Transformer.layers.0.feed_forward.w1', 'Transformer.layers.0.feed_forward.w2', 'Transformer.layers.1', 'Transformer.layers.2', 'Transformer.layers.3', 'Transformer.layers.4', 'Transformer.layers.5', 'Transformer.output', 'Transformer.pos_embeddings']
peak_mem: 13.63 GiB
exposed communication time: 1.02 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140298
Approved by: https://github.com/weifengpy
2024-11-11 22:02:39 +00:00
2c77352fe2 [AOTI][refactor] Clean up call chain in wrapper codegen (#136531)
Summary: For cpp wrapper, generate_kernel_call and define_kernel need to handle both cpu and gpu kernels. Refactor the code to remove nested super() calls.

Differential Revision: [D65639095](https://our.internmc.facebook.com/intern/diff/D65639095)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136531
Approved by: https://github.com/frank-wei
2024-11-11 22:00:42 +00:00
115c58c52a Update ET pin for #6744 (#140199)
This will be updated to ET trunk commit after https://github.com/pytorch/executorch/pull/6744 lands.  I also move ET back from unstable and install llama3 dependencies
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140199
Approved by: https://github.com/kit1980
2024-11-11 21:40:12 +00:00
780b28f67e [ONNX] Update docstring typo in building (#140281)
The oprecorder docstring mistakenly referred to torchscript when it should say ONNX IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140281
Approved by: https://github.com/titaiwangms
2024-11-11 21:01:27 +00:00
001f7366a7 [ROCm] Correct numerical issues in layer norm backwards kernel (#140259)
It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (https://github.com/pytorch/pytorch/pull/87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140259
Approved by: https://github.com/jianyuh
2024-11-11 20:44:18 +00:00
10e40dd5ca [aoti][tooling] Add support to debug printing for all AOTI model run input args (#140064)
Summary:
Add debug printing around: `void AOTInductorModel::run_impl()`

Example:
```
void AOTInductorModel::run_impl(
    AtenTensorHandle*
        input_handles, // array of input AtenTensorHandle; handles
                        // are stolen; the array itself is borrowed
    AtenTensorHandle*
        output_handles, // array for writing output AtenTensorHandle; handles
                        // will be stolen by the caller; the array itself is
                        // borrowed
    DeviceStreamType stream,
    AOTIProxyExecutorHandle proxy_executor
) {

    auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 3);
    auto arg0_1 = std::move(inputs[0]);
    auto arg1_1 = std::move(inputs[1]);
    auto arg2_1 = std::move(inputs[2]);
    aoti_torch_print_tensor_handle(arg0_1, "aoti_model_inputs - arg0_1");
    aoti_torch_print_tensor_handle(arg1_1, "aoti_model_inputs - arg1_1");
    aoti_torch_print_tensor_handle(arg2_1, "aoti_model_inputs - arg2_1");
```

Differential Revision: D65616590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140064
Approved by: https://github.com/chenyang78
2024-11-11 20:10:35 +00:00
7f1e248b50 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [1/N] (#139706)
``torch._dynamo.optimize()`` is wrapped for convenience by ``torch.compile()``.

related commits:

- #139706
- #140238
- #140247
- #140253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139706
Approved by: https://github.com/jansel, https://github.com/ezyang
2024-11-11 20:04:08 +00:00
e7ec294c10 NJT OpInfo tests v2 (#138370)
This PR updates OpInfo-based tests for NJTs:
* Adds extensive coverage across non-contiguous NJTs (both non-contiguous transposed and non-contiguous with holes)
    * The `_sample_njts()` helper that `sample_input_func`s utilize now produces non-contig NJTs as well
* Utilizes a `SampleInput`-based xfail system for granular classification of bugs. For example, it's possible to indicate that a class of ops is expected to fail only on non-contig with holes NJT inputs.
    * I decided on adding `SampleInput`s and utilizing this system over using test parametrization for two reasons:
        * Test perf - adding `SampleInput`s is faster than generating entire new tests
        * Avoiding the possibility of `sample_input_func`s not respecting the non-contig test parameter - this would result in silently incorrect passing of these tests. Keeping the responsibility for `SampleInput` generation firmly within each `OpInfo`'s `sample_input_func` means weirdness like this isn't possible
* Improves `SampleInput` naming for a bunch of `sample_input_func`s. This makes it easier to xfail them as needed. For example, binary / unary / other ops now use the new `_describe_njt()` helper to get a string repr that uniquely defines the type of NJT being passed to the op
* Adds appropriate `XFailRule`s to get tests passing for forward / backward / forward compile / backward compile. In general, each xfail corresponds to some bug that needs to be fixed

```python
# Represents a rule indicating how to xfail a particular test. It allows granularity
# at the device, dtype, op, and individual sample levels. This flexibility allows entire
# bugs to be represented by a single rule, even if this corresponds with multiple conceptual
# test cases across multiple ops.
@dataclass
class XFailRule:
    # expected error type
    error_type: TypeVar = Exception
    # expected error message
    error_msg: str = ".*"
    # function to indicate whether the rule applies; return True if so
    match_fn: Callable[[torch.device, torch.dtype, OpInfo, SampleInput], bool] = None
    # optional name for identifying the rule
    name: str = ""

    def match(self, device, dtype, op, sample) -> bool:
        return self.match_fn(device, dtype, op, sample)
```

Example:
```python
    # Bug when broadcasting a binary op with non-contiguous with holes NJT + dense
    # tensor with 1 in ragged dim.
    XFailRule(
        error_type=RuntimeError,
        error_msg="cannot call binary pointwise function .* with inputs of shapes",
        match_fn=lambda device, dtype, op, sample: (
            isinstance(op, BinaryUfuncInfo)
            and "noncontig_holes" in sample.name
            and "broadcasting 1 over ragged" in sample.name
        ),
        name="binary_noncontig_holes_broadcasting_1_over_ragged",
    ),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138370
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
ghstack dependencies: #140160
2024-11-11 19:35:24 +00:00
0a0915fb5e [SymmetricMemory] improve the API for stream_write_value32 (#139934)
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
2024-11-11 18:49:22 +00:00
96b64182de Delete Buck1 as it is no longer supported (#140067)
Buck1 is no longer supported in favor of buck2. This CI tests the old buck1 flow, however it is difficult to maintain especially since buck1 doesn't support aarch64 mac.

I am suggesting that this CI be deprecated until a decision on buck2 is made, and buck2 support is added. As of now, there seems to be no push towards adding buck2 support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140067
Approved by: https://github.com/huydhn
2024-11-11 18:49:18 +00:00
5f4a21dc58 Revert "[SymmetricMemory] improve the API for stream_write_value32 (#139934)"
This reverts commit 2f3a5a15ef701ffab9a880cf822ff8e5224a4b33.

Reverted https://github.com/pytorch/pytorch/pull/139934 on behalf of https://github.com/malfet due to Broke distributed tests, see https://github.com/pytorch/pytorch/actions/runs/11770673088/job/32784210441 ([comment](https://github.com/pytorch/pytorch/pull/139934#issuecomment-2468641512))
2024-11-11 17:02:07 +00:00
2fe110ff3a [BE][MPS] Standardize indexing shader compilation (#140271)
It was wrong to add it to MPSDevice in the first place, as in the end it's just a regular shader, like all others.
I.e. this PR:
 - Moves contents of `at::mps::indexing_metal_shaders` into `kernels/Indexing.metal`
 - Deletes `MPSDevice::getMetalIndexingLibrary()` and `MPSDevice::metalIndexingPSO` methods
 - Moves `at::native::mps::generateKernelDataOffsets` implementation from `OperationUtils.mm` to `Indexing.mm`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140271
Approved by: https://github.com/Skylion007
2024-11-11 17:00:49 +00:00
f5ffd55a32 [MPS] Add torch.special.i1 op (#140196)
By more-or-less copy-n-pasting 58b661cda2/aten/src/ATen/native/cuda/Math.cuh (L576)

Enable respective tests in test_mps.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140196
Approved by: https://github.com/Skylion007
2024-11-11 16:57:53 +00:00
63715f6567 S390x update builder image (#132983)
Publish current state of s390x builder image to allow reproducing worker setup.
Also, if this image gets published to docker repository later, it'd be possible to download published image instead of building it into worker image in https://github.com/pytorch/pytorch/blob/main/.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile#L66, which should allow improving restart time at the cost of additional runtime overhead.

Compared to first attempt to merge:
- default docker repository settings are added to all runners. Changes are mirrored in this PR.
- job is moved into separate workflow file.
- it's no longer attempted to update limits on s390x. Limits should be properly set up there on the host. And it's not possible to update them from worker since it runs in container. Also, worker container currently doesn't have sudo installed or configured or any systemd running.
- github token is now passed once via named pipe instead of environment variable. This should increase security of tokens.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-11-11 16:14:06 +00:00
04b5b4a94e Add base class for single-subgraph inductor HOPs (#139898)
This PR adds "PrimHOPBase", which is intended to be a base class that
one can extend to create new HOPs that match some criteria:
- they take one subgraph as input, and their semantics are running the
  subgraph on some operands
- the HOP stays alive until Inductor

The motivation is that we are seeing a lot more HOPs (invoke_subgraph,
invoke_quant) that have this property and there can be a lot of shared
code between them.

Future:
- Migrate invoke_subgraph to use this
- There are some TODOs in the code

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139898
Approved by: https://github.com/anijain2305, https://github.com/ydwu4
2024-11-11 16:12:35 +00:00
d4b8857e51 [codecache][triton 3.2] hash -> base64 conversion for triton 3.2 (#140190)
In old triton versions, you take the hash of the triton kernel and use it in the filepath for the cached kernel. In Triton 3.2 (after https://github.com/triton-lang/triton/pull/4553), the filepath will use the base-64-encoded representation of the hash in the path.

This PR checks whether the `_base64` function exists in triton, and if so, uses the base-64-encoded represenatation in the path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140190
Approved by: https://github.com/ezyang
2024-11-11 15:32:28 +00:00
ceb44b22dc [FR] Enable best effort parital analysis and verbose mode for trace printing (#139853)
Based on user feedback, we want to enable two things for FR analysis script:
1. Print out more information when verbose is specified.
2. Perform best effort based analysis when not all ranks have FR trace dumped.

Differential Revision: [D65516081](https://our.internmc.facebook.com/intern/diff/D65516081/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139853
Approved by: https://github.com/c-p-i-o
2024-11-11 14:38:32 +00:00
cb15c15157 [logging] Overhaul dynamo_timed and CompilationMetrics logging. (#139849)
Here's the overview:

There's a new contextmanager singleton called MetricsContext. Entering the MetricsContext is how we demarcate the boundary on which we'll create a single CompilationMetrics object, and therefore, a single dynamo_compile log entry. While we're inside the MetricsContext, we can update/set many different metrics. Most importantly: `dynamo_timed` can also update the in-progress MetricsContext. In the proposal here, we tell `dynamo_timed` that we want it to do so by providing the name of the MetricsContext field to increment. There can be many `dynamo_timed` calls in different parts of the code updating different fields. Then when the MetricsContext exits, that's when the logging of everything gathered finally happens. One potential footgun is trying to use `dynamo_timed` when we haven't entered the MetricsContext, but we assert on that problem. Another problem is that we re-enter the context recursively, but we watch for that and do the logging only when the outermost exits.

Some specifics:
* Introduce MetricsContext - a context manager that on exit, records the CompilationMetrics (which also logs to dynamo_compile).
* Completely remove the concept of frame_phase_timing. Instead, update the MetricsContext during compilation, either directly or via dynamo_timed.
* Remove some globals we previously used to accumulate counters to later populate a CompilationMetrics. We use CompilationMetrics set/update/increment APIs instead.
* `record_compilation_metrics` is now called on exit from MetricsContext.
* Populate legacy CompilationMetrics fields right before logging, inside `record_compilation_metrics`.
* Remove the one-off `add_remote_cache_time_saved` helper; capture that timing directly into the MetricsContext.

And specifically, several changes to dynamo_timed:
* "Modernize" the parameters and update all callsites accordingly.
* Move the backwards logging of the CompilationMetrics to the backwards compile location.
* Add a parameter for which CompilationMetrics field to update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139849
Approved by: https://github.com/ezyang
ghstack dependencies: #140094
2024-11-11 14:24:23 +00:00
565a7942ee Recover non-standard bool test for msort (#139870)
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.

After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)

Therefore we add an explicit type cast.

Test Plan: Remove the test skip

Differential Revision: D65472170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2024-11-11 02:00:34 +00:00
2f3a5a15ef [SymmetricMemory] improve the API for stream_write_value32 (#139934)
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
2024-11-11 01:54:35 +00:00
cyy
ffb979032d [7/N] Fix Wextra-semi warning (#140225)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140225
Approved by: https://github.com/ezyang
2024-11-10 14:28:10 +00:00
d90c25e3e2 OpenReg: Support event (#140111)
Support events. Since cpu backend doesn't support asynchronous execution, all event operations will be executed immediately on the executor side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140111
Approved by: https://github.com/ezyang
2024-11-10 08:38:45 +00:00
c3087ace58 Update torch-xpu-ops commit pin (#139986)
Update the torch-xpu-ops commit to [5e29831 ](https://github.com/intel/torch-xpu-ops/commit/5e29831). Includes:
- OneAPI-2025 build issue fix
- Enhancement of the XPU operator coverage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139986
Approved by: https://github.com/guangyey, https://github.com/jansel
2024-11-10 06:49:38 +00:00
94c9bb73c0 [Inductor] [CPP] Update BRGEMM parameters for Half cpp gemm template (#140116)
Update BRGEMM parameters for Half cpp gemm template as BRGEMM api is changed https://github.com/pytorch/pytorch/pull/138184.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140116
Approved by: https://github.com/jansel
2024-11-10 06:37:10 +00:00
4f6b30bcbc Add testing for the utils surrounding dynamo_timed (#140094)
Summary: This will make it easier to verify that we don't break these utilities for the refactor in https://github.com/pytorch/pytorch/pull/139849.
It's one giant test. I can split it into multiple for better readability if ppl prefer that. My rationale for the giant test is that I found I was just resetting compilation and recompiling the same thing many times, which was slow and wasteful.

Test Plan: The new tests

Differential Revision: [D65682138](https://our.internmc.facebook.com/intern/diff/D65682138)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140094
Approved by: https://github.com/ezyang
2024-11-10 04:17:45 +00:00
5ef33e40b3 Add size param check of unfold (#139965)
Fixes #76617

Changes:

- Add check of input `size` value, give user friendly hint message
- fix `FIXME: move to shape ops test suite` in test file

Before
```python
import torch
x = torch.arange(1., 8)
x.unfold(0, -1, 1)

Traceback (most recent call last):
  File "/home/zong/code/unfold.py", line 12, in <module>
    x.unfold(0, -1, 1)
RuntimeError: Storage size calculation overflowed with sizes=[9, -1] and strides=[1, 1]

```

After
```python
import torch
x = torch.arange(1., 8)
x.unfold(0, -1, 1)

Traceback (most recent call last):
  File "/home/zong/code/pytorch/../unfold.py", line 12, in <module>
    x.unfold(0, -1, 1)
RuntimeError: size is -1 but must be >= 0
```

Test Result:
```bash
pytest test/test_shape_ops.py
```

![image](https://github.com/user-attachments/assets/d7bcef62-04e6-4187-9c8f-bc5220ff6c33)

```bash
$ lintrunner
```

![image](https://github.com/user-attachments/assets/6b48d095-5c8a-4e75-9957-dc22d39a73bb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139965
Approved by: https://github.com/ezyang
2024-11-09 17:12:53 +00:00
f89b2b9630 Refactor conda-builder -> almalinux-builder (#140157)
This changes the conda-builder workflow to almalinux-builder and switches Docker file to almalinux.
Please note: Published conda-builder images will still be available, hence workflows that use these images will still work.
We will be switching workflows that use conda-builder images to almalinux-builder

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140157
Approved by: https://github.com/malfet
2024-11-09 16:06:40 +00:00
cyy
7d4f5f7508 [Environment Variable][6/N] Use thread-safe getenv functions (#140200)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140200
Approved by: https://github.com/ezyang
2024-11-09 15:05:51 +00:00
a2ac96cae0 [BE] Rectify some references to caffe2 (#140204)
- Rename `tools.build_pytorch_libs.build_caffe2` to `tools.build_pytorch_libs.build_pytorch`
- Delete number of `if BUILD_CAFFE2` conditions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140204
Approved by: https://github.com/huydhn, https://github.com/r-barnes, https://github.com/atalman
2024-11-09 14:14:20 +00:00
5107d244ee [c10d][Logging] Remove args and kwargs from c10d logging (#140169)
This PR is trying to reland https://github.com/pytorch/pytorch/pull/139804

We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140169
Approved by: https://github.com/wz337, https://github.com/kwen2501
2024-11-09 13:57:32 +00:00
052b67e2b4 Add torch.version.xpu (#139466)
# Motivation
We add a new attribute `torch.version.xpu` to facilitate the problem diagnosing and version control.

# Additional Context
It is aligned with `torch.version.cuda` and `torch.version.hip`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139466
Approved by: https://github.com/EikanWang, https://github.com/ezyang, https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #139258
2024-11-09 13:31:21 +00:00
8051ee802c Add XPU compiler version control in cmake to keep BC (#139258)
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.

# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
2024-11-09 13:31:21 +00:00
191971e01d [AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c_shim for XPU. (#136742)
[AOTI] Introduce an extensibility mechanism for the c shim codegen to make it easy to produce c shims for out-of-tree OP kernels as well. Add c shim for XPU.

### Motivation
Since the current c shim codegen will only produce C wrappers for Op's registered in `aten/src/ATen/native/native_functions.yaml`, for the same backend, when a portion of out-of-tree OP's are not registered in that file, but are registered externally. For example, `third_party/torch-xpu-ops/yaml/native_functions.yaml` , in this case, the existing codegen can't fulfill the need to do extensions for the c shims from the out-of-tree OPs for the in-tree that has already been produced.

### Design
To extend the c shim with more OP for a backend from out-of-tree.
The PR provided a bool option `--aoti-extend` to indicate the codegen is to extend c shim from out-of-tree.
The generated c shim is stored in the `extend` subdirectory , for example:
```
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/c_shim_xpu.cpp
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.h
torch/include/torch/csrc/inductor/aoti_torch/generated/extend/c_shim_xpu.cpp
```
example usage:
`python -m torchgen.gen --source-path third_party/torch-xpu-ops/yaml/ --xpu --aoti-extend --update-aoti-c-shim  `
`--xpu`:  generate c shim for XPU
`--aoti-extend `: this is an out-of-tree OPs(defined in `third_party/torch-xpu-ops/yaml/native_functions.yaml`)  extend for in-tree ops(defined in `aten/src/ATen/native/native_functions.yaml`)
`--update-aoti-c-shim`: always generate c_shim_xpu.h for the extend c_shim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136742
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #139025
2024-11-09 13:19:52 +00:00
929a647363 [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025)
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.

Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-11-09 13:09:27 +00:00
0b650c360a Build magma for windows (#139924)
Copy the magma for windows job and script from pytorch/builder c9aac65e12/.github/workflows/build-magma-windows.yml

The linux version is moved here in https://github.com/pytorch/pytorch/pull/139888

Fixes #140001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139924
Approved by: https://github.com/atalman
2024-11-09 09:27:59 +00:00
e2e425b4f3 [CUDAGraph] Add dynamo timer to checkpoint, warmup, and record (#139818)
Summary: Add time log to cudagraph, including `create deferred_cudagraphify wrapper`, `warmup`,	`record`, and `checkpoint`.

Test Plan:
1. buck2 run fbcode//mode/opt //pytorch/benchmark:run -- resnet50 -d cuda -t train --inductor --pt2-triton-cudagraph

2. Found the result in [scuba table](https://fburl.com/scuba/pt2_compile_events/0oik8nu9).

 {F1954034920}

Differential Revision: D65505659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139818
Approved by: https://github.com/eellison
2024-11-09 05:27:11 +00:00
cyy
ab55a99283 Use TORCH_DECLARE_XXX (#139952)
Because those files use TORCH_API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139952
Approved by: https://github.com/ezyang
2024-11-09 04:56:28 +00:00
d2d1258b1b Speed up AMD AOT Inductor lowering by memoizing hipify trie to regex logic (#140156)
Summary:
AMD lowering duration is 1.55x longer than H100. Profiling shows hipification related functions took 22% of overall lowering time.

This diff cuts that time by safely memoize the trie to regex logic. The trick is to incrementally build a state of the trie during the trie construction. The state is the hash of all the words added to the trie.

Differential Revision: D65659445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140156
Approved by: https://github.com/ColinPeppler

Co-authored-by: Kefei Lu <kefeilu@meta.com>
2024-11-09 04:28:58 +00:00
8b2e3855a9 Make size a property with an assertion (#139794)
Fixes https://github.com/pytorch/pytorch/issues/120568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139794
Approved by: https://github.com/williamwen42
2024-11-09 03:39:41 +00:00
cyy
032135f8a2 [2/N] Turn inline static functions into static (#140068)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140068
Approved by: https://github.com/ezyang
2024-11-09 03:31:24 +00:00
3b8470c461 add special case for __round__ constant variables (#139583)
Fixes `PYTORCH_TEST_WITH_INDUCTOR=1 tlp python test/test_torch.py TestTorchDeviceTypeCUDA.test_cauchy_cuda_float64` when specialize_float=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139583
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935, #139587
2024-11-09 03:25:53 +00:00
f915409c26 FlopCounterMode: Decompose ops for inference mode (#138508)
Fixes #126268

I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state.

Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508
Approved by: https://github.com/ezyang
2024-11-09 03:13:53 +00:00
4488e23763 Fix another item memo loss location + bool specialization bug (#139587)
This fix was a bit more involved:
1) It fixes a item_memo loss place.
2) It updates a test to be eager instead of aot_eager since it reveals a very obscure bug related to replacements that's not worth solving since in practice inductor will regenerate the runtime asserts anyways
3) It updates tensorify to specialize more places now that the aforementioned bug is fixed.

Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False`

while ensuring `python test/dynamo/test_dynamic_shapes.py DynamicShapesMiscTests.test_runtime_assert_replacement_dynamic_shapes` doesn't regress

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139587
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896, #139935
2024-11-09 03:11:19 +00:00
4893e248a8 [DTensor][Test] Remove safe global context for weights_only torch.load() DTensor (#140173)
We have added DTensor related classes to allowed globals so we can torch.load(DTensor) with weights_only=True. So we don't need the safe_globals context for this test anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140173
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #139949
2024-11-09 02:21:44 +00:00
72976b2486 Use manylinux-builder images with main tag (#140158)
The magma build uses deprecated manylinux-builder images. Update it to use the images with "main" in the tag:

  pytorch/manylinux-builder:cuda<version>-main

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140158
Approved by: https://github.com/atalman
2024-11-09 02:16:00 +00:00
2ede4c9a38 [Partitioner] Enumerate partitions by iterating partition ids (#136598)
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-09 01:31:46 +00:00
9c678af9f9 Misc. non-contig NJT fixes (#140160)
This PR contains several fixes related to non-contiguous NJTs:
1. Propagates `lengths` through op calls appropriately (see desc of #138098)
    * SDPA now calls `nested_view_from_values_offsets_lengths()` instead of `nested_view_from_values_offsets()`
2. Allows non-contig NJTs in unsqueeze / transpose / select
3. Expands padded dense -> NJT conversion to support non-contig NJTs
4. (unrelated sorry) Updates `split` / `split_with_sizes` to allow for optional `dim`, matching the ATen signature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140160
Approved by: https://github.com/cpuhrsch
2024-11-09 01:18:26 +00:00
be172d2a60 [pt2, docs] Add new PT2 troubleshooting doc (#138620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138620
Approved by: https://github.com/ezyang

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-11-09 01:17:39 +00:00
de40a23f6c [dynamo] Remove dead code path for capturing __class__ in UserFunctionVariable (#140034)
This was introduced in https://github.com/pytorch/torchdynamo/commit/d0c10341
as limited support for pre-existing cells, since we know `__class__` wouldn't be modified
in most cases. It's no longer needed now that we have much more support for these cells.

Example:
```python
class Foo():
    def __init__(self):
        super().__init__()

print(Foo.__init__.__code__.co_freevars) # ('__class__',)
print(Foo.__init__.__closure__)          # (<cell at 0x1011fb310: type object at 0x10fe185b0>,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140034
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
ghstack dependencies: #140033
2024-11-09 01:03:24 +00:00
0b8652a999 [dynamo] Remove NestedUserFunctionVariable.closure_scope (#140033)
This was no longer needed after https://github.com/pytorch/torchdynamo/commit/663e4d92,
which removed the uses of `closure_scope` but not the field itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140033
Approved by: https://github.com/williamwen42, https://github.com/anijain2305, https://github.com/jansel
2024-11-09 01:03:24 +00:00
cyy
263d8f7a94 [8/N] Don't skip ASAN on some tests (#140081)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140081
Approved by: https://github.com/ezyang
2024-11-09 01:00:13 +00:00
58b661cda2 Revert "[c10d][Logging] Remove args and kwargs from c10d logging (#140169)"
This reverts commit e3b2f04f052fbc5dcf728f33ac59917d087c324c.

Reverted https://github.com/pytorch/pytorch/pull/140169 on behalf of https://github.com/ZainRizvi due to Man, this test really wants to fail on trunk. Sorry. Details:  distributed/test_c10d_logger.py::C10dErrorLoggerTest::test_exception_logger [GH job link](https://github.com/pytorch/pytorch/actions/runs/11751023962/job/32740983427) [HUD commit link](e3b2f04f05) ([comment](https://github.com/pytorch/pytorch/pull/140169#issuecomment-2465933413))
2024-11-09 00:23:43 +00:00
090b778b8a Clarify meaning of rate parameter in Gamma distribution (#134847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134847
Approved by: https://github.com/fritzo
2024-11-09 00:22:13 +00:00
7eb66173e2 Revert "Fix split decomp returning self (#140065)"
This reverts commit 9d99dceb53884387665a2c273beca99a157193a5.

Reverted https://github.com/pytorch/pytorch/pull/140065 on behalf of https://github.com/ZainRizvi due to Diff been imported internally, but merged externally. And the internal diff has been updated so the diff and PR are now mismatched.  Reverting this PR to get things back into a consistent state. See D65635070 ([comment](https://github.com/pytorch/pytorch/pull/140065#issuecomment-2465928027))
2024-11-09 00:16:26 +00:00
a02e88d19c [miniz] Bump miniz version to 3.0.2 and add patch for zip64 (#140041)
Summary:
Bump miniz version from 2.1.0 to 3.0.2 and apply these patches:

* #79636 patches internal BUCK and bazel build
* #138959 adds `bool compute_crc32` argument
* miniz PR: https://github.com/richgel999/miniz/pull/324 to support
  zip64

Anyone bumping miniz version again, please apply these patches as well.

Test Plan:
Rely on unit test

Imported from OSS

Differential Revision: D65586230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140041
Approved by: https://github.com/mikaylagawarecki
2024-11-09 00:13:16 +00:00
1400fedf76 Revert "add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)"
This reverts commit e5574445b01f264e57653a8a42af1118e89acc9a.

Reverted https://github.com/pytorch/pytorch/pull/135338 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. Please see D65663382 for more details ([comment](https://github.com/pytorch/pytorch/pull/135338#issuecomment-2465911854))
2024-11-08 23:52:49 +00:00
ea0f60ecfa [Dynamo] allow dynamic callables on tensor variables (#137940)
Fixes https://github.com/pytorch/pytorch/issues/134844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137940
Approved by: https://github.com/williamwen42
2024-11-08 23:49:34 +00:00
beae7725be Revert "Tighten type hints for tensor arithmetic (#135392)"
This reverts commit d3788190685685cb828bdf6bed90270c0b60affc.

Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D65641103 for more details ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2465906839))
2024-11-08 23:44:41 +00:00
2af5172774 fix dynamo tracking numpy 2 ops (#138686)
Fixes #136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.

Before this PR, the following tests failed:

```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```

With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.

I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.

```py
from torch._dynamo import trace_rules
import numpy as np

def new_numpy_function_ids():
    unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}

    def is_supported(k, v, mod):
        if not callable(v):
            return False
        if not getattr(v, "__module__", None):
            return True
        if v.__module__ == mod.__name__:
            return True
        if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
            return True
        return False
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        for k, v in mod.__dict__.items():
            if is_supported(k, v, mod):
                rv[id(v)] = f"{mod.__name__}.{k}"
    return rv

def old_numpy_function_ids():
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        rv.update(
            {
                id(v): f"{mod.__name__}.{k}"
                for k, v in mod.__dict__.items()
                if callable(v)
                and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
            }
        )
    return rv

rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())

for v in (rv1 - rv2):
    print(v)
print("****")
for v in (rv2 - rv1):
    print(v)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/williamwen42
2024-11-08 23:38:53 +00:00
1659e241c8 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Differential temp Revision: [D65623152](https://our.internmc.facebook.com/intern/diff/D65623152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-08 23:28:25 +00:00
e3b2f04f05 [c10d][Logging] Remove args and kwargs from c10d logging (#140169)
This PR is trying to reland https://github.com/pytorch/pytorch/pull/139804

We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140169
Approved by: https://github.com/wz337
2024-11-08 23:24:52 +00:00
cc44b55b00 Hook up bf16_gemv_trans to x86 bf16 GEMM (#139220)
This is the big milestone for bf16 and should enable us to close https://github.com/pytorch/torchchat/issues/1253 .

Testing: ran python torchchat.py generate llama3.2-1b --dtype bf16 --device cpu on x86 machine with AVX512-bf16. observed similar tokens/sec with and without MKL path hand-disabled. Also observed speedup from ~2.1 tok/sec to 7.4 tok/sec on x86 machine with only AVX2.

Differential Revision: [D65170967](https://our.internmc.facebook.com/intern/diff/D65170967/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139220
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081, #139208
2024-11-08 23:24:36 +00:00
25c469bac3 Build bf16 gemv fast path & entry points for non-ARM architectures too (#139208)
Very similar to #137917, but for bf16.

Differential Revision: [D65155971](https://our.internmc.facebook.com/intern/diff/D65155971/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139208
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558, #139081
2024-11-08 23:24:36 +00:00
7f0bf9f961 Move bf16_gemv_trans to ReducedPrecisionFloatGemvFastPathKernel (#139081)
Following the previous move of fp16_gemv_trans.

Testing: Checked for performance regression with llm_benchmarks' `python benchmarks/benchmark_torch_mm.py llm`, didn't find one
Differential Revision: [D64930872](https://our.internmc.facebook.com/intern/diff/D64930872/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139081
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090, #139558
2024-11-08 23:24:29 +00:00
44f6d1439e Unbreak vec128_half_neon comparison without FP16 hardware support (#139558)
Discovered this bug when working on Vectorized<BFloat16>; apparently we have no automated testing for aarch64 without FP16.

Testing: Manually disable FP16 feature for local vec_test_all_types run on Mac; see pass.

Differential Revision: [D65385267](https://our.internmc.facebook.com/intern/diff/D65385267/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139558
Approved by: https://github.com/malfet
ghstack dependencies: #139084, #139090
2024-11-08 23:24:22 +00:00
ac6b6c6f98 [BE][CI] Use pip3 instead of pip (#140185)
As on modern distros(see this oldie but goodie: https://launchpad.net/ubuntu/focal/+package/python-is-python3 ), `pip` alias might be missing or indeed point to Python2 installation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140185
Approved by: https://github.com/wdvr, https://github.com/huydhn, https://github.com/seemethere
2024-11-08 23:15:02 +00:00
1cdaf1d85f correctly keep track of processed tensors for foreach reductions (#140103)
Fixes #140066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140103
Approved by: https://github.com/janeyx99

Co-authored-by: Jane Xu <janeyx@meta.com>
2024-11-08 23:04:53 +00:00
f3cbf67686 [CD] Build aarch64 wheels without conda (#140093)
As manylinuxaarch64-builder already comes pre-built with all versions of python runtime

Refactor logic for setting path to DESIRED_PYTHON from `manywheel/build_common` into `set_desired_python.sh` and call it from aarch64_ci_setup.sh

In followup PRs move scons and ninja installation into base docker image
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140093
Approved by: https://github.com/atalman
2024-11-08 22:24:28 +00:00
95198f8299 Remove uses of deleted operations (#139447)
resolves: https://github.com/pytorch/pytorch/issues/138721

Summary:

Delete the uses of deleted nodes. The double for-loop is icky here, but N should
be pretty small and removing it requires refactoring the datastructures
involved, which is a bigger endeavor.

Test Plan:

Normal test coverage should be sufficient. There were a couple of spots in the
scheduler code that didn't check users being deleted, so I'll run a perf test to see
what impact that has, and to make sure N^2 doesn't affect compile times.

Perf:
https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2029%20Oct%202024%2017%3A41%3A36%20GMT&stopTime=Tue%2C%2005%20Nov%202024%2018%3A41%3A36%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=exclamaforte/prune-deleted-users&lCommit=5cb1aa6f7d8a52acdae0c7cf36b8c2d536d7f0d1&rBranch=main&rCommit=f4ee5a243dbb31e6310e5632b1c87898b299df2c
off of nov4 nightly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139447
Approved by: https://github.com/eellison
2024-11-08 22:21:53 +00:00
347f96061f Revert "[cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827)"
This reverts commit cf0bb6c435c58db4c72e489f462e1a0ebe310f14.

Reverted https://github.com/pytorch/pytorch/pull/136827 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. See D65605094 for more details ([comment](https://github.com/pytorch/pytorch/pull/136827#issuecomment-2465805271))
2024-11-08 21:52:33 +00:00
a7724518c0 Revert "[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)"
This reverts commit d72a308e77ec8895d48798dda05996cbc44ffa3e.

Reverted https://github.com/pytorch/pytorch/pull/139595 on behalf of https://github.com/ZainRizvi due to Sorry but the newly added tests in test_mkldnn_pattern_matcher.py fail internally. See D65661038 for more details ([comment](https://github.com/pytorch/pytorch/pull/139595#issuecomment-2465797016))
2024-11-08 21:45:52 +00:00
80d0356b11 Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit c03324de2dfbbf0006818c86b88c92a3378f46b7.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/ZainRizvi due to This fails to build internally. See D65604944 for more details ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2465790157))
2024-11-08 21:40:10 +00:00
3483f7809e Revert "Fix typo in associative_scan tests (#139929)"
This reverts commit 7fa94f03635709a30ef85c6955dcdd5051e72e71.

Reverted https://github.com/pytorch/pytorch/pull/139929 on behalf of https://github.com/ZainRizvi due to This test is breaking in trunk somehow, which is really weird. functorch/test_control_flow.py::AssociativeScanTests::test_associative_scan_binary_operator_compile_mode_compile_dynamic_shape_combine_mode_pointwise_reverse_False_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11747748990/job/32732254909) [HUD commit link](7fa94f0363) ([comment](https://github.com/pytorch/pytorch/pull/139929#issuecomment-2465773366))
2024-11-08 21:26:41 +00:00
411203e7c1 Revert D65490202 (#140142)
Summary:
This diff reverts D65490202
This is causing tests to fail on open source. See distributed/test_c10d_logger.py::C10dErrorLoggerTest::test_exception_logger [GH job link](https://github.com/pytorch/pytorch/actions/runs/11736922614/job/32697709457) [HUD commit link](ba9645f6e5)

Test Plan: NA

Differential Revision: D65663063

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140142
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-11-08 21:22:32 +00:00
119e0699cc [ez] Add .lintrunner.private.toml to .gitignore (#140166)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140166
Approved by: https://github.com/Skylion007
2024-11-08 20:55:21 +00:00
63a0d6587e [AOTI] Update the OSS tutorial (#139956)
Summary: Update the OSS tutorial to use the new aoti_compile_and_package and aoti_load_package APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139956
Approved by: https://github.com/angelayi
ghstack dependencies: #139955
2024-11-08 20:46:57 +00:00
07ad74635b Revert "[Reland] Use static_assert to detect get_type_index used in device code (#139966)"
This reverts commit ca7fdfe4d25f91c4cae48fde6eeac990738447f2.

Reverted https://github.com/pytorch/pytorch/pull/139966 on behalf of https://github.com/malfet due to This approach will prevent one from using get_type_index from device code ([comment](https://github.com/pytorch/pytorch/pull/139966#issuecomment-2465701260))
2024-11-08 20:32:43 +00:00
e6c5a77485 [dynamo][guards] Profile guard manager in C++ (#140110)
This should remove the pybind noise from the profiling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140110
Approved by: https://github.com/jansel
ghstack dependencies: #139953
2024-11-08 18:44:08 +00:00
a140e65e0f [dynamo] Support method with different __self__ on user defined objects (#139953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139953
Approved by: https://github.com/jansel
2024-11-08 18:44:08 +00:00
d18bca4961 [dynamo] switch to get_framelocals_mapping for 3.10 and below (#140037)
Part of implementing https://github.com/pytorch/pytorch/issues/93753. Next step will be to use a lower overhead data structure over `py::dict`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140037
Approved by: https://github.com/jansel
ghstack dependencies: #139921, #139950
2024-11-08 18:43:54 +00:00
bbd427faf5 [dynamo] switch to get_framelocals_mapping for 3.11 (#139950)
Part of implementing https://github.com/pytorch/pytorch/issues/93753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139950
Approved by: https://github.com/jansel
ghstack dependencies: #139921
2024-11-08 18:43:54 +00:00
7fa94f0363 Fix typo in associative_scan tests (#139929)
Fix typo with Associative_Scan tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139929
Approved by: https://github.com/ydwu4
2024-11-08 18:42:26 +00:00
dfcf740a61 Fix traceback.format_exception(...) positional arguments error. (#140109)
Fix #140095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140109
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison
2024-11-08 18:22:32 +00:00
8d61add14a Add Vectorized<c10::BFloat16> specialization for ARM (#139090)
When we have hardware support, we can use it. When we don't have hardware support, we can still do better than vec_base.h. I'm not sure to what extent we're set up to properly test both `defined(__ARM_FEATURE_BF16)` and `!defined(__ARM_FEATURE_BF16)` builds, feedback especially welcome there.

Testing: vec_test_all_types should cover correctness. For perf, seems clear that using vectorized intrinsics should be better than vec_base?

Differential Revision: [D64997747](https://our.internmc.facebook.com/intern/diff/D64997747/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139090
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #139084
2024-11-08 17:11:40 +00:00
8690f60f39 Extract value_type-generic NEON Vectorized<Half> functions to CRTP base class (#139084)
This is in prepraration for adding NEON Vectorized<BFloat16>, which will be simplified by sharing this stuff.

Differential Revision: [D64997744](https://our.internmc.facebook.com/intern/diff/D64997744/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139084
Approved by: https://github.com/malfet
2024-11-08 17:11:40 +00:00
1868fc63d8 [AOTI] Update C++ runner API to take a const vector (#139955)
Summary: Tighten the AOTIModelContainerRunner::run interface to take a const vector of at::Tensor, which 1) makes it clear that the runner will not modify the input tensor vector; 2) runner will be able to take a temp vector of tensors as the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139955
Approved by: https://github.com/chenyang78
2024-11-08 16:59:10 +00:00
fc6496c703 Revert "Enable inductor-rocm workflow for all trunk commits AND inductor-related PRs (#138623)"
This reverts commit ee7c3db092e09cde37ee33648dff1955bcd71e82.

Reverted https://github.com/pytorch/pytorch/pull/138623 on behalf of https://github.com/huydhn due to I think the link failure is legit, it complains about the wrong concurrency setting in the workflow ([comment](https://github.com/pytorch/pytorch/pull/138623#issuecomment-2465277228))
2024-11-08 16:58:05 +00:00
9d99dceb53 Fix split decomp returning self (#140065)
Previously the split decomp would return the input when there were no splits. this errors in torch.compile (or FakeTensorMode) with :

> RuntimeError: View operation returned a tensor that is the same as the input base tensor.  This is no longer allowed; you must explicitly create a new tensor (e.g., using .detach()). As a user, you could have made a mistake implementing __torch_dispatch__ or a Python operator decomposition or meta registration; if that's not the case, please report a bug to PyTorch or the backend you are using.

Fix for https://github.com/pytorch/pytorch/issues/133394

Differential Revision: [D65635070](https://our.internmc.facebook.com/intern/diff/D65635070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140065
Approved by: https://github.com/bdhirsh
2024-11-08 16:53:18 +00:00
22cd1ee951 [CD] Enable 3.13 triton build (#140137)
Copied from https://github.com/pytorch/pytorch/pull/139652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140137
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-11-08 16:34:10 +00:00
dd79d2f5e7 Removing warning for Windows Arm64 (#139746)
This PR removes the warning message on Windows on Arm64, which was triggered by an issue in one of the DLLs, to improve the user experience.

`Microsoft Visual C++ Redistributable is not installed, this may lead to the DLL load failure.
                 It can be downloaded at https://aka.ms/vs/16/release/vc_redist.x64.exe`

The issue is being tracked here: https://developercommunity.visualstudio.com/t/VCRUNTIME140_1DLL-Miscompiled-for-Arm64/10781635?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139746
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-08 16:23:59 +00:00
1d2d9f0de8 Give the magma build job id-token write permissions (#140141)
The configure-aws-credentials action requires special permissions: https://github.com/aws-actions/configure-aws-credentials?tab=readme-ov-file#oidc

Give "id-token: write" permssion to the job that sets the AWS credentials to upload to the S3 bucket.

Fixes #139397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140141
Approved by: https://github.com/atalman
2024-11-08 15:59:49 +00:00
ee7c3db092 Enable inductor-rocm workflow for all trunk commits AND inductor-related PRs (#138623)
It should help with triaging ROCm-inductor-related breakages and surfacing them in the PRs itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138623
Approved by: https://github.com/huydhn
2024-11-08 15:54:09 +00:00
7167323644 Fix type description of torch.chunk (#140089)
Fixes #126278

- Change return type description of `torch.chunk` to tuple
- Add type for input parameters

**Before**
![image](https://github.com/user-attachments/assets/087b6cfa-0815-443b-a69a-785ca4b421d7)

**After**
![image](https://github.com/user-attachments/assets/19532553-6004-4246-a6cf-f7f685f5775c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140089
Approved by: https://github.com/awgu
2024-11-08 15:21:13 +00:00
838958de94 [inductor] Support autotune restore_value for user-defined Triton kernels (#139851)
This PR adds support for the `restore_value` argument of the
`@triton.autotune` for the user-defined Triton kernels in PT2.

The `kernel.restore_idx` are extracted in the
`ir.UserDefinedTritonKernel` and the corresponding arg names are
placed into the `triton_meta["restore_value"]`. From there, those
are added to the existing `mutated_arg_names` in the caching autotuner
infra which already exists and leads to the listed argss being cloned.
This achieves the equivalent effect to the native `restore_value`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139851
Approved by: https://github.com/oulgen
2024-11-08 14:59:00 +00:00
a33fa37b4e [ROCm] Support new AMD triton stream pipeliner (#139881)
Fixes #139182

In Triton 3.2 num_stages=0 will be deprecated with Triton's AMD backend. Let's query default num_stages from the relevant triton backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139881
Approved by: https://github.com/bertmaher
2024-11-08 14:51:05 +00:00
c1c94cb0be Build magma binary tarballs for various cuda (#139888)
This is a first step towards removing builds dependency to conda.

Currently we build magma as a conda package in a pytorch conda channel, implemented in a1b372dbda/magma.

This commit adapts the logic from pytorch/builder as follows:
- use pytorch/manylinux-cuda<cuda-version> as base image
- apply patches and invoke the build.sh script directly (not anymore through conda build)
- stores license and build files along with the built artifact, in an info subfolder
- create a tarball file which resembles that created by conda, without any conda-specific metadata

A new matrix workflow is added, which runs the build for each supported cuda version, and uploads the binaries to pyorch s3 bucket.

For the upload, define an upload.sh script, which will be used by the magma windows job as well, to upload to `s3://ossci-*` buckets.

The build runs on PR and push, upload runs in DRY_RUN mode in case of PR.

Fixes #139397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139888
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/seemethere
2024-11-08 13:28:27 +00:00
5f287df422 Add type information for FakeProcessGroup (#133211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133211
Approved by: https://github.com/Skylion007
2024-11-08 11:18:52 +00:00
e5574445b0 add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501
2024-11-08 11:08:45 +00:00
0b7a2d4aef [Windows XPU] Fix MSVC ambiguous symbol error (#138727)
PT master build with XPU will fail due to MSVC issue of ambiguous symbol error 'std', previously fixed it with MSVC flag in torch-xpu-ops https://github.com/intel/torch-xpu-ops/pull/946/files, but the error is observed in PT master too after 2.5 and oneAPI update.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138727
Approved by: https://github.com/guangyey, https://github.com/ezyang
2024-11-08 08:29:36 +00:00
a3052b3b7c Inductor cpp wrapper: clean-up hard-coded schema and related code (#139873)
Fixes https://github.com/pytorch/pytorch/issues/112552.

non-ABI compatible mode has been removed thus the following values are not needed anymore:
`extern_call_ops`
`cpp_op_schema`
`cpp_kernel_key`
`cpp_kernel_overload_name`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139873
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-11-08 08:15:51 +00:00
d9def02050 [Inductor] record time for 'compile time' autotuning (#139431)
Here are the cases that Inductor does autotuning at compile time:
1. pad mm: benchmark to decide if we should pad or not
2. template autotuning: benchmark triton/cutlass templates and ATen  kernel for matmul/conv and pick the fastest one.

The PR annotate these cases with `dynamo_timed`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139431
Approved by: https://github.com/ezyang
2024-11-08 07:17:00 +00:00
011781f29d Assert that bundled triton payload does not have sentinel value (#139375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139375
Approved by: https://github.com/ezyang
2024-11-08 07:11:40 +00:00
ba9645f6e5 Fix for T206766523 ("Your diff, D65462767, broke some tests") (#139804)
Summary:
This is trying to fix a regression caused by https://github.com/pytorch/pytorch/pull/139757. We now don't want to log args and kwargs directly because if they contain tensor or tensor subclass it would take lots of time in conversion to string or even not supported.

Reviewed By: fduwjj

Differential Revision: D65490202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139804
Approved by: https://github.com/XilunWu
2024-11-08 05:57:30 +00:00
d72a308e77 [Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)
**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`

The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.

Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.

**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-11-08 05:33:16 +00:00
8715fb8aff [DTensor][unpickler] Add DTensor related classes to allowed globals so we can still torch.load(DTensor) with weights_only=True (#139949)
Test uses `torch.load()` for DTensor state_dict:
```
python3 test/distributed/fsdp/test_fsdp_dtensor_state_dict.py -k TestFSDPWithDeviceMeshAndDTensor
```

In this PR, we add `DTensor` related class to allowed safe globals so we can still `torch.load()` a `DTensor` with `weights_only=True`. We also need this for backward compatibility, since `DTensor` can be `torch.load()` before `weights_only` defaults to True. Without the change, `torch.load()` a `DTensor` would run into the following error:
```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
        (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL torch.distributed.tensor.DTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([DTensor])` or the `torch.serialization.safe_globals([DTensor])` context manager to allowlist this global if you trust this class/function.
```

The unit test failure is not being captured by CI when `weights_only` being rolled out for `torch.load()` by default. This is due to another issue that the test communication wrapper `with_comms` let unit tests silently pass without capturing failure due to a recent change (https://github.com/pytorch/pytorch/pull/138108). This wrapper issue is going to be fixed
by a separate PR https://github.com/pytorch/pytorch/pull/139637.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139949
Approved by: https://github.com/mikaylagawarecki
2024-11-08 05:06:11 +00:00
b042606d91 Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787)
Previously we were checking for a last dim with stride == 1. When the size is <= 1 that also is sufficient because the stride is insignificant. Fix for https://github.com/pytorch/pytorch/issues/138317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139787
Approved by: https://github.com/drisspg
2024-11-08 04:54:05 +00:00
114a0bc306 Make PGO work correctly with NJT inputs (#140046)
We were actually triggering a latent bug where nested ints were
uselessly being incorporated into the automatic dynamic state, even
though they were unconditionally ignored afterwards.  Now we munge
them out before putting them in.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65623303](https://our.internmc.facebook.com/intern/diff/D65623303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140046
Approved by: https://github.com/jbschlosser, https://github.com/bdhirsh
ghstack dependencies: #140042
2024-11-08 04:27:39 +00:00
af682f3cd7 Move put_code_state to only trigger on successful compile (#140042)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65623081](https://our.internmc.facebook.com/intern/diff/D65623081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140042
Approved by: https://github.com/markkm
2024-11-08 04:19:50 +00:00
cyy
43f0fe60a3 [Environment Variable][5/N] Use thread-safe getenv functions (#139762)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139762
Approved by: https://github.com/ezyang
2024-11-08 03:49:09 +00:00
86792a5a8d [invoke_subgraph] User facing API to support arbitrary args and kwargs (#139162)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139162
Approved by: https://github.com/zou3519
2024-11-08 03:31:19 +00:00
4715b77001 Create manylinux 2.28 cuda 12.6 image (#139909)
Add a version of the manylinux 2.28 image with cuda 12.6.

Once this is done, cuda 12.6 can be enable for the new magma non-conda distribution provided by https://github.com/pytorch/pytorch/pull/139888

Partially-fixes #139397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139909
Approved by: https://github.com/atalman
2024-11-08 03:03:04 +00:00
1fcc99c6bf Update quantization.rst (#139824)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139824
Approved by: https://github.com/svekars
2024-11-08 02:34:50 +00:00
347d134ee2 [BE] Delete DeprecatedTypeProperties cast (#139358)
Differential Revision: [D65549001](https://our.internmc.facebook.com/intern/diff/D65549001)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139358
2024-11-07 18:28:29 -08:00
f0ffaa5e16 Revert "[inductor] fix test_linear_binary_dynamic_shapes_cpp_wrapper (#139942)"
This reverts commit 0618c7fe667a4ca3891d0699bfd7cf2e4964924b.

Reverted https://github.com/pytorch/pytorch/pull/139942 on behalf of https://github.com/huydhn due to Sorry for revert this, but I think we miss running the test and it is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/139942#issuecomment-2463599298))
2024-11-08 01:55:48 +00:00
cyy
da1e120dfd [2/N] Replace c10::sv with std::sv (#139456)
Follows  #139453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139456
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-11-08 01:48:00 +00:00
81d077cca2 Fix to modules.rst: indent line with activation functions (#139667)
At line 205, I believe the code `x = self.activations[act](x)` should be indented so that it is in the body of the for loop. Otherwise, applying the four linear modules has the same effect as applying a single linear module, in the sense that it is still just a linear map so there is no point in having four of them.  In other words, each layer of this network should have a nonlinearity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139667
Approved by: https://github.com/malfet
2024-11-08 01:12:52 +00:00
103cbd7231 [MPS] Restrict MSELoss to floating types (#139960)
Becuase if invoked with long type it crahses deep in MPSGraph framework and to keep parity with CPU

Add test that validates that if dtype is not floating, both CPU and MPS implementations will error out
Fix function name for `mse_loss_out_mps` as `__func__` for any structured op implementation is `impl`

Fixes https://github.com/pytorch/pytorch/issues/139723
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139960
Approved by: https://github.com/kimishpatel
ghstack dependencies: #139961, #139959
2024-11-08 00:28:54 +00:00
1127c82592 Revert #137523: Add functionality to call dump function of NCCL profiler plugin (#139847)
Reverts PR https://github.com/pytorch/pytorch/pull/137523

Reasons for the reversion:
1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener).

2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below:
c19c384690/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L415-L427)
After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139847
Approved by: https://github.com/c-p-i-o
2024-11-08 00:24:29 +00:00
cyy
bf1b8adee6 Turn static inline into static function (#139843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139843
Approved by: https://github.com/ezyang
2024-11-07 23:58:18 +00:00
dbaa431dfb Put remote fx cache dynamo_timed definition in OSS location (#140016)
Summary: I'm refactoring dynamo_timed and updating the params. It will be much easier to do this refactor entirely in OSS. So this diff essentially provides a couple aliases in the OSS area that I can update without affecting the internal usage.

Test Plan: Ran locally and made sure I still got samples: https://fburl.com/scuba/dynamo_compile/sandbox/qub89lwj

Reviewed By: oulgen

Differential Revision: D65580302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140016
Approved by: https://github.com/oulgen
2024-11-07 23:51:48 +00:00
ae01f2b61b Extend CPU implementation of MSELoss to BF16 (#139959)
It's strange that it has not been implemented for the type yet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139959
Approved by: https://github.com/jgong5, https://github.com/janeyx99
ghstack dependencies: #139961
2024-11-07 23:50:15 +00:00
22dd17c7bb [doc] fixing missing colon in custom op doc (#140060)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140060
Approved by: https://github.com/malfet
2024-11-07 23:48:44 +00:00
c076001ed9 handle AttrProxy._modules when module is overwritten as None (#139957)
Fixes tracing through `mod._modules` access, when one of the submodules has been reset to None

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139957
Approved by: https://github.com/zhxchen17
2024-11-07 23:39:48 +00:00
2ee91db03d Add APIs to separate norm calculation and gradient scaling in nn.utils.clip_grad_norm_ (#139662)
Fixes https://github.com/pytorch/pytorch/issues/139467

Refactor `nn.utils.clip_grad_norm_` into `nn.utils.get_total_norm` and then `nn.utils.clip_grads_with_norm_` . `clip_grad_norm_` now calls into these two new ops,

`get_total_norm` is generalized (rather than `get_grad_norm` due to the discussion on the issue from @awgu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139662
Approved by: https://github.com/H-Huang
2024-11-07 23:13:23 +00:00
09ba38c4b7 Add an opt-out label to runner determinator on PR (#140054)
My sales pitch:  I need to ssh into the runner from time to time on my PR to debug issues, but it's well-known that LF runners don't support SSH login anymore.  So, the propose fix here is to introduce a new label called ~no-runner-determinator~ `no-runner-experiments` that can be attached to the PR.  Whenever `.github/scripts/runner_determinator.py` runs on a PR and sees this label, it will not apply any logic and just straight up use an empty prefix.

### Testing

With the label:

```
python3 runner_determinator.py \
    --github-token "MY_TOKEN" \
    --github-issue "5132" \
    --github-branch "install-torchao-torchtune-et" \
    --github-actor "huydhn" \
    --github-issue-owner "huydhn" \
    --github-ref-type "branch" \
    --github-repo "pytorch/pytorch" \
    --eligible-experiments "" \
    --pr-number "139947"

INFO    : Opt-out runner determinator because #139947 has no-runner-determinator label
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::
```

Without the label:

```
python3 runner_determinator.py \
    --github-token "MY_TOKEN" \
    --github-issue "5132" \
    --github-branch "install-torchao-torchtune-et" \
    --github-actor "huydhn" \
    --github-issue-owner "huydhn" \
    --github-ref-type "branch" \
    --github-repo "pytorch/pytorch" \
    --eligible-experiments "" \
    --pr-number "139947"

INFO    : Based on rollout percentage of 95%, enabling experiment lf.
INFO    : Skipping experiment 'awsa100', as it is not a default experiment
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::lf.
```

Running in trunk commit without a PR number will use the regular logic:

```
python3 runner_determinator.py \
    --github-token "MY_TOKEN" \
    --github-issue "5132" \
    --github-branch "install-torchao-torchtune-et" \
    --github-actor "huydhn" \
    --github-issue-owner "huydhn" \
    --github-ref-type "branch" \
    --github-repo "pytorch/pytorch" \
    --eligible-experiments "" \
    --pr-number ""

INFO    : Based on rollout percentage of 95%, enabling experiment lf.
INFO    : Skipping experiment 'awsa100', as it is not a default experiment
WARNING : No env var found for GITHUB_OUTPUT, you must be running this code locally. Falling back to the deprecated print method.
::set-output name=label-type::lf.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140054
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2024-11-07 22:55:27 +00:00
ba499c32cb [export] Disable AttrProxy when every submodule has a unique path. (#139918)
Summary:
In most cases, we don't need to turn on AttrProxy tracing for two reasons:
1. It's only needed when you have one submodule owning multiple FQNs.
2. AND it will cause model using module identity to be traced incorrectly (because we substitute module objects at tracing time).

Overall after offline discussion with some export folk, we think it's better to turn off AttrProxy if we can make sure every submodule has unique FQN, which tends to be the common case.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r module_dict_key

Differential Revision: D65555919

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139918
Approved by: https://github.com/tugsbayasgalan
2024-11-07 22:43:14 +00:00
75f3056c81 [hop-db] Import invoke_subgraph to avoid Dynamo error on mac (#140038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140038
Approved by: https://github.com/ydwu4
2024-11-07 22:36:57 +00:00
0618c7fe66 [inductor] fix test_linear_binary_dynamic_shapes_cpp_wrapper (#139942)
I recently added a new pattern here https://github.com/pytorch/pytorch/pull/139136 to remove pointless view/permute pairs.  At that PR, I've already updated the matched pattern/node count in `test_linear_binary` to account for the new pattern. But it looks like with cpp wrapper, one more pattern will be matched.

```
7 patterns without cpp-wrapper:

========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view_pair at 0x7f6d25c67b50, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.p
y", line 581> =======
========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f6d25c67aa0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object linear at 0x7f6d176e5dc0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 11
21> =======
========== pattern matched <code object reshape_linear_reshape_pattern at 0x7f6d176e5210, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mk
ldnn_fusion.py", line 732> =======
========== pattern matched <code object fn at 0x7f6d176d3ec0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 476> =
======

8 patterns with cpp wrapper:
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view_pair at 0x7f8e78bf0870, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.p
y", line 581> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object pointless_view at 0x7f8e78bf07c0, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/joint_graph.py", l
ine 568> =======
========== pattern matched <code object linear at 0x7f8e59c04190, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 11
21> =======
========== pattern matched <code object reshape_linear_reshape_pattern at 0x7f8e59dfb520, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mk
ldnn_fusion.py", line 732> =======
========== pattern matched <code object fn at 0x7f8e59dfa290, file "/home/shunting/ws/pytorch/torch/_inductor/fx_passes/mkldnn_fusion.py", line 476> =
======
```

I fixed this test by +1 to the expected number if cpp wrapper is enabled. But I think fundamentally can we not assert for the total number of patterns matched in the test? I think that makes the test very fragile. People adding new patterns may keep  breaking these 'un-related' tests. One possible way to improve is, we have a counter for each specific pattern, in the tests, instead of check the total number of patterns matched, just check the match count for the ***RELEVANT*** patterns. That should reduce false-positive for broken tests.   cc possible test creator @jgong5

Fixes https://github.com/pytorch/pytorch/issues/139812 (we need to have this to run this disabled test on your PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139942
Approved by: https://github.com/huydhn, https://github.com/jgong5
2024-11-07 22:34:25 +00:00
68f1b52d8a Revert "Turn static inline into static function (#139843)"
This reverts commit 72d3f5b26d90396f7a357fa3e5d82656ca74c102.

Reverted https://github.com/pytorch/pytorch/pull/139843 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing tests to fail on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11729669425/job/32675829894) [HUD commit link](72d3f5b26d) ([comment](https://github.com/pytorch/pytorch/pull/139843#issuecomment-2463354131))
2024-11-07 22:29:45 +00:00
d1a45800a3 refresh numbers after accepted less than noise regression (#140029)
https://github.com/pytorch/pytorch/pull/138363 regressed some benchmarks but less than noise level updating values to avoid flakiness.
<img width="803" alt="Screenshot 2024-11-07 at 10 31 29 AM" src="https://github.com/user-attachments/assets/31326452-a6ad-44b8-b324-25e953355fcf">

PASS: benchmark ('add_loop_eager', 'compile_time_instruction_count') pass, actual result 3073605220 +1.21% is within expected 3037000000 ±1.50%

PASS: benchmark ('add_loop_eager_dynamic', 'compile_time_instruction_count') pass, actual result 5700849667 +1.37% is within expected 5624000000 ±2.50%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140029
Approved by: https://github.com/bobrenjc93
2024-11-07 22:27:00 +00:00
83e36a6bfa AOTI Minifier (#139351)
See documentation at https://docs-preview.pytorch.org/pytorch/pytorch/139351/torch.compiler_aot_inductor_minifier.html.

Add a minifier for AOTI.

Test Plan:
python test/inductor/test_minifier.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139351
Approved by: https://github.com/desertfire
2024-11-07 21:43:44 +00:00
8d070d23d6 [ROCm] Tune flex-attention and decode to num_stages=1 (#139883)
Fixes #139755 #139621

The new stream pipeliner on AMD triton backend enables num_stages to function equivalent to NV backend. This upgrade in triton 3.2 will cause OOM issues in flex attention due to num_stages=3 setting, we have tuned this to num_stages=1 which is the best setting for flash attention kernels and avoids the shmem issues.

We will follow up this PR with some config tuning on AMD backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139883
Approved by: https://github.com/bertmaher
2024-11-07 21:16:52 +00:00
36e0f119d0 Revert "[experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)"
This reverts commit 5203138483e97141ad96a8906f1c6f8b7ff8adc6.

Reverted https://github.com/pytorch/pytorch/pull/139227 on behalf of https://github.com/yifuwang due to Need to address internal build failure D65605027 ([comment](https://github.com/pytorch/pytorch/pull/139227#issuecomment-2463204467))
2024-11-07 21:01:36 +00:00
d378819068 Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-07 20:54:39 +00:00
b5286ba207 Small fix to Python rendering in documentation. (#138281)
The text was being rendered as normal text but I believe was meant to be code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138281
Approved by: https://github.com/janeyx99
2024-11-07 20:48:47 +00:00
d8afa21ef2 specialize symfloats for wrapped_gradient in get_fake_value (#139935)
Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_gradient_type_promotion_cpu` when `specialize_float=False`

Reviewers might wonder why we need to have this whitelist. Can't we rely on python_arg_parser.h to do the specialization generically? Alas this path doesn't actually FFI to C++ so we do need to do the specialization in pythonland.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139935
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454, #139896
2024-11-07 20:27:02 +00:00
bdeca2a24f [BE] Remove warn about using Half on CPUs (#139961)
Was added by https://github.com/pytorch/pytorch/pull/33021, but modern CPUs right now are quite capable of handling half precision types.
Alternatively one can guard the warning with `#ifdef x86_64`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139961
Approved by: https://github.com/jgong5
2024-11-07 20:23:42 +00:00
df136df8d5 Remove upload_test_stat_aggregates script (#139915)
Instead of moving these queries to ClickHouse, we're just going to remove it since it's not really used.  We do want something for test aggregates, but we can make a new script instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139915
Approved by: https://github.com/huydhn
2024-11-07 20:14:12 +00:00
cyy
83fa1014f1 [3/N] Replace c10::sv with std::sv (#139861)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139861
Approved by: https://github.com/ezyang
2024-11-07 20:03:57 +00:00
85204d0081 Don't wrap inf values as symfloat (#139896)
Fixes `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCPU.test_comprehensive_linalg_norm_cpu_float16` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139896
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846, #139454
2024-11-07 20:03:54 +00:00
cyy
9d09af981b Wrap torch_python with torch_compile_options (#136743)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136743
Approved by: https://github.com/ezyang
2024-11-07 19:36:40 +00:00
d0da40a8b9 [PT2][Optimus] fix the default alpha and beta values (#139857)
Summary:
We noticed that the default coefficient values for beta and alpha should be int 1, instead of float 1.0, which will cause error when the inputs for the add are int types.

More contex:

https://fb.workplace.com/groups/1075192433118967/permalink/1539142760057263/

Test Plan:
# local reproduce
```
buck2 run mode/opt scripts/shuaiyang:test -- --optimus --flow_id 660724017 2>&1 | tee ~/local_run_shuai_660724017.txt
```

trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-11-05-21-18-17/trace.json.gz&bucket=gpu_traces

# E2E

before fix:
f660724017

after fix:

Differential Revision: D65521638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139857
Approved by: https://github.com/jackiexu1992
2024-11-07 19:12:23 +00:00
cyy
72d3f5b26d Turn static inline into static function (#139843)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139843
Approved by: https://github.com/ezyang
2024-11-07 19:08:41 +00:00
f5147e989c [dynamo] prefix some eval_frame.c functions with dynamo_ (#139921)
Fix https://github.com/pytorch/pytorch/issues/137994. I didn't prefix every function, but the ones that are on the hotpath.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139921
Approved by: https://github.com/ezyang
2024-11-07 19:07:23 +00:00
071d48c56e Add output_node util function to fx.Graph (#139770)
Summary: A util function for access output node for FX graph

Test Plan: OSS CI

Differential Revision: D65486457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139770
Approved by: https://github.com/ezyang, https://github.com/Chillee
2024-11-07 18:54:59 +00:00
ee54dfb64d [Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643)
Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last.

This way, the channels-last CK instances will be added to benchmark choices in max autotune

# Testing
```
pytest test/inductor/test_ck_backend.py -k conv2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643
Approved by: https://github.com/chenyang78
2024-11-07 18:37:39 +00:00
edbf57b336 [pipelining] remove extra variables (#139817)
Cleaning up counters / extra variables not needed after https://github.com/pytorch/pytorch/pull/139415 was landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139817
Approved by: https://github.com/wconstab
2024-11-07 18:32:20 +00:00
8f4b29810b Fix aarch64 wheel builds (#140020)
Shell script still referencing builder checkout rather than PyTorch, which results in
```
python /builder/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
python: can't open file '/builder/aarch64_linux/aarch64_wheel_ci_build.py': [Errno 2] No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140020
Approved by: https://github.com/atalman
2024-11-07 18:24:34 +00:00
eabef5000f [user triton] reset kernel_side_table before test_tma_capture_and_functionalize (#139907)
The test was failing when I ran the whole test suite. I'm guessing that the exact indices would previously depend on the order that tests would run; by resetting the kernel_side_table we should hopefully get results that are reproducible independent of the test execution order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139907
Approved by: https://github.com/oulgen, https://github.com/aakhundov
2024-11-07 17:56:53 +00:00
cyy
ca7fdfe4d2 [Reland] Use static_assert to detect get_type_index used in device code (#139966)
#139173 was reverted due to an internal build break of using get_type_index in device code. This PR is created for ease of importing into META to further investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139966
Approved by: https://github.com/malfet, https://github.com/huydhn

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-07 17:36:47 +00:00
e474f0de82 [PGNCCL] Slimming watchdog loop (#139834)
- Refactored traceback code into `work.printTraceback()`.  cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @shuqiangzhang
- Refactored desync debug code into `class DesyncDebugger`.
- Moved occurrences of `futureWorkResult_->markCompleted` into `checkAndSetException` and `checkTimeout`, respectively. cc @shuqiangzhang
- Modularized dump signal broadcast code into `ProcessGroupNCCL::broadcastDumpSignal`. cc @fduwjj @c-p-i-o

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139834
Approved by: https://github.com/shuqiangzhang
2024-11-07 17:22:44 +00:00
a60bc051e3 Revert "Fix the use of fsspec transactions (#135541)"
This reverts commit 59cf4bc5ae64aea2c6a9b870243821695adfc30b.

Reverted https://github.com/pytorch/pytorch/pull/135541 on behalf of https://github.com/ZainRizvi due to Breaking internally. See D65551490 ([comment](https://github.com/pytorch/pytorch/pull/135541#issuecomment-2462774239))
2024-11-07 17:03:37 +00:00
7e02386303 Revert "[2/N] Replace c10::sv with std::sv (#139456)"
This reverts commit 028c5d3426743673edbbe6e11a491d76f1402f7c.

Reverted https://github.com/pytorch/pytorch/pull/139456 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. @ezyang can you please help get this landed? See D65546398 for more details ([comment](https://github.com/pytorch/pytorch/pull/139456#issuecomment-2462768891))
2024-11-07 17:00:59 +00:00
781c68c865 [aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor (#139095)
Based on discussion here: https://github.com/pytorch/pytorch/pull/138731

Introducing ability for subclass implement type convertion to expected_type.
```
    def __coerce_same_metadata_as_tangent__(
        self, expected_metadata: Any, expected_type: Optional[Type] = None
    ):
```
Here if `expected_type=None` means `SubclassClass` is expected.

E.g. for `DTensor` we may find tangent `AsyncCollectiveTensor` where we expected `Tensor` - in this case
`expected_type=Tensor` will be called during runtime

Adding implementation to AsyncCollectiveTensor, that just triggers `wait()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139095
Approved by: https://github.com/bdhirsh
2024-11-07 16:24:48 +00:00
8d3d47e439 Trigger symfloat specialization in argument binding code (#139454)
Fixes the test `python test/inductor/test_torchinductor.py CpuTests.test_upsample_cat_conv_cpu` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139454
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572, #139846
2024-11-07 16:10:23 +00:00
c35a01173b Remove compile event logging for automatic dynamic (#139891)
Summary: These events are a pretty large portion of the table, but not really currently used. Only log to tlparse for now.

Test Plan: Unit tests

Differential Revision: D65539986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139891
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-07 14:52:10 +00:00
81ecf98d23 Pass all arguments when quantizing embedding bag from float (#137697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137697
Approved by: https://github.com/snadampal, https://github.com/jerryzh168
2024-11-07 09:53:49 +00:00
314aa268ce In AMX GEMM micro-kernel, use same dtype for A & B only if B is dequantized (#139906)
@frost-intel discovered that some Inductor auto-tuning UTs for CPU are currently broken on machines supporting AMX ISA. That's because in #136688, I had reverted a change in the AMX GEMM micro-kernel that was introduced in #131887, but it looks like some other implementations introduced after the aforementioned change rely upon it, so it should not have been reverted.

Added a fix.

Ideally, a CI machine that supports AMX should cover these UTs (test/inductor/test_cpu_select_algorithm.py). We do have at least one CI machines that support AMX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139906
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-11-07 09:18:59 +00:00
a4e7b8001c refuse to generate a symbolic variable if a float input is inf (#139846)
Fixes `PYTORCH_TEST_WITH_INDUCTOR=1 tlp python test/test_torch.py TestTorchDeviceTypeCPU.test_cauchy_cpu_float64` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139846
Approved by: https://github.com/ruidazeng, https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568, #139572
2024-11-07 09:16:55 +00:00
c4a323ed05 [Inductor] Generalize device-bias code newly introduced in scheduler.py (#139872)
[Inductor] Generalize device-bias code newly introduced in scheduler.py to align the Inductor behavior for xpu with cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139872
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey
ghstack dependencies: #139705
2024-11-07 07:10:28 +00:00
320374b011 [Inductor] Refine triton_bundler.py to support correctly on Intel GPU and fix CI failures. (#139705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139705
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/guangyey
2024-11-07 07:10:28 +00:00
3caf56d97a Mark full_like as core ATen (#139937)
Fixes #139617

As titled. For ExecuTorch `full_like` is implemented so this should be fine: https://github.com/pytorch/executorch/blob/main/kernels/portable/cpu/op_full.cpp

Also there are decompositions for ops such as `fill.Scalar` that gives `full_like`: https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L164

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139937
Approved by: https://github.com/tugsbayasgalan
2024-11-07 07:08:18 +00:00
c03324de2d Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2024-11-07 06:28:47 +00:00
ca30704f0b [Inductor][ROCm][CK] Add standalone runner (#139441)
Generate standalone executable to debug and profile CK gemm instances

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139441
Approved by: https://github.com/ColinPeppler
2024-11-07 06:21:27 +00:00
d36fdaf157 Openreg: Support stream (#136991)
Support stream. When the driver communicates with the executor, it will send the stream id corresponding to the execution command; when the executor receives the command with the stream id, it will ignore the stream id because cpu backend doesn't support asynchronous execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136991
Approved by: https://github.com/ezyang
2024-11-07 06:09:07 +00:00
ab42967238 [hop free symbols] lift free symbols in example_value when create_graph_input (#138363)
There are 4 parts (they are hard to further break into smaller ones cause they're highly coupled) in this PR:
1. **Whenever we call create_graph_input, we try to bind the symbols in the graph input.**
We've enforced the invariant that all create_graph_inputs calls must provide an example value, we could intercept at the create_graph_input calls (This PR only handles free symbols in tensors).
2. **We cache the bound_symbols** to avoid lift the same symbol repeated.
3. For lifted symbols, we re-used  **lifted_freevars** i.e. the mapping between symbol proxy in parent graph to the lifted phs in current subgraph, which we handle lifted tensors. In this way, all hops that supports lifted tensors should be able to handle lifted_symints automatically (at least in dynamo part).
4. For **unbacked symbols** created during tracing, we need to also bound these symbols to its proxy. This is to support the tests cases where we want to lift unbacked symbols as input. We need the proxy of the unbacked symbol in parent graph in order to properly create the args to the hop.
5. We change all the tests after free symbols are lifted in subgraphs. And also supports the lifted symbols in existing higher order ops.

**The interaction of nested tracers:**
The previous design for lifting tensor closures is that: suppose we're in nested tracers, whenever we see a new proxy that's not created by create tracer, we recursively look for the proxy in parent tracer until we find the tracer that creates this proxy (either a placeholder or some intermediate results). More detail is in Note [Nested SubgraphTracer and free_variable handling].

Given the above design, the plan for lifting the free symbols is: whenever we lift a free tensor to be the inputs of current subgraph, we'll look at the symbols in it and bind the symbols at the same time.

For example, suppose we have the following function:
```python
def f(x: [s1, s2]):
  def true_f():
    def true_f_inner():
      return x.sin()
```
what will happen in time order:

1. we create a subtracer 1 and start to speculate the outer cond's true_f
2. we create a another subtracer 2 and start to speculate the inner cond's true_f_inner.
3. dynamo realize the tensor input x by calling wrap_tensor in top-level to create graph input x (tracer 0), we bind the symbol s1, s2 after ph for x is created. So the graph now looks like:
```python
def gm(s1, s2, x):
```
4. when seeing TensorVariable.call_method of x,  tracer2 wants to create a call_function(sin, proxy_of_x), but it finds that proxy_of_x is not created by current tracer. So it recursively look up its parent tracer1 and find parent tracer1 also doesn't track this proxy_of_x then it finds the root tracer0, who is the creator of it and tracks it as a ph. Then tracer 1 create_graph_input  to lift the closure to its input ph1 and add (proxy_of_x: ph1) k-v in **lifted_freevars**  of tracer 1.
Now the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(x):
```
5. Since there are free symbols inside this new tensor input, tracer 1 also binds the symbols (maybe_bind_symbol), which calls create_graph_input for s1 and s2. Now the graph looks like
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
```
6. then it goes back to tracer 2, and call create_graph_input for x and get ph2, tracer 2's **lifted_freevars** records (ph1, ph2). and tracer 2 also binds the symbols in this new tensor input. Now the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner(s1, s2, x):
```
7. Finally the sin call_function node is created by tracer 2.

**This PR also handles the following cases:**
- What if we lift two tensors share the same symbol? e.g. x1 [s1, s2], x2 [s2, s3]? Each subtracer maintains bound_symbols as a cache that maps a symbol.expr to its proxy in current tracer. So when we see x1, we'll track s1 and s2 as inputs and bound s1 to ph1, s2 to ph2. So when we try to bind symbols of x2, s2 will already be tracked so no graph input is created.
- what if a subgraph close over a symint? e.g.
```python
def f(x):
  def true_f():
    c = x.size(0)
   def true_fn_inner():
     return c
```
When we speculate true_fn_inner, we find proxy_of_c is not tracked by tracer 2, so it recursively looks up its parent. At this point, x and its symbols have been lifted as input of true_f (as a result of lifting x during tracing true_f in tracer 1. Specifically the graph looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner():
```
So tracer 2 is able to find that s1 have been tracked as ph in tracer 1 so it returns back to gm and call create_graph_input on s1. The graph now looks like:
```python
def gm(s1, s2, x):
  def true_gm(s1, s2, x):
    def true_gm_inner(s1):
     return s1
```

-  What if subgraph close over an unbacked symint? e.g.
```python
def f(x):
  def true_f():
    c =  x.item()
    def true_f_inner():
      return c
```
When x.item() is called, proxy_of_c and its symnode variable is created for tracer 1, and we also call track_unbacked_symbols to record this relationship. So when tracer 2 finds proxy_of_c is not created by current tracer, it recursivelly looks up its parent tracer and finds that that expression u0 has been tracked as a result of track_unbacked_symbol in tracer 1. So it will stop the recursion and create_graph_input u0 in tracer 2. Graph looks like:
```python
def f(x):
  def true_f(s1, s2, x):
    c = x.item()
    def true_gm_inner(u0):
      return u0
    cond(pred, true_gm_inner, false_gm_inner, (c,))
```

- what if subgraph close over a tensor with unbacked symint shape?
```python
def f(x):
  def true_f():
    c = x.item()
    r = torch.randn((c,))
    def true_f_inner():
      return r + 1
```
This is the same as the case of closing over tensors with backed shapes. where we first lift r, then bind u0 in it, which recursively bind_symint of u0 in its parent and found u0 is tracked in parent tracer as a result of .item() call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138363
Approved by: https://github.com/zou3519
2024-11-07 04:44:32 +00:00
3368f3ad41 [ONNX] Update TorchTensor implementation to handle fake mode (#139534)
Update TorchTensor implementation to handle fake mode better. Specifically, we disable fake mode before calling detach() etc. when getting the weights if it is already a real tensor so we do not lose it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139534
Approved by: https://github.com/fatcat-z, https://github.com/titaiwangms
2024-11-07 04:36:24 +00:00
2037ea3e15 Add type annotations to Configs (#139833)
Summary:
Adds types to Configs, and fixes a bug in options that was caused by the lack of types.

fixes: https://github.com/pytorch/pytorch/issues/139822

Configs are used by many modules so not sure which label to put.

Types also allow https://github.com/pytorch/pytorch/pull/139736 to fuzz configs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139833
Approved by: https://github.com/c00w
2024-11-07 03:49:09 +00:00
5203138483 [experimental] async-tp impl with cutlass-based, progress aware kernel (#139227)
This PR introduces the following:

### torch.ops.symm_mem._async_input_mm

`_async_input_mm(Tensor a, Tensor b, Tensor a_chunk_signals, int a_chunk_pivot) -> Tensor`

An mm impl that supports consuming asynchronous input. It guarantees the following rasterization order, and that the corresponding signal arrives before an input chunk is consumed.
```
num_chunks = a_chunks_signals.numel()
for chunk_idx in range(a_chunk_pivot, num_chunks + a_chunk_pivot):
    chunk_idx = chunk_idx % num_chunks
    wait_signal(a_chunk_signals, chunk_idx)
    # Compute output tiles that consumes the input chunk
```

### PersistentAsyncInputScheduler

This is a forked version of PersistentScheduler that supports consuming asynchronous input. This tile scheduler introduces the following arguments:

- `tiles_per_chunk_m` – Specifies the size of an M chunk. Chunks are the granularity at which the asynchronous input becomes ready. It must be an interger multiple of the size of an M tile.
- `chunk_signals` – `chunk_signals[i] == 1` indicates that chunk i is ready. Before returning a work tile, get_current_work() waits for the signal to ensure that the corresponding chunk is ready.
- `tile_idx_pivot_m` – After applying swizzling, apply `pivot(m) => (m + tile_idx_pivot_m) % tiles_m` to `m`. In a distributed setting, this allows different ranks to process different m indices at the same time, thus avoiding communication hotspots.

Note that this scheduler currently only supports the `KernelTmaWarpSpecializedCooperative` kernel schedule. This is enforced via the template argument `KernelSchedule`.

Usage:
```
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   Shape<int, int, int, int>,
   CollectiveMainloop,
   CollectiveEpilogue,
   cutlass::gemm::PersistentAsyncInputScheduler<KernelSchedule>>;
```

### _fused_all_gather_matmul_native
An ag-mm impl that combines `torch.ops.symm_mem._async_input_mm` and progress-aware all-gather. This is not yet enabled via the async-tp passes. We will use it as a backend to optimize the current decomposition-based async-tp impl.

## Benchmarks

### 4096x3584x8192
- cublas + nccl: 539us
- decomp-based async-tp w/o cuda graph: 694us
- decomp-based async-tp w/ cuda graph: 478us
- new cutlass kernel: 408us

<img width="478" alt="image" src="https://github.com/user-attachments/assets/39f316ab-36c5-4b41-af77-07854a385dfc">

### 2048x3584x8192
- cublas + nccl: 301us
- decomp-based async-tp w/o cuda graph: 687us
- decomp-based async-tp w/ cuda graph: 356us
- new cutlass kernel: 276us

<img width="441" alt="image" src="https://github.com/user-attachments/assets/9e23ce21-863b-43dd-a562-fb05d3a5a144">

## Next Steps
- Add tuning logic
- Use `_fused_all_gather_matmul_native` as a backend for the decomp-based async-tp impl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139227
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-11-07 03:43:12 +00:00
a59132b9c8 fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661)
Fix https://github.com/pytorch/pytorch/issues/132634.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2024-11-07 03:21:36 +00:00
604e353cae Revert "Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787)"
This reverts commit 060bee7f22a6ff5c14562713dc4bb6aa74923469.

Reverted https://github.com/pytorch/pytorch/pull/139787 on behalf of https://github.com/huydhn due to Sorry for reverting this, but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/139787#issuecomment-2461234683))
2024-11-07 03:17:16 +00:00
f459c3095f [dynamo] Document codegen and clean up some code paths (#139670)
This patch
1. Adds documentation to `PyCodegen.__call__`, `PyCodegen.tempvars` and
   the `allow_cache` flag.
2. Merges a few existing code paths in `PyCodegen.__call__`.
3. removes the `elif var in cg.tempvars` code path in
   `codegen_save_tempvars`, because it's no longer needed after #113725,
   as we have up-to-date `VariableTracker.source` now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139670
Approved by: https://github.com/jansel
ghstack dependencies: #139538
2024-11-07 03:14:16 +00:00
183b386cb2 [dynamo] Simplify Codegen for variables with MutableSideEffects (#139538)
This effectively undoes #115095, which is not longer be needed after #113725.

Why did we need #115095? I went back in history and found that [this line](https://github.com/pytorch/pytorch/pull/113725/files#diff-0bb1756725c4426408938314b0c9d3988ae5bf49994892d7038ad7746e209e9fR86)
actually fixed what #115095 fixed. Specifically, without the
`allow_cache` check for the "dup_top" optimization, we could incorrectly
codegen based on source, despite `codegen_update_mutated` requested to
codegen from value, for updates to pre-existing lists, etc. Since #113725 added
the `allow_cache` check, we no longer need the `mutable_side_effects_from_source`
code path from #115095.

However, #115442 introduced a `value_from_source` flag which didn't
account for the `mutable_side_effects_from_source` branch. So this patch
adds an extra check to keep existing behavior for export, and leaves a
TODO for investigating what exactly export wants from codegen, when it
comes to side effects and sources.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139538
Approved by: https://github.com/jansel
2024-11-07 03:14:16 +00:00
cf0bb6c435 [cpu] Modify inductor opt flag --- ftree-loop-vectorize (#136827)
Reopen https://github.com/pytorch/pytorch/pull/121782, as more optimizations have landed.

Fixes https://github.com/pytorch/pytorch/issues/115261, https://github.com/pytorch/pytorch/issues/113017.
For CPU inductor path, remove -ftree-loop-vectorize from optimization flags to fix functional issues.

### Validation on 3 benchmark suites

#### FP32
![image](https://github.com/user-attachments/assets/ec920928-fa36-467f-ba07-d2c05c51b92e)

Outlier models (speedup<0.8, single socket): None.

#### BF16
![image](https://github.com/user-attachments/assets/4a301e5e-147d-4b74-beb1-40290969ed80)

Outlier models (speedup<0.8, single socket multi threads):

- functorch_dp_cifar10 0.58
- opacus_cifar10 0.57

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136827
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-11-07 02:49:52 +00:00
617b4538f1 Support symbolic builtin round in export (#139549)
Differential Revision: D65380866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139549
Approved by: https://github.com/digantdesai, https://github.com/angelayi
2024-11-07 02:49:44 +00:00
FEI
54e680151b Optimize peak memory for flash _scaled_dot_product_attention_math (#139612) (#139613)
Fixes #139612

@drisspg @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139613
Approved by: https://github.com/drisspg
2024-11-07 02:25:39 +00:00
2b400236c2 [DCP] Cross-link DCP doc to tutorials (#139776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139776
Approved by: https://github.com/mhorowitz, https://github.com/LucasLLC, https://github.com/fduwjj
ghstack dependencies: #139938
2024-11-07 02:19:49 +00:00
b51b7e28ee Add DCP doc to DCP merge-rules (#139938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139938
Approved by: https://github.com/LucasLLC, https://github.com/c-p-i-o, https://github.com/fduwjj
2024-11-07 02:19:49 +00:00
4e647871d6 Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786
Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305
ghstack dependencies: #139716
2024-11-07 01:58:05 +00:00
47446cb5f3 [fr][c10d] move logger out from utils.py (#139806)
Summary:
Move flight recorder logger class out from utils.py into its own file.
This makes the program more modular.
This is mostly a refactoring/non-functional change.

Test Plan:
Build fr_trace locally and ran it.
```
buck build //caffe2/fb/flight_recorder:fr_trace
Buck UI: https://www.internalfb.com/buck2/875ca6a3-e86e-4263-95a0-579502494c5c
Network: Up: 0B  Down: 0B
Jobs completed: 6818. Time elapsed: 0.2s.
BUILD SUCCEEDED
```
Ran it as follows:
```
cd buck-out/v2/gen/fbcode/caffe2/fb/flight_recorder

./fr_trace.par  -p trace_ /tmp
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
```

Differential Revision: D65503768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139806
Approved by: https://github.com/fduwjj
2024-11-07 01:44:12 +00:00
d0ffd6d142 [AOTI] Add data_ptr to RAIIAtenTensorHandle (#139895)
Summary: To increase the readbility of the generated code. This is not BC-breaking, because RAIIAtenTensorHandle is implemented as header-only.

Differential Revision: [D65547216](https://our.internmc.facebook.com/intern/diff/D65547216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139895
Approved by: https://github.com/chenyang78
2024-11-07 01:36:28 +00:00
4ddf015e7d [ONNX export] exporting model to onnx error when tensor.index_fill ops met dim=0 #139594 (#139596)
When fill_index op's param dim==0, there is no need to unsqueeze the index tensor's dimension. So we return index tensor directly if ths size of axes_i == 0

Fixes #139594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139596
Approved by: https://github.com/justinchuby
2024-11-07 01:32:34 +00:00
bd5a2c2c71 [AOTI] Simplify the return code (#139889)
Summary:
```
    if constexpr (std::is_same_v<std::decay_t<decltype(buf3)>,RAIIAtenTensorHandle> || std::is_same_v<std::decay_t<decltype(buf3)>,AtenTensorHandle> || std::is_same_v<std::decay_t<decltype(buf3)>,ConstantHandle>) {
        output_handles[0] = buf3.release();
    } else {
        thread_local ThreadLocalCachedOutputTensor<std::decay_t<decltype(buf3)>> cached_output_0(buf3);
        cached_output_0.copy_data_from(buf3);
        AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&output_handles[0]));
        AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_assign_tensors(cached_output_0.tensor(), output_handles[0]));
    }
```
->
```
 output_handles[0] = buf3.release();
```

Test Plan: CI

Differential Revision: D65460719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139889
Approved by: https://github.com/chenyang78
2024-11-07 01:28:43 +00:00
6fcef86cfa [inductor] fix the unligned variable ranges issue in fuse node (#138568)
Fixes #138550.

### Description
In the fusion of two nodes, one node with less variables (`node_to_recomp`) would make its variable ranges aligned with the other node (`ref_node`). In detail, `node_to_recomp` would change its variable ranges to the original ranges of `ref_node`. However, if both of the nodes have changed its ranges, i.e., the simplified variable ranges are different from its original ones, the issue comes up.

### Solution
For the case where the `ref_node` also changes its variable ranges, we recompute the size and body for it, to ensure the nodes are simplified to the same size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138568
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-11-07 01:17:58 +00:00
ed0e63e938 Add NHWC support for group normalization (#126635)
Fixes #111824

Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This  conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126635
Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki
2024-11-07 01:12:08 +00:00
59ec011855 [numerical debugger] bumped up the starting handler id (#139666)
Differential Revision: D65445250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139666
Approved by: https://github.com/tarun292, https://github.com/dulinriley
2024-11-07 01:00:43 +00:00
e675c6702d justknobs: Remove JustKnobsConfig and justknobs_feature (#138767)
This never ended up getting used, and instead we're doing this
resolution within the configuration system.

Removing these unused internal features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138767
Approved by: https://github.com/ezyang
ghstack dependencies: #138766, #138956
2024-11-07 00:21:46 +00:00
52446d7f30 Revert D65290089 (#139893)
Summary:
This diff reverts D65290089
This change is introducing more logging than I realized and could present problems for tlparsen

Test Plan: NA

Reviewed By: jamesjwu

Differential Revision: D65541060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139893
Approved by: https://github.com/jamesjwu
2024-11-07 00:10:09 +00:00
ac5fa26e07 [dynamo][weakref] Support weakref.ref call (#139914)
Should fix - https://github.com/pytorch/pytorch/pull/135001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139914
Approved by: https://github.com/jansel
ghstack dependencies: #139856
2024-11-06 23:16:41 +00:00
738bfff5f9 [dynamo][user-defined] Fix bugs with method descriptors (#139856)
Should fix some problems in https://github.com/pytorch/pytorch/pull/138080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139856
Approved by: https://github.com/jansel
2024-11-06 23:16:40 +00:00
ed16f28f02 Fix ExecuTorch CI after landing #6564 (#139700)
After landing https://github.com/pytorch/executorch/pull/6564, we need to update the pinned ExecuTorch commit on PyTorch is fix the regression on PyTorch side.  The change to `.ci/docker/common/install_executorch.sh` is needed because it's how the dependencies are setup on ExecuTorch CI now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139700
Approved by: https://github.com/larryliu0820, https://github.com/malfet
2024-11-06 23:04:35 +00:00
060bee7f22 Loosen last dim contiguity for sdpa constraint to include last dim 0,1 (#139787)
Previously we were checking for a last dim with stride == 1. When the size is <= 1 that also is sufficient because the stride is insignificant. Fix for https://github.com/pytorch/pytorch/issues/138317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139787
Approved by: https://github.com/drisspg
2024-11-06 22:53:01 +00:00
56a40d4ebb Add conda to Manylinux Docker images (#139903)
We would like to switch https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job.yml from ``pytorch/conda-builder`` to
``pytorch/manylinux-builder`` and later to ``pytorch/manylinux_2_28-builder`` . Hence adding conda to these images.

Test Infra PR that does the switch : https://github.com/pytorch/test-infra/pull/5867 - need to be rebased after this PR is merged
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139903
Approved by: https://github.com/seemethere
2024-11-06 22:49:36 +00:00
b8cf324e50 [pt2 logging] move remote cache get/put logging up one level (#139423)
Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see.

Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x

Differential Temp Revision: D65290089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423
Approved by: https://github.com/oulgen
2024-11-06 22:44:53 +00:00
8f077b811b [ROCm][Inductor]Fixing missing ck package warning when the backend is disabled (#139790)
```

test_addmm_multiple_dynamic_cuda (__main__.AOTInductorTestABICompatibleCuda) ... W1101 10:26:20.492000 1361741 torch/_inductor/utils.py:1207] Please pip install Composable Kernel package
AUTOTUNE addmm(16x6, 16x16, 16x6)
  triton_mm_0 0.0104 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=0, num_stages=2, num_warps=1
  triton_mm_1 0.0104 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=16, B_PROLOGUE_CAST_TYPE=None, EVEN_K=True, GROUP_M=8, matrix_instr_nonkdim=16, num_stages=2, num_warps=1
SingleProcess AUTOTUNE benchmarking takes 0.2182 seconds and 0.2979 seconds precompiling for 2 choices
```
This PR disables the warning message when the CK backend is disabled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139790
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2024-11-06 22:04:32 +00:00
cbf449c83c [BE]: Add NT missing fp classification functions (#139890)
Follow up to some issues @malfet's recent PR pointed out about missing ops #139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139890
Approved by: https://github.com/malfet
2024-11-06 22:00:54 +00:00
aafb3deaf1 Remove multinomial from cudagraph skip list' (#139897)
Since https://github.com/pytorch/pytorch/pull/134818/files we can run multinomial in cudagraph without error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139897
Approved by: https://github.com/BoyuanFeng
2024-11-06 21:28:42 +00:00
86475dfc9f [ONNX] Prioritize strict=False export strategy (#139905)
Prioritize the `strict=False` export strategy in ONNX export because it is preferred according to @SherlockNoMad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139905
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2024-11-06 21:27:29 +00:00
779c0b80cd [inductor] collect memory snapshort in the wrapper (#138429)
To collect memory snapshot for a generated wrapper, run the wrapper with `--cuda-memory-snapshot`. E.g.
```
python /tmp/torchinductor_shunting/tmpyhtfwdlv/wp/cwpulanbieu4beruc6w5uc3podcs2x3rzdk5okftu37c4k3bnd4b.py --cuda-memory-snapshot
```
gives me:

<img width="800" alt="Screenshot 2024-11-05 at 3 53 47 PM" src="https://github.com/user-attachments/assets/82edd2d6-df57-488e-a390-8fa5fc00ba5f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138429
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #139136, #138756
2024-11-06 21:22:18 +00:00
2a857e940d config: Add env_name_default and env_name_force to Config (#138956)
This allows Configs to handle setting their defaults (or overriding
themselves) via environment variables.

The environment variables are resolved at install time (which is usually
import time). This is done 1) to avoid any race conditions between
threads etc..., but 2) to help encourage people to just go modify the
configs directly, vs overriding environment variables to change
pytorch behaviour.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138956
Approved by: https://github.com/ezyang
ghstack dependencies: #138766
2024-11-06 21:20:42 +00:00
1270c78268 Add logging for num_triton_bundles (#139807)
Summary: Adding logs for number of inductor cache triton bundles

Test Plan:
Ran adhoc code and looked at dynamo_compile/sandbox

https://fburl.com/scuba/dynamo_compile/sandbox/nhktfy19

Differential Revision: D65490826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139807
Approved by: https://github.com/masnesral
2024-11-06 21:11:04 +00:00
9018326bb8 Revert "[pt2 logging] move remote cache get/put logging up one level (#139423)"
This reverts commit c412a42ae2a978122d8a41b94c3861290bc689e0.

Reverted https://github.com/pytorch/pytorch/pull/139423 on behalf of https://github.com/ZainRizvi due to Reverted internally. See D65541060 for more details ([comment](https://github.com/pytorch/pytorch/pull/139423#issuecomment-2460765579))
2024-11-06 20:59:54 +00:00
ff616c26fb Optimize isclose description (#139724)
Fixes #139563

Make description user friendly.

After Change:

![image](https://github.com/user-attachments/assets/88a805c0-0105-4441-812b-582c09abc72b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139724
Approved by: https://github.com/janeyx99
2024-11-06 19:30:44 +00:00
dd6738c1ad Revert "Use Manylinux2_28 for wheel builds (#138732)"
This reverts commit 5860c8ebd155bd06666d87811847b73040b55f7b.

Reverted https://github.com/pytorch/pytorch/pull/138732 on behalf of https://github.com/atalman due to Reverting for now will be relanding ([comment](https://github.com/pytorch/pytorch/pull/138732#issuecomment-2460570980))
2024-11-06 19:12:52 +00:00
3abbde976d Allow any single non-batch dim to be ragged for NJT (#137125)
Fixes #137512

Relaxes the restriction that the ragged dim is immediately next to the batch dim e.g. `(B, *, D_0, ..., D_N)`. This allows for constructing NJTs of shape e.g. `(B, D, j0)` directly. It's possible before this PR to get an NJT of e.g. shape `(B, D, j0)` by constructing an NJT of shape `(B, j0, D)` and transposing it. This PR allows a user to go straight there without the transpose. The standard `torch.nested.nested_tensor(list)` constructor has been updated to support this.

At the very least, this is useful for testing on transposed NJTs. I'm willing to make this functionality private if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137125
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2024-11-06 18:50:08 +00:00
d1e2e81ede [AOTI] Fix two test failures from #139471 (#139885)
Summary: https://github.com/pytorch/pytorch/pull/139471 caused two internal test failures due to different compiler path settings.

Differential Revision: D65519537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139885
Approved by: https://github.com/hl475
2024-11-06 18:41:28 +00:00
6ed237e5b5 [pytorch] Make global module hook to pass kwargs similar to how module hook works (#137403)
Differential Revision: D63576353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137403
Approved by: https://github.com/mikaylagawarecki
2024-11-06 18:20:57 +00:00
99deedff57 [ONNX] Describe memory usage of TorchDynamo-based exporter. (#139388)
Add a new documentation to show one memory usage benefit brought by TorchDynamo-based ONNX exporter.

Also add a unit test to make sure TorchDynamo-based ONNX exporter works well under FakeTensorMode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139388
Approved by: https://github.com/xadupre
2024-11-06 17:29:11 +00:00
d6034016e2 Run slow jobs in trunk commits (#139842)
Per our discussion in https://fburl.com/gdoc/voce5o06, we will run slow jobs more frequently on all trunk commits.  Note that slowgradcheck jobs are moved to periodic as they are not about running slow tests.

There are currently 3 GPU + 2 ROCm + some CPU `linux.4xlarge` runners running slow jobs.  So, I don't expect to see a big increase in CI cost after this.

Also, these slow jobs will only run in trunk commits, not in PRs, so their duration won't affect PR TTS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139842
Approved by: https://github.com/clee2000
2024-11-06 17:21:39 +00:00
8d983aaf68 Add conda install to Manylinux 2_28 images (#139894)
This way we can use these images instead of conda-build images for all workflows in test-infra.

Please note:
- I am using existing conda install script, thats alredy used in https://github.com/pytorch/pytorch/blob/main/.ci/docker/conda/Dockerfile#L47
- PR with update to miniforge will be posted as followup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139894
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2024-11-06 17:14:27 +00:00
6bdbc86550 [AOTI] Fix a cubin file path issue (#139848)
Summary: When we use aoti_compile_and_package to package the AOTI compiled artifacts, cubin files will be included, and at the deploy time, we should setup the cubin file directory to the right path that contains unziped cubin files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139848
Approved by: https://github.com/aakhundov
2024-11-06 16:45:30 +00:00
dd6a5de00d Allow OpOverloadPackets as safe torch functions, sanitize dynamo gm before running aotdispatch with cache (#139785)
Summary:
This diff implements two things to improve cache hit rates after testing AOTAutogradCache with internal cogwheel jobs:
- We should allow torch functions that are OpOverloadPackets
- When running with cache, there are some fields that dynamo puts into the input graph module to aotdispatch that are not stable between runs. We use a context manager to null these out so that they can't be used to affect the output of AOTAutograd, and then we put the fields back onto the gm before returning from AOTAutogradCache.load().

Test Plan:
New unit tests + running nanogpt with AOTAutogradCache.

Meta:

Run on a long running job
Cache miss:
 {F1953831996}

Cache hit:
 {F1953830872}

Servicelabs here:
https://www.internalfb.com/servicelab/experiment/4301352991/

Cache hit:
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f660597709-TrainingApplication/attempt_0/version_0/rank_0/index.html

Cache miss:
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f660569960-TrainingApplication/attempt_0/version_0/rank_0/index.html

We can see that with these changes, autograd cache hits and saves compile time:
https://fburl.com/scuba/pt2_compile_events/ycddxstd

Differential Revision: D65436373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139785
Approved by: https://github.com/bdhirsh
2024-11-06 16:34:02 +00:00
e05a096c49 Ignore polyfill when reporting user backtraces in summarized form (#139850)
Fixes https://github.com/pytorch/pytorch/issues/139316

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139850
Approved by: https://github.com/bobrenjc93
2024-11-06 16:33:34 +00:00
68ef445c33 [MPS][Perf] Dispatch to SDP-math-mps for non-contig Tensors (#139791)
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.

Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time

vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
                                    subfolder='vae',
                                    torch_dtype=torch.bfloat16,
                                    force_upcast=False).to('mps')

pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
                                                 torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)

start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
             num_inference_steps=10,
             guidance_scale=8,
             generator=torch.Generator("mps").manual_seed(42),
             ).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```

Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:
![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1)

Fixes https://github.com/pytorch/pytorch/issues/139389
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
2024-11-06 16:25:39 +00:00
59cf4bc5ae Fix the use of fsspec transactions (#135541)
fsspec transactions do not support concurrency and assumes that there is at most 1 running transaction per filesystem. This is *not* true in our usage, where because of multi-threading we usually have multiple concurrent transactions running at once.

Previously, this would just (unsafely) pass but lead to hard-to-debug race conditions (since the commit of one transaction will blow away the state of the other transaction). In fsspec 2024.3.0, trying to commit concurrent transactions will actually crash (see the code at 76ca4a6888/fsspec/transaction.py (L39) -- because each filesystem can have a single transaction, this tear-down logic will error).

Instead, let's manually handle committing / discarding changes to the file.

I don't have a minimal test-case, but in Meta this solves a broken test on `fsspec >= 2024.3.0`:

Before: https://www.internalfb.com/intern/testinfra/testrun/7318349626774607
After: https://www.internalfb.com/intern/testinfra/testrun/2251800062722633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135541
Approved by: https://github.com/Skylion007
2024-11-06 15:16:12 +00:00
641ca67d5a [ROCM] Fix hipBLASLt version check in TunableOp test (#139811)
Allow 3 or more digits for hipBLASLt version check in TunableOp test. Needed due to upcoming ROCm 6.3 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139811
Approved by: https://github.com/eqy, https://github.com/malfet
2024-11-06 14:37:45 +00:00
44df6522ee add Half/BFloat16 support for grid_sample on CPU (#134812)
Fix https://github.com/pytorch/pytorch/issues/127224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812
Approved by: https://github.com/Skylion007, https://github.com/mingfeima
2024-11-06 14:02:08 +00:00
cyy
d558c1a047 Enable cppcoreguidelines-special-member-functions (#139132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132
Approved by: https://github.com/sraikund16
2024-11-06 13:42:20 +00:00
c0c6bf4ef2 Don't use deprecated type properties in UpsampleKernel (#139399)
By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399
Approved by: https://github.com/Skylion007
ghstack dependencies: #139353
2024-11-06 13:34:45 +00:00
44e4949bcf Revert "[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)"
This reverts commit 22e89ea2aaa3e0ef0ec4504bd2dbf230447a6d2a.

Reverted https://github.com/pytorch/pytorch/pull/139595 on behalf of https://github.com/malfet due to It broke number of tests, see 22e89ea2aa ([comment](https://github.com/pytorch/pytorch/pull/139595#issuecomment-2459754355))
2024-11-06 13:31:26 +00:00
10d7729333 Revert "Enable cppcoreguidelines-special-member-functions (#139132)"
This reverts commit a9b4989c726a29b4b89c64282e32b9e4fc0b7d68.

Reverted https://github.com/pytorch/pytorch/pull/139132 on behalf of https://github.com/ZainRizvi due to Sorry but this fails on trunk. See inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_smooth_quant_with_int_mm [GH job link](https://github.com/pytorch/pytorch/actions/runs/11699366379/job/32591132460) [HUD commit link](22e89ea2aa) ([comment](https://github.com/pytorch/pytorch/pull/139132#issuecomment-2459743145))
2024-11-06 13:27:42 +00:00
06ad404401 Revert "[BE] And delete DeprecatedTypProperties cast (#139358)"
This reverts commit b82a51bc6b1170da3db8f67816799f3a47530ff8.

Reverted https://github.com/pytorch/pytorch/pull/139358 on behalf of https://github.com/malfet due to And it was backed out again due to the internal usages of deprecated API ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2459740090))
2024-11-06 13:23:43 +00:00
53299b8a38 Revert "Don't use deprecated type properties in UpsampleKernel (#139399)"
This reverts commit 0058f7100222523fa8b9f74af9ea7d341a6458b4.

Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/malfet due to And it was backed out again due to the internal usages of deprecated API ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2459740090))
2024-11-06 13:23:43 +00:00
5f266b5a02 [ROCm] re-enable flex attention UTs (#139632)
https://github.com/pytorch/pytorch/pull/136792 accidentally disabled flex attention UTs on ROCm. Re-enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139632
Approved by: https://github.com/drisspg
2024-11-06 12:49:44 +00:00
d622b490d6 [Dynamo] Support tensor mro without source (#139838)
Fixes https://github.com/pytorch/pytorch/issues/137743

The issue here is that if `type` was called on a tensor without a source, we wouldn't have a source even for `torch.Tensor`, and the `__mro__` retrieval would fail. Since `torch.Tensor` is an internal torch type, I add handling for it in `call_type` in builtins.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139838
Approved by: https://github.com/williamwen42
2024-11-06 08:52:53 +00:00
cyy
a9b4989c72 Enable cppcoreguidelines-special-member-functions (#139132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132
Approved by: https://github.com/sraikund16
2024-11-06 07:59:09 +00:00
22e89ea2aa [Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595)
**About the PR**
In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between.
This PR adds a pass to fuse the corresponding patterns:
- (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape`
- (with bias) `pattern_no_bias -> add -> reshape -> reshape`

The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants.

Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`.

**Validation results**
Accuracy/perplexity is not changed with or without this fusion pass.
Latency is improved by >10% with the fusion pass.
Test method:
- Model: EleutherAI/gpt-j-6b
- Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores
- Using Intel OMP and Tcmalloc
- Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile`

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-11-06 07:54:47 +00:00
d031d1bf4c Update to upload-artifacts and download-artifacts to v4 (#139808)
The 2 actions actions/download-artifact@v3 and
actions/upload-artifact@v3 will be deprecated December 5th, 2024. This change updates them to using v4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139808
Approved by: https://github.com/seemethere
2024-11-06 05:57:41 +00:00
157c18a180 [BE][Attention] Use isneginf (#139763)
May be I'm missing some vital piece of information, but it feels like
```c++
  const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device()));
  const auto masked = self.eq(neg_inf);
```
should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139763
Approved by: https://github.com/Skylion007
ghstack dependencies: #139788, #139784
2024-11-06 04:32:37 +00:00
1c63612567 Fix & unit test for c10::ArrayRef constructed from user-defined types (#139758)
Fixes #139391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139758
Approved by: https://github.com/ezyang
2024-11-06 04:23:05 +00:00
d35a600b74 [pgnccl] skip restart test fro rocm (#139809)
Summary:
PG restart test is flaky in rocm: https://github.com/pytorch/pytorch/pull/139809, skip the AMD/ROCM test for now
Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139809
Approved by: https://github.com/kwen2501
2024-11-06 04:17:29 +00:00
96ca17fec4 [CD] Move linux-aarch64 build scripts (#139815)
All files in `.ci/aarch64_linux` folder are from 88590cd635/aarch64_linux
Companion PR to delete `aarch64_linux` folder in builder: https://github.com/pytorch/builder/pull/2030
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139815
Approved by: https://github.com/wdvr, https://github.com/huydhn
2024-11-06 04:16:48 +00:00
c19c384690 Fix torch.load (torch.utils.benchmark) after #137602 (#139810)
After #137602, the default `weights_only` has been set to True.  This test is failing in trunk slow jobs atm

benchmark_utils/test_benchmark_utils.py::TestBenchmarkUtils::test_collect_callgrind [GH job link](https://github.com/pytorch/pytorch/actions/runs/11672436111/job/32502454946) [HUD commit link](1aa71be56c)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139810
Approved by: https://github.com/kit1980
2024-11-06 03:08:29 +00:00
63b01f328e [inductor] support masked_scatter w/ unbacked sized source (#138083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138083
Approved by: https://github.com/jansel
2024-11-06 02:16:25 +00:00
cyy
028c5d3426 [2/N] Replace c10::sv with std::sv (#139456)
Follows  #139453

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139456
Approved by: https://github.com/ezyang
2024-11-06 01:50:38 +00:00
39ede99a33 Add current FSDP2 path to old composable FSDP1 warning (#139759)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139759
Approved by: https://github.com/weifengpy, https://github.com/wz337
ghstack dependencies: #139650
2024-11-06 01:43:04 +00:00
bd45c00fde [BE][Attention] Code de-dup (#139784)
The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to
Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implementations are trivial oneline call
Also, as suggested by @Skylion007, replace `at::where(foo->logical_not, -inf, 0)` with `at::where(*foo, 0, -inf)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139784
Approved by: https://github.com/Skylion007, https://github.com/drisspg
ghstack dependencies: #139788
2024-11-06 01:33:19 +00:00
aec179e2be Fix docs for logcumsumexp formula (#139768)
The previous formula was wrong and reused some indexing variables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139768
Approved by: https://github.com/janeyx99
2024-11-06 01:19:09 +00:00
a787320d0f Do not try to optimize new implications in get_implications (#139738)
Summary:
save around 8%  on the torchrec model.
In most case the new implications are not optimizaiton anyway in some case though they are,
but optimizing them is useless.

ex:
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding True
adding False
```

VS
```
generating implications for Eq(Mod(s0, 3), 0)
adding Eq(Mod(s0, 3), 0)
adding Eq(0, Mod(s0, 3))
adding Ne(Mod(s0, 3), 0)
adding Ne(0, Mod(s0, 3))
adding Mod(s0, 3) <= 0
adding 0 < Mod(s0, 3)
adding 0 <= Mod(s0, 3)
adding Mod(s0, 3) < 0
```
the main difference is that  0 <= Mod(s0, 3) can be simplified to True and Mod(s0, 3) < 0 to False but with this change
this wont happen. but True:True and False: False are useless anyway lol. so its ok i think
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```

<img width="1082" alt="Screenshot 2024-11-04 at 9 25 51 PM" src="https://github.com/user-attachments/assets/a26e291b-9280-4b55-9275-f3201a36ac51">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139738
Approved by: https://github.com/ezyang
ghstack dependencies: #139703
2024-11-06 00:23:40 +00:00
6a30c14a0a [Traceable FSDP2] Run any unexecuted post_backward at beginning of pre_backward hook (#139671)
Assuming the forward pass user code looks like:
```
for _ in range(2):
    x = layer(x)
```
and we have `fully_shard(layer)`, then:
- the forward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" (currently same for both eager and compile)
- the backward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" in eager, but currently it's "unshard layer -> call layer 1st time -> call layer 2nd time -> reshard layer" in compile

The behavior in the backward pass is different between eager and compile, which is not ideal.

 I am currently trying to look for a way to fix this non-ideal behavior of compile - tried a few things:
1. Tracing the RegisterPostBackwardFunction custom autograd function - this stills seems to be a no-go, due to HOP not supporting side-effects.
2. Instead of custom autograd function, do a "multi-grad hook" to wait for all gradients to be ready before triggering post_backward. However, this approach seems to have bad interaction with register_hook of pre_backward, in the sense that it's unclear which of them will be triggered first in practice.
3. Force execute any pending post_backward before unshard in pre_backward hook, and rely on compiler to move the reshard to the right place to optimize peak memory. -> This PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139671
Approved by: https://github.com/awgu
2024-11-06 00:19:06 +00:00
e7cf7d00be Support torch.bool in torch.sort + CUDA (#139409)
Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok.

Test Plan: CI

Differential Revision: D65282650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409
Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy
2024-11-06 00:02:54 +00:00
06f619d999 typing ir.py - part 2 (#131846)
See #131852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846
Approved by: https://github.com/eellison
ghstack dependencies: #139238
2024-11-06 00:01:15 +00:00
c2109ec479 typing ir.py - Disallow untyped defs for ir.py (#139238)
- Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def"
- Mark some trivial functions with the proper return types (`None` and `torch.dtype`)
- Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper()
- `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-06 00:01:15 +00:00
82e4de4994 [Inductor][CPU] Enable the oneDNN Linear fusion for special case (#139172)
**Summary**
In the case of LLaMA2, for a linear operation with an activation size of `(4, 1, 4096)` and a stride of `(4096, 128, 1)` which has been decomposed into `matmul`. And the decomposition of `matmul` results in `bmm` due to a strict continuity check. We can align the continuity check with ATen by skip dim of size 1 to enable decomposition into `mm` instead.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_input_non_contiguous_3D_wo_bias
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139172
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-11-05 23:49:53 +00:00
d1c26b0781 Improvements for associative_scan - slicing of xs (#138858)
In this PR, the combine_fn is consistently called with a slice along the scan dim. It implements part of https://github.com/pytorch/pytorch/pull/136966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138858
Approved by: https://github.com/ydwu4
2024-11-05 23:38:21 +00:00
eec153a69c [BE][Attention] Factor out common code (#139788)
- Compute attention mask before the switch
- Introduce `query_device_type` variable
- Refactor some of MPS-math checks into easily readable boolean names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139788
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-11-05 23:27:18 +00:00
faab564bda [doc] Fix grammar in export.ir_spec.rst (#139584)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139584
Approved by: https://github.com/zou3519
2024-11-05 23:26:36 +00:00
86d7d39bff Forward fix D65441551 for T206731737 (#139767)
Test Plan: -

Differential Revision: D65482429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139767
Approved by: https://github.com/awgu
2024-11-05 23:19:08 +00:00
c0d642a295 [pgnccl][simple] log started work numel (#139773)
Summary:
We saw some cases that the same work was started on multiple ranks, but
did not complete. This info could give us more info if the numel matches
Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139773
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2024-11-05 23:11:19 +00:00
1d28b8b6d5 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit e84d1121ad66a453c8c24fcc098625e2e9764fca.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. More details in D65483292 ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2458381056))
2024-11-05 23:10:38 +00:00
f63ee13f2c [Test][DTensor] Skip test_dtensor_mm if ROCm (#139719)
Seems there are some numeric issues when running on ROCm.
```
PYTORCH_TEST_WITH_ROCM=1 python test/distributed/_tensor/test_matrix_ops.py DistMatrixOpsTest.test_dtensor_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139719
Approved by: https://github.com/XilunWu
2024-11-05 22:56:35 +00:00
16da289402 [Workspace Inductor] Fix dynamic shapes (#139777)
# Summary
Arg ordering was wrong for when dynamic shapes is enabled and we pass in the additional size args

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139777
Approved by: https://github.com/eellison
ghstack dependencies: #139157
2024-11-05 22:34:09 +00:00
d26dcda35e [test] Fix Triton test to use the correct divisibility attr (#139772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139772
Approved by: https://github.com/bertmaher
2024-11-05 22:28:18 +00:00
b09eb6ed6a [dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476

We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.

If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.

So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
2024-11-05 21:48:07 +00:00
75eeefbfab [pp] pipelining + dcp unit test (#139633)
Currently there aren't any unit tests for PP and DCP, this unit test could be useful for quick experimentation in issues like (https://github.com/pytorch/torchtitan/issues/474).

`python test/distributed/_composable/test_composability/test_pp_composability.py -k test_pp_and_dcp`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139633
Approved by: https://github.com/wconstab
2024-11-05 21:02:11 +00:00
1a70185309 Add Autograd Fallback for MTIA (#139211)
Summary: As title.

Test Plan: OSS and internal CIs.

Differential Revision: D65022481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139211
Approved by: https://github.com/jvandebon
2024-11-05 20:58:21 +00:00
59b66944d4 Migrate inductor-perf-test-nightly.yml to use linux.aws.a100 (#139657)
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-11-05 21:24:28 +01:00
6734cb7bf2 [hop free symbols] refactor tensor.to_list implementation to call wrap_fx_proxy. (#139663)
Refactoring only. Previously, we manually cal SymNodeVariable.create, now we handle it with wrap_fx_proxy. This unifies the handling of operations that produce symints in wrap_fx_proxy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139663
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428, #138558, #138737, #138559
2024-11-05 20:19:09 +00:00
ae86939425 [aarch64] add CUDA 12.6 to docker for sbsa wheel (#138562)
Add cuda 12.6 installation for sbsa docker
Related to #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138562
Approved by: https://github.com/atalman
2024-11-05 20:15:51 +00:00
d549ddfb14 [fr][rfc] use a logger to control output for flight recorder analyzer (#139656)
Summary: Use a logger to control output to console. This is useful for hiding out debug/detail messages from the console v/s showing everything together.

Test Plan:
Ran `torchfrtrace` with various switches.

The `-v` verbose swtch
```
torchfrtrace --prefix "trace_" /tmp/ -v
loaded 2 files in 0.2567298412322998s
built groups, memberships
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
appending a non-matching collective
built collectives, nccl_calls
Groups
                  id  desc          size
--------------------  ----------  ------
09000494312501845833  default_pg       2
Memberships
            group_id    global_rank
--------------------  -------------
09000494312501845833              0
09000494312501845833              1
Collectives
  id    group_id
----  ----------
   0           0
   1           0
NCCLCalls
  id    collective_id    group_id    global_rank    traceback_id  collective_type    sizes
----  ---------------  ----------  -------------  --------------  -----------------  --------
   0                0           0              0               0  nccl:all_reduce    [[3, 4]]
   1                0           0              1               0  nccl:all_reduce    [[3, 4]]
   2                1           0              0               0  nccl:all_reduce    [[3, 4]]
   3                1           0              1               0  nccl:all_reduce    [[3, 4]]
   4                            0              0               0  nccl:all_reduce    [[4, 5]]
```

Without the verbose switch
```
❯ torchfrtrace --prefix "trace_" /tmp/
Not all ranks joining collective 3 at entry 2
group info: 0:default_pg
collective: nccl:all_reduce
missing ranks: {1}
input sizes: [[4, 5]]
output sizes: [[4, 5]]
expected ranks: 2
collective state: scheduled
collective stack trace:
 <module> at /home/cpio/test/c.py:66
```

With the `-j` switch:
```
❯ torchfrtrace --prefix "trace_" /tmp/ -j
Rank 0                                             Rank 1
-------------------------------------------------  -------------------------------------------------
all_reduce(input_sizes=[[3, 4]], state=completed)  all_reduce(input_sizes=[[3, 4]], state=completed)
all_reduce(input_sizes=[[3, 4]], state=completed)  all_reduce(input_sizes=[[3, 4]], state=completed)
all_reduce(input_sizes=[[4, 5]], state=scheduled)
```

Differential Revision: D65438520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139656
Approved by: https://github.com/fduwjj
2024-11-05 20:14:18 +00:00
b9f0563aaf Add repro instructions to fx_graph_runnable.py (#139481)
This PR adds some instructions for how to add a TARGETS file to run the
fx_graph_runnable script. I'm planning to add some followups that will
add additional imports for custom ops and use autodeps to get the
dependencies, but I figure this PR is an easy first step.

Test Plan:
- pytest test/dynamo/test_structured_trace.py
- Does anyone have suggestions for how to test this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139481
Approved by: https://github.com/eellison
2024-11-05 19:24:16 +00:00
01bcf37123 [dynamo][NFC] Remove some dead code paths (#139674)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139674
Approved by: https://github.com/Skylion007, https://github.com/anijain2305, https://github.com/mlazos
2024-11-05 19:12:17 +00:00
2b3a227b35 [dynamo] Add is_mutable() and is_immutable() methods to VariableTracker (#139341)
This patch adds 2 simple methods `VariableTracker.is_mutable()` and
`VariableTracker.is_immutable()`, which helps clarify intention. For
instance, rather than writing
```python
if var.mutation_type:
    ...
```
After this patch one can write
```python
if var.is_mutable():
    ...
```

This patch also simplifies `mutation_type` propagation in some
`ListVariable` methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139341
Approved by: https://github.com/mlazos, https://github.com/anijain2305
ghstack dependencies: #139339, #139340
2024-11-05 19:11:41 +00:00
0ba3962b80 [dynamo][NFC] Move MutationType classes into variables/base.py (#139340)
As title, this addresses
https://github.com/pytorch/pytorch/pull/137905/files#r1806800222.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139340
Approved by: https://github.com/anijain2305
ghstack dependencies: #139339
2024-11-05 19:11:41 +00:00
693a0a1bd4 [dynamo][NFC] Rename mutable_local and add documentation (#139339)
This patch addresses the renaming part of #133027, specifically, it
renames the following and adds documentation for relevant classes.
1. `VariableTracker.mutable_local` to `mutation_type`
2. `MatableLocal `to `ValueMutationNew`
3. `MutableSideEffects `to `ValueMutationExisting`
4. `MutableLocalSource` to `SourceType`
5. `MutableLocalSource.Local` to `New`

Note that (2), (3) and (5) are mainly to bring consistency between them
and `AttributeMutationNew`, `AttributeMutationExisting`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2024-11-05 19:11:41 +00:00
5f2ed505eb [PGNCCL] Watchdog prints call-time traceback when reporting timeout (#139659)
### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-11-05 19:07:17 +00:00
ee42a99745 [SymmetricMemory] introduce a binding for cuMemset32Async (#138755)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.

To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
2024-11-05 18:47:24 +00:00
87059d4547 [AOTAutograd] Handle edge cases for donated buffer & enable in oss (#139669)
This PR enables donated buffer in OSS and handles two edge cases:

1. While donated buffer relies on storage to check alias, sparse tensor subclasses does not provide access to storage. So we skip sparse tensor subclasses for donated buffer.
2. Handles missing "val" from n.meta. This is observed from `inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_11_cpu`,
`functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_simple_with_none_and_nontensor`, and
`inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139669
Approved by: https://github.com/bdhirsh
2024-11-05 18:38:20 +00:00
27ec3921bc Optimize mutable torch.library.custom_op overhead (#139513)
We don't need to do a loop over all the args, kwargs in the
AdInplaceOrView key; we just need to bump the version on the args,
kwargs that are mutable.

On the benchmark mentioned in
https://github.com/pytorch/pytorch/issues/139494
this made the time go from
```
mutate2 = 61.72943878173828
no_mutate2 = 36.89440155029297
mutate = 236.3092498779297
no_mutate = 59.31964874267578

```
to
```
mutate2 = 47.976478576660156
no_mutate2 = 38.37468719482422
mutate = 71.21315002441406
no_mutate = 59.7432975769043
```

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139513
Approved by: https://github.com/bdhirsh
ghstack dependencies: #139509
2024-11-05 18:30:53 +00:00
9dc5851f5d handle more devices in method_type method of TensorVariable (#138078)
Fixes #138077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138078
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-11-05 18:19:52 +00:00
de509abe1c [export] Dedup data-dependent errors based on stacktrace (#139540)
Summary:
Dedup the data-dependent errors based on the stacktrace it points to. Right now we just display every propagate-real-tensor log that shows up, but we actually can dedup them if they are due to the same piece of code (ex. there could multiple calls to a piece of code that does some data dependent computation).

This occurred when trying out draft export on the PT2I model zoo. For a specific model, previously we would get ~3k data dependent errors, but after deduping based on the stacktrace we now only get 4 errors.

Test Plan: CI

Differential Revision: D65374254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139540
Approved by: https://github.com/pianpwk, https://github.com/zou3519
2024-11-05 18:16:05 +00:00
cc25b6d7ba [inductor] Error on unsupported autotuner configs (#139658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139658
Approved by: https://github.com/aakhundov
2024-11-05 18:09:02 +00:00
41e4d88584 [logging][ez] Add timer logging for pickling and unpickle for object based collective (#139757)
Summary: As discussed, we want to measure the time spent during pickling and unpickle.

Test Plan: CI

Reviewed By: wz337

Differential Revision: D65462767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139757
Approved by: https://github.com/awgu, https://github.com/Skylion007, https://github.com/fegin, https://github.com/c-p-i-o
2024-11-05 17:40:27 +00:00
5860c8ebd1 Use Manylinux2_28 for wheel builds (#138732)
Fixes https://github.com/pytorch/pytorch/issues/123649
Use Manylinux 2_28 Docker builds for PyTorch Nightly builds

This moves the wheels to a Docker image that uses : ``quay.io/pypa/manylinux_2_28_x86_64`` as a base rather then ``centos:7`` which is EOL on June 30, 2024.

Information:
https://github.com/pypa/manylinux#manylinux_2_28-almalinux-8-based

manylinux_2_28 (AlmaLinux 8 based)
Toolchain: GCC 13
Built wheels are also expected to be compatible with other distros using glibc 2.28 or later, including:
Debian 10+
Ubuntu 18.10+
Fedora 29+
CentOS/RHEL 8+

This migration should enable us to migrate to latest CUDNN version, and land this PR: https://github.com/pytorch/pytorch/pull/137978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138732
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-05 17:21:24 +00:00
c0d21b6581 End TritonBundle on non-cache write codepaths (#139698)
Summary:
When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset.

This diff also turns on the cache for internal so that we can do a code rollout.

Test Plan: updated tests

Differential Revision: D65457224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698
Approved by: https://github.com/ezyang
2024-11-05 17:00:40 +00:00
4d5cc1b4ef Revert "[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)"
This reverts commit e6ff07f00e04a9b58efb86a3dd70ed7280ae8522.

Reverted https://github.com/pytorch/pytorch/pull/139560 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking internal tests. Please see D65430317 for more details ([comment](https://github.com/pytorch/pytorch/pull/139560#issuecomment-2457620720))
2024-11-05 16:22:30 +00:00
cyy
a2bc2e38f9 Use clang-tidy 17 (#139678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139678
Approved by: https://github.com/Skylion007
2024-11-05 16:00:25 +00:00
e0156f9faa HACK: use FB proxy for testowners (#139473)
I got fed up with this always timing out when I didn't have
correct proxy settings.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139473
Approved by: https://github.com/malfet
2024-11-05 15:35:41 +00:00
13eb3b3f6f [Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702)
Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use  `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now.

Test Plan: CI.

Differential Revision: D65453312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702
Approved by: https://github.com/XilunWu
2024-11-05 15:28:03 +00:00
53f164cae5 [CUDA][CI][cusparselt] Only CUDA 11.8 ships the libcusparseLt.so.0, CUDA 12 would use PYPI libcusparselt (#138547)
since nvidia-cusparselt-cu12 is available and
nvidia-cusparselt-cu11 is not available

Related: #138175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138547
Approved by: https://github.com/atalman
2024-11-05 15:12:41 +00:00
349cd49406 Fix compiler collective TORCH_TRACE and improve code state printing (#139716)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139716
Approved by: https://github.com/yf225
2024-11-05 14:32:52 +00:00
cyy
546318e559 [7/N] Don't skip ASAN on some tests (#139675)
Follows #139565
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139675
Approved by: https://github.com/ezyang
2024-11-05 14:01:01 +00:00
f551d90552 Fix for gcc10 torch.compile compiler error when march=aarch64+sve (#137795)
Disable tree vectorize in vec_convert.h for gcc10 and aarch64+sve which causes compiler error to occur.

```
/tmp/tmpuqk7lj9j/zx/czx2eyturb6j6m727xhvknkjbdu3y5nqqk66wgxcjkwnxuzvpm5r.cpp:3:18: internal compiler error: in vect_get_vector_types_for_stmt, at tree-vect-stmts.c:12252
    3 | extern "C"  void kernel(const float* in_ptr0,
```
Fixes #137775

I've not linked a gcc bug report yet as they require a minimal reproducer to be made.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137795
Approved by: https://github.com/malfet
2024-11-05 12:46:42 +00:00
e84d1121ad Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-11-05 10:44:56 +00:00
ffb7a08921 Fix torch.histc not checking min > max on cuda for int8 tensors (#139372)
Fixes #139360

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)

Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)

![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487)

Change type of `minvalue` and `maxvalue` to fix it, similar like in line:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)

**Test Result**
```bash
$ pytest test/test_reductions.py -vv
```
![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372
Approved by: https://github.com/eqy
2024-11-05 08:42:38 +00:00
356fc41ae0 [Intel GPU] Avoid target_link_libraries twice for torch_xpu_ops which will potentially cause multiple definition symbol linker error. (#139024)
[Intel GPU] Avoid target_link_libraries twice for torch_xpu_ops which will potentially cause multiple definition symbol linker error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139024
Approved by: https://github.com/EikanWang, https://github.com/fengyuan14, https://github.com/jansel
2024-11-05 08:18:09 +00:00
6ad52db8c8 use torch.sym_sum instead of incremental sum in _cat_meta (#139653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139653
Approved by: https://github.com/ezyang
2024-11-05 07:24:24 +00:00
51a3d6dbc3 Fix existing lint issues in ir.py (#139237)
- Remove stale mypy "type: ignores"
- Made ir.py pass the rest of the lints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139237
Approved by: https://github.com/Skylion007
2024-11-05 06:06:12 +00:00
b2f5a5311b RMSNorms docs - remove biases initialization (#139620)
RMSNorm doesn't use a bias in `elementwise_affine`, so I've removed it from the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139620
Approved by: https://github.com/mikaylagawarecki
2024-11-05 05:59:41 +00:00
9aaf3a04fa [profiler][UT] instantiate profiler UTs for devices and enable UTs for xpu profiler (#134316)
This PR enables the profiler related UT to be device-agnostic. It instantiates the profiler UTs for different device types and enable them on XPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134316
Approved by: https://github.com/etaf, https://github.com/aaronenyeshi, https://github.com/gujinghui
2024-11-05 05:46:13 +00:00
de4216bfda increase add_loop benchmark and refresh all results! (#139703)
see comments end of https://github.com/pytorch/pytorch/pull/138756
I am also refreshing all values

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139703
Approved by: https://github.com/bobrenjc93
2024-11-05 05:41:21 +00:00
9e14d86573 [Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255)
`kernel_micro_gemm` generated using BRGEMM:
```
template <bool accum>
inline void kernel_micro_gemm(
    const half* __restrict__ A,
    const half* __restrict__ B,
    float* __restrict__ C,
    int64_t M,
    int64_t N,
    int64_t K,
    int64_t lda,
    int64_t ldb,
    int64_t ldc
) {
    at::native::cpublas::brgemm(
      M, N, K,
      lda, ldb, ldc,
      1.f, accum ? 1.f : 0.f,
      A,
      B,
      C);
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-05 05:33:29 +00:00
c8a55eea88 [DCP] Fix process_group logging for DCP methods (#139428)
Summary:
Currently, we incorrectly log process_group for DCP based events.

We rely on [c10d_logger.py](https://fburl.com/v4mdme9z) to fill in information about process_group (e.g. backend, nccl_version if available).

In [checkpoint/logger.py](https://fburl.com/yho9nqbu) we pass the `msg_dict` to c10d_logger which never contains the `process_group` param, so [c10d_logger](https://fburl.com/zlw2ukxp) logs information about the default process_group which is always `NCCL`.

Test Plan:
Before:

Always defaults to NCCL even though GLOO is passed by caller.

{F1950847585}

After:

GLOO backend shows up.

{F1950848375}

Differential Revision: D65255871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139428
Approved by: https://github.com/teja-rao, https://github.com/mhorowitz
2024-11-05 05:24:38 +00:00
fe4fa1df9f [dynamo][eval_frame] Set the callback to None earlier for guard eval (#139655)
xref - https://fb.workplace.com/groups/1075192433118967/permalink/1536570810314458/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139655
Approved by: https://github.com/jansel, https://github.com/williamwen42
2024-11-05 05:18:46 +00:00
fdfd4c50ba Assign owners to periodic and slow jobs (#139519)
As an outcome of https://fburl.com/gdoc/voce5o06, I want to assign owner(s) to any periodic or slows job that are still needed but couldn't run more frequently (too $$$, capacity constraint, don't fail that often).  They include:

* multigpu
* debug build
* ROCm (distributed, slow)

@malfet @soulitzer I put down your names as the owners of debug build and slowgradcheck respectively.  Please let me know if you are ok with that, or if you have a better option in mind.

Any jobs there without an owner are owned by us (PT Dev Infra)

### Testing

The owners are show up in the job name https://hud.pytorch.org/pr/139519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139519
Approved by: https://github.com/malfet
2024-11-05 04:48:12 +00:00
a766d84a3c Allow inplacing buffer when other users are inconsequential (#138383)
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.

Implements:
https://github.com/pytorch/pytorch/issues/132826

Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
2024-11-05 03:44:09 +00:00
1e9390a30a Add setuptools and wheel to cp312, cp313 and cp313t for Manylinux2_28 builds (#139636)
Install setuptools and wheel dependencies for cp312, cp313, cp313t on Manylinux 2_28 images.
This should resolve
```
ModuleNotFoundError: No module named 'setuptools'
```
On PR: https://github.com/pytorch/pytorch/pull/138732

This issue was addressed on XPU images already. We should apply the same fix for the rest of the images instead of keeping it XPU specific.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139636
Approved by: https://github.com/huydhn, https://github.com/chuanqi129
2024-11-05 03:25:35 +00:00
9039fbb47e [FSDP2] Make module-to-state mapping use weakrefs (#139650)
Without this, `del model` does not free memory of a module with FSDP2 applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139650
Approved by: https://github.com/yf225
2024-11-05 02:16:52 +00:00
cyy
5008d15ae9 [2/N] Remove usage of C array (#139589)
Follows  #139567
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139589
Approved by: https://github.com/ezyang
2024-11-05 01:58:12 +00:00
c92de3b5df Add BRGEMM API versioning to be compatible with different oneDNN versions (#138184)
oneDNN v3.6 updated the ukernel APIs of `brgemm` and `brgemm_pack_B`. Considering the upgrade of oneDNN,  ukernel API versioning is needed to be compatible with different oneDNN versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138184
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-11-05 01:26:27 +00:00
299dbcde61 [CI] Fix xpu ci test with s3 cache (#139604)
Fix a regression caused by https://github.com/pytorch/pytorch/pull/121323
Works for #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139604
Approved by: https://github.com/atalman, https://github.com/malfet
2024-11-05 01:23:21 +00:00
eaf92b2484 [Python 3.13 CD] Enable Aarch64 py3.13 builds (#138629)
Adding CD aarch64. Part of: https://github.com/pytorch/pytorch/issues/130249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138629
Approved by: https://github.com/ZainRizvi
2024-11-05 01:16:37 +00:00
967cef294b [inductor][triton 3.2] fix test_codegen_config_option_dont_assume_alignment for triton 3.2 (#139640)
"divisible_by_16" was renamed "divisibility_16". Found in #139206.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139640
Approved by: https://github.com/aakhundov
2024-11-05 01:13:54 +00:00
3672c688e3 Fix layout for SetSourceTensorKernel (#137973)
Fixes #136837.
`aten.set_.source_Tensor` will make the size and stride of the first input and output follow that of the second input: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorShape.cpp#L440. If the layouts of the two inputs are different, the following `assert_size_stride` will fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137973
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-11-05 00:55:17 +00:00
639162f39a Add cache size to pt2_compile_events (#139627)
Summary:
I realized I wanted to check "are my cache entries/IO unreasonably large"
and there's no easy way to do it.  This lets me do it.

Test Plan: servicelab

Differential Revision: D65390363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139627
Approved by: https://github.com/c00w
2024-11-05 00:30:10 +00:00
0058f71002 Don't use deprecated type properties in UpsampleKernel (#139399)
By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399
Approved by: https://github.com/Skylion007
ghstack dependencies: #139353, #139358
2024-11-05 00:29:58 +00:00
b82a51bc6b [BE] And delete DeprecatedTypProperties cast (#139358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139358
Approved by: https://github.com/ezyang
ghstack dependencies: #139353
2024-11-05 00:23:12 +00:00
1b6f0b2a00 Revert "[BE] And delete DeprecatedTypProperties cast (#139358)"
This reverts commit 92a2a9ded22ef20a49e8c31dc2add93b40e8a78c.

Reverted https://github.com/pytorch/pytorch/pull/139358 on behalf of https://github.com/ZainRizvi due to Change reverted internally due to broken builds. See D65378845 ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2455959040))
2024-11-05 00:13:48 +00:00
4a3ee96427 Revert "Don't use deprecated type properties in UpsampleKernel (#139399)"
This reverts commit 9d096e4d9ffc2b57a19cbefd5d4b5cce7306945b.

Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/ZainRizvi due to Change reverted internally due to broken builds. See D65378845 ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2455959040))
2024-11-05 00:13:48 +00:00
cyy
64d9ee88d7 [11/N] Fix extra warnings brought by clang-tidy-17 (#139599)
Follows #139385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139599
Approved by: https://github.com/sraikund16
2024-11-04 23:57:41 +00:00
3f248a5735 Classify miss-inplaced tensors in logs. (#139240)
Summary:
use signpost logs,
a followup is to remove the field possibly_missed_reinplacing_opportunities form dynamo compile table.

Differential Revision: D65180194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139240
Approved by: https://github.com/zou3519
2024-11-04 23:56:14 +00:00
e947649e8f [BE] Change _marked_safe_globals_list to set (#139303)
Prevent same global from being added multiple times

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139303
Approved by: https://github.com/janeyx99
ghstack dependencies: #138936, #139221, #139433, #139541, #137602
2024-11-04 23:50:55 +00:00
1565eba4b4 [cuDNN][SDPA] Match query's memory layout ordering for output in cuDNN SDPA (#138354)
For #138340

~~We might consider more sophisticated logic here but the corresponding logic in other backends doesn't seem to do anything fancy for non BSHD/BHSD cases ea8ea2f33f/aten/src/ATen/native/transformers/cuda/attention.cu (L1145~~)

ended up going with a more general approach to much more or less arbitrary layouts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138354
Approved by: https://github.com/drisspg
2024-11-04 23:49:09 +00:00
a678eaf1ad check fake/real mismatches during real tensor prop (#137747)
Summary:
While testing exportability for PT2 Inference models, we found various cases of invalid op inputs during tracing, for example errors like: `a and b must have same reduction dim`, `expected scalar type Long but found Int`, etc. Looking more closely, these happened to due the same few meta kernels & eager kernels producing mismatched outputs upstream (e.g. different output tensor dtype, int output).

Adding checks to catch mismatched outputs in real tensor prop upstream, so errors are raised at the mismatched op, instead of the downstream ops taking them as inputs. Relies a lot on utils from [CrossRefFakeMode](929797dedb/torch/_subclasses/fake_utils.py (L78))

Follow ups: could add more checks, and maybe have a flag to only enable these for cases like draft mode, so perf doesn't suffer?

Test Plan: test_export, test_fake_tensor

Differential Revision: D64210055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137747
Approved by: https://github.com/zou3519
2024-11-04 23:39:48 +00:00
9919932783 Specialize symfloats that flow through is_integer (#139572)
Fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_is_integer_num_type6_dynamic_shapes` when specialize_float = False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139572
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457, #139568
2024-11-04 23:35:35 +00:00
350bc2a166 [export] Add support for symbool to make it usable for torch.cond (#138765)
# Why?

I want the following code to work.

minimal repro:
```
class M(torch.nn.Module):
    def forward(self, dilate_flag):
        return dilate_flag.item()

input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
model = M().cuda()

ep = torch.export.export(model, input1, strict=True)
path = torch._inductor.aot_compile(ep.module(), input1)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input1)
```

error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program

second error will be handled by https://github.com/pytorch/pytorch/pull/138760

# Motivation

I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work.

```
class M(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.dilate_flag = 0

    def forward(self, dilate_flag):
        self.dilate_flag = dilate_flag.item()

        def true_fn(dilate_flag):
            return dilate_flag.clone()

        def false_fn(dilate_flag):
            return dilate_flag.clone()

        torch.cond(
            self.dilate_flag,
            true_fn,
            false_fn,
            (dilate_flag,),
        )
        return self.dilate_flag

input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),)
inputs = (input1, input2)
model = M().cuda()

for input in inputs:
    expected_output = model(*input)

    ep = torch.export.export(model, input, strict=False)
    path = torch._inductor.aot_compile(ep.module(), input)
    aot_model = torch._export.aot_load(path, device="cuda")
    actual_output = aot_model(*input)

    assert (
        expected_output == actual_output
    ), f"henry they are not equal {expected_output} != {actual_output}"
```

Differential Revision: D64867504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765
Approved by: https://github.com/ydwu4
2024-11-04 23:31:49 +00:00
6add86a29f Revert "Tighten type hints for tensor arithmetic (#135392)"
This reverts commit bf5cd8d0116d90d24b8acb38d578b8952dab22ef.

Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking lint on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11673543178/job/32504499599) [HUD commit link](bf5cd8d011) ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2455908056))
2024-11-04 23:30:15 +00:00
23169a6bcc Disable foreach tests for complex128 internally (#139649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139649
Approved by: https://github.com/ngimel
2024-11-04 23:24:47 +00:00
87a379b61b Move pippy to training IR (#139233)
Differential Revision: [D65282662](https://our.internmc.facebook.com/intern/diff/D65282662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139233
Approved by: https://github.com/kwen2501
ghstack dependencies: #138658, #139209
2024-11-04 23:07:14 +00:00
397938b453 [hop free symbols][refactor] lift freevar to parent graph before lifting to subgraph (#138559)
This refactoring is for getting a deterministic ordering of binding tensors and sizes of tensors. When seeing a free tensor  x with shape (s0,) in subgraph, the ordering of lifting changes from
```
lift_x_in_child, lift_s0_in_child, lift_s0_in_parent, lift_x_in_parent
```
to
```
lift_x_in_parent, lift_s0_in_parent, lift_x_in_child, lift_s0_in_child
```
This produces a determinstic ordering of handling the symints in lifted tensors.

This is also the current contract of dynamo top-level graph: we lift free_symbols in sizes after tensor x and insert the free symbols before the tensor x's proxy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138559
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428, #138558, #138737
2024-11-04 22:48:14 +00:00
c5b79699e1 [hop free symbols] replace ctx.save_for_backward to support symints/ints (#138737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138737
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Chillee
ghstack dependencies: #138345, #138428, #138558
2024-11-04 22:48:14 +00:00
ac20d0f893 [hop free symbols][refactor] make map's save_for_backward to handle int (#138558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138558
Approved by: https://github.com/zou3519
ghstack dependencies: #138345, #138428
2024-11-04 22:48:07 +00:00
dc3a6a9d08 [hop free symbols][refactor] make create_graph_input always take example_value (#138428)
Code refactoring only. We move the wrap_to_fake_tensor_logic out of wrap_fx_proxy for placeholders to provide the invariant that **all graph inputs must set their example values when creating the inputs**. This invariant helps us to identify all the free symbols in the graph in top-level and sub-graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138428
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #138345
2024-11-04 22:47:49 +00:00
54c69a785b [hop free symbols][refactor] make bound_symbols a dictionary (#138345)
Code refactoring only. Change all self.tx.output.bound_symbols to self.tx.output.root_tracer.bound_symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138345
Approved by: https://github.com/zou3519
2024-11-04 22:47:41 +00:00
514c466cd9 Redirect the custom ops landing page :D (#139634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139634
Approved by: https://github.com/zou3519
2024-11-04 22:25:15 +00:00
bf5cd8d011 Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-04 22:10:04 +00:00
080e0ca584 [aoti tests] enable some aoti package tests for fbcode (#139359)
Differential Revision: D65249372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139359
Approved by: https://github.com/angelayi
2024-11-04 22:06:07 +00:00
3d93caf664 [c10d] Add thread-safety initialization warning (#139638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139638
Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/XilunWu
2024-11-04 21:38:47 +00:00
cyy
7deec3942f [6/N] Don't skip ASAN on some tests (#139565)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139565
Approved by: https://github.com/ezyang
2024-11-04 21:32:44 +00:00
91d38a5a82 Fix cuda Manylinux 2_28 docker images PATH setting (#139631)
Enabling Manywheel builds here: https://github.com/pytorch/pytorch/pull/138732

During the build I observe the failure with cuda jobs:

```
-- Compiler does not support SVE extension. Will not build perfkernels.
-- Found CUDA: /usr/local/cuda (found version "11.8")
-- The CUDA compiler identification is unknown
CMake Error at cmake/public/cuda.cmake:47 (enable_language):
  No CMAKE_CUDA_COMPILER could be found.

  Tell CMake where to find the compiler by setting either the environment
  variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
  path to the compiler, or to the compiler name if it is in the PATH.
Call Stack (most recent call first):
  cmake/Dependencies.cmake:44 (include)
  CMakeLists.txt:851 (include)
```

While correct sequence suppose to be:
```
-- Found CUDA: /usr/local/cuda (found version "11.8")
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.8.89")
```

Issue found to be missing PATH setting in 2_28 Docker file.  This section exist in CentOS Docker file here:
https://github.com/pytorch/pytorch/blob/main/.ci/docker/manywheel/Dockerfile#L174-L175

(Please Note these Docker images are not used yet. The https://github.com/pytorch/pytorch/pull/138732 should enable using these images)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139631
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-11-04 21:13:17 +00:00
888110841c [inductor] don't fuse two nodes if likely increase peak memory (#138756)
Partially fixing https://github.com/pytorch/pytorch/issues/138685

Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory.

The doc string mainly explains what this PR is doing:
```
        The implementation is more like a heuristic since we don't really know if we are at peak
        or not when trying to fuse these two ndoes. The order of nodes may change later which makes the
        peak memory estimation hard.
        Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes:
        1. find all buffers read by each node with a single user. These buffers are supposed to
           be reused if we don't fuses these 2 nodes
        2. find the intersection of these buffers for the two node and sum the total buffer size.
           If we don't fuse these two nodes, we can at lease avoid this much memory allocation.
           Note that the extra memory allocation is not necessarily causing peak memory increase.
           This is just a heuristic.
        We return true only if the saving for fusion can not trade off the extra memory allocation.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138756
Approved by: https://github.com/jansel
ghstack dependencies: #139136
2024-11-04 20:49:29 +00:00
1aa71be56c [PT2] Decouple decompose_triton_kernel_wrapper_functional from decompose_auto_functionalized (#139526)
As title. We may not always want to remove the `triton_kernel_wrapper_functional` for example the references of [`unsafe_remove_auto_functionalized_pass`](c8ab9b06a2/torch/export/_remove_auto_functionalized_pass.py (L48)).

Test Plan: CI & [D62592946](https://www.internalfb.com/diff/D62592946)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139526
Approved by: https://github.com/zou3519
2024-11-04 20:16:18 +00:00
71dc5df93c [pipelining] Fix 'last backward' counting for dI / dW (#139415)
Since any stage can run a mixture of full backwards and split backwards,
it is important to count the sum of (full_backwards + backward_weight)
when comparing to num microbatches to determine last backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415
Approved by: https://github.com/H-Huang
2024-11-04 20:14:10 +00:00
99413cd1a8 [CMake] Fix local MPS builds (#139651)
Not sure how it works on some machines, but clean build fails for me after https://github.com/pytorch/pytorch/pull/138636 was landed, even though it works fine on another machine.

Solution is to create an empty file when one adds a dependency, but later this dependency will be updated by the build rule

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139651
Approved by: https://github.com/atalman
2024-11-04 19:43:53 +00:00
30a83ca991 [dynamo] Improve codegen for DataPtrVariable and fix tensor reference issue (#139487)
This addresses
https://github.com/pytorch/pytorch/pull/137677/files#r1799836499, which
had to set `allow_cache=False` for codegen on `DataPtrVariable.base`,
which is a `TensorVariable`, otherwise we observe failure of
`test_no_grad_copy` when testing with Dynamo.

I've seen `test_no_grad_copy` failing a few times, and every single time
it's related to cyclic reference, my best guess is the cyclic reference
holds some tensor object longer in memory than necessary, preventing the
optimization introduced in #11165.

This patch makes `OutputGraph.cleanup()` more aggressive by clearing out
all fields that might reference a `VariableTracker`. As a result, we can
remove the aforementioned `allow_cache=False`, which helps generate
better code (e.g., in the case of `test_no_grad_copy`, it skipped generating
a redundant graph whose only op is returning the input tensor; instead we just
generate a single `LOAD_FAST`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139487
Approved by: https://github.com/jansel, https://github.com/aakhundov
2024-11-04 19:14:06 +00:00
740054ffe6 [AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597)
Summary: Reland https://github.com/pytorch/pytorch/pull/139154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597
Approved by: https://github.com/angelayi
2024-11-04 18:53:17 +00:00
e76ce20177 Log to pt2 compile events (#139601)
Summary: This option was added after I wrote the original diff, lets publish to pt2_compile_events

Test Plan: CI

Differential Revision: D65404910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139601
Approved by: https://github.com/jamesjwu
2024-11-04 18:39:06 +00:00
4930c4b716 [inductor] patterns to remove pointless view/permute pairs (#139136)
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.

Consider this snippet:
```
        class LinearAndCEL(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(C, V)
                self.ce = nn.CrossEntropyLoss()

            def forward(self, x, y):
                return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```

`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.

The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-11-04 18:39:02 +00:00
ca43ecd599 Flip default on weights_only (#137602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137602
Approved by: https://github.com/malfet, https://github.com/albanD
ghstack dependencies: #138936, #139221, #139433, #139541
2024-11-04 18:30:29 +00:00
f55dfbcf87 Remove hasattr(__slots__) for BUILD logic in weights_only unpickler (#139541)
This is tested in PR stacked above in

```python
python test/distributed/fsdp/test_fsdp_state_dict.py TestFSDPStateDict.test_torch_save_load
```

We cannot depend on whether `hasattr(..., __slots__)` to know whether a BUILD instruction has slotstate. For example, if a class subclasses ABC `hasattr(__slots__)` will be `True` but there might be no slots (and hence `state` will not be a tuple). So revert #138936 to following the pickle library's code

```python

>>> from abc import ABC
>>> hasattr(ABC, "__slots__")
True
```

So

```python
import torch
from abc import ABC
from dataclasses import dataclass

class Foo(ABC):
    pass

class FooWrapper(Foo):
    def __init__(self, x, y):
        self.x = x
        self.y = y

f = FooWrapper(1, 2)
torch.save(f, "temp.pt")
with torch.serialization.safe_globals([FooWrapper]):
    torch.load("temp.pt")
```

Would fail on the previous code with
```
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1934, in _load
    result = unpickler.load()
  File "/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py", line 366, in load
    for k, v in slotstate.items():
```

As there is actually no slotstate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139541
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221, #139433
2024-11-04 18:30:29 +00:00
ae0e7042f6 Fix custom obj being input (#139209)
Differential Revision: [D65158939](https://our.internmc.facebook.com/intern/diff/D65158939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139209
Approved by: https://github.com/ydwu4
ghstack dependencies: #138658
2024-11-04 18:24:29 +00:00
85c3c4132d no-op torch.library.custom_op APIs on torch.deploy (#139509)
We forgot this case in the previous PR. Fixes
https://github.com/pytorch/pytorch/issues/137536

Test Plan:
- better tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139509
Approved by: https://github.com/williamwen42
2024-11-04 18:01:08 +00:00
6dada2136a Revert "Refactor FxGraphDrawer to use HTML-like labels (#137726)"
This reverts commit 1e738420296a84406cd0a1626074ea6447a6603a.

Reverted https://github.com/pytorch/pytorch/pull/137726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like some internal components are failing after this change and need to be updated ([comment](https://github.com/pytorch/pytorch/pull/137726#issuecomment-2455332612))
2024-11-04 17:44:44 +00:00
e080c89bdc Make test_torchbind.py training IR compatible (#138658)
In this diff, i make test_torchbind.py tests to handle training IR. Today in the training IR, we don't see the effect token and HOP because this happens at the FunctionalTensorMode. Maybe in the future, we should move this logic up to the training IR so that writing passes etc on training Ir is safer. But for the migration purposes, i think it is ok for now.  I also fixed two bugs:
1. ep.module() doesn't register all aliased constants in the module.
2. When we retrace, we need to fakify the original Torchbind object.
3. We don't run any DCE on training IR so we need to add some more torch ops to verifier.

Differential Revision: [D64853530](https://our.internmc.facebook.com/intern/diff/D64853530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138658
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2024-11-04 17:43:11 +00:00
68c515b292 don't run z3 analysis on backed symfloat nodes (#139568)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139568
Approved by: https://github.com/ezyang
ghstack dependencies: #139569, #139457
2024-11-04 17:04:29 +00:00
d3fc13a9dd use more elements per thread for narrow dtypes (#139449)
Fix perf issue for narrow type by accessing more elements per thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449
Approved by: https://github.com/Chillee, https://github.com/eqy
2024-11-04 16:43:33 +00:00
3ca794783f Revert "[SymmetricMemory] introduce a binding for cuMemset32Async (#138755)"
This reverts commit 924e726c3a2566125f55cdbff4dff054d3db3232.

Reverted https://github.com/pytorch/pytorch/pull/138755 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally.  Can you please fix this PR so it works internally and re-merge it? See D65401876 for more details ([comment](https://github.com/pytorch/pytorch/pull/138755#issuecomment-2455173596))
2024-11-04 16:34:34 +00:00
87404b6ca6 support symfloats in translation validation (#139457)
fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesHigherOrderOpTests.test_cond_pytree_operands_with_non_tensor_leaves_dynamic_shapes` when `specialize_float=False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139457
Approved by: https://github.com/ezyang
ghstack dependencies: #139569
2024-11-04 15:40:08 +00:00
6b8e3022f2 Remove c10::optional usages in PyTorch (#139525)
Test Plan: Sandcastle

Reviewed By: swolchok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139525
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-11-04 15:35:23 +00:00
cyy
419a7e197d [6/N] Fix Wextra-semi warning (#139605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605
Approved by: https://github.com/ezyang
2024-11-04 13:43:16 +00:00
2ce2e4df4e Update slow tests (#139051)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139051
Approved by: https://github.com/pytorchbot
2024-11-04 11:49:06 +00:00
12d225d91c add opaque unary sin and cos to SYMPY_INTERP (#139569)
Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nn.py TestNNDeviceTypeCPU.test_affine_3d_rotateRandom_cpu` when specialize_float = False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139569
Approved by: https://github.com/ezyang
2024-11-04 07:37:11 +00:00
3337439dc0 [inductor] modify the heuristic for disabling vectorization (#136422)
Summary
Since we have already implemented tail loop mask vectorization (https://github.com/pytorch/pytorch/pull/126526), I re-tuned the heuristics for disabling vectorization from performance perspective. I changed the heuristic to: when the total number of elements along the vec dim is less than `tiling_factor/4` and the number of operations is less than 10, we disable the vectorization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136422
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-11-04 07:33:32 +00:00
f4ee5a243d Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402)
Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times.

In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish.

In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel.

The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen.

Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402
Approved by: https://github.com/oulgen
2024-11-04 06:37:18 +00:00
cyy
3179eb15ae [1/N] Remove usage of C array (#139567)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139567
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-11-04 04:52:46 +00:00
cadc50e7e9 LOG(INFO) -> VLOG(2) in ProcessGroupNCCL (#130696)
In the same spirit as https://github.com/pytorch/pytorch/pull/105695

Initialization and error handling logs are mostly kept. Routine logs are changed to VLOG.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130696
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@fb.com>
2024-11-04 04:43:42 +00:00
ed30fa74ab [inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364, #139365, #139370, #139452
2024-11-04 04:28:40 +00:00
b6fb135c2c [inductor] Simplify remove_kernel_local_buffers (#139452)
I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365, #139370
2024-11-04 04:28:40 +00:00
3d633f12ba [inductor] Move remove_kernel_local_buffers to Kernel (#139370)
This method mutates the kernel, so it fits better in that class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365
2024-11-04 04:28:33 +00:00
66d5e2405d [inductor] Remove Node.last_usage mutation (#139365)
I can't figure out why this is needed.  Let's see if tests fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365
Approved by: https://github.com/shunting314
ghstack dependencies: #139364
2024-11-04 04:28:25 +00:00
d189f92eb1 [inductor] Remove SIMDKernel.last_usage (#139364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-11-04 04:28:18 +00:00
e6ff07f00e [dynamo][guards] Consider tensors as immutable for dict tag matches (#139560)
This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476

We have dict tag optimization where if the dict tag does not change, we
skip guards on all the items of the dict that are "immutable". We
considered tensors as immutable in such scenarios. This is critical for
guard eval performance, because generally users dont change their
parameters.

If I try to remove this optimization, we see slowdowns, e.g, 3.03x to
2.95x on conv_mixer TIMM benchamrk.

So, I am adding a flag which keeps the current state but allows the
users to remove this optimization. Not ideal, but given how serious guard eval perf has to be,
we are in the gray are of unsoundness vs performance tradeoff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560
Approved by: https://github.com/jansel
2024-11-04 00:54:20 +00:00
cyy
7f387fa612 [10/N] Fix extra warnings brought by clang-tidy-17 (#139385)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139385
Approved by: https://github.com/Skylion007
2024-11-04 00:47:19 +00:00
3242049daa [profiler] Annotate triton kernels with kernel hash (#139531)
As above, annotates triton kernel hash in the profile attributes.

Added a new unit test in profiler to triton/dynamo events.

Testplan:

Running new unit test in CI

Internal:
  buck2 run @mode/dev-nosan caffe2/test:profiler -- -r test_pt2_triton_attributes

Running on an example, this is how the kernel hash file looks
```
  {
    "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 1670242, "tid": 1670242,
    "ts": 2413669097354.058, "dur": 95.812,
    "args": {
      "External id": 3,"kernel_hash": "cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu",
"grid": "grid(100,)", "Record function id": 0, "stream": 0, "Concrete Inputs": ["", "", "", "100"], "kernel_file": "/tmp/torchinductor_bcoutinho/qa/cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu.py", "kernel_backend": "triton", "Input type": ["float", "float", "float", "Scalar"], "Input Strides": [[10, 1], [10, 1], [10, 1], []], "Input Dims": [[10, 10], [10, 10], [10, 10], []], "Ev Idx": 2

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139531
Approved by: https://github.com/davidberard98
2024-11-03 23:19:35 +00:00
924e726c3a [SymmetricMemory] introduce a binding for cuMemset32Async (#138755)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.

To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
2024-11-03 21:37:31 +00:00
5d07651c72 only use hint_size in _smart_symbol_sort for size type symbols (#139571)
Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_exponential_kstest_cpu_bfloat16` when specialize_float = False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139571
Approved by: https://github.com/ezyang
ghstack dependencies: #139451, #139482, #139484, #139486
2024-11-03 21:15:08 +00:00
cyy
57a49018b1 [5/N] Fix Wextra-semi warning (#139465)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139465
Approved by: https://github.com/ezyang
2024-11-03 20:40:50 +00:00
cyy
03e83111f5 Remove unnecessary check of CUDA 10.2 (#139566)
Since PyTorch now requires higher CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139566
Approved by: https://github.com/ezyang
2024-11-03 20:04:37 +00:00
d84a344410 [Inductor] Skip coordinate_descent_tuning for mm/bmm decomposition on CPU (#139537)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/138823, `coordinate_descent_tuning` doesn't benefit on CPU and prefer lowering `mm`/`bmm` into ATEN kernels or CPP GEMM Template.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_coordinate_descent_tuning
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139537
Approved by: https://github.com/jansel
2024-11-03 10:10:29 +00:00
585dbfa583 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-03 06:29:57 +00:00
3a2ab9584f Revert "[executorch hash update] update the pinned executorch hash (#139536)"
This reverts commit 468d592fbc12dfc67d89f954781ccbf540241470.

Reverted https://github.com/pytorch/pytorch/pull/139536 on behalf of https://github.com/huydhn due to This is breaking trunk, need to fix before relanding ([comment](https://github.com/pytorch/pytorch/pull/139536#issuecomment-2453313984))
2024-11-03 06:25:41 +00:00
a1370259ba always specialize float on export path (#139486)
This is the next step in support dynamic float arguments in PT2: docs.google.com/document/d/1HswUSp9H6mg8Vg27mhRk8YzC9q_uf63b6wz-gwx65BQ/edit?pli=1#heading=h.xvyiqp8tuje6. To make this more incremental and tractable, we've decided to opt the export path our of this first phase of the rollout.

Fixes python test/export/test_export.py TestExport.test_export_input_mutation_dynamic_shape when specialize_float=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139486
Approved by: https://github.com/ezyang
ghstack dependencies: #139451, #139482, #139484
2024-11-03 04:47:12 +00:00
25f243ff5d Update tensorify pass to specialize symfloats we didn't tensorify away (#139564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139564
Approved by: https://github.com/huydhn
2024-11-03 04:27:43 +00:00
b3ad45733b [Lint] Clang-format all metal kernels (#139530)
Except Quantized.metal, where linting breaks all the ASCII art
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139530
Approved by: https://github.com/cyyever, https://github.com/Skylion007
ghstack dependencies: #139522
2024-11-03 04:14:20 +00:00
468d592fbc [executorch hash update] update the pinned executorch hash (#139536)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139536
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-11-03 03:14:06 +00:00
067d2a089d Revert "Expose Storage _use_count API in Python (#139426)"
This reverts commit e31136d07bbfb10735df101df953c73d22dde24b.

Reverted https://github.com/pytorch/pytorch/pull/139426 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing some inductor job in trunk ([comment](https://github.com/pytorch/pytorch/pull/139426#issuecomment-2453269063))
2024-11-03 02:40:45 +00:00
b8b60e0bc5 add is_integer to support example_value function whitelist (#139484)
Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_is_integer_dynamic_shapes when specialize_float=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139484
Approved by: https://github.com/ezyang
ghstack dependencies: #139451, #139482
2024-11-03 02:01:38 +00:00
f121eab018 [c10d] Remove dead Dynamo marker (#139545)
Per discussion with @anijain2305, `dynamo_unsupported_distributed_c10d_ops` is not referenced anywhere.
Removing this dead code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139545
Approved by: https://github.com/Skylion007
2024-11-03 00:40:26 +00:00
0f06dff4d7 Restores release_lock_on_cudamalloc behavior in CUDACachingAllocator (#139430)
In https://github.com/pytorch/pytorch/pull/134685, I transformed the following code:
```CPP
      if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
        // At scope exit, acquire the lock again. This provides safety against
        // any potential exceptions in the cudaMallocMaybeCapturing function.
        auto sg = c10::make_scope_exit([&]() { lock.lock(); });
        lock.unlock();
        p.err = cudaMallocMaybeCapturing(&ptr, size);
      } else {
        p.err = cudaMallocMaybeCapturing(&ptr, size);
      }
      if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
        TORCH_CHECK(
            lock.owns_lock(), "Failed to acquire lock after cudaMalloc");
      }
```
into:
```CPP
      if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
        // At scope exit, acquire the lock again. This provides safety against
        // any potential exceptions in the cudaMallocMaybeCapturing function.
        auto sg = c10::make_scope_exit([&]() { lock.lock(); });
        lock.unlock();
      }
      auto active_pool = MemPoolContext::getActiveMemPool();
      if (active_pool && active_pool->allocator() &&
          p.pool->owner_PrivatePool) {
        ptr = active_pool->allocator()->raw_alloc(size);
        p.err = ptr ? cudaSuccess : cudaErrorMemoryAllocation;
      } else {
        p.err = cudaMallocMaybeCapturing(&ptr, size);
      }
      if (CUDAAllocatorConfig::release_lock_on_cudamalloc()) {
        TORCH_CHECK(
            lock.owns_lock(), "Failed to acquire lock after cudaMalloc");
      }
```
This is wrong because, I didn't realize what `c10::make_scope_exit([&]() { lock.lock(); });` does. And so my changes doesn't let `release_lock_on_cudamalloc` unlock..execute alloc..lock, and instead it just unlock..locks. This PR rectifies that change, and in addition adds an ASSERT ensuring the active pool and p.pool are the same (mirroring the behavior from released_cached_blocks).

Thanks @zvon82 for reporting this!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139430
Approved by: https://github.com/ezyang
2024-11-03 00:04:30 +00:00
a3cb8ee38b AOTAutograd: Make general SymInt hashable when merging view inputs. (#139553)
Fix: #139111

This PR wraps `SymInt` input arguments with `SymIntEqByExpr`, making them hashable when
merging view inputs (`merge_view_inputs` function).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139553
Approved by: https://github.com/ezyang
2024-11-02 23:57:11 +00:00
b46e1fc141 [Dynamo] Fix graph break when tensor.split() is called within a device context manager (#139270)
Fixes: #139183

Note: this case can also be reproduced on cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139270
Approved by: https://github.com/ezyang

Co-authored-by: Vincent Moens <vincentmoens@gmail.com>
2024-11-02 23:55:51 +00:00
e31136d07b Expose Storage _use_count API in Python (#139426)
Would be nice to replace the torch._C._storage_Use_Count call in https://github.com/pytorch/torchtune/pull/1936, at least without needing to know about _cdata in OSS code.

Initially keeping it private as Tensor._use_count is also private.

In favor over https://github.com/pytorch/pytorch/pull/139109 in solving the same problem, as exposing an existing API is better than adding a new one (and this enables a more robust fix)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139426
Approved by: https://github.com/soulitzer
2024-11-02 23:36:31 +00:00
f6e5d09682 Raise error for int64 and bool dtypes in nanmean, even for empty tensors (#138745)
This PR ensures that the `nanmean()` function raises a `RuntimeError` when using `int64` or `bool` dtypes, even for empty tensors. Previously, non-empty tensors correctly raised errors for unsupported dtypes, while empty tensors did not. This change brings consistent error handling for both cases.

addressing the need raised in an issue by @hyperkai  (Issue [#131043](https://github.com/pytorch/pytorch/issues/131043)).

### Changes

- Added checks in `nanmean_out()` to raise errors for `int64` and `bool` dtypes regardless of tensor size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138745
Approved by: https://github.com/ezyang
2024-11-02 22:52:40 +00:00
232af152b5 Fix graph breaks related to specialized float inputs (#139482)
Fixes issue with timm models where

example_value = 0.09999
proxy.node.target = <built-in function sub>

would fall through to

```
        unimplemented(
            "torch.* op returned non-Tensor "
            + f"{typestr(example_value)} {proxy.node.op} {proxy.node.target}",
            case_name="unsupported_operator",
        )
```

and graph break

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139482
Approved by: https://github.com/ezyang
ghstack dependencies: #139451
2024-11-02 21:58:46 +00:00
854be65fa0 Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (#139013)"
This reverts commit 55038aa66162372acc1041751d5cc5c8ed9bc304.

Reverted https://github.com/pytorch/pytorch/pull/139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](https://github.com/pytorch/pytorch/pull/139013#issuecomment-2453085932))
2024-11-02 18:29:10 +00:00
e9eb7b1b13 [CI] Skip test_cuda_tracker_equivalence for ROCm (#139543)
Test fails on ROCm, skipping it for this platform.
Resolves https://github.com/pytorch/pytorch/issues/139515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139543
Approved by: https://github.com/huydhn
2024-11-02 15:39:07 +00:00
92d7f29e59 Revert "Profile guided optimization for automatic_dynamic (#139001)"
This reverts commit f6be44c74e012fb4329e6e716ebb78e9f5092a3b.

Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to more fbcode errors ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452985581))
2024-11-02 13:11:04 +00:00
709752e0bb Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)"
This reverts commit 293fbb42d207058d49f0ae40ca408214ee88b76b.

Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))
2024-11-02 13:04:00 +00:00
f6be44c74e Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-02 11:50:11 +00:00
55038aa661 [PGNCCL] Make sure we do not use split for P2P comm creation (#139013)
Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172

There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013
Approved by: https://github.com/shuqiangzhang
2024-11-02 07:47:55 +00:00
2a3fe06ce0 Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598)"
This reverts commit 39ec5a20ea3d7bc8c2147f8363f8a06f4bb1e953.

Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails an executorch test https://github.com/pytorch/executorch/blob/main/exir/backend/test/test_graph_partition.py#L114-L175 ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2452903705))
2024-11-02 07:19:22 +00:00
f3238106fd Revert "Allow inplacing buffer when other users are inconsequential (#138383)"
This reverts commit 030f70b40bca62993bd65d03c58ded45601abe35.

Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but I think it has a test failing internally and also on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2452898229))
2024-11-02 06:53:48 +00:00
0863d6a08e Revert "[inductor] Remove SIMDKernel.last_usage (#139364)"
This reverts commit 286d3ce266ce01ca905afb1cc9ea5d81abf79ff7.

Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:11 +00:00
9331640e26 Revert "[inductor] Remove Node.last_usage mutation (#139365)"
This reverts commit 1e934b473cabe6bc003f66d9811082e97c958a31.

Reverted https://github.com/pytorch/pytorch/pull/139365 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
dc4b459737 Revert "[inductor] Move remove_kernel_local_buffers to Kernel (#139370)"
This reverts commit b57b4b7f9b168389def15ea06a4dcf9e5f6f4f04.

Reverted https://github.com/pytorch/pytorch/pull/139370 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
66a401c9e1 Revert "[inductor] Simplify remove_kernel_local_buffers (#139452)"
This reverts commit 73c0762a34ef152450287dbc365cb8db930031b7.

Reverted https://github.com/pytorch/pytorch/pull/139452 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
98e11b0021 Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)"
This reverts commit c53beab3775671b5b7ec6106737c0d8939b8455a.

Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
fdd298dcb7 add hex method on SymFloat (#139451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139451
Approved by: https://github.com/ezyang
2024-11-02 05:33:19 +00:00
8d1eaa3da6 Revert "Profile guided optimization for automatic_dynamic (#139001)"
This reverts commit a6630bcf8736e4d66375688dfd8b45c401de3fef.

Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))
2024-11-02 03:38:15 +00:00
540f3ef9b1 Fix flex_decode to build offsets off of strides (#139516)
Fixes PR: https://github.com/pytorch/pytorch/issues/139462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139516
Approved by: https://github.com/Chillee
2024-11-02 03:17:46 +00:00
293fbb42d2 [AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154
Approved by: https://github.com/angelayi
ghstack dependencies: #139153
2024-11-02 03:10:05 +00:00
a46a79fe92 [AOTI] Ignore .o files in package_aoti (#139153)
Summary: There is no point to package .o files since a .so file is included in that package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139153
Approved by: https://github.com/angelayi
2024-11-02 03:10:05 +00:00
c53beab377 [inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364, #139365, #139370, #139452
2024-11-02 03:04:22 +00:00
387b120549 [ONNX] Remove type promotion rule for pow (#139527)
ONNX supports different input types in Pow, so type promotion is not needed.

The resulting graph is the following:

```py
ONNXProgram(
    model=
        <
            ir_version=9,
            opset_imports={'': 18, 'pkg.onnxscript.torch_lib.common': 1},
            producer_name='pytorch',
            producer_version='2.6.0a0+git59a1af5',
            domain=None,
            model_version=None,
        >
        graph(
            name=main_graph,
            inputs=(
                %"x"<FLOAT16,[3]>
            ),
            outputs=(
                %"pow_1"<FLOAT16,[3]>
            ),
        ) {
            0 |  # node_Constant_0
                 %"val_0"<?,?> ⬅️ ::Constant() {value=Tensor<FLOAT,[]>(array(2., dtype=float32), name=None)}
            1 |  # node_Pow_1
                 %"pow_1"<FLOAT16,[3]> ⬅️ ::Pow(%"x", %"val_0")
            return %"pow_1"<FLOAT16,[3]>
        }
...
    ,
    exported_program=
        ExportedProgram:
            class GraphModule(torch.nn.Module):
                def forward(self, x: "f16[3]"):
                     # File: /workspace/pytorch/test/onnx/exporter/test_small_models_e2e.py:53 in forward, code: return x**2.0
                    pow_1: "f16[3]" = torch.ops.aten.pow.Tensor_Scalar(x, 2.0);  x = None
                    return (pow_1,)

        Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='pow_1'), target=None)])
        Range constraints: {}

)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139527
Approved by: https://github.com/titaiwangms
2024-11-02 02:19:50 +00:00
7e65060410 Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel
2024-11-02 02:14:01 +00:00
edd3f5a94d [profiler] fix a building warning by adding USE_KINETO namespace for setTraceID (#139461)
Fix: https://github.com/pytorch/pytorch/issues/139460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139461
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/sraikund16
2024-11-02 01:02:29 +00:00
092fe2f422 Handle nan case when checking mutations (#139483)
Test Plan: PT2 readiness models

Differential Revision: D65340986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139483
Approved by: https://github.com/zou3519
2024-11-02 00:49:05 +00:00
b71e813bce [dynamo, 3.13] fix bytecode nop tests (#139323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323
Approved by: https://github.com/jansel
2024-11-02 00:39:36 +00:00
8c17830dea [AOTI] Unify how weights are stored as data section (#139471)
Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well.

Test Plan: CI

Differential Revision: D65302273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471
Approved by: https://github.com/chenyang78
2024-11-02 00:23:24 +00:00
aa54b2467f [executorch hash update] update the pinned executorch hash (#139133)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139133
Approved by: https://github.com/pytorchbot
2024-11-02 00:14:47 +00:00
ee2f8a50d3 Class rename (#139490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490
Approved by: https://github.com/exclamaforte, https://github.com/zou3519
ghstack dependencies: #139295
2024-11-02 00:10:17 +00:00
c95adb9c5b Revert "use more elements per thread for narrow dtypes (#139449)"
This reverts commit f5b9e725d14a9a2906b7f1701d97cb4e95891a92.

Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a bunch of tests are failing after it lands, it looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2452723863))
2024-11-01 23:42:16 +00:00
b617d4813c Revert "fix dynamo tracking numpy 2 ops (#138686)"
This reverts commit 124eac255e3af04379721af09631a45a05c7fb05.

Reverted https://github.com/pytorch/pytorch/pull/138686 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I am seeing inductor failure with hf_BigBird number of graph breaks after it lands ([comment](https://github.com/pytorch/pytorch/pull/138686#issuecomment-2452718164))
2024-11-01 23:34:06 +00:00
77b72d686e [BE][MPS] Make metal shaders compile cleanly (#139522)
I.e. without warnings, by deleting dead code and fixing one
signed-unsigned comparison warning

Also, pass `-Werror` to metal compiler if WERROR options is set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139522
Approved by: https://github.com/Skylion007
2024-11-01 23:22:47 +00:00
2382b3b6d8 [Easy] Add joint graph passes, fallback_random to bisector (#139295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139295
Approved by: https://github.com/zou3519, https://github.com/exclamaforte
2024-11-01 23:21:53 +00:00
1e73842029 Refactor FxGraphDrawer to use HTML-like labels (#137726)
Fixes https://github.com/pytorch/pytorch/issues/137499
Testing: Added a new unit test to make sure that the regression case succeeds.
I'm debating about whether to make the borders visible. I'm partial to no borders, but it might make it harder for some people to read?
![68a2b0e3-orig_fx_graph_diagram](https://github.com/user-attachments/assets/fbc2fd98-9e76-488e-8ebe-c64fbf206932)
Vs.
![2bfe1c4f-orig_fx_graph_diagram](https://github.com/user-attachments/assets/b6bc88ba-dda2-4cf7-84ac-a615e1e03a74)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137726
Approved by: https://github.com/eellison, https://github.com/malfet
2024-11-01 23:19:50 +00:00
60542eeb33 [inductor] set sanitize_overflow=False for triton kernels (#139502)
In upstream triton, https://github.com/triton-lang/triton/pull/4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. https://github.com/triton-lang/triton/pull/5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139502
Approved by: https://github.com/bertmaher
2024-11-01 23:10:21 +00:00
da395384a2 Delete Windows GPU jobs in periodic (#139336)
As an outcome of https://fburl.com/gdoc/voce5o06, we could stop running Windows GPU tests on periodic pending the green light from MS. No one is monitoring these jobs atm.

We already have Windows CUDA and CPU build jobs in trunk.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139336
Approved by: https://github.com/ZainRizvi, https://github.com/wdvr, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-01 22:26:22 +00:00
4c64a7f33f [pgnccl] add a restart test for PGs in blocking mode (#139496)
Summary:
Restarting (aborting and re-initialize a PG) is a basic need if we want
to achieve in-process restart of PGs without tearing down the whole
process.

Add this tests to verify that this is supported by current NCCL.
Note that this restart test passes steadily only for blocking mode for now.
In nonblockin mode. There is problem in either nccl init or abort that
needs further investigation
Test Plan:
new UT

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139496
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2024-11-01 22:13:37 +00:00
0b13bdd877 Delete parallelnative jobs in periodic (#139328)
As an outcome of https://fburl.com/gdoc/voce5o06, we can now clean up parallelnative build and test jobs in periodic.  There is not much value in running them anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139328
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-11-01 22:05:13 +00:00
8eb75cbad6 Delete iOS jobs from periodic (#139345)
As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up iOS lite interpreter jobs on PyTorch CI. There is not much value in running them anymore.

It's stated in https://github.com/pytorch/ios-demo-app/blob/master/README.md that ExecuTorch is the replacement now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139345
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-11-01 22:04:27 +00:00
8ad76efb8d Delete Vulkan jobs from periodic (#139354)
As an outcome of https://fburl.com/gdoc/voce5o06, we can clean up this job now as the backend has been marked as deprecated https://pytorch.org/tutorials/prototype/vulkan_workflow.html to be replace by ExecuTorch Vulkan delegate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139354
Approved by: https://github.com/wdvr, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-01 22:03:12 +00:00
a979318ef7 Add section to serialization note re weights_only (#139433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221
2024-11-01 21:51:50 +00:00
a1f854f270 [MPS] Compile kernels into Metallib (#138636)
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see  https://github.com/pytorch/pytorch/pull/125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`

This PR does exactly that by:
 - Moving shader sources into separate .metal files
 - Adds check on whether full Xcode installed or just DeveloperTools
 - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
 - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```

Some historical stats:
| PyTorch Version  | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12  | 0  | |
| 1.13  | 2  | bitwise_ops and  index.out |
| 2.0  | 4  | cross repeat and view)  |
| 2.1  | 9   | unary_ops, histogram, renorm, binary_ops |
| 2.2  | 11   | gamma and bucketization |
| 2.3  | 12  | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |

Pros:
  - Better code structure/readability
  - Eventually allows one to use shared headers (and implement something like `TensorIterator`)
  - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:
  - Build process is a bit more complicated that it used to be
  - Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636
Approved by: https://github.com/manuelcandales
2024-11-01 21:47:20 +00:00
a6630bcf87 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-01 21:43:25 +00:00
9c2ffce71a add condition for freeable input buffer (#139480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139480
Approved by: https://github.com/yf225
ghstack dependencies: #139396
2024-11-01 21:15:40 +00:00
18f3b3c991 Clean up Android jobs in CI (#139350)
As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up Android lite interpreter jobs on PyTorch CI. There is not much value in running them anymore.

It's stated in https://github.com/pytorch/android-demo-app/blob/master/README.md that ExecuTorch is the replacement now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139350
Approved by: https://github.com/ZainRizvi
2024-11-01 21:10:19 +00:00
c412a42ae2 [pt2 logging] move remote cache get/put logging up one level (#139423)
Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see.

Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x

Differential Revision: D65290089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423
Approved by: https://github.com/oulgen
2024-11-01 21:06:59 +00:00
0e57f2b589 [invoke_subgraph] Change the joint_graph output signature to simplify min-cut partitioner (#139326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139326
Approved by: https://github.com/zou3519
ghstack dependencies: #139216, #139130
2024-11-01 21:02:32 +00:00
6a268c3fbb [invoke_subgraph] Generate fake_inputs correctly (#139130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139130
Approved by: https://github.com/zou3519
ghstack dependencies: #139216
2024-11-01 21:02:32 +00:00
4c756cacfd [invoke_subgraph] Re-enable fake tensor model in the fake tensor impl (#139216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139216
Approved by: https://github.com/zou3519
2024-11-01 21:02:32 +00:00
5d67efb809 [ONNX] New registration API (#135403)
The ONNX custom ops registration API.

## Design

1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] | Callable" parameter for specifying extra functions
2. Use a callable as the key to support all possible call_function targets in the fx graph
3. Allow a callable or a Sequence of callables as values.
		- When there is a single callable, it is the translation function for the op
		- When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes.
		- The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function.
		- Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions.

```py
torch.onnx.export(dynamo=True,
	custom_translation_table = {
	torch.ops.aten.add: [overload1, overload2],
	torch.sym_not: sym_not_onnx,
})
```
Support for functions that handles complex inputs will be in separate PRs.

fixes https://github.com/pytorch/pytorch/issues/138391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403
Approved by: https://github.com/titaiwangms
2024-11-01 20:58:54 +00:00
f5b9e725d1 use more elements per thread for narrow dtypes (#139449)
Fix perf issue for narrow type by accessing more elements per thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449
Approved by: https://github.com/Chillee, https://github.com/eqy
2024-11-01 20:41:13 +00:00
73c0762a34 [inductor] Simplify remove_kernel_local_buffers (#139452)
I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365, #139370
2024-11-01 20:36:39 +00:00
dcdcb8b364 Avoid overflow in float32-to-int32 test (#139489)
Summary:

Triton has added some integer overflow detection when kernels are compiled with
`debug=True`, and this test results in integer overflow (2.0 is 0x40000000,
times 2 is 0x80000000 which overflows a signed int32).

Assertion `int32 overflow detected for operation mul` failed

Fixes #139479

Test Plan:
```
python inductor/test_torchinductor.py -k test_float32_to_int32_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139489
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/chenyang78
2024-11-01 20:22:19 +00:00
0dbc284a72 [SymmetricMemory] expose signal_pads as tensors in Python (#138754)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR

```python
# Obtain the signal pad of the specified peer rank as a tensor.
# If both shape and dtype are unspecified, the returned tensor will be a
# 1d uint32 tensor, which is most natural for signaling purposes.
symm_mem.get_signal_pad(peer_rank)

# If only shape is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape)
symm_mem.get_signal_pad(peer_rank, shape)

# If only dtype is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank).view(dtype)
symm_mem.get_signal_pad(peer_rank, dtype=dtype)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754
Approved by: https://github.com/weifengpy, https://github.com/lw
2024-11-01 20:17:15 +00:00
124eac255e fix dynamo tracking numpy 2 ops (#138686)
Fixes #136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.

Before this PR, the following tests failed:

```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```

With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.

I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.

```py
from torch._dynamo import trace_rules
import numpy as np

def new_numpy_function_ids():
    unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}

    def is_supported(k, v, mod):
        if not callable(v):
            return False
        if not getattr(v, "__module__", None):
            return True
        if v.__module__ == mod.__name__:
            return True
        if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
            return True
        return False
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        for k, v in mod.__dict__.items():
            if is_supported(k, v, mod):
                rv[id(v)] = f"{mod.__name__}.{k}"
    return rv

def old_numpy_function_ids():
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        rv.update(
            {
                id(v): f"{mod.__name__}.{k}"
                for k, v in mod.__dict__.items()
                if callable(v)
                and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
            }
        )
    return rv

rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())

for v in (rv1 - rv2):
    print(v)
print("****")
for v in (rv2 - rv1):
    print(v)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/lezcano, https://github.com/williamwen42
2024-11-01 19:51:40 +00:00
ea0e09b3f3 Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221)
Fixes https://github.com/pytorch/pytorch/issues/129698

https://github.com/pytorch/pytorch/pull/139106 without pickletools

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221
Approved by: https://github.com/malfet
ghstack dependencies: #138936
2024-11-01 19:31:39 +00:00
f3b485eb2a [reland] Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#137064)
This is to match the default layout constraint for custom operators. By
default, Inductor should match the stride order of inputs to a triton
kernel.

IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been
more than two weeks since this landed. You can flip the config locally
as a workaround.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064
Approved by: https://github.com/albanD, https://github.com/eellison
2024-11-01 19:21:16 +00:00
abc5d59dcb config: create Config objects with JK support (#138766)
This teaches install_config_module (and the underlying code) to
understands Config objects. Additionally we've added a JK option to this
which resolves the JK.

This config gets stored within the _ConfigEntry class and is evaluated
when __getattr__ is called. If justknobs is set, it'll call
justknobs_check to see the result.

Due to preceeding work, basically everything works correctly here and we
had to update a couple of tests, and modify the getattr behaviour.

Note that we are updating the justknob_check function to support a
default option, to make default work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766
Approved by: https://github.com/ezyang
2024-11-01 19:20:37 +00:00
eqy
6fc63b4ef1 [ROCM][CUDA][NCCL] Disable test_lowering_one_shot_all_reduce on ROCM (#139414)
I'm not sure this is expected to run if it requires buffer-registration support CC @yifuwang @huydhn @syed-ahmed #138029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139414
Approved by: https://github.com/huydhn, https://github.com/yifuwang
2024-11-01 18:39:47 +00:00
391ee62180 Ensure scalar tensor device matches attn_mask for convert_boolean_attn_mask_cudnn. (#139450)
This is causing a small performance hit when using SDPA with the cuDNN backend due to unnecessary host-to-device memcpy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139450
Approved by: https://github.com/drisspg, https://github.com/eqy
2024-11-01 18:38:02 +00:00
d8b606ecb5 [fx graph cache] Support freezing with FX graph caching (#136505)
Summary: The main changes to support freezing are:
1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata.
2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places.
3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded.
4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along.

Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505
Approved by: https://github.com/eellison
2024-11-01 18:29:29 +00:00
7d644f025f make equation behind torch.isclose element-wise (#138459)
The current formula behind torch.isclose, according to the docs, is
![imagen](https://github.com/user-attachments/assets/6b79f6d8-e675-4585-b26b-0c6933f7ecdd)

However, torch.isclose acts element-wise, so this formula may be misleading at first, given that the docs said that `input` and `other` are the first, respectively second tensor to compare. I propose the following change, to stress the element-wise nature of the norms in the equation:
![imagen](https://github.com/user-attachments/assets/2926a1c6-c4fa-4c48-8874-106521d3f54c)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138459
Approved by: https://github.com/soulitzer
2024-11-01 18:18:33 +00:00
1857be1b48 Fix S390 builds (#139491)
Caused by https://github.com/pytorch/pytorch/pull/137918 By guarding all cpuinfo use with `!defined(__s390x__ ) && !defined(__powerpc__)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139491
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-11-01 18:16:29 +00:00
51adab0829 [MPS] Fix reduction ops outputs for empty tensors (#139446)
By adding a switch for all reduction types, that either sets it to given value or raises runtime error.
Before this change, reduction ops returned uninitialized values in many case

Fixes https://github.com/pytorch/pytorch/issues/139400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139446
Approved by: https://github.com/Skylion007
2024-11-01 17:32:12 +00:00
7d081cabfb [AOTI] Forward fix #139458 (#139485)
Summary: A new test added in https://github.com/pytorch/pytorch/pull/139458 only fails in certain CI instance. Skip for now as the failing test has a low priority.

@diff-train-skip-merge (to silent fb bot so that I can land this myself)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139485
Approved by: https://github.com/huydhn, https://github.com/hl475
2024-11-01 17:14:40 +00:00
3e0f4d18eb [PyTorch] Support non-zero beta in fp16_gemv_trans (#138275)
No real reason to have the zero-beta restriction, so let's lift it.

Testing: intentionally broke new paths locally to verify test coverage existed

Differential Revision: [D64407752](https://our.internmc.facebook.com/intern/diff/D64407752/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138275
Approved by: https://github.com/malfet
ghstack dependencies: #139082, #139083, #137918, #138005
2024-11-01 16:49:05 +00:00
195b1b9a9b [PyTorch] Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures (#138005)
Following up on previous rev to use fp16_gemv_trans in gemv, not just gemm-used-for-gemv.

Differential Revision: [D64351092](https://our.internmc.facebook.com/intern/diff/D64351092/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138005
Approved by: https://github.com/malfet
ghstack dependencies: #139082, #139083, #137918
2024-11-01 16:49:05 +00:00
fad5d89321 [PyTorch] Hook up fp16_gemv_trans to x86 fp16 GEMM (#137918)
This is the first big milestone we've been building towards!
(Following rev also hooks this up to actual gemv.)
Testing: To check perf, I ran python torchchat.py generate stories110M
--dtype fp16 --device cpu on an x86 machine without AVX512FP16. Observed roughly 5x tokens/sec increase.
Differential Revision: [D64280688](https://our.internmc.facebook.com/intern/diff/D64280688/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64280688/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137918
Approved by: https://github.com/malfet
ghstack dependencies: #139082, #139083
2024-11-01 16:48:56 +00:00
d79c5143d8 [PyTorch] Add efficient isnan for NEON half (#139083)
Same as the efficient one for float when f16 hardware support is available.

Testing: Added exhaustive isnan test coverage

Differential Revision: [D65003321](https://our.internmc.facebook.com/intern/diff/D65003321/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139083
Approved by: https://github.com/malfet
ghstack dependencies: #139082
2024-11-01 16:40:51 +00:00
9ecd7d1587 [PyTorch] Add efficient isnan for NEON float (#139082)
Just test x != x rather than applying element-by-element scalar isnan.

Testing: vec_test_all_types checks IsNan

Differential Revision: [D65001633](https://our.internmc.facebook.com/intern/diff/D65001633/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139082
Approved by: https://github.com/malfet
2024-11-01 16:40:51 +00:00
3cbf0c0bbf [Inductor][CPP] Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688)
# Summary

The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights.

This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension.

The figure below is from the document linked above -

<img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94">

## Performance data

Collected on 48 physical cores of one socket of Intel Xeon  Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded.

|M | N | K | Latency with ATen _weight_int8pack_mm | Latency with codegened templated GEMM (current main branch) | Latency with codegened templated GEMM (this PR) |
|-----|-----|-----|------|----------|----|
|4096|4096|4096| 45.844 ms | 9.322 ms| 5.2181 ms |
|4096|11008|4096| 127.618 ms |24.6258 ms | 13.6046 ms|
|4096|4096|11008| 121.953 ms | 25.4692 ms | 10.2669 ms |
|4096|32000|4096| 478.450 ms| 75.3942 ms | 48.21 ms |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688
Approved by: https://github.com/jgong5
2024-11-01 16:32:22 +00:00
b57b4b7f9b [inductor] Move remove_kernel_local_buffers to Kernel (#139370)
This method mutates the kernel, so it fits better in that class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365
2024-11-01 16:28:15 +00:00
1e934b473c [inductor] Remove Node.last_usage mutation (#139365)
I can't figure out why this is needed.  Let's see if tests fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365
Approved by: https://github.com/shunting314
ghstack dependencies: #139364
2024-11-01 16:28:15 +00:00
286d3ce266 [inductor] Remove SIMDKernel.last_usage (#139364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-11-01 16:28:15 +00:00
df0c1eceb9 [pgnccl][simple] clean up unused members of PGNCCL (#139436)
Summary:
Found those unused members when prototying something else.
Better remove unused members
Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139436
Approved by: https://github.com/Skylion007
2024-11-01 16:25:04 +00:00
33dce10ece [AOTI][reland] Update zero size computation in clone_preserve_strides (#139458)
Summary: Reland https://github.com/pytorch/pytorch/pull/139224. clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly.

Differential Revision: [D65317451](https://our.internmc.facebook.com/intern/diff/D65317451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139458
Approved by: https://github.com/hl475
2024-11-01 13:51:02 +00:00
560a0704c5 Use a different test name for testConversionToStringView (#139448)
Summary:
The change comes from D65214804 (https://github.com/pytorch/pytorch/pull/139239)

`buck2 test @//fbobjc/mode/buck2/ios-tests fbsource//xplat/caffe2/c10:c10_testApple` doesn't like having 2 `testConversionToString` in the same suite `StringViewTest`, so just need to use a different name there.

Test Plan: `buck2 test @//fbobjc/mode/buck2/ios-tests fbsource//xplat/caffe2/c10:c10_testApple` passes

Differential Revision: D65314266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139448
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-11-01 13:25:16 +00:00
e6e140c3d7 [Inductor] fix a compilation time regression caused by user-visible output handling (#139420)
This PR fixes a compilation time regression manifested in timm_models/hrnet_w18 caused by https://github.com/pytorch/pytorch/pull/136732.

The regression is reproducible locally. The compilation time is a bit noisy, but it's still possible to tell the difference.

```
Before the offending PR

compilation_latency mean=176.022 seconds
compilation_latency mean=176.564 seconds

On the offending PR

compilation_latency mean=180.096 seconds
compilation_latency mean=179.101 seconds

On the fix

compilation_latency mean=173.153 seconds
compilation_latency mean=174.182 seconds
```

(I think the fix being faster than the baseline is due to noise)

The cause of the regression is an inefficiency in `is_user_visible_output()`. Specifically, it used `output_node.args[0].index(node)` to obtain the output idx for each node (and we called this for each node twice). The offending PR had the assumption that `len(output_node.args[0])` is rather small. However, it has been proven false by the benchmark (it was 1900+ for timm_models/hrnet_w18).

The fix is to precompute `user_visible_output_strides` once by iterating only over the nodes in `output_node.args[0]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139420
Approved by: https://github.com/ezyang
2024-11-01 08:27:40 +00:00
307ee7926e [Workflow][1/3] Remove benchmack tests from rerun disbled tests (#139337)
Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Manual Test
- Test run Inductor.yml:
https://github.com/pytorch/pytorch/actions/runs/11603287758/job/32309968542?pr=139337
- Test run inductor-unittest.yml ([3cbd83d](3cbd83d3d5))
https://github.com/pytorch/pytorch/actions/runs/11605399925/job/32315737205?pr=139337

# Steps to fix the issue

- [x]  [**THIS PR**] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139337
Approved by: https://github.com/huydhn
2024-11-01 08:23:51 +00:00
f7407b3de0 [Workflow][2/3] Remove benchmack tests from rerun disbled test (#139407)
Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Steps to fix the issue
- [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [x] [**THIS PR**] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [ ] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139407
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-11-01 08:09:31 +00:00
5e4c8b671c [inductor] loaf-fix (#139376)
Fix https://github.com/pytorch/pytorch/issues/128063 .

Now for this snippet
```
        def f(x):
            y = torch.sum(torch.sum(x, dim=-1))

            z = x / 10.0
            z_t = z.t().contiguous().t()
            return y, z, z_t
```
Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused).

The PR needs fix 2 subtile bugs regarding LOAF .
1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison.
2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376
Approved by: https://github.com/jansel, https://github.com/eellison
2024-11-01 07:54:32 +00:00
39ec5a20ea [Partitioner] Enumerate partitions by iterating partition ids (#136598)
Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598
Approved by: https://github.com/tarun292

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-01 07:42:36 +00:00
61df90e3f6 Add TORCHDYNAMO_EXTENDED_ADVICE (#137159) (#137196)
Fixes #137159

Happy to contribute to this project for the first time. If I missed any contribution guidelines, please let me know, I'm happy to adjust.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137196
Approved by: https://github.com/ezyang
2024-11-01 06:43:26 +00:00
86db2cd194 [export] Initial draft export (#139383)
Differential Revision: [D65288590](https://our.internmc.facebook.com/intern/diff/D65288590)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139383
Approved by: https://github.com/zou3519
2024-11-01 06:25:44 +00:00
300ca6368f Remove depracated alias macro(2/3) (#137559)
**Detailed Descriptions:**
- Remove AT_ASSERTM Macro
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137559
Approved by: https://github.com/ezyang
2024-11-01 06:17:57 +00:00
0c47657b05 [dynamo] ignore False/None callback in fail_on_recompile/force_backend stances (#139215)
Fix https://github.com/pytorch/pytorch/issues/139202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139215
Approved by: https://github.com/jansel
2024-11-01 06:15:28 +00:00
cyy
4a2da52137 [1/N] Replace c10::sv with std::sv (#139453)
Picks some safe replacements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139453
Approved by: https://github.com/Skylion007
2024-11-01 05:39:37 +00:00
cyy
6ef6b3f586 Remove const fromDLPack overload (#139156)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139156
Approved by: https://github.com/ezyang
2024-11-01 04:12:46 +00:00
84416618a6 [Pipelining] Update schedules to use I, B actions. (#138886)
Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD)
consistently.

Previously, schedules would issue a 'B' operation and leave it ambiguous
whether that operation should be BACKWARD_INPUT or FULL_BACKWARD,
depending on a separate flag (use_full_backward) passed to the schedule
class, which would determine which behavior was taken at runtime.

Now, use_full_backward is removed and the schedule class is required to
produce unambiguous IR.  The logic for 'use_full_backward' is removed
from the runtime.

_validate_pipeline_order is replaced  with _simulate_comms_compute. Both
offer similar functionality, to validate the corrrectness of a schedule
IR.  'validate' operates on compute-only IR, while simulate operates on
compute + comm IR.  To convert from using validate to simulate, you have
to first insert comm actions via '_add_send_recv'.

'simulate' was inefficiently written before this PR and needed to be
optimized to run quickly for extra large schedules with >32 ranks and
microbatches per rank used in some unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886
Approved by: https://github.com/H-Huang
2024-11-01 03:54:06 +00:00
094d288f40 Update tensorify pass to specialize symfloats we didn't tensorify away (#138868)
As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things:

1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation)
2) It updates the tensorify pass to do the backup specialization

This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows:

1) Integrate turning off specialize float only in the automatic dynamic pass.
2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised.
3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868
Approved by: https://github.com/ezyang
2024-11-01 03:18:02 +00:00
c8a648d4df Add option to dynamo_timed and chromium_event_logger for logging pt2 compile events (#139309)
This diff considerably changes the column format of PT2 Compile Events:

- Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention.

- Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba.

Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309
Approved by: https://github.com/oulgen
2024-11-01 02:40:25 +00:00
46bca8a4b6 Export XPU oneDNN header to the public (#139177)
# Motivation
Export oneDNN header to the public, for example, the third-party extension now could use `GpuStreamManager` to manage `dnnl::stream` to submit oneDNN kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139177
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/malfet
2024-11-01 02:36:16 +00:00
04382efe5e [Bash][3/3] Remove benchmack tests from rerun disbled test (#139422)
Fixes [#5774](https://github.com/pytorch/test-infra/issues/5774)
# Overview
Remove benchmark tests from rerun-disabled-tests, this is considered non-unittest.
See one page doc: [[Bootcamp Task] Remove non-unittest test during rerun-disabled-tests](https://docs.google.com/document/d/1xffkt_LNC5ZLsoVQDmuKbNqYnMUW_xYYStv66Pr-qac/edit?tab=t.0)

# Steps to fix the issue
- [ ] Create inductor-unittest.yml to handle unit test and daily rerun for inductor
- [ ] Create Inductor-cu124-unittest.yml to handle unit tests and daily rerun for inductor-cu124
- [x] Disable benchmark test in mixed test such as CPP_Wrapper which includes both unittest and benchmark test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139422
Approved by: https://github.com/huydhn
2024-11-01 01:49:58 +00:00
030f70b40b Allow inplacing buffer when other users are inconsequential (#138383)
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.

Implements:
https://github.com/pytorch/pytorch/issues/132826

Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
2024-11-01 01:24:40 +00:00
8ace3e8023 Add sv starts/ends_with (#139261)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139261
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-01 01:17:42 +00:00
2a309c0997 Fix weights_only for BUILD instructions for user allowlisted objects with __slots__ (#138936)
Previously `BUILD` instruction missed handling for `__slots__`. **This only applies for things allowlisted via `add_safe_globals`/`safe_globals` that use slots.**

### Background
When does pickle serialize a `BUILD` instruction? When `state` is not `None` and `state_setter` is `None` [[link](c5b99f5c2c/Lib/pickle.py (L765))]. In this case, the docs tell us that either `__setstate__` or a `__dict__` update will be performed [[link](https://github.com/python/cpython/blob/3.13/Lib/pickletools.py#L1984)]

`__reduce__`/`__reduce_ex__` are expected to return tuples of length 2 to 6 where `state` is the 3rd argument. When user doesn't patch `__reduce__` but patches `__setstate__`/`__getstate__`, state will be what is yielded by `__getstate__`

Note the return type for [`__getstate__` ](https://docs.python.org/3/library/pickle.html#object.__getstate__)

- For a class that has no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is None.
- For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is `self.__dict__`.
- For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is a tuple consisting of two dictionaries: `self.__dict__`, and a dictionary mapping slot names to slot values. Only slots that have a value are included in the latter.
- For a class that has [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__) and no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__), the default state is a tuple whose first item is None and whose second item is a dictionary mapping slot names to slot values described in the previous bullet.

see handling in pickle code c5b99f5c2c/Lib/pickle.py (L1846-L1867)

Before this PR, we didn't account for the fact that when `__setstate__` is not defined, `state` might be a tuple so this would fail

```python
from dataclasses import dataclass

# Define the dataclass
@dataclass
class MyDataClass:
    __slots__ = ["x", "y"]
    x: int
    y: str
# Create an instance of the dataclass
my_data = MyDataClass(x=2, y=3)
# Save the dataclass to a file
torch.save(my_data, "my_data.pt")
with torch.serialization.safe_globals([MyDataClass]):
    loaded_my_data = torch.load("my_data.pt", weights_only=True)
# AttributeError: 'MyDataClass' object has no attribute '__dict__'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138936
Approved by: https://github.com/malfet
2024-11-01 00:59:29 +00:00
c2ffd41a86 [inductor] Enable AMD cooperative reduction tests (#139230)
Fixes #139099

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139230
Approved by: https://github.com/eellison
2024-11-01 00:55:13 +00:00
f9ef880c0b [inductor] Refactor kernel args into SIMDKernelFeatures (#139327)
This is a refactor PR to move stuff around.  I'm planning to use the SIMDKernelFeatures class (in a future PR) to host new heuristics for selecting kernel types and block sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139327
Approved by: https://github.com/eellison, https://github.com/shunting314
2024-11-01 00:30:14 +00:00
b6b9596607 Revert "[dynamo] Fix constant propagation in builtins and UserClasses (#131354)"
This reverts commit 44257c063e2f7bd9b35e6e4973f89d7f1cb65442.

Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break some internal tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2451050605))
2024-11-01 00:13:20 +00:00
d33849908d [aotd] Fuse tangents subclasses runtime traversals (#139068)
Reason:
Currently we have multiple traversals for tangents in runtime:
 - To check that types and structure are identical to what we guessed during tracing time
 - Coerce metadata
 - Coerce memory_format
 - Unwrap_tensor_subclass
All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses.

Change:
To do everything in one traversal at runtime (including flattening)

Implementation details:

Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too.

Preparing memory_format is optional (controlled by with_memory_format=True).

2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068
Approved by: https://github.com/bdhirsh
2024-11-01 00:03:02 +00:00
86602a66d7 [orm] fix live_memory computation in lpmf algorithm (#139396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139396
Approved by: https://github.com/yf225
2024-10-31 23:45:30 +00:00
3d3551506d Revert "[dynamo, 3.13] fix bytecode nop tests (#139323)"
This reverts commit c2d754441f8e941c208579661a04b5ed1e5e71bc.

Reverted https://github.com/pytorch/pytorch/pull/139323 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a regression in instruction count metric ([comment](https://github.com/pytorch/pytorch/pull/139323#issuecomment-2451017609))
2024-10-31 23:34:00 +00:00
6727f343b5 [c10d][fr][easy] Move check_no_missing_dump_files (#139417)
Summary:
Move check_no_missing_dump_files to after the "just print" location.
This allows us to print dump_files when there are actual missing files.

Test Plan:
```
torchfrtrace -j ~/pyper-training-online-924394600  --selected-ranks 1 2

Inferred common prefix nccl_trace_rank_
loaded 95 files in 0.040270328521728516s
built groups, memberships
Rank 1                                                              Rank 2
------------------------------------------------------------------  ------------------------------------------------------------------
broadcast(input_sizes=[[2]], state=completed)                       broadcast(input_sizes=[[2]], state=completed)
```
Without this change, the command was erroring out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139417
Approved by: https://github.com/Skylion007, https://github.com/fduwjj
2024-10-31 22:55:01 +00:00
8e8040a5c2 [Pipelining] Optimize ready_to_schedule logic (#138924)
Used in both simulator and add_send_recv pass, the ready_to_schedule
logic works by looking at all the previously scheduled ops on a rank to
see if any of them 'unblocks' the current op to be scheduled.  For example,
to schedule a FORWARD op, a previous RECV_F op is needed, unless this is
stage 0 or there is a previous stage on the same rank that ran FORWARD
already.

The old implementation iteratively compared the candidate op to the
previous ops.  The new implementation uses set lookups to reduce
complexity.  It also maintains the set of previous ops as ops are
scheduled rather than constructing a set on demand.

I did not save benchmark results, but this results in a 10-100x speedup
which is most noticeable for unit tests with artificially huge schedule
IR, the largest of which took longer than 20m before (I never let it
finish) but now takes less than 14s.  Most schedules take less than
10ms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928, #131762
2024-10-31 22:49:45 +00:00
c82e0d117a [Pipelining] Support separate dI / dW and V-schedules (#131762)
### Separate dI / dW:

PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD
or separate dI / dW operations.

Separating the B and W may add execution overhead or may be suboptimal
in cases where BW are 'fused', but it is worthwhile when separating B, W
lets the schedule be more efficient by filling in bubbles.  In some
cases, the schedule will still issue B followed by W at certain points,
so in these cases just merge them back into BW ops and execute them as
full backwards rather than executing a B followed by a W.

### V-schedules:

V-schedules have a special case where the last rank has 2 adjacent
stages.

E.g. if rank3 had stage 3 and stage 4, then we should implement direct
transfer of stage3 outputs to stage4 inputs without a
send/recv.

In the schedling logic, we also must allow scheduling the
stage 4 forward after running stage 3 forward, without expecting a stage
4 RECV_F

In the runtime, we pass activations between adjacent stages without
using SEND/RECV ops since the stages are on the same rank/process.  We
add new APIs to PipelineStage abstraction for passing the activations
both during forward and backward.  Currently the implementation directly
modifies the 'recv buffers' the stage is managing, so the
forward/backwrad execution logic does not need to know the difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762
Approved by: https://github.com/H-Huang
ghstack dependencies: #138928
2024-10-31 22:49:45 +00:00
45da80b970 reland D65167805 "[export] Update min_val and max_val to Optional[int] in serialization." (#139394)
Summary:
had a land racing with another diff D65166035 to fix the schema.

According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/rec/ir/tests:ir_export_deserialize_test

Differential Revision: D65273170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139394
Approved by: https://github.com/yiming0416
2024-10-31 22:28:32 +00:00
01136fb9e0 Update MPS_ERROR_RUNTIME_TOO_LOW message (#139427)
https://github.com/pytorch/pytorch/pull/133141 updated min os requirement to 13.0, but missed the message

Fixes https://github.com/pytorch/pytorch/issues/139425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139427
Approved by: https://github.com/seemethere, https://github.com/kit1980
2024-10-31 22:04:08 +00:00
c1e7d85ce6 Add Weighted Loss Functions to PyTorch : WMSE, WMAE, and Weighted Huber Loss (#132049)
#### Summary
This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others.

#### Changes
- **`weighted_huber_loss`**: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter.
- **`wmse_loss`** (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks.
- **`wmae_loss`** (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers.

#### Code Details
- **Input Validation**: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors.
- **Reduction Options**: Supports `none`, `mean`, and `sum` reductions to suit various computational needs.
- **Backward Compatibility**: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument.

#### Usage Example
```python
import torch
input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32)
target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32)
weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32)

loss = weighted_huber_loss(input, target, weights, delta=1.0)
print(loss)
```
---

Feedback on these implementations is welcome; please let me know if further modifications are required.

Resolves #132465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2024-10-31 21:59:43 +00:00
82e74ad40e [aot autograd] refactor CompiledFunction.backward: control flow (3/N) (#139347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139347
Approved by: https://github.com/zou3519
ghstack dependencies: #139331, #139343
2024-10-31 21:53:03 +00:00
8134456a27 [aot autograd] refactor CompiledFunction.backward: epilogue (2/N) (#139343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139343
Approved by: https://github.com/zou3519
ghstack dependencies: #139331
2024-10-31 21:53:03 +00:00
04ce9ec087 [aot autograd] refactor CompiledFunction.backward: prologue (1/N) (#139331)
So for functional autograd + CA, most nodes are inlined in aot autograd. But user-defined callables aren't safe to make_fx unless dynamo traces through them. The AOT backward must be inlined by dynamo time. We plan to directly insert calls to the backward in the graph:
- call prologue
- call bwd graph
- call epilogue

Restructuring our AOT bwd implementation will make this implementation easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139331
Approved by: https://github.com/zou3519
2024-10-31 21:53:03 +00:00
8c22e09e39 [aoti] Add masked_select to cshim (#139071)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139071
Approved by: https://github.com/desertfire
2024-10-31 21:52:53 +00:00
b9acbde4fd Revert "Update tensorify pass to specialize symfloats we didn't tensorify away (#138868)"
This reverts commit a49457279919b324d8ca1db85636d16d6dfd4e0f.

Reverted https://github.com/pytorch/pytorch/pull/138868 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the new tests are failing on fbcode ([comment](https://github.com/pytorch/pytorch/pull/138868#issuecomment-2450863895))
2024-10-31 21:46:06 +00:00
6a1c451479 Don't uselessly recompute axiom dict every static eval call (#138967)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967
Approved by: https://github.com/ezyang
2024-10-31 21:16:55 +00:00
c4d9428b17 Revert "[AOTI] Update zero size computation in clone_preserve_strides (#139224)"
This reverts commit 206a8dde68faef052dfeedabb4180179ab24015e.

Reverted https://github.com/pytorch/pytorch/pull/139224 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139224#issuecomment-2450811914))
2024-10-31 21:05:07 +00:00
ddb291a881 Fix and test several NJT reductions (#139317)
I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit.

It applies to the following ops:
* `sum` / `mean` / `prod`
* `all` / `any`
* `amin` / `amax`
* `min` / `max`
* `argmin` / `argmax`

The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, *args, **kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim.

Extensive test coverage includes:
* reductions across ragged dim
* reductions across non-batch, non-ragged dims
* reductions across both batch and ragged dims
* multiple dim reductions (for ops that support this)
* full reduction -> scalar

Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317
Approved by: https://github.com/cpuhrsch
2024-10-31 20:55:38 +00:00
abb0dd4b00 Revert "[inductor] patterns to remove pointless view/permute pairs (#139136)"
This reverts commit 2b86cd74a60ca2483173ba3012506aeac85ab2d7.

Reverted https://github.com/pytorch/pytorch/pull/139136 on behalf of https://github.com/ZainRizvi due to Sorry but this PR seems to have broken on trunk. The failure: distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op [GH job link](https://github.com/pytorch/pytorch/actions/runs/11615060962/job/32346609889) [HUD commit link](2b86cd74a6) ([comment](https://github.com/pytorch/pytorch/pull/139136#issuecomment-2450796414))
2024-10-31 20:54:17 +00:00
76b5ee1119 [ONNX] Set flags correctly in tests (#139413)
Previously the flag was set via envvar, since the envvar was read at initialization, it may not have been correctly set.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139413
Approved by: https://github.com/titaiwangms
2024-10-31 20:46:23 +00:00
938803df94 Add bfloat16 support for per tensor/channel cpu/cuda fake quantize ops (#139306)
Summary: Fixes https://fb.workplace.com/groups/2240361332735959/permalink/8190736677698365

Test Plan:
buck2 test 'fbcode//mode/dev' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)'

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_tensor_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)'

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cuda (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)'

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_forward_per_channel_cachemask_cpu (caffe2.test.quantization.core.test_workflow_ops.TestFakeQuantizeOps)'

Differential Revision: D65221710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139306
Approved by: https://github.com/navsud
2024-10-31 20:41:15 +00:00
53c9c19e76 [Autotune Inductor] Some clean up and dataclassing (#139157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139157
Approved by: https://github.com/eellison
2024-10-31 20:04:55 +00:00
c2d754441f [dynamo, 3.13] fix bytecode nop tests (#139323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323
Approved by: https://github.com/jansel
2024-10-31 20:03:43 +00:00
1518cf426b Remove @skipIfTorchDynamo from test_extremal_numerics_l1_loss_cpu test (#139318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139318
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2024-10-31 19:57:28 +00:00
886579af99 Revert "Use static_assert to detect get_type_index used in device code (#139173)"
This reverts commit d391ed3f4ec6b1a78f7b34e27cba74b37d885475.

Reverted https://github.com/pytorch/pytorch/pull/139173 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139173#issuecomment-2450695123))
2024-10-31 19:50:19 +00:00
ac7acfb894 [Profiler] Create Auto-Trace Frontend for Trace ID (#139310)
Summary:
This PR adds Auto-Trace implementation for Trace ID. By default, the python side will generate a uuid in the same format as the one set in the backend by kineto. Upon running an auto-trace, the python generated trace id will overwrite the one set in kineto using the Config variable. Since we don't expect users to generate on-demand traces after an auto-trace we can simply keep overwriting the backend trace id whenever autotrace is ran. If we one day want to eventually do something like this, we simply have to add a call in kineto on the backend to generate a new ID upon start of profiling.

We also implement a custom callback in the frontend such that users can generate their own trace ids if they wish to. This works similarly as the default, only difference being that they have to manually set this callback after a profiler is generated. We use a specific call to set this rather then putting it in the frontend initializer in case users want to change the trace_id for different repeats.

Test Plan: Tested both default and custom callbacks using the verbose prints added. Trace ids on the frontend and the prints on the backend for the manifold upload matched.

Differential Revision: D65178308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139310
Approved by: https://github.com/shengfukevin
2024-10-31 19:02:57 +00:00
7faf0ad913 [dyanmo] fix deque.maxlen support when extending elements from left (#139279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139279
Approved by: https://github.com/jansel
2024-10-31 18:38:11 +00:00
8e27833e30 Ensure SWA boundary conditions w.r.t. definition (#133773)
According to the documentation, decay is a number in [0,1] range,[ i.e.](https://pytorch.org/docs/stable/optim.html)
```
Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to get_ema_multi_avg_fn, the default is 0.999.
```
An inspection of `swa_utils.py`  indicates there are no checks for invalid values of `decay`. Adding asserts as suggested in this PR ensures valid compute range (one way to enforce correct behavior, there are perhaps more suitable ones). Papers `torch` cites for reference idea/implementation also consider exclusively this range (e.g., https://arxiv.org/pdf/2310.04415).

Fixes https://github.com/pytorch/pytorch/issues/133772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133773
Approved by: https://github.com/janeyx99
2024-10-31 18:24:08 +00:00
547d921462 [Pipelining] Remove unused special case from simulator (#138928)
The special case was added during experimentation with batched send/recv
ops.  The ops needed to be jointly scheduled or the simulator would
think that each op was unschedulable since each contained a recv that
depended on the other's send.  The workaround I added was to let the
scheduler 'peek' one op ahead for unblocking, which let batched ops be
scheduled but also changed the behavior or non-batched ops.  It let RECV
ops be simulated one step earlier than the unblocking SEND ops, which
shortened the simulated duration of schedules.

Removing this workaround simplifies the simulator but more importantly
lends to optimizing the runtime of the simulator by making it much
easier to avoid copying or extending lists of previous ops on each
iteration.  It also restores the output of the simulator for non-batched
ops to a more natural output where RECV must happen at the same time or
later than matching SEND, rather than possibly a step earlier.

For example, for this test:
`python test/distributed/pipelining/test_schedule.py -k test_send_recv_test_info0`

Before:

```
Step 0: 0F0      1RECV_F0
Step 1: 0SEND_F0
Step 2: 0F1      1RECV_F1
Step 3: 0SEND_F1 1F0
Step 4: 0RECV_B0 1B0
Step 5: 0B0      1SEND_B0
Step 6:          1F1
Step 7: 0RECV_B1 1B1
Step 8: 0B1      1SEND_B1
```

After:
```
Rank 0   Rank 1
Step 00: 0F0
Step 01: 0SEND_F0 1RECV_F0
Step 02: 0F1
Step 03: 0SEND_F1 1RECV_F1
Step 04:          1F0
Step 05:          1B0
Step 06: 0RECV_B0 1SEND_B0
Step 07: 0B0      1F1
Step 08:          1B1
Step 09: 0RECV_B1 1SEND_B1
Step 10: 0B1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138928
Approved by: https://github.com/H-Huang
2024-10-31 17:48:35 +00:00
9d096e4d9f Don't use deprecated type properties in UpsampleKernel (#139399)
By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399
Approved by: https://github.com/Skylion007
ghstack dependencies: #139353, #139358
2024-10-31 17:32:19 +00:00
206a8dde68 [AOTI] Update zero size computation in clone_preserve_strides (#139224)
Summary: clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly. Fix that.

Differential Revision: [D65250405](https://our.internmc.facebook.com/intern/diff/D65250405)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139224
Approved by: https://github.com/angelayi
2024-10-31 17:07:18 +00:00
f93ebb2cf4 [Easy] Refactor post grad application of passes (#139293)
Refactors GraphTransformObserver to hook into the bisect manager pass application. And reworks post grad passes to use it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139293
Approved by: https://github.com/exclamaforte
ghstack dependencies: #139292
2024-10-31 17:05:27 +00:00
5075046db2 [c10d] separate comm init from getNCClComm (#139362)
Summary:
This PR is a non op. But it clearly separate the init logic from the
getNCCLCOMM. getNCClComm is now a purely a 'read' only function
Test Plan:
existing CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139362
Approved by: https://github.com/wconstab
2024-10-31 16:58:20 +00:00
864beebb41 [easy] Add start event metadata to collected metadata for PT2 Compile Events (#139289)
We should be logging metadata from event starts to PT2 Compile Events too.

Differential Revision: [D65070086](https://our.internmc.facebook.com/intern/diff/D65070086/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139289
Approved by: https://github.com/oulgen
2024-10-31 16:52:30 +00:00
dd6263e2fb Implement HPUHooksInterface (#137338)
Fixes #137262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137338
Approved by: https://github.com/guangyey, https://github.com/albanD

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2024-10-31 16:26:19 +00:00
87f1990697 Revert "Don't uselessly recompute axiom dict every static eval call (#138967)"
This reverts commit 24b695ae2d5d85a3bda0e493fb4631d5e0add290.

Reverted https://github.com/pytorch/pytorch/pull/138967 on behalf of https://github.com/ZainRizvi due to Sorry, looks like this PR introduced a failure that was incorrectly classified as flaky, and the log classifier didn't identify the right log line either ([comment](https://github.com/pytorch/pytorch/pull/138967#issuecomment-2450228525))
2024-10-31 15:54:18 +00:00
2b86cd74a6 [inductor] patterns to remove pointless view/permute pairs (#139136)
These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph.

Consider this snippet:
```
        class LinearAndCEL(nn.Module):
            def __init__(self):
                super().__init__()
                self.linear = nn.Linear(C, V)
                self.ce = nn.CrossEntropyLoss()

            def forward(self, x, y):
                return self.ce(self.linear(x).view(B * T, V), y.view(-1))
```

`x` passed to `forward` is a 3D tensor of shape [B, T, C].
The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients.

The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-10-31 15:35:46 +00:00
d21a25c6b7 [fx graph cache] Refactor FxGraphCachePickler, step 2 (#138683)
Summary: Move all the custom `_reduce_*` functions inside the FxGraphCachePickler class. This is mostly a cosmetic change since they're conceptually members of FxGraphCachePickler. But also in an upcoming diff, I'll add a member variable to the class to control how we handle constant tensors, so it will be convenient to be able to query that setting via `self`. I made the analogous changes to AOTAutogradCachePickler for consistency.

Test Plan: unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138683
Approved by: https://github.com/eellison
ghstack dependencies: #138681, #138682
2024-10-31 15:12:18 +00:00
92a2a9ded2 [BE] And delete DeprecatedTypProperties cast (#139358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139358
Approved by: https://github.com/ezyang
ghstack dependencies: #139353
2024-10-31 14:39:22 +00:00
ea07718a5a Remove redundant warning compress (#139367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139367
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-10-31 14:39:19 +00:00
c934ed6567 init kineto after torch module initialized (#131448)
Fixes #131020

As discussed in the issue thread,  we can use ` KINETO_DAEMON_INIT_DELAY_S` to delay the initialization of `kineto`  in case `kineto` is initialized before `libtorch_cuda.so`.

It's not clear to set a proper value of environmental variable `KINETO_DAEMON_INIT_DELAY_S`, here's a trick to make the initialization of `kineto` after the initialization of module `torch`. I'm not sure whether this is an acceptable trick, please take a look at this pr, thanks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131448
Approved by: https://github.com/sraikund16, https://github.com/briancoutinho
2024-10-31 13:24:24 +00:00
ccaa2a206a [inductor] make requires_stride_order more unbacked-symint-aware (#137063)
Previously, we tried to sort SymInt strides to determine the stride
order. This PR makes the sorting more unbacked symint aware: given a Tensor
with sizes (u0, u1, u2), it has strides (u1 * u2, u1, 1), which is
sortable under the guard_size_oblivious assumptions.

Test Plan:
- test case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137063
Approved by: https://github.com/eellison
2024-10-31 13:11:02 +00:00
3192bdeea4 [AOTI] Use len(serialized_weights) when calculating consts_size (#139054)
Fixes the failure of INT8 DLRM using AOTI.
The previous code calculates `consts_size` directly using `tensor` from `graph.constants`:
```
  consts_size = sum(
      get_nbytes_of_tensor(tensor, all_cuda)
      for (name, tensor) in graph.constants.items()
      if name not in graph.folded_constants
  )
```
Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`:
```
  serialized_weights = b"".join(
      _to_bytes(graph.get_original_value_of_constant(name), all_cuda)
      for name in graph.constants.keys()
      if name not in graph.folded_constants
  )
```

`tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in https://github.com/pytorch/pytorch/pull/135205.

This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency.

We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139054
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-10-31 09:54:16 +00:00
24b695ae2d Don't uselessly recompute axiom dict every static eval call (#138967)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967
Approved by: https://github.com/ezyang
2024-10-31 07:46:35 +00:00
73fde0d940 [PyTorch] Unbreak C10_ALWAYS_INLINE_ATTRIBUTE on MSVC (#139363)
At least one recent version refuses to accept it on a lambda, so disable.

Differential Revision: [D65250256](https://our.internmc.facebook.com/intern/diff/D65250256/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D65250256/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139363
Approved by: https://github.com/ngimel, https://github.com/malfet
2024-10-31 07:40:05 +00:00
f98bc9a49d Revert D65167805 (#139371)
Summary:
This diff reverts D65167805
broke the release pipeline

Test Plan: NA

Differential Revision: D65245198

@diff-train-skip-merge (to silent facebook-github-bot until I have a stamp to land this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139371
Approved by: https://github.com/malfet
2024-10-31 07:25:28 +00:00
86e6513c86 [BE] Remove deprecated AT_DISPATCH_ALL_TYPES_AND_HALF (#139353)
It's been deprecated for 2 years now, time to delete
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139353
Approved by: https://github.com/ezyang
2024-10-31 07:06:19 +00:00
a7479fa282 TunableOp use dense size calculations as minimum sizes (#139137)
Fixes #139116.  Also fixes other unreported issues with torch.bmm due to incorrect size calculations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139137
Approved by: https://github.com/yoyoyocmu
2024-10-31 06:01:58 +00:00
261d90c18f Add docs page for torch.inf and torch.nan (#138430)
Fixes #131040

## Description
Add docs for `torch.inf` and `torch.nan`,

## Checklist
- [x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138430
Approved by: https://github.com/ezyang
2024-10-31 05:46:46 +00:00
cyy
f95c71867e [9/N] Fix extra warnings brought by clang-tidy-17 (#139286)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139286
Approved by: https://github.com/ezyang
2024-10-31 05:20:31 +00:00
42b5e191ae Fix the example of fx/interpreter (#139368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139368
Approved by: https://github.com/ezyang
2024-10-31 05:12:43 +00:00
d08dbd0436 Update torch-xpu-ops commit pin (#139041)
# Motivation
This PR intends to update torch-xpu-ops commit pin. It mainly includes the following two highlighted changes:
1. split the DLL library into 4 smaller libraries to avoid the 2G limitation on Windows;
2. some new operators added, for example, `cdist`, `pdist`, `maxunpool2d`, `maxunpood3d`, `upsample_trilinear3d, `Bessel operators`, etc...

# Additional Context
We have to supply XPU device check logic in `cdist` and `pdist` ops.
This PR depends on https://github.com/pytorch/pytorch/pull/139050 to fix Windows build issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139041
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-10-31 05:06:06 +00:00
74b7fb9519 Add conjugate method on SymFloat (#139249)
Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_conjugate_num_type4_dynamic_shapes

when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139249
Approved by: https://github.com/ezyang
2024-10-31 04:55:36 +00:00
0cf4cc3d5f [fx] split_module subgraph should always have an output node (#139275)
Fixes https://github.com/pytorch/pytorch/issues/138207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139275
Approved by: https://github.com/ezyang
2024-10-31 04:53:19 +00:00
e3e3ab805b [fx graph cache] Refactor FxGraphCachePickler (#138682)
Summary: In an upcoming change, we need to modify FxGraphCachePickler to behave differently depending on whether the graph has frozen parameters (whether or not we have frozen parameters). To do that, it will be convenient to change FxGraphCachePickler into a regular object instead of a collection of classmethods.

Test Plan: unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138682
Approved by: https://github.com/eellison
ghstack dependencies: #138681
2024-10-31 03:31:51 +00:00
cyy
70ba471957 [3/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139248)
Follows #139158
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139248
Approved by: https://github.com/ezyang
2024-10-31 03:29:19 +00:00
cyy
1dd503c6fb [4/N] Fix Wextra-semi warning (#139256)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139256
Approved by: https://github.com/ezyang
2024-10-31 03:01:14 +00:00
bd88d40e5f [Submodule] update submodule onnx==1.17.0 (#139128)
Follow-up PR of: https://github.com/pytorch/pytorch/pull/138719

CC @malfet @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139128
Approved by: https://github.com/malfet
2024-10-31 02:50:00 +00:00
cyy
29297731bb [5/N] Don't skip ASAN on some tests (#139265)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139265
Approved by: https://github.com/ezyang
2024-10-31 02:49:03 +00:00
d7411c0cc1 [AOTI] add C shim for QConvPointWise (#138540)
This PR adds C shim for `QConvPointWisePT2E` and `QConvPointWiseBinaryPT2E` similar to https://github.com/pytorch/pytorch/pull/138439. Besides that, we aligned the implementation of `qconv_pointwise` with `qlinear_pointwise` in the following aspects:
1. The parameter order of `qconv_pointwise` and `qlinear_pointwise` are quite different, we aligned the schema of `qconv_pointwise` to have similar parameter order as `qlinear_pointwise` to make it more consistent.
2. We always converted `x_scale` and `x_zero_point` to Tensors, just like in the lowering of `qlinear_pointwise`. This avoids the need to create two separate C APIs (one for `double x_scale` and `int64_t x_zero_point`, and another for `Tensor` versions). Instead, we only need one API for `Tensor`-based `x_scale` and `x_zero_point`. If we later add dynamic quantization for qconv (which will use `Tensor` for `x_scale` and `x_zero_point`), we can reuse the code from this PR and don't need to change the C shim layer API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138540
Approved by: https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #138691, #138806
2024-10-31 02:03:01 +00:00
69ea2e726c Consolidate Triton cache into Inductor cache (#138239)
Summary:
This diff/PR attempts to consolidate Triton caching into the Inductor caching so that there can be just one cache that unifies them both, reducing network requests and increasing success rate.

Implementation details can be found via reading the code or the post: https://fb.workplace.com/groups/1553867532149891/posts/1605037517032892

I did not use the Autotune bundler code at all since I want to simplify that and merge it into this on the next diff/PR.

In terms of instrumentation
1) Dynamo compile: `triton_bundler_time_saved_s` this is sum of all triton.compile calls. We dont have to use the specific number, can use this as a binary value.
2) Events table: I used dynamo_timed to measure how much time we spend on bundler collect and write functions which is all the work we do in this diff
3) TLParse: I emitted number of kernels and triton_bundler_time_saved_s into tlparse as well

Test Plan: Updated unit tests

Adhoc running
```
TORCHINDUCTOR_BUNDLE_TRITON_INTO_FX_GRAPH_CACHE=1 buck2 run @mode/opt //scripts/oulgen:runner
```
gives
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpmTZt6b/0_0_0/fx_graph_cache_hit_4.json
<img width="771" alt="image" src="https://github.com/user-attachments/assets/478782a2-ee47-40cb-b723-fcac2bf9dd93">

Differential Revision: D64504909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138239
Approved by: https://github.com/ezyang
2024-10-31 01:37:16 +00:00
c7f1fccd7a Globally enable Python dispatcher for all of Inductor compilation (#137621)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137621
Approved by: https://github.com/eellison
2024-10-31 01:35:23 +00:00
289e03a429 Revert "Allow inplacing buffer when other users are inconsequential (#138383)"
This reverts commit 8840889c3f6565b7975150adebcbe062f19035ee.

Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2448824206))
2024-10-31 01:32:15 +00:00
38429938de [cond] make cond not throw warnings on constant pred in eager mode (#138837)
We don't raise warnings for torch.cond in eager mode the motivation is in  https://github.com/pytorch/pytorch/issues/138782.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138837
Approved by: https://github.com/zou3519
2024-10-31 01:13:19 +00:00
b90503d9ae [DCP] Unit Test to validate the stateful and non-stateful loads (#139251)
Summary: Unit Test to validate the stateful and non-stateful loads. This test is a follow up to the fix in [#138575](https://github.com/pytorch/pytorch/pull/138575) which addresses an issue in stateful dict's in-place updates in distributed checkpoint loading. Also, added additional code comments regarding the stateful and non-stateful loads.

Test Plan:
```
buck2 test //caffe2/test/distributed/checkpoint/e2e:test_e2e_save_and_load
```

https://www.internalfb.com/intern/testinfra/testrun/8162774562859797

Differential Revision: D65188659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139251
Approved by: https://github.com/LucasLLC, https://github.com/fegin
2024-10-31 01:12:51 +00:00
7ed0d69004 [ROCm] Increase hipBLASLt default workspace size (#139300)
This PR increases hipBLASLt default workspace size to 76 MB which is the recommended default. This PR does not contain any bug fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139300
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2024-10-31 00:56:54 +00:00
42d790bb65 Revert "Add conjugate method on SymFloat (#139249)"
This reverts commit bcf8a0124fbadb469f6766eb7555a75ea0fa9d43.

Reverted https://github.com/pytorch/pytorch/pull/139249 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the doc build failure is legit ([comment](https://github.com/pytorch/pytorch/pull/139249#issuecomment-2448755839))
2024-10-31 00:45:48 +00:00
4db6b740bc [Easy] GraphTransformObserver Refactoring (#139292)
Uses `torch._inductor.config.trace.log_url_for_graph_xform` by default as the log url. It was only ever instantiated with this as the log_url argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139292
Approved by: https://github.com/shengfukevin, https://github.com/shunting314
2024-10-31 00:33:28 +00:00
8fa0bc3358 Use cached dnnl::stream in GpuStreamManager (#139176)
# Motivation
The code changes in `GpuStreamManager` class intend to help manage `dnnl::stream` efficiently.

# Addtional Context
Use the following code to simply benchmark.
```python
import torch
import time

device = torch.device("xpu")

M, N, K = 64, 64, 64  # You can change these dimensions as needed
torch.manual_seed(0)

A = torch.randn(M, K, device=device)
B = torch.randn(K, N, device=device)

# Warm-up
for _ in range(10):
    torch.matmul(A, B)

s1 = torch.xpu.Stream()
s2 = torch.xpu.Stream()

# Measure the time for the GEMM operation
start_time = time.time()
with torch.xpu.stream(s1):
    for _ in range(50000):
        C = torch.matmul(A, B)

with torch.xpu.stream(s2):
    for _ in range(50000):
        D = torch.matmul(A, B)

torch.xpu.synchronize()
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time

# Print the results
print(f"Time taken for GEMM operation: {elapsed_time:.6f} seconds")
```
Compared with the old implementation elapses 2.077069s, the new implementation consumes 2.023017s, which means ~2% performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139176
Approved by: https://github.com/gujinghui, https://github.com/jgong5
2024-10-31 00:23:39 +00:00
f81223938c support nesting of suppress_guards, suppress guards when generated compiled autograd graph (#138968)
Fixes https://github.com/pytorch/pytorch/issues/138920. See comments there for details.

I still need to try to get a smaller repro to write an actual test. But suppressing the guards, I now no longer see the specilization in the CA graph in the linked example:
```
        aot1_view_3: ... = torch.ops.aten.view.default(aot1_tangents_1, [aot1_sym_size_int, 48, 1])
        aot1_view_4: ... = torch.ops.aten.view.default(aot1_view_3, [aot1_sym_size_int, 48])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138968
Approved by: https://github.com/yf225, https://github.com/xmfan
2024-10-31 00:13:39 +00:00
cyy
d391ed3f4e Use static_assert to detect get_type_index used in device code (#139173)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139173
Approved by: https://github.com/r-barnes, https://github.com/ezyang
2024-10-31 00:06:53 +00:00
f747bd2947 Move slow test query to ClickHouse (#139322)
Example run: https://github.com/pytorch/pytorch/actions/runs/11602255032/job/32306827867?pr=139322 (pr creation commented out), also tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139322
Approved by: https://github.com/huydhn
2024-10-30 23:58:27 +00:00
48854cbfc4 Add missing operator and corresponding unittest (#138309)
Fixes https://github.com/pytorch/pytorch/issues/129690

Add operator.neg and oepartor.pos into _SYM_BOOL_OPS.

Provide simple unit test under export/test_serialize.py that can reproduce the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138309
Approved by: https://github.com/ezyang, https://github.com/angelayi
2024-10-30 23:50:24 +00:00
f32b9a5145 Fx graph always return tuple in fuse_as_graphmodule (#139236)
Summary: As title.

Test Plan: Let's see what OSS CI says

Differential Revision: D65147426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139236
Approved by: https://github.com/ezyang
2024-10-30 23:31:06 +00:00
a494572799 Update tensorify pass to specialize symfloats we didn't tensorify away (#138868)
As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things:

1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation)
2) It updates the tensorify pass to do the backup specialization

This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows:

1) Integrate turning off specialize float only in the automatic dynamic pass.
2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised.
3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868
Approved by: https://github.com/ezyang
2024-10-30 23:28:25 +00:00
bcf8a0124f Add conjugate method on SymFloat (#139249)
Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_conjugate_num_type4_dynamic_shapes

when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139249
Approved by: https://github.com/ezyang
2024-10-30 23:28:09 +00:00
a426837f85 Don't set replacement if lhs is in the free symbols of the rhs (#139250)
Fixes python test/dynamo/test_functions.py FunctionTests.test_is_integer

when we turn off specialize float on eager: https://github.com/pytorch/pytorch/pull/138915

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139250
Approved by: https://github.com/ezyang
2024-10-30 23:21:30 +00:00
754b262bdb Move close_nonexistent_disable_issues.py queries to ClickHouse (#139296)
Example run: https://github.com/pytorch/pytorch/actions/runs/11601996563/job/32305991204?pr=139296 (commented out the part that actually closes issues but the queries run)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139296
Approved by: https://github.com/huydhn
2024-10-30 23:09:39 +00:00
ae6cbd4256 Block more keys from config serialization (#139285)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139285
Approved by: https://github.com/jovianjaison, https://github.com/markkm, https://github.com/c00w
2024-10-30 23:05:59 +00:00
4a8d12227e [Pipelining] add schedule simulator and chrometrace dump (#138134)
Schedule simulator is useful for detecting hangs in schedules and
validating that they won't hang.  It also inserts bubbles (None actions)
at any timestep where a rank can not enqueue its next action due to
unmet dependencies, which can serve as a rough metric for schedule
efficiency.  The output can be visualized.  The simulator expects a full
comm + compute schedule as input.

Chrometrace dump is a basic visualization utility.  It currently just
renders one 'process' per rank, and lets users visualize the schedule in
a UI instead of as text.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138134
Approved by: https://github.com/H-Huang
2024-10-30 23:00:58 +00:00
ec5fbee6c0 Revert "Drop caffe2 string_utils (#139217)"
This reverts commit 1797a2035d92d25d3dcc46fd8facdd6569b30c53.

Reverted https://github.com/pytorch/pytorch/pull/139217 on behalf of https://github.com/huydhn due to Chatting with @r-barnes, this is still used in lots of place internally ([comment](https://github.com/pytorch/pytorch/pull/139217#issuecomment-2448568071))
2024-10-30 22:23:32 +00:00
fef5e94657 addmm: error on output dtype mismatch. (#138520)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138520
Approved by: https://github.com/ezyang
ghstack dependencies: #138515
2024-10-30 21:46:39 +00:00
6da3a043a8 Add test for consistency between meta and CPU devices. (#138515)
Reference: https://github.com/pytorch/pytorch/issues/138399

This PR introduces an `OpInfo` test that checks whether running each `out=` operation
using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically,
it tests the case where the output tensors are not of the expected data type. According to
the `out=` specification, some operations should error.

I have added XFAIL to the set of operations that are currently failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515
Approved by: https://github.com/ezyang
2024-10-30 21:46:39 +00:00
24c9683355 [mergebot] Add ci-no-td label on revert (#139218)
Just in case?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139218
Approved by: https://github.com/wdvr
2024-10-30 21:36:09 +00:00
8840889c3f Allow inplacing buffer when other users are inconsequential (#138383)
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.

Implements:
https://github.com/pytorch/pytorch/issues/132826

Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
2024-10-30 21:35:50 +00:00
ad0883a288 [real_tensor_prop] Infer Fake kernels during real tensor prop (#139213)
This PR changes real_tensor_prop to also infer fake kernels when the
operator doesn't have it.

We infer the fake output to be of the same properties as the real
output, with unbacked symints in the sizes and some stride order.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139213
Approved by: https://github.com/pianpwk
ghstack dependencies: #139212
2024-10-30 21:29:33 +00:00
03ec25053a [export] Update min_val and max_val to Optional[int] in serialization. (#139223)
Summary: According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_serialize_infinite_sym_int

Differential Revision: D65167805

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139223
Approved by: https://github.com/yiming0416
2024-10-30 21:14:17 +00:00
6d5944c9f1 turn off USE_MIMALLOC_ON_MKL temporary. (#139204)
Fixes #138994

We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused https://github.com/pytorch/pytorch/issues/138994

For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: https://github.com/pytorch/pytorch/pull/138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139204
Approved by: https://github.com/ezyang
2024-10-30 21:09:21 +00:00
05cb98f91d [TF32][Inductor] Account for TF32 in test_inductor_layout_optimization_input_mutations (#138948)
Tests using a conv2d kernel which can dispatch to a TF32-backed implementation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138948
Approved by: https://github.com/ezyang
2024-10-30 20:34:16 +00:00
77e25d57b0 Create ciflow/inductor-periodic (#138763)
This is related to https://github.com/pytorch/pytorch/issues/138476.  This would save about 1/8 of the total cost, not a big number, but still a save I guess.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138763
Approved by: https://github.com/desertfire
2024-10-30 19:59:44 +00:00
ef380f7b8e [real tensor prop] Add some asserts for custom ops (#139212)
When we see a custom op:
- check that its mutation annotations are correct
- check that its aliasing constraints matches our constraints for custom
  ops.

Otherwise, there may be undefined behavior.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139212
Approved by: https://github.com/angelayi
2024-10-30 19:29:11 +00:00
5c6d35482e [Inductor] Support Triton AttrsDescriptor cls field (#139193)
Fixes #139179

Adding corresponding changes to https://github.com/triton-lang/triton/pull/4888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139193
Approved by: https://github.com/bertmaher
2024-10-30 18:16:38 +00:00
180d283156 [export] avoid debug name crash for dim hints (#139104)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139104
Approved by: https://github.com/ezyang
2024-10-30 18:12:44 +00:00
7765d1ef70 Preliminary registered-buffer collective support via Inductor (#138029)
```
NOTE [lowering-time collective optimization]

In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.

To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:

- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.

To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.

For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```

### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).

### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029
Approved by: https://github.com/Chillee
ghstack dependencies: #138028
2024-10-30 18:11:09 +00:00
421473c234 get_symm_mem_workspace(): print helpful error during graph capture (#138028)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138028
Approved by: https://github.com/weifengpy
2024-10-30 18:11:09 +00:00
f4ab8b48c5 Allow schedules to run with single stage (#138925)
Ran into issues (https://github.com/pytorch/pytorch/pull/138863) when adding a Schedule with a single stage, so adding code to support this edge case (mostly for test purposes)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138925
Approved by: https://github.com/wconstab
2024-10-30 17:33:16 +00:00
ad637a4c5c Add support for index_put_ in NT (#135722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722
Approved by: https://github.com/jbschlosser
2024-10-30 17:17:59 +00:00
f14f245747 [export] Remove custom forward func in swap (#139126)
Differential Revision: [D65100694](https://our.internmc.facebook.com/intern/diff/D65100694)

Remove the custom forward function and instead move the pytree flatten/unflatten ops into the graph. This allows us to natively run via the interpreter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139126
Approved by: https://github.com/avikchaudhuri
2024-10-30 16:50:57 +00:00
4b83302585 [MPS] Update error message for supported autocast type (#139192)
Autocast in MPS currently only supports dtype of `torch.float16`. This PR updates the error message to reflect this.

This PR was created using [Copilot Workspace](https://copilot-workspace.githubnext.com/pytorch/pytorch/issues/139190?shareId=5b510fda-380c-4e86-8e91-6b67a078f180) with no human input other than clicking buttons.

Fixes #139190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139192
Approved by: https://github.com/malfet
2024-10-30 16:48:29 +00:00
996c40e85e Adjusted install_user script for Ubuntu 24.04 support (#138815)
Fixes #138812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138815
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2024-10-30 16:31:09 +00:00
29eb65fce8 Fix in-place state dict updates for distributed checkpoint loading (#138575)
`dcp.load()` is documented as "operating in place", updating the state of existing state_dict elements instead of replacing them wherever possible. However, it appears that in the case of a stateful element, the code both updates its state in-place, then replaces it with a copy of itself in the state_dict. This looks like a simple oversight, so here's a PR that should fix it!

[From the docs:](https://pytorch.org/docs/stable/distributed.checkpoint.html)
> DCP is different than torch.save and torch.load in a few significant ways: *...*
> - It operates in place, meaning that the model should allocate its data first and DCP uses that storage instead.

This manifested as a strange bug in TorchTitan, causing a model loaded from a checkpoint to be saved incorrectly, resulting in a twice-resumed model being subtly broken.

Let me know if this makes sense, and if there's anything else I should add!

Thanks for all the work on PyTorch!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138575
Approved by: https://github.com/kwen2501, https://github.com/fegin
2024-10-30 16:10:24 +00:00
04eb15da44 [AOTI] Unify the default value of allow_stack_allocation (#139147)
Summary: Unify the default value of allow_stack_allocation for fbcode and OSS

Differential Revision: D65064673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139147
Approved by: https://github.com/hl475
2024-10-30 16:01:23 +00:00
6e85266a47 [MPS] Fixes SiLU on non-contiguous tensors (#139006)
Similar to #123049, however, `SiLU` also produces random values, `0.0`, or `NaN` as results if input tensor is not contiguous on prior to macOS 15.0.
Orignally the problem was found at jy0205/Pyramid-Flow#113.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139006
Approved by: https://github.com/malfet
2024-10-30 15:44:59 +00:00
49bfbed2eb Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit 383eba522922f0b7c525b88ed4348c64b40b95cf.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to larger memory usage apparently not acceptable ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2447382819))
2024-10-30 14:43:15 +00:00
456c87c8a2 [8/N] Fix extra warnings brought by clang-tidy-17 (#139151)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139151
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-30 14:20:08 +00:00
44257c063e [dynamo] Fix constant propagation in builtins and UserClasses (#131354)
* Fixes https://github.com/pytorch/pytorch/issues/118675
* Replaces https://github.com/pytorch/pytorch/pull/118994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-10-30 12:47:20 +00:00
a951d99e16 Revert "Move reduce to template parameter in vectorized_reduction (#138672)"
This reverts commit 9b2c99d731695b76205d617ddc1e799ba11ae1a0.

Reverted https://github.com/pytorch/pytorch/pull/138672 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138672#issuecomment-2446927015))
2024-10-30 12:12:13 +00:00
9bbe4a67ad [dynamo] support maxlen for collections.deque (#138194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138194
Approved by: https://github.com/jansel, https://github.com/malfet
2024-10-30 10:08:02 +00:00
a4b35767cb Don't have random print in convert_frame (#139203)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139203
Approved by: https://github.com/Skylion007
2024-10-30 09:35:37 +00:00
a19bdfb36e [compiled autograd] reorder backward hooks to match eager behavior (#138553)
Fixes #138538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138553
Approved by: https://github.com/xmfan
2024-10-30 08:46:45 +00:00
b71ab3fc85 [DTensor][Bug Fix]Fix 2D DTensor mm with mesh_shape (1, n) or (n, 1) (#139134)
Fixes #138742. In the issue, the matrix multiplication with DTensor failed when the size of one of mesh dimension is 1 when the mesh is > 1D. We are missing tests for covering this corner case where mesh_shape is (n, 1) or (1, n). The DTensor mm op is correct when the 1D mesh is of shape (self.world_size, ) or 2D mesh with none of the mesh_dimension has a size of 1.

In this PR, we fixed the corner case by updating `gen_einsum_strategies` in `_einsum_strategy.py`. Specifically, we cannot skip generating `mesh_dim_strategies` when `mesh_dim <= 1`, as this is not valid for nD mesh with one of the mesh dimension sizes being 1.

Without the fix, the OpStrategy generated for 2D mesh with mesh_shape of (1,n) or (n,1) is wrong, as the OpStrategy generated is 1D.

```
all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]]
OpStrategy(all_strategies):::   [(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1)[(R, R) -> R, (S(1), S(0)) -> P, (S(0), R) -> S(0), (R, S(1)) -> S(1)] @ mesh: (4, 1)
```

After the fix, we can see the OpStrategy generated is correct with 2D strategy.
```
all_mesh_dim_strategies=[[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]][[[Replicate(), Replicate(), Replicate()], [Partial(sum), Shard(dim=1), Shard(dim=0)], [Shard(dim=0), Shard(dim=0), Replicate()], [Shard(dim=1), Replicate(), Shard(dim=1)]]]
OpStrategy(all_strategies) = [(RR, RR) -> RR, (RS(1), RS(0)) -> RP, (RS(0), RR) -> RS(0), (RR, RS(1)) -> RS(1), (S(1)R, S(0)R) -> PR, (S(1)S(1), S(0)S(0)) -> PP, (S(1)S(0), S(0)R) -> PS(0), (S(1)R, S(0)S(1)) -> PS(1), (S(0)R, RR) -> S(0)R, (S(0)S(1), RS(0)) -> S(0)P, (S(0)S(0), RR) -> S(0)S(0), (S(0)R, RS(1)) -> S(0)S(1), (RR, S(1)R) -> S(1)R, (RS(1), S(1)S(0)) -> S(1)P, (RS(0), S(1)R) -> S(1)S(0), (RR, S(1)S(1)) -> S(1)S(1)] @ mesh: (4, 1)
```

*******
As a follow up, we should add more test coverage for DTensor op with 2D mesh and 2D mesh with one of the size of mesh dimension being 1.
*******

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139134
Approved by: https://github.com/fegin
2024-10-30 08:09:39 +00:00
ceab24def4 [CI] Unify numpy version for python-3.9 and 3.10 configs (#139244)
Per dependabot numpy-1.21 is subject of CVE-2021-34141 so perhaps it's ok not to test against it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139244
Approved by: https://github.com/huydhn
2024-10-30 06:47:38 +00:00
3495ef78a2 Unbreak fp16 dot issues caused by #137917 (#139262)
See comment for explanation. In short, doing the fixup in float.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139262
Approved by: https://github.com/huydhn
2024-10-30 05:10:19 +00:00
cyy
4e5f9afc7f Enable c10::sv and std::sv constexpr conversions (#139239)
As a small step towards moving c10::sv to std::sv and this tiny change shouldn't break META builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139239
Approved by: https://github.com/malfet
2024-10-30 03:57:47 +00:00
cd8f7730f4 [PT2E][Quant] Remove Redundant Method in X86 Quantizer (#139161)
**Summary**
Remove the redundant method of X86 Inductor Quantizer as `get_supported_quantization_configs`, `get_supported_operator_for_quantization_config` and `get_supported_operators`. They are not the must have to implement a customized Quantizer and not mentioned in existing document for how to use X86 Inductor Quantizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139161
Approved by: https://github.com/jgong5
2024-10-30 03:31:17 +00:00
edcab61f93 Skip test for PT2E quantized ops in fbcode (#138792)
Skip those tests as they are failing in fbcode.
Submit this PR per request from @jerryzh168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138792
Approved by: https://github.com/jerryzh168
2024-10-30 02:37:38 +00:00
eqy
b4e4f84a06 Fix regex in test_static_inputs_address_mutation_log for Python 3.12 (#139229)
Otherwise Python 3.12's `re` seems to be unhappy with `re.error: global flags not at the start of the expression at position 113`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139229
Approved by: https://github.com/ezyang
2024-10-30 02:36:31 +00:00
cyy
b0f84aad5d [3/N] Fix Wextra-semi warnings (#139165)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139165
Approved by: https://github.com/ezyang
2024-10-30 02:08:13 +00:00
5861279f47 Revert "Add support for index_put_ in NT (#135722)"
This reverts commit b4836e5b5ce2891e9af21790d255720e2dbf8e91.

Reverted https://github.com/pytorch/pytorch/pull/135722 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/135722#issuecomment-2445651914))
2024-10-30 01:53:55 +00:00
1797a2035d Drop caffe2 string_utils (#139217)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139217
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2024-10-30 01:13:16 +00:00
cyy
da1c1a9884 [4/N] Don't skip ASAN on some tests (#139189)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139189
Approved by: https://github.com/ezyang
2024-10-30 00:59:32 +00:00
ba40dc19d2 [CI] Run aarch64 build/tests on every trunk commit (#139228)
As we have sccache now, should be reasonably fast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139228
Approved by: https://github.com/kit1980
2024-10-30 00:49:06 +00:00
f643499ddd Fix vec128_half_neon.h compilation with GCC (#139235)
`mask` is already defined as `uint16x8_t` no need to reinterpret it
bd369bb182/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h (L220)

Fixes
```
var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h: In static member function 'static at::vec::DEFAULT::Vectorized<c10::Half> at::vec::DEFAULT::Vectorized<c10::Half>::set(const at::vec::DEFAULT::Vectorized<c10::Half>&, const at::vec::DEFAULT::Vectorized<c10::Half>&, int64_t)':
/var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128_half_neon.h:227:39: error: cannot convert 'uint16x8_t' to 'float16x8_t'
  227 |                 vreinterpretq_u16_f16(mask),
      |                                       ^~~~
      |                                       |
      |                                       uint16x8_t
In file included from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/intrinsics.h:23,
                 from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec128/vec128.h:4,
                 from /var/lib/jenkins/workspace/aten/src/ATen/cpu/vec/vec.h:6,
                 from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.h:2,
                 from /var/lib/jenkins/workspace/aten/src/ATen/test/vec_test_all_types.cpp:1:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:5841:36: note:   initializing argument 1 of 'uint16x8_t vreinterpretq_u16_f16(float16x8_t)'
 5841 | vreinterpretq_u16_f16 (float16x8_t __a)
      |                        ~~~~~~~~~~~~^~~
```

introduced by https://github.com/pytorch/pytorch/pull/137911

Also, guard any use of NEON intrinsics in `ReducedPrecisionFloatGemvFastPathKernel.cpp` with `!defined(CPU_CAPABILITY_SVE)` otherwise compilation fails with
```
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::VectorizedN<c10::Half, 16>&)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:77:24: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t'
   77 |   return vaddvq_f32(t0 + t1);
      |                     ~~~^~~~
      |                        |
      |                        at::vec::SVE256::Vectorized<float>
In file included from /var/lib/jenkins/workspace/c10/util/Half.h:51,
                 from /var/lib/jenkins/workspace/c10/util/Float8_e5m2.h:17,
                 from /var/lib/jenkins/workspace/c10/core/ScalarType.h:8,
                 from /var/lib/jenkins/workspace/c10/core/TensorImpl.h:11,
                 from /var/lib/jenkins/workspace/c10/core/GeneratorImpl.h:8,
                 from /var/lib/jenkins/workspace/aten/src/ATen/core/Generator.h:18,
                 from /var/lib/jenkins/workspace/aten/src/ATen/CPUGeneratorImpl.h:3,
                 from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4,
                 from /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:2,
                 from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1:
/usr/lib/gcc/aarch64-linux-gnu/11/include/arm_neon.h:10423:25: note:   initializing argument 1 of 'float32_t vaddvq_f32(float32x4_t)'
10423 | vaddvq_f32 (float32x4_t __a)
      |             ~~~~~~~~~~~~^~~
In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.SVE256.cpp:1:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp: In function 'float at::native::SVE256::reduce(at::vec::SVE256::Vectorized<float>)':
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:119:21: error: cannot convert 'at::vec::SVE256::Vectorized<float>' to 'float32x4_t'
  119 |   return vaddvq_f32(x);
      |                     ^
      |                     |
      |                     at::vec::SVE256::Vectorized<float>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139235
Approved by: https://github.com/huydhn
2024-10-30 00:48:57 +00:00
d9e87fb339 [draft-export] Include guards for constraint violation errors (#138748)
Summary:
Added where logs are being added to constrain violations in draft export.

Example output:
```
1. Constraint violation error.
    The specified input dynamic_shapes spec was found to be incorrect during tracing.
    Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}.
    This occured at the following stacktrace:
        File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1736, in _wrapped_call_impl
        File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/torch/nn/modules/module.py, lineno 1747, in _call_impl
        File /data/users/angelayi/fbsource/buck-out/v2/gen/fbcode/1beb9df83fd74b9a/scripts/angelayi/draft_export/__test_draft_export__/test_draft_export#link-tree/scripts/angelayi/draft_export/test_draft_export.py, lineno 138, in forward.
    Because of this, we have modified the dynamic shapes structure to be the following:
    ```
    dynamic_shapes = {'a': {0: 3}}
    ```
```

The result of this diff is also that `dynamic` logs are permanently turned on during draft export. Otherwise we cannot capture the `[guard added]` logs from symbolic_shapes.py.

Test Plan: `buck2 run @//mode/dev-nosan scripts/angelayi/draft_export:test_draft_export -- -r "test_shape_failure" `

Differential Revision: D64862374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138748
Approved by: https://github.com/ezyang
2024-10-30 00:24:17 +00:00
b4836e5b5c Add support for index_put_ in NT (#135722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135722
Approved by: https://github.com/jbschlosser
2024-10-30 00:03:21 +00:00
341a28f0ce Refactors empty_cache to return only MemPool memory to the system (#133602)
Canonically, the empty_cache API releases all cached blocks of the CUDACachingAllocator. There is no API that can release only the cached blocks of a given pool.

In this PR, we extend the functionality of empty_cache API such that it only releases the cached blocks of an active pool. When empty_cache API is called under a MemPoolContext, we only release the cached blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133602
Approved by: https://github.com/ezyang
2024-10-29 23:58:44 +00:00
bd369bb182 Workaround torch.deploy failures (#139195)
Summary:
Which are backed with an older version of `typing_extensoins` but this runtime could not care less about type-checking.
So pretend that is has `TypeIs` by replacing it with `TypeGuard`

Fixes test failures introduced by https://github.com/pytorch/pytorch/pull/133814 / D65030974

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//multipy/runtime:test_deploy -- --exact 'multipy/runtime:test_deploy - TorchpyTest.TestNumpy'`

Differential Revision: D65145409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139195
Approved by: https://github.com/Skylion007
2024-10-29 23:36:16 +00:00
fcb36a69cd [ONNX] Add a test file for _building.py (#139107)
Fixes #138761

Add test file for _building.py to verify and guarantee the correct behavior on OpRecorder. Noted that the tests does not validate the model itself, but the expected behavior of the evaluator adding extra ops during input preprocessing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139107
Approved by: https://github.com/justinchuby
2024-10-29 23:25:31 +00:00
a0e095dd9f config: Modify install_config_module to use a layered approach (#138758)
This modifies the config system, to use a single mapping of config ->
ConfigEntry and to store the default and user values within them.

We could have used multiple dicts (i.e. user_override and default), but
as we add more fields (justknobs in this PR, perhaps testing and env
variables later), it quickly becomes painful.

There are a couple design decisions we could change.
1) All configs we save store the resolved value - not the default and
   user override seperately
2) All configs we load, apply the resolved value as a user override.

This means that certain complexities of default behvaiour and deletion
(as well as JK), will change if you save + load a config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138758
Approved by: https://github.com/ezyang
2024-10-29 23:19:36 +00:00
46d0b635b9 [CMake] Remove pthread linking (#134436)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134436
Approved by: https://github.com/r-barnes
2024-10-29 23:14:40 +00:00
eqy
c9bd712305 [CUDA][AMP] Speed up fp16/bf16 casts on H100+ (#137053)
Similar to #110251 we're seeing cases where vectorization can benefit casts to fp16/bf16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137053
Approved by: https://github.com/drisspg
2024-10-29 23:01:16 +00:00
b29c170bee [PyTorch] Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too (#137917)
Remove reasons to gate it on ARM.

Differential Revision: [D64280687](https://our.internmc.facebook.com/intern/diff/D64280687/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137917
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915, #137916
2024-10-29 22:38:01 +00:00
fc2d0da773 [PyTorch] Convert reduced precision gemv vectorized tail loop to use whole vector register instead of half (#137916)
The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137916
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915
2024-10-29 22:38:01 +00:00
5be1556d4a [PyTorch] Clean up Registers/ElementsPerIteration constants (#137915)
In preparation for other vector instruction sets. (NEON and AVX512 have 32 registers, but AVX and AVX2 have only 16.)

Differential Revision: [D64265759](https://our.internmc.facebook.com/intern/diff/D64265759/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137915
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914
2024-10-29 22:37:49 +00:00
aafbea49b9 [PyTorch] Move FP16 dot and GEMV kernels to new file in ATen/native/cpu/ (#137914)
This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913
2024-10-29 22:37:37 +00:00
6502d6cf17 [PyTorch] Use Half, not float16_t, in fp16 gemv fast path signatures (#137913)
float16_t is ARM-specific. Half is not.

Differential Revision: [D64218427](https://our.internmc.facebook.com/intern/diff/D64218427/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137913
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912
2024-10-29 22:37:30 +00:00
9ede4b2746 [PyTorch] Migrate fp16 gemv fast path kernel from intrinsics to vec::Vectorized (#137912)
Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
https://github.com/pytorch/torchchat/issues/1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137912
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911
2024-10-29 22:37:24 +00:00
41d7471413 [PyTorch] Specialize Vectorized<Half> for NEON even if FP16 arithmetic isn't available (#137911)
We can do most of what this header does (by line count) anyway by converting to and from float.

Differential Revision: [D64265757](https://our.internmc.facebook.com/intern/diff/D64265757/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137911
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #137661
2024-10-29 22:37:17 +00:00
837538f040 [PyTorch] Move NEON VecConvert specialization from vec256_convert to vec128_convert (#137661)
NEON vectors are 128-bit and don't belong with 256 stuff.

Differential Revision: [D64143615](https://our.internmc.facebook.com/intern/diff/D64143615/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137661
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-10-29 22:37:10 +00:00
23d590e518 More flexible test parametrization with @reparametrize (#138369)
**Background:** The `@parametrize` decorator enjoys widespread usage as a convenient tool for ensuring extensive test coverage. One particular feature that makes this easy is the ability to stack such decorators, testing over the cross-product of inputs. Example:
```python
class MyTestClass(TestCase):
    @parametrize("x", range(3))
    @parametrize("y", [False, True])
    def test_foo(self, x, y):
        # Invoked with:
        # x=0, y=False
        # x=1, y=False
        # x=2, y=False
        # x=0, y=True
        # x=1, y=True
        # x=2, y=True
        ...
```

Note that the `@ops` and `@modules` decorators employ the same underlying machinery for parametrizing over `OpInfo` / `ModuleInfo` entries. These decorators also parametrize over op-specific `device` / `dtype` info *according to what is supported for each op*.
```python
class MyTestClass(TestCase):
    @ops(op_db)
    def test_foo(self, op, device, dtype):
        # Invoked each OpInfo in the db along with each device / dtype that corresponds
        # with this op according to the OpInfo entry.
        ...
```

Note that this in contrast to the naive cross product between ops and devices / dtypes, which would generate too many tests. Certain use cases benefit from a similar type of flexible parametrization that is more intelligent than simple cross-product composition. It is expensive to generate / run too many tests, even if the unneeded ones are skipped appropriately.

This PR attempts to generalize such flexible parametrization and satisfy these use cases through the introduction of a `@reparametrize` decorator, which operates on an existing parametrizer and allows for customized on-the-fly parametrization through the use of an `adapter_fn`. Examples:
```python
# adapter_fn that adds a new arg
 def include_is_even_arg(test_name, param_kwargs):
    x = param_kwargs["x"]
    is_even = x % 2 == 0
    new_param_kwargs = dict(param_kwargs)
    new_param_kwargs["is_even"] = is_even
    is_even_suffix = "_even" if is_even else "_odd"
    new_test_name = f"{test_name}{is_even_suffix}"
    yield (new_test_name, new_param_kwargs)

# adapter_fn that excludes certain values
def exclude_odds(test_name, param_kwargs):
    x = param_kwargs["x"]
    is_even = x % 2 == 0
    yield None if not is_even else (test_name, param_kwargs)

class MyTestClass(TestCase):
    @reparametrize(parametrize("x", range(5)), include_is_even_arg)
    def test_foo(self, x, is_even):
        # Invoked with both the x value and the new is_even arg
        ...

    @reparametrize(parametrize("x", range(5)), exclude_odds)
    def test_bar(self, x):
        # Only invoked with even x values
        ...
```

For a more real-world use case, imagine you want to write a set of OpInfo tests that parametrize over additional op-specific things beyond `device` / `dtype` (in NJT's case, this includes contiguity type, whether to operate over the batch / ragged / other dims, etc.). The `@reparametrize` decorator allows you to customize the `@ops` parametrization to add in these additional args as they make sense on a per-op basis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138369
Approved by: https://github.com/janeyx99
2024-10-29 22:14:38 +00:00
ebaa774f96 Migrate inductor and torchbench workflows to start experimenting with a100 on aws (#139079)
Excluding nightly workflows, as they are more critical and run less frequently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139079
Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/huydhn
2024-10-29 22:11:25 +00:00
80c7c7178e Make sure all SDPA tests are ran with tensor cores enabled (#135592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135592
Approved by: https://github.com/eqy
2024-10-29 20:53:10 +00:00
c81d4fd0a8 Upgrade sccache to v0.8.2 for CPU targets (#121323)
This essentially reverts https://github.com/pytorch/pytorch/pull/95997 but switches to builds from source to official mozilla's sccache repo for CPU builds, except PCH one, see https://github.com/pytorch/pytorch/issues/139188
- Define `SCCACHE_REGION` for the jobs that needs it.
- Enable aarch64 builds to use sccache, which allows one to do incremental rebuilds under 10 min, see https://github.com/pytorch/pytorch/actions/runs/11565944328/job/32197278296

Fixes https://github.com/pytorch/pytorch/issues/121559
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121323
Approved by: https://github.com/atalman
2024-10-29 19:54:36 +00:00
2b577ae58f Implement NJT embedding backward (#138627)
Fixes #138352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138627
Approved by: https://github.com/jbschlosser
2024-10-29 18:44:58 +00:00
a884462bca Add workspace to TritonTemplates (#138050)
Here's a markdown summary for the PR:

# Add workspace buffer support for Triton templates

## Summary
Adds support for templates to allocate and use temporary workspace buffers

## Key Changes
- Add `WorkspaceArg` support in Triton template system
- Automatic workspace allocation/deallocation around kernel execution
- Zero-initialization support for workspace buffers
- Seamless integration with existing tensor management

## Example Usage
```python
def generate(self, ...):
    workspace_arg = WorkspaceArg(
        count=1024*1024,  # 1MB workspace
        zero_fill=True    # Zero-initialized
    )

    return TritonTemplateCaller(..., workspace_arg=workspace_arg)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138050
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-10-29 18:17:54 +00:00
7964bcc3dc [DeviceMesh] fix sub mesh size calculation in create_sub_mesh() (#138945)
**Summary**
This PR fixes a calculation miss in DeviceMesh's create_sub_mesh().

**Error Description**
When users call `device_mesh["dim0", "dim1", "dim2", "dim3"]`, it creates a slice of mesh or we call it "submesh". Users can also slice a submesh from a flattened mesh. For example:
```
flattened_mesh = device_mesh["dim0", "dim1", "dim2"]._flatten("dim0-2")
alias_flattened_mesh = device_mesh["dim0-2"]  # this mesh slice leads to error in current impl
```

It triggers the error in the size calculation `reduce(lambda, mesh_dim)` happening in `create_sub_mesh`:
```
IndexError: Dimension out of range (expected to be in range of [-4, 3], but got 4)
```

**Fix**
The usage of lambda is wrong, for `lambda x, y`, the x is the accumulated value while `y` is the iterator value.

**Test**
`pytest test/distributed/test_device_mesh.py -s -k test_flatten_mesh_4d`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138945
Approved by: https://github.com/wz337
2024-10-29 17:56:56 +00:00
cyy
82a6d2db3f [2/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139158)
Follows #139007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139158
Approved by: https://github.com/Skylion007
2024-10-29 17:16:37 +00:00
c98c88a211 [Bugfix] UnicodeDecodeError: 'utf-8' codec can't decode byte (#139062)
Fixes #113564

When I used PyTorch's profiler to analyze the performance of vLLM, I encountered the following error. This error is similar to #113564. After analysis and troubleshooting, I changed the temporary file from text mode to binary mode, and it no longer reported an error and ran normally.

```bash
ERROR 10-28 10:25:50 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 722, in stop
ERROR 10-28 10:25:50 engine.py:160]     self._transit_action(self.current_action, None)
ERROR 10-28 10:25:50 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 751, in _transit_action
ERROR 10-28 10:25:50 engine.py:160]     action()
ERROR 10-28 10:25:50 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 745, in _trace_ready
ERROR 10-28 10:25:50 engine.py:160]     self.on_trace_ready(self)
ERROR 10-28 10:25:50 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 444, in handler_fn
ERROR 10-28 10:25:50 engine.py:160]     prof.export_chrome_trace(os.path.join(dir_name, file_name))
ERROR 10-28 10:25:50 engine.py:160]   File "/usr/local/lib/python3.12/dist-packages/torch/profiler/profiler.py", line 220, in export_chrome_trace
ERROR 10-28 10:25:50 engine.py:160]     fout.writelines(fin)
ERROR 10-28 10:25:50 engine.py:160]   File "<frozen codecs>", line 322, in decode
ERROR 10-28 10:25:50 engine.py:160] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 5896: invalid start byte
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139062
Approved by: https://github.com/ezyang
2024-10-29 17:16:26 +00:00
68134a320e [Flex Attention] Paged Attention (#137164)
This PR adds paged attention for flex attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137164
Approved by: https://github.com/drisspg
2024-10-29 17:05:22 +00:00
cyy
3907f36808 Turn some variables and functions into static (#136847)
Re-check some files and mark variables and functions into static and fix other warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136847
Approved by: https://github.com/ezyang
2024-10-29 17:01:56 +00:00
3f9f6048da [aoti] Print output name for sympy.Expr as well (#138524)
To avoid
```
NotImplementedError: unsupported type of output=s0*s1
```

It seems like this was caused by the use of `_scaled_dot_product_flash_attention`.

Fallback kernek:
```
FallbackKernel(
  python_kernel_name='torch.ops.aten._scaled_dot_product_flash_attention.default',
  name=buf55,
  layout=MultiOutputLayout(device=device(type='cuda', index=0)),
  inputs=[ComputedBuffer(name='buf52', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0*s1, 64], stride=[384*s0*s1, 64*s0*s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99da20>, ranges=[1, 6, s0*s1, 64])), ComputedBuffer(name='buf53', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0*s1, 64], stride=[384*s0*s1, 64*s0*s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99d480>, ranges=[1, 6, s0*s1, 64])), ComputedBuffer(name='buf54', layout=FixedLayout('cuda', torch.bfloat16, size=[1, 6, s0*s1, 64], stride=[384*s0*s1, 64*s0*s1, 64, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function BaseView.make_loader.<locals>.loader at 0x7fcd7f99c430>, ranges=[1, 6, s0*s1, 64]))],
  constant_args=(0.125,),
  kwargs={'scale': 0.125},
  output_view=None,
  python_kernel_name=torch.ops.aten._scaled_dot_product_flash_attention.default,
  cpp_kernel_name=at::_ops::_scaled_dot_product_flash_attention::call,
  ordered_kwargs_for_cpp_kernel=['scale'],
  op_overload=aten._scaled_dot_product_flash_attention.default,
  arg_properties=[{'name': 'query', 'type': Tensor, 'default_value': None}, {'name': 'key', 'type': Tensor, 'default_value': None}, {'name': 'value', 'type': Tensor, 'default_value': None}, {'name': 'dropout_p', 'type': float, 'default_value': 0.0}, {'name': 'is_causal', 'type': bool, 'default_value': False}, {'name': 'return_debug_mask', 'type': bool, 'default_value': False}],
  kwarg_properties=None,
  unbacked_bindings=None,
  mutation_outputs=[],
  origin_node=None,
  origins=OrderedSet([_scaled_dot_product_flash_attention])
)
```

codegen with this pr
```
// Topologically Sorted Source Nodes: [scaled_dot_product_attention], Original ATen: [aten._scaled_dot_product_flash_attention]
    double var_147 = 0.125;
    AtenTensorHandle buf56_handle;
    AtenTensorHandle buf57_handle;
    auto buf55_4 = s0*s1;
    auto buf55_5 = s0*s1;
    AtenTensorHandle buf58_handle;
    AtenTensorHandle buf59_handle;
    AtenTensorHandle buf60_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cuda__scaled_dot_product_flash_attention(convert_arrayref_tensor_to_tensor(buf52), convert_arrayref_tensor_to_tensor(buf53), convert_arrayref_tensor_to_tensor(buf54), 0.0, 0, 0, &var_147, &buf56_handle, &buf57_handle, nullptr, nullptr, &buf55_4, &buf55_5, &buf58_handle, &buf59_handle, &buf60_handle));
    RAIIAtenTensorHandle buf56(buf56_handle);
    RAIIAtenTensorHandle buf57(buf57_handle);
    RAIIAtenTensorHandle buf58(buf58_handle);
    RAIIAtenTensorHandle buf59(buf59_handle);
    RAIIAtenTensorHandle buf60(buf60_handle);
```

Differential Revision: D64724460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138524
Approved by: https://github.com/chenyang78
2024-10-29 16:02:45 +00:00
a762dc0357 [inductor] Multi-kernel + cooperative reductions (#138893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138893
Approved by: https://github.com/shunting314
ghstack dependencies: #138533
2024-10-29 15:45:17 +00:00
77b0ae832d [inductor] Allow cooperative + persistent reductions (#138533)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138533
Approved by: https://github.com/shunting314, https://github.com/eellison
2024-10-29 15:45:17 +00:00
9d7a0869f0 Make DDP Quantization hooks backend Agnostic (#138816)
Current ddp hooks quantization code use .cuda() API to move tensors and parameter on backend devices. This limits only cuda backend to work with ddp quantization hooks.
Change is to make code backend agnostic and move tensors/parameters based on **tensor.device.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138816
Approved by: https://github.com/kwen2501
2024-10-29 15:02:45 +00:00
869d1ad0b4 [BE] Nested namespace in quantized folder (#139166)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139166
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-10-29 14:53:07 +00:00
489c66fdb3 [AOTI] fix pointer_to_list (#138806)
Fixes the `pointer_to_list` function to take `*(ptr + i)` instead of `*ptr`.
This fixes the runtime error when running INT8 yolo-v7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138806
Approved by: https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #138691
2024-10-29 14:33:16 +00:00
9af1816974 [AOTI] add C shim for _weight_int8pack_mm (#138691)
Fixes the error of running WOQ-INT8 LLaMA:
```
E           In file included from /home/user/inductor/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3,
E                            from /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:4:
E           /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp: In function ‘void inductor_entry_impl(AtenTensorOpaque**, AtenTensorOpaque**)’:
E           /tmp/torchinductor_user/sw/csw5gfmlzp5iooqvfwl2gwn574frwdpmtrx2y6nu2m6x76d3xcux.cpp:117:33: error: ‘aoti_torch_cpu__weight_int8pack_mm’ was not declared in this scope
E             117 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_cpu__weight_int8pack_mm(convert_arrayref_tensor_to_tensor(arg8_1), _frozen_param0, _frozen_param1, &buf0_handle));
E                 |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138691
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-10-29 13:53:36 +00:00
69d401d010 Update test_quantize_pt2e.py with HPU support (#137863)
**MOTIVATION**

We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

**CHANGES**
- Add support for HPU devices within the test_move_exported_model_bn using TEST_HPU flag
- Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
- Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137863
Approved by: https://github.com/jerryzh168
2024-10-29 13:01:03 +00:00
b9618c9b88 [Dynamo] Add itertools.compress() support (#139061)
Use polyfill to add `itertools.compress()` support in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139061
Approved by: https://github.com/jansel
2024-10-29 10:25:55 +00:00
cyy
e201460f8a [2/N] Fix Wextra-semi warnings (#139142)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139142
Approved by: https://github.com/ezyang
2024-10-29 08:14:37 +00:00
93d7f90c3a [inductor] getting AOT inductor to treat None args correctly (#139114)
Differential Revision: [D65102228](https://our.internmc.facebook.com/intern/diff/D65102228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139114
Approved by: https://github.com/aakhundov
2024-10-29 08:11:53 +00:00
8b08559c80 Move more workflows to 3.9 (#139145)
Specifically mergebot and others should be using 3.9 now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139145
Approved by: https://github.com/kit1980, https://github.com/Skylion007, https://github.com/huydhn
2024-10-29 05:39:46 +00:00
38645e8a3e Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 8aedc649bdd0789b0ea9b9348d552fb1b0e437ff.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))
2024-10-29 04:54:37 +00:00
ea93e09896 [CI] Align XPU CI build with CD to fix build issue (#139050)
Works for #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139050
Approved by: https://github.com/ezyang
2024-10-29 04:53:53 +00:00
e52ccb3ca6 [Device] Replace hardcoded devices with 'torch._C._get_accelerator()' (#139032)
I noticed that some hard-code like `"cuda" if torch.cuda.is_available() else "cpu"` which can be replaced with `torch._C._get_accelerator()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139032
Approved by: https://github.com/ezyang
2024-10-29 04:51:47 +00:00
cyy
a0865b00fb [1/N] Fix clang-tidy warnings in python_variable_methods.cpp (#139007)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139007
Approved by: https://github.com/ezyang
2024-10-29 04:48:13 +00:00
cyy
0274d16c01 Fix clang-tidy warnings in jit code (#138974)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138974
Approved by: https://github.com/ezyang
2024-10-29 04:33:40 +00:00
48b55ca1b1 [export] Fix non-strict retracing with kwargs (#138927)
Summary:
`torch.fx.Interpreter.run()` only takes args as input. Currently we pass kwargs as well which causes errors during retracing.

Flatten the kwargs and concat them with args will solve the issue.

Several previously failing tests under `_retraceability_non_strict` now passes.

Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability_non_strict
```

Differential Revision: D64980053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138927
Approved by: https://github.com/angelayi
2024-10-29 04:31:21 +00:00
3342b533bb Update setuptool to 72.1.0 (#139144)
As older versions are affected by CVE-2024-6345

Also, update `typing_extensions` to 4.11 to support `TypeIs`, otherwise some of the workflows report following error (but succeed somehow), see [this](https://github.com/pytorch/pytorch/actions/runs/11566785190/job/32196549021):
```
2024-10-29T03:55:01.3601410Z + /Users/ec2-user/runner/_work/_temp/miniconda/bin/conda run -p /Users/ec2-user/runner/_work/_temp/conda_environment_11566785190 --no-capture-output python3 -c 'import torch'
2024-10-29T03:55:01.3602260Z ~/runner/_work/_temp ~/runner/_work/pytorch/pytorch
2024-10-29T03:55:01.8043630Z Traceback (most recent call last):
2024-10-29T03:55:01.8044540Z   File "<string>", line 1, in <module>
2024-10-29T03:55:01.8045670Z   File "/Users/ec2-user/runner/_work/_temp/conda_environment_11566785190/lib/python3.9/site-packages/torch/__init__.py", line 37, in <module>
2024-10-29T03:55:01.8046690Z     from typing_extensions import ParamSpec as _ParamSpec, TypeIs as _TypeIs
2024-10-29T03:55:01.8048010Z ImportError: cannot import name 'TypeIs' from 'typing_extensions' (/Users/ec2-user/runner/_work/_temp/conda_environment_11566785190/lib/python3.9/site-packages/typing_extensions.py)
```
Also delete macOS-X86 as we no longer build those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139144
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/huydhn
2024-10-29 04:24:51 +00:00
61d0686168 [PyTorch] Use intrusive_ptr(p, DontIncreaseRefcount) directly in TensorBase unsafe borrow ctor (#138934)
We observed ASAN failures stemming from 5ea6777861/torch/csrc/autograd/python_variable.cpp (L403) . Since it's possible that `tensor` is dead here, `borrowed()` needs to avoid dereferencing it. `intrusive_ptr::reclaim` dereferences the pointer in builds with debug checks enabled, so use the DontIncreaseRefcount ctor directly instead.

Differential Revision: [D64990707](https://our.internmc.facebook.com/intern/diff/D64990707/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138934
Approved by: https://github.com/ezyang
2024-10-29 04:20:11 +00:00
6aef58a249 Revert "Dont decompose aten.baddmm in inductor (#137904)"
This reverts commit c066f4a055020ae994dd10a1b1fafbe3774108cd.

Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the test is failing in trunk, maybe a landrace? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2443158194))
2024-10-29 04:08:11 +00:00
4ee514144b [c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager (#137763)
This PR aims to support the following use case:
```python
def all_reduce_eager(x):
    y = x * x
    req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True)
    assert isinstance(req, torch.distributed.Work)
    return y

@torch.compile(fullgraph=True)
def all_reduce_wait_compiled(y):
    torch.ops.c10d_functional.wait_tensor(y)
    return y * y

x = torch.ones(1280, 1280, device="cuda") + self.rank
with allow_inflight_collective_as_graph_input_ctx():
    y = all_reduce_eager(x)
    z = all_reduce_wait_compiled(y)
```
where the collective is issued in eager (with `async_op=True`) but waited in compiled region.

This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel.

----

**Update**: Did two items to prevent regression to existing use cases:

1. Added memory-stressed test case to test_c10d_nccl.py `test_unwaited` to cover existing user's "not calling work.wait() for non-functional collective" use case
2. Gated all new `register_work()` / `unregister_work()` calls with `c10d::allow_inflight_collective_as_graph_input()` check, which is a new context manager that requires explicit user enablement (i.e. not on by default, so should not affect existing users).

The risk of this new version of PR causing regression should be very low.

------

Test commands:
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited`
- `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal`
- `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli`
- `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True`
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees`
- `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco`

------

Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763
Approved by: https://github.com/yifuwang
2024-10-29 03:31:19 +00:00
cyy
d8f99f39cb Avoid unnecessary tensor constructions (#139039)
Because Variable is an alias of Tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139039
Approved by: https://github.com/Skylion007
2024-10-29 02:23:23 +00:00
e80fe7f13a [dynamo][guards] Skip guards on empty nn module hooks (#138942)
This brings some unsoundness in guards. Earlier we were skipping empty nn module hooks dict guard only on inbuilt nn modules, but as seen in https://github.com/pytorch/pytorch/issues/138386, there could be still be significant guard overhead. With this PR, we reduce the guard eval latency from 420 us to 280 us (1.5x reduction).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138942
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #139040, #138954
2024-10-29 02:11:47 +00:00
2aa5348356 [dynamo][guards] Skip no tensor aliasing guards on parameters (#138954)
This is another unsound guard eval optimization. Its rare in practice to
compile a function with two different parameters as inputs, and then
later call the function with one parameter input as two different inputs
(aliasing). This further reduces guard overhead from 280 us to 240 us
for the model in https://github.com/pytorch/pytorch/issues/138386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138954
Approved by: https://github.com/jansel
ghstack dependencies: #139040
2024-10-29 02:11:47 +00:00
dee7e715ba [dynamo][refactor] Remaining cleanup from config-cleanup of enable_cpp_guard_manager (#139040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139040
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-10-29 02:11:39 +00:00
7c7b2d89ba [ROCm] set hipblas workspace (#138791)
Fixes #138532.

This brings hipblas behavior in line with cublas behavior with respect to setting the workspace to an allocation from the caching allocator as well as the env var HIPBLAS_WORKSPACE_CONFIG.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138791
Approved by: https://github.com/naromero77amd, https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-29 01:37:55 +00:00
eqy
07b0d633b8 [cuDNN][SDPA] Bail out of cuDNN SDPA for seqlen 1 inputs (#138531)
Forwarded #138529 to the cuDNN team but for now but we want to avoid dispatching to unsupported cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138531
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-29 01:03:36 +00:00
1637a40796 Adds snapshot API for MemPools to get pool memory segments (#133601)
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.

In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
2024-10-29 01:01:47 +00:00
c066f4a055 Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-29 00:54:29 +00:00
2b937e4e6d [inductor] Cooperative reductions (#137756)
Example generated code for `(x+y).sum()`:
```py
@triton.jit
def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr):
    xnumel = 1
    rnumel = 1048576
    rsplit_id = tl.program_id(0)
    num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK
    rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK
    rsplit_start = rsplit_chunk * rsplit_id
    rsplit_end = rsplit_chunk * (rsplit_id + 1)
    xoffset = tl.program_id(1) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1)
    rbase = tl.arange(0, RBLOCK)[None, :]
    _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32)
    for roffset in range(rsplit_start, rsplit_end, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r0 = rindex
        tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0)
        tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0)
        tmp2 = tmp0 + tmp1
        tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK])
        tmp5 = _tmp4 + tmp3
        _tmp4 = tl.where(rmask, tmp5, _tmp4)
    tmp4 = tl.sum(_tmp4, 1)[:, None]
    if RSPLIT > 1:
        tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32))
        tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None)
    if RSPLIT > 1:
        triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True)
    if RSPLIT > 1:
        tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first')
        tmp4 = tl.sum(tmp4_peers, 1)[:, None]
    if rsplit_id == (0 % RSPLIT):
        tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756
Approved by: https://github.com/eellison
2024-10-29 00:45:53 +00:00
cyy
383d9e3de6 [4/N] Fix cppcoreguidelines-special-member-functions warnings (#139027)
Follows #138796
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139027
Approved by: https://github.com/ezyang
2024-10-29 00:18:18 +00:00
5b39734a0a [DTensor][Test] Fix gloo backend failure when eager_init is turned on (#139097)
We should only pass the `device_id` when the backend is `nccl`. Otherwise, we would run into the following error:
```
RuntimeError: No backend for the parent process group or its backend does not support splitting
```

This also fixes test failure is not asserted when using `with_comms()` or `with_comms(eager_init=False)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139097
Approved by: https://github.com/XilunWu
2024-10-29 00:04:06 +00:00
cyy
aa2b17c330 [3/N] Don't skip ASAN on some tests (#139058)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139058
Approved by: https://github.com/ezyang
2024-10-28 23:57:23 +00:00
cyy
5ab81099e3 [2/N] Fix object slice (#139036)
Follows #138880
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139036
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-10-28 23:56:36 +00:00
e00ead400c Add a temporary Survey about the search (#139096)
- Add a link to the new search survey
- Add .css classes needed for the search banner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139096
Approved by: https://github.com/seemethere, https://github.com/cjyabraham
2024-10-28 23:43:25 +00:00
ab09c4d913 Add host-side TMA support to AOTInductor (#138878)
This adds host-side Triton TMA support to AOTInductor. Notes:

- Two helper functions, `init1DTMADescriptor` and `init2DTMADescriptor` are added to the C++ wrapper codegen on GPU, conditioned on the model having user-defined Triton kernels with host-side TMA (CUDA-specific).
- C++ wrapper codegen on GPU emits TMA descriptor initialization via the aforementioned helper functions.
- Special handling added for the TMA descriptors (in the Python wrapper codegen) during the compile-time autotuning, as the underlying tensor can't be passed directly to the user-defined Triton kernel. TMA descriptors are generated in-between the source tensor's buffer and the kernel call, like in the full Python wrapper codegen.
- This PR concludes the host-side Triton TMA support in PT2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138878
Approved by: https://github.com/desertfire, https://github.com/chenyang78
ghstack dependencies: #138759, #138877
2024-10-28 23:39:53 +00:00
fd9f4e6770 Back out "[compiled autograd] tls access helpers (#138061)" and Back out "[compiled autograd] Compiled autograd configs in TLS (#137821)" (#139086)
Summary:
Original commit changeset: 9bf80c1492d7

Original Phabricator Diff: D64796226

Original commit changeset: aa1d9ef8f6e6

Original Phabricator Diff: D64796212

Differential Revision: D65072644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139086
Approved by: https://github.com/malfet
2024-10-28 23:37:05 +00:00
18ad44e830 [BE] Test collect env against torch-2.* (#139122)
And also update Python version to 3.9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139122
Approved by: https://github.com/kit1980
2024-10-28 23:17:38 +00:00
ba749755f5 Bump rexml from 3.3.3 to 3.3.9 in /ios/TestApp (#139088)
Bumps [rexml](https://github.com/ruby/rexml) from 3.3.3 to 3.3.9.
- [Release notes](https://github.com/ruby/rexml/releases)
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md)
- [Commits](https://github.com/ruby/rexml/compare/v3.3.3...v3.3.9)

---
updated-dependencies:
- dependency-name: rexml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-28 15:47:10 -07:00
23fb8baf37 Bump certifi from 2024.2.2 to 2024.7.4 in /tools/build/bazel (#130173)
Bumps [certifi](https://github.com/certifi/python-certifi) from 2024.2.2 to 2024.7.4.
- [Commits](https://github.com/certifi/python-certifi/compare/2024.02.02...2024.07.04)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-28 15:44:49 -07:00
b7524b05d2 Make test_export training IR compatible (#138517)
In this PR, I make test_export to be compatible with training IR. The idea is that when we flip the IR to non-functional training IR, all these tests should be green. The changes involve reading through the test case, and add necessary decomposition etc to make sure the tests pass. For example, if the tests expect to see mutated buffers returned, we need to get them via running run_decomp.

Differential Revision: [D64732360](https://our.internmc.facebook.com/intern/diff/D64732360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138517
Approved by: https://github.com/avikchaudhuri
2024-10-28 22:38:19 +00:00
904816d1ed [dynamo] handle 3.13.0 __dict__ watcher bug (#138284)
https://github.com/python/cpython/pull/116115 introduced a bug (https://github.com/python/cpython/issues/125608) where changing the attributes of an object may not fire the dict watchers registered to the object's `__dict__`. It has been fixed by https://github.com/python/cpython/pull/125611 but will only be in 3.13.1+.

This PR disables the dict watcher guard shortcut for `__dict__`s on 3.13.0 and warns the user to try using 3.13.1+ instead. We also added a simple test to check for this functionality in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138284
Approved by: https://github.com/jansel
ghstack dependencies: #138030
2024-10-28 22:25:21 +00:00
35be6aef69 [dynamo] add some cpython debugging methods (#138030)
This PR enables you to inspect PyObjects in C using `INSPECT(...)` without requiring https://docs.python.org/3/howto/gdb_helpers.html. `torch._dynamo.eval_frame.raise_sigtrap` can also be used to set gdb breakpoints while running Python code, e.g.

```python
x = x + 1
torch._dynamo.eval_frame.raise_sigtrap();
# can breakpoint on ceval.c:CALL to breakpoint the `sin` call in C.
x = torch.sin(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138030
Approved by: https://github.com/jansel
2024-10-28 22:25:21 +00:00
edf2a1be97 [ROCm][CK] Explicit cast values to half (#138751)
Addresses ambiguous conversions and calls introduced by these two pull requests:
[[ROCm] CK-based GEMM](https://github.com/pytorch/pytorch/pull/131004)
[[AMD] Fix torch ck backend build with 6.2.1](https://github.com/pytorch/pytorch/pull/138434)

Co-authored-by: cjatin <cjatin@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138751
Approved by: https://github.com/jeffdaily

Co-authored-by: pruthvistony <pruthvigithub@gmail.com>
Co-authored-by: cjatin <cjatin@users.noreply.github.com>
2024-10-28 22:00:26 +00:00
ded83d2b16 support torch._utils._flatten_dense_tensors/_unflatten_dense_tensors … (#139023)
Fixes #138897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139023
Approved by: https://github.com/ezyang
2024-10-28 21:59:07 +00:00
8785353f2f Fix tensor subclass + dynamic shapes in torch.compile + aot autograd (#125941)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125941
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133337
2024-10-28 21:58:59 +00:00
6baccb430b Update TwoTensor impl. to accept outer_size/outer_stride (#133337)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133337
Approved by: https://github.com/bdhirsh
2024-10-28 21:58:59 +00:00
cyy
f4f0f2995d Fix Wextra-semi warnings (#139000)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139000
Approved by: https://github.com/ezyang
2024-10-28 21:48:51 +00:00
52c80f663d change name of dynamo CI chard to dynamo_wrapped (#138233)
Implements https://github.com/pytorch/pytorch/issues/118127
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138233
Approved by: https://github.com/clee2000
2024-10-28 21:42:33 +00:00
02339e674d Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (#139013)"
This reverts commit 74878ac271feecfa3ff3d32f78c7d889bcac97d6.

Reverted https://github.com/pytorch/pytorch/pull/139013 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be breaking on trunk. See: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False [GH job link](https://github.com/pytorch/pytorch/actions/runs/11559910615/job/32177150816) [HUD commit link](74878ac271) ([comment](https://github.com/pytorch/pytorch/pull/139013#issuecomment-2442667605))
2024-10-28 21:30:28 +00:00
1a275fea4b Remove numpy dependency for maia serialization (#137600)
See rationale in #137444 description

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137600
Approved by: https://github.com/albanD
2024-10-28 20:57:35 +00:00
dd688099af Update unbacked symints in torch.nonzero more precisely (#137663)
### Summary
The fake impl for `nonzero` sets the symint's upper range to `sys.maxsize - 1` if there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape.

See https://github.com/pytorch/pytorch/pull/134899 as a merged solution for a similar problem for a different op.

### Test plan
Added unit test to verify upper bound reduction calculation (`python test/export/test_export.py TestExport.test_nonzero_dynamic`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137663
Approved by: https://github.com/ezyang
2024-10-28 20:57:23 +00:00
8fa0479dd8 [inductor] Enable cpp wrapper for test_torchinductor (#138579)
Summary: Expand cpp wrapper testing to test_torchinductor. Using skip_cpp_wrapper to skip failing tests for now, and fixes are coming later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138579
Approved by: https://github.com/chenyang78, https://github.com/benjaminglass1
2024-10-28 20:35:25 +00:00
e5595f10c8 Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager (#137763)"
This reverts commit a688c57033b4536ef59356cdad241d65ca52a869.

Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/yf225 due to Seems to have bad interaction with latest commits on trunk, reverting to be safe ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2442527696))
2024-10-28 20:13:46 +00:00
8ba9063002 FlexAttention support for NJT (#136792)
This PR adds FlexAttention + NJT support. In particular:
* To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically.
* Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately
* Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported
* Tests that FlexAttention with a causal mask matches causal SDPA
* Adds a new public API for FlexAttention usage:
    * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space.
      * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term.

Example usage:
```python
def causal_mask(b, h, q_idx, kv_idx):
    return q_idx >= kv_idx

query = ... # NJT of shape (B, H, S*, D)
key = ... # NJT of shape (B, H, S*, D)
value = ... # NJT of shape (B, H, S*, D)
# create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space
block_mask = create_nested_block_mask(causal_mask, 1, 1, query)  # block mask conceptual shape is (B, H, sum(S*), sum(S*))
output = flex_attention(query, key, value, block_mask=block_mask)

def causal_score_mod(score, b, h, q_idx, kv_idx):
    return torch.where(q_idx >= kv_idx, score, float("-inf"))

# flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs
output2 = flex_attention(query, key, value, score_mod=causal_score_mod)
```

TODO:
* ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though
* ~~Some cleanup~~
* ~~`njt_score_mod_adapter`~~
* ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~
* Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices?
    * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though.
* ~~Demonstrate non-causal mask~~
* Support non-contiguous NJTs with holes (**booted to future PR**)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792
Approved by: https://github.com/drisspg
ghstack dependencies: #138841
2024-10-28 20:01:27 +00:00
4cd985a886 [dynamo] Remove some files from dynamo_expected_failures (#138935)
Some tests in `test/dynamo` are marked as "expected failure when testing
with `PYTORCH_TEST_WITH_DYNAMO=1`, i.e., we added files of those test
names in the `dynamo_expected_failures` folder.

However, a lot of those dynamo tests seem to be passing with
`PYTORCH_TEST_WITH_DYNAMO=1`, so this patch removes them from
`dynamo_expected_failures`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138935
Approved by: https://github.com/anijain2305
2024-10-28 19:41:26 +00:00
9e06b5b5cb fix unflatten with HOPs (#138978)
Summary:
Unflatten was broken for HOPs for a couple of reasons:
(1) we didn't expect `get_attr` nodes in the exported program, but they can occur to hold graph arguments to HOPs; such attributes must be moved from the exported program to the corresponding unflattened submodule containing the HOP call.
(2) we don't record metadata for graph arguments on serialization (there's nothing to hold it in our schema), and accordingly the `get_attr` nodes we create on deserialization don't have `nn_module_stack` metadata, which obviously wrecks unflatten.

Test Plan: added a couple of tests

Differential Revision: D65013647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138978
Approved by: https://github.com/zhxchen17
2024-10-28 19:30:56 +00:00
c2ded9ec0d Fix dot reference checks (#138596)
dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch

i.e.
```python
import torch
x = torch.tensor([1,2,3], dtype=torch.float32)
y = torch.tensor([4,5,6], dtype=torch.float16)
x.dot(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half
```

However the below does not raise an exception
```python
x.to("meta").dot(y.to("meta"))
```
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596
Approved by: https://github.com/bdhirsh
2024-10-28 19:11:40 +00:00
068f7e7a78 torch::optional -> std::optional (#138987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138987
Approved by: https://github.com/Skylion007
2024-10-28 19:09:46 +00:00
228963ad60 Revert "Add test for consistency between meta and CPU devices. (#138515)"
This reverts commit 006130d8eae834d17e3d3e21e61c506740cce6dc.

Reverted https://github.com/pytorch/pytorch/pull/138515 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the test is failing in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/138515#issuecomment-2442357471))
2024-10-28 18:45:09 +00:00
f466df63a9 [torch] Address -Wreturn-type warning when compiling for AMD (#138951)
Summary: Yep yep see title

Test Plan: CI

Differential Revision: D64971115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138951
Approved by: https://github.com/cyyever, https://github.com/adamomainz
2024-10-28 18:26:40 +00:00
817e57f832 Remove Python 3.8 from README (#139089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139089
Approved by: https://github.com/clee2000, https://github.com/malfet
2024-10-28 18:12:11 +00:00
475ba1df8d Expliclty avoid recording when should_record_events is false in record_shapeenv_event (#138965)
Looking at the function record_shapeenv_event its hard to tell that it does not always run
but we do disable it by setting top level is_recording to True self.should_record_events is false
this makes it more explicit to avoid confusion and overloading is_recording.

alternativley we can rename is_recording to do_no_record.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138965
Approved by: https://github.com/ezyang
ghstack dependencies: #138804
2024-10-28 18:12:06 +00:00
a688c57033 [c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager async_op=True collective if under allow_inflight_collective_as_graph_input_ctx() context manager (#137763)
This PR aims to support the following use case:
```python
def all_reduce_eager(x):
    y = x * x
    req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True)
    assert isinstance(req, torch.distributed.Work)
    return y

@torch.compile(fullgraph=True)
def all_reduce_wait_compiled(y):
    torch.ops.c10d_functional.wait_tensor(y)
    return y * y

x = torch.ones(1280, 1280, device="cuda") + self.rank
with allow_inflight_collective_as_graph_input_ctx():
    y = all_reduce_eager(x)
    z = all_reduce_wait_compiled(y)
```
where the collective is issued in eager (with `async_op=True`) but waited in compiled region.

This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel.

------

Test commands:
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited`
- `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal`
- `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli`
- `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True`
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees`
- `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco`

------

Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763
Approved by: https://github.com/yifuwang
2024-10-28 18:11:23 +00:00
5c49db98b4 [EZ] Update minversion to 3.9.0 (#139085)
Fixes https://github.com/pytorch/pytorch/issues/138979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139085
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/seemethere, https://github.com/Skylion007
2024-10-28 18:04:29 +00:00
74878ac271 [PGNCCL] Make sure we do not use split for P2P comm creation (#139013)
Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172

There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013
Approved by: https://github.com/shuqiangzhang
2024-10-28 18:03:25 +00:00
fb2c750e9d [AOTI][refactor] Move convert_arrayref_tensor_to_tensor logic (#139030)
Summary: Move convert_arrayref_tensor_to_tensor codegen logic to cpp_wrapper_cpu_array_ref.py

Test Plan: CI

Differential Revision: D64904187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139030
Approved by: https://github.com/hl475
2024-10-28 18:00:41 +00:00
949fdd2997 remove redundant a (#139046)
As per title, only one "a" is sufficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139046
Approved by: https://github.com/Skylion007
2024-10-28 17:47:24 +00:00
66a3c249ae Linter for no workflows on fork (#138849)
MInor, adds a linter that ensures that all jobs run on pull_request, schedule, push etc have a `if: github.repository_owner == 'pytorch'` or are dependent on a job that has that check

There is also a setting in Github repos that can disable all workflows for that repo

A lot of these are unnecessary because many jobs use reusable workflows that have that check.  However, this is a one time change so I'm not that bothered

Unfortunately I can't put this at the workflow level, which would make this better

Lots of weird string parsing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138849
Approved by: https://github.com/malfet
2024-10-28 17:46:50 +00:00
01b055abe3 Make masked_scatter core aten (#137949)
Summary: Making `masked_scatter` core aten since it is hard to decompose and we now have a portable kernel for it

Test Plan: N/A

Differential Revision: D64368725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137949
Approved by: https://github.com/larryliu0820
2024-10-28 17:31:53 +00:00
bca696ae81 Switch times to us in CompilationMetrics and improvements (#138975)
Companion logger diff: https://www.internalfb.com/diff/D65012523

* Using float seconds for timestamps is bad because our internal system defaults to float32 precision and you don't even get second precision for timestamps in float32
* We decide to use microseconds instead of milliseconds because millisecond granularity you can end up with the same timestamp if compilation is happening very quickly; much better to force non-overlapping spans
* Because there are so many new fields and I don't feel like reimplementing each on BwdCompilationMetrics, BwdCompilationMetrics is no more, it's just that everything in CompilationMetrics is now optional.
* The actual frame compile times collection is not modified (still float) to reduce blast radius, so I just convert to microseconds before making the record. At float64 precision (Python's default), you get about microsecond precision on timestamps so shouldn't be a data problem (https://www.leebutterman.com/2021/02/01/store-your-unix-epoch-times-as-float64.html)
* I rename some entries for clarity. In particular, whenever a timing contains all of the its lower phases (e.g., how Inductor also contains Triton compilation) we put "cumulative" in its name.  If something doesn't happen at compile time but is delayed until we have actual real inputs, we put "runtime" in its name.

Test plan:

```
buck2 run @mode/opt @mode/inplace //scripts/oulgen:runner
```

And then inspect https://fburl.com/scuba/dynamo_compile/sandbox/mslu7f5w and verify the us columns are populated and meaningful.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138975
Approved by: https://github.com/masnesral
2024-10-28 17:17:18 +00:00
cyy
9b2c99d731 Move reduce to template parameter in vectorized_reduction (#138672)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138672
Approved by: https://github.com/soulitzer
2024-10-28 17:13:12 +00:00
3685c630b8 [pytorch] Plumb compile context from dynamo.export to aot_compile (#138793)
Summary:
tlparse shows unknown for certain items when _export.aot_compile() passes the graph obtained from dynamo.export() to inductor.aot_compile(), we also do not have access to the dynamo trace in the GraphModule exported by dynamo.

This change plumbs through the compile_context into aot_compile as a part of GraphModule.meta without a major change to APIs within dynamo.

Addresses issue: https://github.com/pytorch/pytorch/issues/123759?fbclid=IwY2xjawGE0LBleHRuA2FlbQIxMQABHS-PRpxvsrsHCDPdStHpqr1jQvx1YOnrPsRAfYAb-oXkU8MxidkIUENY-Q_aem_MAT2oaOgD03C8ggBNm575Q#issuecomment-2430722505

Test Plan:
```
buck2 test mode/opt //caffe2/test/dynamo:test_dynamo
Buck UI: https://www.internalfb.com/buck2/ad64c267-65be-47cf-a94f-e4b26e6e030b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9288674286334710
Network: Up: 83KiB  Down: 314KiB  (reSessionID-1dad223b-c91d-4718-97a4-bb2c81e480f0)
Jobs completed: 10750. Time elapsed: 19:18.5s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
Tests finished: Pass 5365. Fail 2. Fatal 0. Skip 4. Build failure 0

buck2 test mode/opt //caffe2/test/dynamo:test_dynamo_fb
Buck UI: https://www.internalfb.com/buck2/179a60bb-34e1-43b3-97ad-91af8a93ab01
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275046340687
Network: Up: 201KiB  Down: 1.8GiB  (reSessionID-36f33983-6d78-4ec9-aa1b-34cee80dcb4f)
Jobs completed: 17. Time elapsed: 42.9s.
Cache hits: 0%. Commands: 1 (cached: 0, remote: 0, local: 1)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
```

https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxZGXf6/index.html
Repor fixed: https://github.com/pytorch/pytorch/issues/123759?fbclid=IwY2xjawGE0LBleHRuA2FlbQIxMQABHS-PRpxvsrsHCDPdStHpqr1jQvx1YOnrPsRAfYAb-oXkU8MxidkIUENY-Q_aem_MAT2oaOgD03C8ggBNm575Q#issuecomment-2430722505

Differential Revision: D64863946

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138793
Approved by: https://github.com/ezyang
2024-10-28 17:07:44 +00:00
91ded0576d Add sym_log2 (#137980)
Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1515595595745313/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137980
Approved by: https://github.com/bobrenjc93
2024-10-28 17:03:14 +00:00
006130d8ea Add test for consistency between meta and CPU devices. (#138515)
Reference: https://github.com/pytorch/pytorch/issues/138399

This PR introduces an `OpInfo` test that checks whether running each `out=` operation
using meta inputs is consistent with using concrete (e.g. CPU) inputs. More specifically,
it tests the case where the output tensors are not of the expected data type. According to
the `out=` specification, some operations should error.

I have added XFAIL to the set of operations that are currently failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138515
Approved by: https://github.com/ezyang
2024-10-28 16:58:48 +00:00
4dd04db5d0 Revert "[Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643)"
This reverts commit 4d92d6e60436b1aeffbf4dfce51f16923505251b.

Reverted https://github.com/pytorch/pytorch/pull/138643 on behalf of https://github.com/wdvr due to reverting due to a large number of internal failures, see below ([comment](https://github.com/pytorch/pytorch/pull/138643#issuecomment-2442036958))
2024-10-28 16:18:38 +00:00
d90717e4e2 Add option to save real tensors in TORCH_COMPILE_DEBUG repro (#138110)
This pr adds a utility to try to try to construct the corresponding real tensor values of fake tensors by seeing if their meta storage is contained in the meta converter.

Then, we are able to save real tensor values for fx_graph_runnable if `TORCH_COMPILE_DEBUG_SAVE_REAL=1` is set.

Differential Revision: [D64502744](https://our.internmc.facebook.com/intern/diff/D64502744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138110
Approved by: https://github.com/ezyang
2024-10-28 16:18:22 +00:00
2922b9fee1 [ROCm] Fix ADDMM hipBLASLt regression (#138267)
Fixes #138067

A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604

The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267
Approved by: https://github.com/eqy, https://github.com/jeffdaily
2024-10-28 16:07:11 +00:00
ad933578ed [fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.

Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
2024-10-28 15:23:56 +00:00
3b0f39336c Revert "Adds snapshot API for MemPools to get pool memory segments (#133601)"
This reverts commit 00504aa6b8b0ae68761b89f023184202e8c79bc8.

Reverted https://github.com/pytorch/pytorch/pull/133601 on behalf of https://github.com/wdvr due to reverting for now as this breaks lots of internal tests. Details below ([comment](https://github.com/pytorch/pytorch/pull/133601#issuecomment-2441864871))
2024-10-28 15:12:20 +00:00
5916def695 Fix MKL status check wrong to MKLDNN. (#139049)
Fix check MKL status wrong to MKLDNN.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139049
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-10-28 14:28:56 +00:00
4d8090cabb Avoid file encoding issues when loading cpp extensions (#138565)
I've found that when using `torch.utils.cpp_extension.load` on my Windows system, decoding errors occur when my .cpp/.cu files contain certain non-English characters.

`test.py`:
```py
from torch.utils.cpp_extension import load
my_lib = load(name='my_cuda_kernel', sources=['my_cuda_kernel.cu'], extra_cuda_cflags=['-O2', '-std=c++17'])
# ......
```

`my_cuda_kernel.cu`:
```cpp
#include <torch/types.h>
#include <torch/extension.h>
// 向量化 <------ some chinese characters

// ......
```

Errors will be reported as:
```
Traceback (most recent call last):
  File "E:\test\test.py", line 8, in <module>
    my_lib = load(
                 ^^^^^
  File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1314, in load
    return _jit_compile(
           ^^^^^^^^^^^^^
  File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\cpp_extension.py", line 1680, in _jit_compile
    version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 46, in bump_version_if_changed
    hash_value = hash_source_files(hash_value, source_files)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\XXX\AppData\Roaming\Python\Python311\site-packages\torch\utils\_cpp_extension_versioner.py", line 17, in hash_source_files
    hash_value = update_hash(hash_value, file.read())
                                         ^^^^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0x96 in position 141: illegal multibyte sequence
```

The issue lies in the fact that the `open()` function in Python is platform-dependent, which can cause decoding errors when a file contains characters that are not supported by the default encoding. Pytorch uses file contents to generate hash string:
60c1433041/torch/utils/_cpp_extension_versioner.py (L16-L17)

In my windows the default encoding is `gbk` but all of my cpp files are in `utf-8`.

There is a simple solution to this problem I think: just change the file reading mode to binary mode, which can avoid issues related to file encoding. It works perfectly on my computer.

```diff
- with open(filename) as file:
+ with open(filename, 'rb') as file:
    hash_value = update_hash(hash_value, file.read())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138565
Approved by: https://github.com/malfet, https://github.com/janeyx99

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-28 14:06:34 +00:00
cyy
1ec76dd1dc Enable clang-tidy on torch/csrc/distributed (#139043)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139043
Approved by: https://github.com/Skylion007
2024-10-28 13:56:54 +00:00
60d1c7138d Revert "[inductor] Cooperative reductions (#137756)"
This reverts commit fed37dbfbceefe306af648ff4fe1e0124c4d7844.

Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322))
2024-10-28 13:24:33 +00:00
2487a834a4 Revert "Add sym_log2 (#137980)"
This reverts commit 5d450d7facd7480482132408acc4c23d80933bab.

Reverted https://github.com/pytorch/pytorch/pull/137980 on behalf of https://github.com/jeanschmidt due to lint broke from this onwards on main ([comment](https://github.com/pytorch/pytorch/pull/137980#issuecomment-2441570186))
2024-10-28 13:21:08 +00:00
8274dadac5 Make OpaqueUnaryFn pickleable (#138395)
Fixes https://github.com/pytorch/pytorch/issues/138070

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138395
Approved by: https://github.com/XuehaiPan, https://github.com/bobrenjc93
2024-10-28 13:10:04 +00:00
cyy
4d9b5a87e4 [3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138796
Approved by: https://github.com/ezyang
2024-10-28 10:53:11 +00:00
2265c2d48c Add pytorch.wait_counter.actual_codegen_and_compile WaitCounter (#138010)
The current pytorch.wait_counter.codegen_and_compile scopes over
cache hit/miss, so it doesn't accurately say if you're actually
spending time doing Inductor compile or not.  This counter /only/
is triggered when we're actually about to spend time in Inductor.
It covers Inductor lowering, codegen as well as Triton compilation.
It does NOT cover Triton compilation that occurs when you cache hit.

Some more bikeshedding may be needed.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138010
Approved by: https://github.com/markkm
2024-10-28 08:06:24 +00:00
46132dc026 [Dynamo] Refactor wrap_fx_proxy (#138933)
During the work to dedup graphs for hierarchical compilation I tried to tame the `wrap_fx_proxy_cls` mess  by separating the wrapping into three distinct scenarios (vs a jumble of conditionals). These are:
1) wrapping a preexisting tensor (`_wrap_fx_preexisting_tensor`
2) wrapping and tracing a new op into the graph (`_wrap_fx_proxy`)
3) handling a value that is some other proxyable data structure

See `wrap_fx_proxy_cls` for the conditional tree handling these three cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138933
Approved by: https://github.com/williamwen42
2024-10-28 08:05:33 +00:00
9ca749d6cd Revert " [3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796)"
This reverts commit 7cb3cef05f4b1d1b448a82a01420e2a9ed1ccfe0.

Reverted https://github.com/pytorch/pytorch/pull/138796 on behalf of https://github.com/wdvr due to reverting since this started failing a windows test ([comment](https://github.com/pytorch/pytorch/pull/138796#issuecomment-2440710865))
2024-10-28 07:06:00 +00:00
633dcf1a2d Constant folding for lifted graph (#135060)
Summary:
Current implementation for lifted graph takes a dict of [constant name: constant value]. And the constant value is used to run_node and excute the constant graph to get the folded values and then create new getattr nodes for folded values.

We don't have constant values for lifted graph during model compilation on MTIA. I think it is more general to allow the constant folding pass to just take the constant names only to produce the constant graph and represent the folded nodes as placeholders to make it consistent with lifted graph. Additionally, this mimic the real situation on Sigmoid, where Sigmoid executes the constant graph, get the folded values and set the folded values to the main graph. This diff is to update the pass to work with a list of constant names.

Test Plan:
```
buck run mode/opt caffe2/test:test_export -- -r split_const_gm
```

Differential Revision: D62144791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135060
Approved by: https://github.com/SherlockNoMad

Co-authored-by: Tuan Trieu <tuant@meta.com>
2024-10-28 06:28:31 +00:00
a99e8eeb97 Propagate real tensor tracing with torchbind + fixing side effects (#138797)
Summary:
* Fixed real tensor tracing w/ torchbind objs by passing the cloned tensor obj. For now I just catch the exception and have an error message if the `_clone` fails, but up for discussion on what to do here
  * Separate question, should we require people to set up FakeScriptObjects and stuff for draft mode?
* Prevent side effects from happening when we do the first pass of custom ops profiling by cloning/copying everything. Not sure if deepcopying the model will succeed in all cases... But also I guess this path can be removed once custom ops profiling turns into one pass.

Test Plan: `buck2 run @//mode/dev-nosan //scripts/angelayi/draft_export:test_draft_export`

Reviewed By: ydwu4

Differential Revision: D64124825

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138797
Approved by: https://github.com/ydwu4
2024-10-28 06:27:36 +00:00
dd9ff9f139 [compiled autograd] add tests for bwd hooks relative firing order (#139004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139004
Approved by: https://github.com/yf225
ghstack dependencies: #139003
2024-10-28 05:55:56 +00:00
fac74687a6 [compiled autograd] fix node origin graph comments (#139003)
the comment update was done after prehooks were already collected, so prehooks would appear as part of the previous node

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139003
Approved by: https://github.com/yf225
2024-10-28 05:55:56 +00:00
cyy
f9ae3fac8c [Distributed] [19/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138903)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138903
Approved by: https://github.com/ezyang
2024-10-28 05:29:25 +00:00
cyy
39aa3cb8d6 Re-enable skipped ubsan tests (#139008)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139008
Approved by: https://github.com/ezyang
2024-10-28 05:21:31 +00:00
d2052ea84d Update test_multiarray.py to support numpy 2.0+ (#138461)
Import _core instead of core.

Addresses partially #137182
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138461
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-10-28 04:30:50 +00:00
4c6ae39afd Fix some nits in symbolic_shapes.py (#139018)
While I was reading through this file for understanding, I fixed some nits.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139018
Approved by: https://github.com/ezyang
2024-10-28 04:27:12 +00:00
1fad37a023 [audio hash update] update the pinned audio hash (#138402)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138402
Approved by: https://github.com/pytorchbot
2024-10-28 04:04:28 +00:00
6f5d538972 [executorch hash update] update the pinned executorch hash (#138661)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138661
Approved by: https://github.com/pytorchbot
2024-10-28 03:44:00 +00:00
d72241d045 [Ez][BE]: Fix one more incorrect TypeIs (#139010)
One other case where the side conditions could cause inaccurate typing info. Follow up to #138990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139010
Approved by: https://github.com/malfet
2024-10-28 03:36:45 +00:00
cyy
f7dc13806e [2/N] Don't skip ASAN on some tests (#138663)
Follows #138571
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138663
Approved by: https://github.com/ezyang
2024-10-28 03:35:57 +00:00
5d450d7fac Add sym_log2 (#137980)
Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1515595595745313/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137980
Approved by: https://github.com/bobrenjc93
2024-10-28 03:09:11 +00:00
c056dc4cb8 In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804)
Title + we avoid calling defer_assert when we statically know the guard results.
timing for pnasnet5large

```
TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052
```
matches with out the diff
```
TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804
Approved by: https://github.com/ezyang
2024-10-28 02:19:55 +00:00
7cb3cef05f [3/N] Fix cppcoreguidelines-special-member-functions warnings (#138796)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138796
Approved by: https://github.com/ezyang
2024-10-28 01:38:02 +00:00
cyy
d2ec289787 Turn header static function into inline (#138671)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138671
Approved by: https://github.com/ezyang
2024-10-27 20:07:39 +00:00
192385e261 Add sym_sum to TorchInGraphFunctionVariable (#138848)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138848
Approved by: https://github.com/Skylion007
2024-10-27 20:04:35 +00:00
beb15c80fb print USE_STATIC_MKL for further debug. (#138902)
print `USE_STATIC_MKL` for further debug.
<img width="257" alt="image" src="https://github.com/user-attachments/assets/cd45bada-c28a-441a-b271-35956cfe1f21">
if we use `MKL`, then show its link method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138902
Approved by: https://github.com/ezyang
2024-10-27 18:08:30 +00:00
652a2ab93e [BE] Skip print(foo) tests (#139009)
Skipped `test_exponential` and `test_multinomial` because simply printing the result of an operator does not constitute a test. The testing framework does not attempt to interpret the output.
Modify `test_print_non_contiguous` to get tensors string representation, which is an equivalent operation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139009
Approved by: https://github.com/Skylion007
2024-10-27 18:04:03 +00:00
ee11e2da1e [PGNCCL] Use non-blocking mode by default in eager init (#138527)
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
2024-10-27 17:40:43 +00:00
fed37dbfbc [inductor] Cooperative reductions (#137756)
Example generated code for `(x+y).sum()`:
```py
@triton.jit
def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr):
    xnumel = 1
    rnumel = 1048576
    rsplit_id = tl.program_id(0)
    num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK
    rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK
    rsplit_start = rsplit_chunk * rsplit_id
    rsplit_end = rsplit_chunk * (rsplit_id + 1)
    xoffset = tl.program_id(1) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1)
    rbase = tl.arange(0, RBLOCK)[None, :]
    _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32)
    for roffset in range(rsplit_start, rsplit_end, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r0 = rindex
        tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0)
        tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0)
        tmp2 = tmp0 + tmp1
        tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK])
        tmp5 = _tmp4 + tmp3
        _tmp4 = tl.where(rmask, tmp5, _tmp4)
    tmp4 = tl.sum(_tmp4, 1)[:, None]
    if RSPLIT > 1:
        tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32))
        tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None)
    if RSPLIT > 1:
        triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True)
    if RSPLIT > 1:
        tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first')
        tmp4 = tl.sum(tmp4_peers, 1)[:, None]
    if rsplit_id == (0 % RSPLIT):
        tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756
Approved by: https://github.com/eellison
ghstack dependencies: #138970
2024-10-27 16:31:38 +00:00
3217ae2082 [inductor] Only apply score_fusion_memory_threshold to horizontal fusions (#138970)
PR #136782 made `x.sum()+1` become two kernels, which hurts compile
times as @ezyang noticed and breaks a lot of the tests in this stack.  This reworks that heuristic to not apply as often.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138970
Approved by: https://github.com/shunting314
2024-10-27 16:31:38 +00:00
bae3426af7 reimport pr137735 due to merging check issues (#138959)
This is  a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959
Approved by: https://github.com/mikaylagawarecki
2024-10-27 16:31:34 +00:00
144d75d934 Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527)"
This reverts commit 07e30eae2a8241e531890b6c9a33ab5a80c5ccaf.

Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2440070035))
2024-10-27 15:39:33 +00:00
d969b34377 Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804)"
This reverts commit f1a677cba5ef7514f2cf303753d3117528867a33.

Reverted https://github.com/pytorch/pytorch/pull/138804 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to fail pr_time_benchmarks job in trunk ([comment](https://github.com/pytorch/pytorch/pull/138804#issuecomment-2440069407))
2024-10-27 15:36:46 +00:00
5d074746e9 [BE]: Add better optional typing (#138426)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138426
Approved by: https://github.com/XuehaiPan, https://github.com/malfet
2024-10-27 14:19:00 +00:00
d9534a50a9 [AOTI][refactor] Separate header codegen (#138882)
Summary: Move arrayref specific header codegen logic to cpp_wrapper_cpu_array_ref.py, and consolidate some header files codegen logic

Test Plan: CI

Differential Revision: D64899248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138882
Approved by: https://github.com/hl475
2024-10-27 14:14:27 +00:00
40c098f731 Introduce a device-agnostic runtime API design (#132204)
# Motivation
According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design.
I personally prefer the **Simple Version** APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does **NOT** break the previous design philosophies.
I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle:
1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter.
2. Device-specific APIs should be placed under device-specific submodules.
3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter.

Also, I list the pros and cons of **Simple Version** here:
Pros:
- `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience;
- more concise, facilitate the developer to write a device-agnostic code.

Cons:
- no obvious drawbacks.

# Additional Context
I list the new APIs here:
```python
torch.accelerator.is_available() -> bool:
torch.accelerator.current_accelerator() -> torch.device:
torch.accelerator.device_count() -> int:
torch.accelerator.current_device_idx() -> int:
torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None:
torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream:
torch.accelerator.set_stream(stream: torch.Stream) -> None:
torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None:
```
According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204
Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD
2024-10-27 10:37:09 +00:00
1152726feb [PGNCCL] Use recursive mutex in NCCLComm (#138997)
Fixes #138995: [PGNCCL][BUG] mutex acquired in recursive way may deadlock

The fix: use `std::recursive_mutex` to replace `std::mutex`.

Found and proposed by @dsjohns2. Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138997
Approved by: https://github.com/dsjohns2
2024-10-27 08:58:47 +00:00
4681539f42 [inductor] force strides for efficient attn bwd (#138879)
Try to fix https://github.com/pytorch/pytorch/issues/138772 .

aten._scaled_dot_product_efficient_attention_backward requires the out and gradient_out to have stride order (3, 1, 2, 0).  When Inductor layout optimization is enabled, Inductor may change tensor strides if they are not user visible. For efficient_attention_backward, Inductor tries to follow eager strides. But the eager strides Inductor gets for backward graph may be the one after optimization. There are a few possible fixes:
1. change the kernel to allow stride order other than  (3, 1, 2, 0). This is probably hard
2. backout https://github.com/pytorch/pytorch/pull/112045/files and don't do layout optimization if the model contains efficient_attention.
3. Force (3, 1, 2, 0) strides order for the relevant tensors
4. Pass original eager layouts to Inductor for the backward graph. Let Inductor follow those layouts for tensors with extra layout requirement.

The PR implements option 3. Option 4 looks more general to me, I think we can do this in long term.

I tried to add a test but failed to repro: https://gist.github.com/shunting314/fe37a246aad269de9ea00199446688f6

Here is the original command to repro the issue:
```
TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 time python benchmark.py --model maxvit_nano_rw_256 --precision bfloat16 --torchcompile --bench train --no-retry -b 64
```
benchmark.py is https://github.com/huggingface/pytorch-image-models/blob/main/benchmark.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138879
Approved by: https://github.com/drisspg, https://github.com/eellison
2024-10-27 04:54:15 +00:00
c480a479b1 Make automatic_dynamic state live per CodeId, rather than on code object (#138740)
This is semantics changing as if you are dealing with multiple code objects which have exactly the same filename/firstlineno/name, but are distinct objects, and need non-aliasing automatic dynamic state. Otherwise, this should be equivalent (modulo lifetime). I want to do this because when I do PGO I can't index on code object identity, need a stable identifier.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138740
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #138693, #138717
2024-10-27 03:08:41 +00:00
14a45d7793 Refactor core algorithm for automatic dynamic shapes (#138717)
While working on automatic dynamic PGO (https://github.com/pytorch/pytorch/pull/138052) one abstract property I was looking for out of profile information is that it formed a semilattice: I could join together two profiles and get a merged profile that is consistent with the profiles that I saw in both cases. While working on this data structure that supported joins, I realized that the base automatic dynamic algorithm could be implemented in this way, therefore this refactor.

The basic recipe is that we now support a join operation on FrameStateSizeEntry. Intuitively, if you join two sizes that are equal, you get back that size (join(2, 2) == 2), but if you join two different sizes you get a special singleton auto_dynamic indicating that the size of the tensor is dynamic (join(2, 3) == auto_dynamic). So now, the automatic dynamic algorithm is: (1) compute the FrameStateSizeEntry that corresponds to the concrete values we've seen, and (2) join it into the ambient FrameStateSizeEntry. As a bonus, compiler collectives can buy into the same abstraction (we're simply distributing FrameStateSizeEntry from each node to every other node). For convenience, I also added the necessary `auto_unset` extra state which is the identity element (which makes our semilattice bounded from both top and bottom). Here, join(2, auto_unset) == 2.

While doing this, there was a complication: the infer stride algorithm wasn't technically a semilattice. Here, I did what I suggested in the original code review https://github.com/pytorch/pytorch/pull/130232 which is stop using a heuristic, and instead replicate the stride inference algorithm in automatic dynamic. This means that when I join strides together, I don't join their concrete values, instead, if a stride can be inferred as the contiguous stride for a particular inner dimension, then you represent it as InferStride(dim). There's an example in code which I recommend looking at.

Some other extra things that are happening in this PR:

* I tried to deduplicate the size/stride automatic dynamic logic as much as possible. So hopefully less code to review here.
* I had to reimplement all the logging. For the most part I tried to track the logging as closely to the original as possible, but I think we could be emitting less Chrome events here
* The `marked_dynamic` handling is still preserved as is, but I kind of don't like it and we should figure out how to put it somewhere else

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138717
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #138693
2024-10-27 03:08:41 +00:00
28013aa527 [AOTInductor] Disable comprehensive_padding when use_runtime_constant_folding=True (#138872)
Summary:
Disable comprehensive_padding when use_runtime_constant_folding=True.
We need to disable the comprehensive padding because it modifies the stride thus the stride information between the constant graph and main graph will differ.

Test Plan:
```
buck2 run mode/opt -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=a100  caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/643940255/17/gpu_lowering/input.predictor.disagg.gpu.merge  --lower-backend="AOT_INDUCTOR_EP" --aot-inductor-config="{'max_autotune': True, 'aot_inductor.use_runtime_constant_folding': True}"
```

Reviewed By: 22quinn, henryoier

Differential Revision: D64927546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138872
Approved by: https://github.com/chenyang78
2024-10-27 01:12:27 +00:00
fee17d530d [AOTInductor] Add relu_nan_to_num option for pre-grad passes (#138545)
Summary: Add a relu_nan_to_num in pre-grad pass.

Test Plan: Included in commit

Differential Revision: D64724780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138545
Approved by: https://github.com/chenyang78
2024-10-27 00:57:11 +00:00
42994234a6 std::value/std::type -> std::_v/std::_t (#138746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-26 20:59:24 +00:00
cyy
fb36daac9f [7/N] Fix extra warnings brought by clang-tidy-17 (#138972)
Fix extra warnings brought by clang-tidy-17

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138972
Approved by: https://github.com/Skylion007
2024-10-26 19:09:47 +00:00
3a6f014381 [Inductor] improve the stride preservation logic of user-visible outputs (#136732)
## Context

Previously, the stride preservation of user-visible nodes worked as follows:

- After joint-graph tracing, we recorded the **names** of user-visible nodes and passed them to GraphLowering.
- In GraphLowering, we determined whether we needed to preserve the striding for a certain node by checking if the node's name was in `user_visible_outputs`.
- We obtained the original strides by checking `node.meta["val"].stride()`.

However, there's a problem with this approach: the nodes in output_node.args[0] and their strides could change between the completion of joint-graph tracing and the consumption of `user_visible_outputs` (e.g., during post-grad passes), making it unreliable.

## This PR

- After joint graph tracing:
  - Record the original strides for all nodes in `output_nodes.args[0]` as `output_node.meta["original_output_strides"]` (recording for all nodes in case we need the info for other purposes such as debugging).
  - Record the indices of user-visible outputs as `output_node.meta["user_visible_output_idxs"]`.
- Remove the original plumbing of `user_visible_outputs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136732
Approved by: https://github.com/Chillee
2024-10-26 18:49:14 +00:00
1d83a893c5 [BE][MPS] Use templates in Repeat shader (#138962)
- Instead of generating shader from templated code on host, just define two specializations of one kernel template
- Get rid of unused `threads_per_threadgroup` argument
- Replace `if (typeid(scalar_t) == typeid(int32_t))` with `if constexpr (std::is_same_v<scalar_t, int32_t>)` in the host code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138962
Approved by: https://github.com/janeyx99
2024-10-26 17:42:07 +00:00
e78c4ded48 Use the unicode variant of the Windows API (#47422) (#138605)
Use the unicode variant of the Windows API in c10/util/Backtrace.cpp
- #47422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138605
Approved by: https://github.com/peterjc123, https://github.com/malfet
2024-10-26 17:41:39 +00:00
cyy
1a73255102 Concat namespaces in jit code (#138976)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138976
Approved by: https://github.com/Skylion007
2024-10-26 17:41:27 +00:00
4de93d1ead [BE][Ez]: Fix bad TypeIs conversion (#138990)
Fixes on TypeIs / TypeGuard conversion error. Follow up to #133814
Thanks for @ezyang for reminding me to double check the side conditions here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138990
Approved by: https://github.com/malfet
2024-10-26 17:37:40 +00:00
705f5b3489 Several enhancements for check_results.py (#137925)
1) always generate expected_results.csv up to accuracy of first three digits
ex: 112313212312 --> 1120000000 .. etc
2) regenerate all record in  expected_results.csv and not just failed ones , why? because if we change something
by 1.3% and noise 1.5% we want to reflect that.
3) add "please update all results that changed significantly, and not only the failed ones"

```
(myenv) [lsakka@devgpu005.nha1 ~/pytorch/benchmarks/dynamo/pr_time_benchmarks (check_result_ehancements)]$ python check_results.py test_check_result/expected_test.csv te
st_check_result/result_test.csv out
WIN: benchmark ('a', 'instruction count') failed, actual result 9011111111 is -18.16% lower than expected 11011111111 ±1.00% please update the expected results.

please update all results that changed significantly, and not only the failed ones
REGRESSION: benchmark ('b', 'memory') failed, actual result 20011111111 is 99.89% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results.

please update all results that changed significantly, and not only the failed ones
REGRESSION: benchmark ('c', 'something') failed, actual result 107111111111 is 969.92% higher than expected 10011111111 ±+10.00% if this is an expected regression, please update the expected results.

please update all results that changed significantly, and not only the failed ones
MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

new expected results file content if needed:
a,instruction count,9011000000,0.01
b,memory,20010000000,0.1
c,something,107100000000,0.1

There was some failures you can use the new reference expected result stored at path:out and printed above

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137925
Approved by: https://github.com/aorenste
2024-10-26 16:27:55 +00:00
1a2dc89f17 [Dynamo] Allow torch.cond() to handle emply arguments (#138190)
Fixes #138150

```python
import torch

@torch.compile(fullgraph=True)
def foo(x, y, z):
    def f():
        return y + 2

    def g():
        return z + 1

    return torch.cond(x, f, g)

print(foo(torch.zeros(1), torch.ones(1), torch.ones(1))) # tensor([2.])
print(foo(torch.ones(1), torch.ones(1), torch.ones(1))) # tensor([3.])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138190
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-10-26 15:26:21 +00:00
c84f9b2069 [dynamo][guards] Log average time of constructed guard_manager (#138941)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138941
Approved by: https://github.com/jansel
ghstack dependencies: #138512, #138896
2024-10-26 15:14:46 +00:00
dba6887dc6 [dynamo][refactor][config-cleanp] Use guard_manager consistently instead of check_fn (#138896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138896
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #138512
2024-10-26 15:14:46 +00:00
49ed365b22 [BE]: Update Typeguard to TypeIs for better type inference (#133814)
Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814
Approved by: https://github.com/ezyang
2024-10-26 15:07:13 +00:00
eb6c7b93a7 Log AOTAutogradCache state to PT2 Compile Events (#138604)
Same as previous diff for inductor, but for autograd instead

Differential Revision: [D64765199](https://our.internmc.facebook.com/intern/diff/D64765199/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138604
Approved by: https://github.com/oulgen
2024-10-26 15:04:38 +00:00
f1a677cba5 In Inductor, be willing to generate deferred runtime asserts when unbacked (#138804)
Title + we avoid calling defer_assert when we statically know the guard results.
timing for pnasnet5large

```
TIMING: code_gen:21.79672 inductor_compile:39.57726 backend_compile:65.30649 entire_frame_compile:95.22052 total_wall_time:95.22052
```
matches with out the diff
```
TIMING: code_gen:21.89314 inductor_compile:39.72298 backend_compile:65.38539 entire_frame_compile:95.0854 total_wall_time:95.0854
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138804
Approved by: https://github.com/ezyang
2024-10-26 15:03:53 +00:00
14a17ad630 Elide calls to is_nested in Dynamo-traced graphs (#138841)
Before this PR, calling `is_nested` in-graph would result in graph code like the following:
```python
  class GraphModule(torch.nn.Module):
      def forward(self, L_nt_: "f64[3, s1, 5]", s1: "Sym(s1)"):
          l_nt_ = L_nt_

          # Note this useless line!
          getattr_1 = l_nt_.is_nested;  getattr_1 = None

          add: "f64[3, s1, 5]" = l_nt_ + 2;  l_nt_ = None
          return (add,)
```

This PR follows what is done for `is_sparse` / `is_quantized`: store it onto `TensorVariable` and have `getattr` calls to `is_nested` return the stored value as a constant. This removes the useless line above from the graph. Note that guarding is handled through tensor type check guards, so no need to guard on `is_nested` status.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138841
Approved by: https://github.com/soulitzer
2024-10-26 15:03:32 +00:00
3234b251b3 Fix typos in CreateTMADescriptorVariable (#138877)
This fixes some leftover typos in
CreateTMADescriptorVariable.call_function (and close).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138877
Approved by: https://github.com/davidberard98, https://github.com/zou3519, https://github.com/Skylion007
ghstack dependencies: #138759
2024-10-26 15:03:07 +00:00
043864afdf enable test_x86inductor_quantizer.py UTs on Windows. (#138937)
This UTs are failed months ago, but due to the main branch move forward, some PRs fixed it. Let's turn on them.

Local test passed:
<img width="863" alt="image" src="https://github.com/user-attachments/assets/a2ec160c-cdf1-404d-bc24-2f60faa8d791">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138937
Approved by: https://github.com/jansel
2024-10-26 12:48:51 +00:00
a3aca24ae5 [AOTI] add C shim for QLinearPointwise (#138439)
This PR adds C shim for `QLinearPointwisePT2E` and `QLinearPointwiseBinaryPT2E`.

The below changes are needed:
- We moved the qlinear API out of the anonymous namespace since we need to call it in the shim layer.

- We fixed the code which generated the `inputs` and `constant_args` so that we can directly leverage the `codegen` of the parent class.

- `x_scale` and `x_zp` are ensured to be tensor during the lowering stage, thus we can remove the code which handles whether they're tensor or not.
  fb0da32377/torch/_inductor/mkldnn_lowerings.py (L492-L496)

  fb0da32377/torch/_inductor/mkldnn_lowerings.py (L499-L503)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138439
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-10-26 08:04:15 +00:00
99608ceed6 Scoped extension building for C++ backed custom ops tests (#136695)
FIXES #125579 #131103 #133197 #133283 #134738 #135369 #135685

Tests that create C++ extensions can cause flakiness in CI due to library namespace conflict and test ordering. We can build them in temp dirs to ensure isolation.

An alternative is to build these as part of the build process and have build time errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136695
Approved by: https://github.com/zou3519
2024-10-26 07:41:00 +00:00
10e2840ce3 Enable failing diffs on update_hint_regression and sum_floordiv_regression and autograd benchmarks regression (#137548)
update_hint_regression has been behaving, so I am setting 2% noise threshold for it. 1.5% for sum_floordiv_regression.

I have one concern, with the way we do the regression detection. small or changes <threshold level  will accumulate and eventually trigger failure. to avoid those would have to keep any eye on the dashboard and potentially refresh the expected result file regularly even when there is no faluires. .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137548
Approved by: https://github.com/aorenste
2024-10-26 07:28:49 +00:00
07e30eae2a [PGNCCL] Use non-blocking mode by default in eager init (#138527)
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #138860
2024-10-26 06:53:15 +00:00
00504aa6b8 Adds snapshot API for MemPools to get pool memory segments (#133601)
Canonically, the snapshot API returns the entire memory state of the CUDACachingAllocator (using `get_all_blocks`). There is no API that can only return the memory state of a given pool.

In this PR, we extend the functionality of snapshot API such that it can only return the memory addresses of an active pool. When snapshot API is called under a MemPoolContext, we only return the blocks that correspond to the pool id of the active pool.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133601
Approved by: https://github.com/ezyang
2024-10-26 03:34:59 +00:00
940658405b [test/test_cuda] Use temp file for test_improper_device_name (#138856)
Use `tempfile.NamedTemporaryFile()` to have test_specify_improper_device_name save/load to a tmp file rather than the current-working-directory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138856
Approved by: https://github.com/Skylion007
2024-10-26 02:42:25 +00:00
0ac9a663ec [hop] always trace subgraph with fake to support .item in eager mode (#138771)
Fixes https://github.com/pytorch/pytorch/issues/138664

When we eagerly run torch.cond with autograd keys set, we'll create_fw_bw_graph using real tensors. This PR forces fakification when cannot detect the fake mode so as to trace the .item calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138771
Approved by: https://github.com/zou3519, https://github.com/malfet
2024-10-26 02:17:17 +00:00
f14247d5aa [dynamo] Accurately identify mutated cells captured by multiple functions (#138632)
This patch changes `mutated_closure_cell_contents: Set[str]` to
`mutated_closure_cell_ids: Set[int]` so that Dynamo can more accurately
identify closure cells across different instances of
`UserFunctionVariable`. This prevents Dynamo from mistakenly treat a
cell as immutable, despite it'll be mutated when referenced as closure
cell from another function.

More context in
https://github.com/pytorch/pytorch/issues/138112#issuecomment-2420580779.

Fixes #138112.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138632
Approved by: https://github.com/jansel
ghstack dependencies: #138639
2024-10-26 02:17:07 +00:00
1e1f0ceb40 Allow Lazy Module to be modelled as UnspecializedNNModuleVariable (#138639)
This patch
- removes the `is_lazy_module` check from `is_dynamic_nn_module`, and
  adds a regression test.
- removes a series of dynamo expected failures on lazy modules. The few
  ones I checked all were failing due to speculation log divergence,
  similar to #138489.

Note that #100047 introduced the conditional removed in this patch, and
it was trying to fix #100001. But I've confirmed locally that #100001 no
longer repros after this patch.

Fixes #138489. See more context in the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138639
Approved by: https://github.com/jansel
2024-10-26 02:17:07 +00:00
4af93fdb77 [BE]: Update cudnn_frontend submodule to 1.8.0 (#138709)
Update cudnn frontend. Let's see what breaks

@eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138709
Approved by: https://github.com/eqy
2024-10-26 01:55:33 +00:00
565a53d326 Use DLPack for creating tensors out of custom classes, when available. (#138697)
Fixes #120614
Takes over #120615

In summary, this PR:
- Adds a `__dlpack__` attribute check in the tensor creation path (i.e. [`internal_new_from_data` @ tensor_new.cpp](cdfe1bffd1/torch/csrc/utils/tensor_new.cpp (L266)))
    - Creates the tensor by using the DLPack machinery, instead of an element-by-element copy
    - No changes since #120615
- Adds a test, making sure the DLPack machinery is used
    - Wraps a tensor in a fresh `TensorDLPackWrapper` class that implements only the DLPack methods
    - Creates a new tensor from an instance of `TensorDLPackWrapper`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138697
Approved by: https://github.com/ezyang

Co-authored-by: Wenzel Jakob <wenzel.jakob@epfl.ch>
2024-10-26 01:27:05 +00:00
e299193423 Bug fix: Use oneDNN for torch._int_mm CPU only when avx512_vnni is supported (#136942)
Fixes #136746

If AVX512_VNNI is not supported, overflow occurs inside oneDNN. Fall back to ref path in such case.
UT is also updated to catch the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136942
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-26 01:17:11 +00:00
a3de067975 [PyTorch] Use 128-bit vectors for ARM64 (#137426)
The correct vector length for ARM64 is 128 bits (16
bytes). We were previously using double this, apparently just because
that would be the same length as AVX2.

Differential Revision: [D63984039](https://our.internmc.facebook.com/intern/diff/D63984039/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137426
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542, #138655, #138716, #138744
2024-10-26 00:20:35 +00:00
7ada814107 [c10/util] Add explicit include of <mutex> to c10/util/env.cpp (#138854)
Add explicit include of `<mutex>` to `c10/util/env.cpp` since it has usages of `std::lock_guard` which is defined in the header `<mutex>`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138854
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-10-26 00:16:05 +00:00
cyy
1605d4aeb8 Fix object slice (#138880)
To avoid casting Tensor to Tensorbase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138880
Approved by: https://github.com/Skylion007
2024-10-26 00:13:19 +00:00
939fc4e335 [PGNCCL] Fix P2P data corruption in non-blocking mode (#138860)
In non-blocking mode, it seems a single `ncclRecv` or `ncclSend` call can "early return" `ncclSuccess` before the kernel is fully enqueued. This causes the event record below missing the P2P the kernel, leading to data corruption.

Side note: per NCCL, it is legal to call `ncclSend` or `ncclRecv` only if there is only one P2P op. This is true whether we are in blocking or non-blocking mode.

In this fix, we use ncclGroup semantics to ensure that the kernel is enqueued for single-P2P ops. The ncclGroup call itself should introduce minimal overhead.

Added a test `test_non_blocking_p2p`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138860
Approved by: https://github.com/shuqiangzhang
2024-10-25 23:58:43 +00:00
54d13a9348 [c10d][CI] Improve world size setting in some tests (#138846)
Following change in #137161 , bumping world size for some test suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138846
Approved by: https://github.com/fduwjj
2024-10-25 23:02:17 +00:00
a57e418c1f [PGNCCL] Use ncclSend and ncclRecv (#138875)
Stop routing to `torch::cuda::nccl`. Use native `ncclSend` and `ncclRecv` APIs instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138875
Approved by: https://github.com/shuqiangzhang
2024-10-25 22:17:10 +00:00
4d92d6e604 [Inductor][ROCm][CK] Enable lowering conv2d instances in CK Inductor backend (#138643)
Set PYTORCH_MIOPEN_SUGGEST_NHWC environment variable to force output layout to channels-last.

This way, the channels-last CK instances will be added to benchmark choices in max autotune

# Testing
```
pytest test/inductor/test_ck_backend.py -k conv2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138643
Approved by: https://github.com/chenyang78
2024-10-25 22:11:44 +00:00
36b7135c6f Revert "[fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)"
This reverts commit 6cadf616aeb612f3c866b734268919ad1616ffaf.

Reverted https://github.com/pytorch/pytorch/pull/138681 on behalf of https://github.com/jeanschmidt due to Introduced regressions on linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138681#issuecomment-2438945493))
2024-10-25 22:07:30 +00:00
14b8028c81 [Pytorch][ATEN] Enable FP8 NCCL in Pytorch ATEN (#138776)
Summary: Enable FP8 NCCL in Pytorch ATEN to unblock FP8 collective communication such as FP8 all-to-all

Test Plan: CI & D64374424

Differential Revision: D64866426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138776
Approved by: https://github.com/eqy, https://github.com/jianyuh
2024-10-25 21:56:47 +00:00
86b45bde19 [pt2] Add logger logging for remote fx graph cache get + put (#138164)
Summary: Capture the timing for the remote fx graph cache get and put operations and add them to the logger logging.

Test Plan:
1) Landed D64483593 and waited for logger actualization.
2) Ran test script on devserver: `buck2 run mode/opt scripts/slarsen/torch_compile_model:run`
3) Queried dynamo_compile/sandbox:
```
(pytorch-3.10_4) devvm2296:~/local/pytorch-3.10_4  $ scuba -e="select time,co_filename,remote_fx_graph_cache_get_time_s,remote_fx_graph_cache_put_time_s from \`dynamo_compile/sandbox\` where remote_fx_graph_cache_put_time_s is not null"
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+
|    time    |                                                                                    co_filename                                                                                    | remote_fx_graph_cache_get_time_s | remote_fx_graph_cache_put_time_s |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+
| 1729136266 | null                                                                                                                                                                              |              0.05652284622192383 |               0.9691152572631836 |
| 1729136263 | /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/289bb46b326874c6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py |               0.8298435211181641 |              0.18642282485961914 |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+----------------------------------+
```

Reviewed By: oulgen

Differential Revision: D64484025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138164
Approved by: https://github.com/jamesjwu, https://github.com/ezyang
2024-10-25 21:30:18 +00:00
78377ec130 [PT2][Optimus] Normalize Clamp to use kwargs (#138723)
Summary: The current clamp normalization does not include torch.clamp where its min and max are not normalized to kwargs, thus the batch fusion of clamp can hit min and max are both empty problem.

Test Plan:
```
buck2 run mode/opt servicelab/ai_ml/auto_tune:local_model_pt2 -- --flow_id 654509735 --test_mode split
```

GPU type: NVIDIA PG509-210
=============Print full analysis for offsite_cvr_oba_optout_dedicated_model================
| Metric             | Value            |
|:-------------------|:-----------------|
| GPU type           | A100             |
| Batch size         | 10               |
| Latency            | 227.13 ms        |
| Model size         | 2322763344 bytes |
| Flops/example      | 1136.52 G        |
| TFLOPS             | 50.04            |
| MFU                | 16.04%           |
| Activation/example | 2722.49 MB       |
I1023 112249.043 local_model_with_pt2.py:25] benchmark results [('batch_size', 10), ('latency_ms', 22712), ('model_size_bytes', 2322763344), ('flops_per_example', 113652), ('tflops_g', 5003), ('mfu', 1603), ('activation_per_example_mb', 272249)

Differential Revision: D64848369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138723
Approved by: https://github.com/jackiexu1992
2024-10-25 21:05:39 +00:00
a874ec85e8 [Functorch] Fix devices Parameter Type in benchmark_utilization Function (#138774)
Summary:
Issue described in https://github.com/pytorch/pytorch/issues/136697

Original user does not have CLA privileges so this is my commandeer

Test Plan: OSS CI

Differential Revision: D64872833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138774
Approved by: https://github.com/davidberard98
2024-10-25 19:25:18 +00:00
3a0c361899 Remove presere ops (#138371)
Summary:
CI
#buildall

Test Plan: CI

Reviewed By: StellarrZ

Differential Revision: D64151426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138371
Approved by: https://github.com/bdhirsh
2024-10-25 19:13:55 +00:00
b988388bac Add CUDA 12.6 to Linux CD docker images (#138563)
Reference https://github.com/pytorch/builder/pull/1003/files
Related to #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138563
Approved by: https://github.com/malfet
2024-10-25 19:10:07 +00:00
846b4e614b [TF32][cuDNN][Convolution] Add some missing TF32 decorators (#138768)
Newer cuDNN versions seem to be able to dispatch to cuDNN kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138768
Approved by: https://github.com/Skylion007
2024-10-25 19:03:42 +00:00
c6bb9b53f4 [scan] better error handling and remove redundant tests (#137967)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137967
Approved by: https://github.com/zou3519
2024-10-25 19:01:25 +00:00
7d283309d8 Avoid calling realize() on LazyVariableTracker on reconstruct (#138495)
Fixes: https://github.com/pytorch/pytorch/issues/137686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138495
Approved by: https://github.com/zou3519
2024-10-25 19:01:15 +00:00
392221b390 Made DDPOptimizer work with HOPs (#138787)
Fixes https://github.com/pytorch/pytorch/issues/137481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138787
Approved by: https://github.com/yf225
ghstack dependencies: #138733, #138794, #138881
2024-10-25 18:59:01 +00:00
07dbc42881 Stop force realizing to prevent recursion errors unless it's much bigger (#138881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138881
Approved by: https://github.com/shunting314
ghstack dependencies: #138733, #138794
2024-10-25 18:59:01 +00:00
de54246c42 Recomend pip install -r requirements in the unit testing guidelines. (#137797)
Somehow make setup-env as recomended in CONTRIBUTING.MD is not installing all dependencies require to run tests

This makes it slightly clearer when running tests.

Specific repro on my side was
```
git checkout e7679663070e3149ae7cd6e28d376d86852ce9e4
make setup-env
conda activate pytorch-deps
python test/test_utils_internal.py
```

which is what my reading of the instructions implies should be correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137797
Approved by: https://github.com/albanD
2024-10-25 18:47:44 +00:00
03f9136870 Add wait counter on cuda::device_synchronize (#138883)
The wait counter is typically only minute precision, but if there is a collective in the queue it will show up. We think this explains up to eight minutes of delay in some compile traces we're looking at, but the counter would definitively prove it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D64944970](https://our.internmc.facebook.com/intern/diff/D64944970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138883
Approved by: https://github.com/eqy
2024-10-25 18:13:57 +00:00
dbbdfd9df5 Add pytorch.wait_counter.dynamo_compile (#138072)
I was discussing with James March how the current fx_codegen_and_compile
counter doesn't actually capture all compile time.  This one is more
accurate and corresponds closely to the existing events in dynamo_compile
table.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138072
Approved by: https://github.com/markkm
2024-10-25 18:12:34 +00:00
77587f43d2 Add one more shard for CPU pull jobs (#138894)
The first shard is close to 3.5 hours and timing out flakily in trunk now, for example https://github.com/pytorch/pytorch/actions/runs/11509141659/job/32039126506.  So, I think we could just add one more shard in the same spirit as https://github.com/pytorch/pytorch/pull/137433
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138894
Approved by: https://github.com/Skylion007
2024-10-25 18:09:50 +00:00
ba6526814a Add dtype attribute to CSEVariable (#136778)
Summary:
- This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op.

- There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen.

Test Plan: CI

Differential Revision: D61815079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2024-10-25 18:00:30 +00:00
d0640b945b [inductor][nit] removing unnecessary else statements (#138789)
Summary: while reading through inductor template code I found a few places where else statements were driving me crazy. Fixing them as I read

Test Plan: CI

Differential Revision: D64882385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138789
Approved by: https://github.com/aakhundov
2024-10-25 17:59:25 +00:00
69af467d4f Eliminate c10::value_or_else (#138818)
Test Plan: Sandcastle

Differential Revision: D64857418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138818
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-10-25 17:59:01 +00:00
a6287b5c27 Fixing issue in move pass for copying Parameter (#138855)
Summary: Fixing bug for Parameter copy during move pass of exported graph.

Test Plan:
UT

runs on APS models.

Differential Revision: D64876951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138855
Approved by: https://github.com/pianpwk

Co-authored-by: Gagan Jain <gaganj@meta.com>
2024-10-25 17:57:27 +00:00
375d71cc5a plumb is_export flag to FunctionalTensorMode in analysis pass (#138836)
Summary: there is an issue with functionalization V2 in export. This is a quick fix that plumbs `is_export` through to `run_functionalized_fw_and_collect_metadata`.

Test Plan: CI

Differential Revision: D64915263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138836
Approved by: https://github.com/tugsbayasgalan
2024-10-25 17:56:14 +00:00
3d0aa6f049 Update readme with std::optional (#138914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138914
Approved by: https://github.com/malfet
2024-10-25 17:40:58 +00:00
6f66398ab8 Revert "[aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731)"
This reverts commit 245026af2d2f26c74993cb90e01bddbd627c6797.

Reverted https://github.com/pytorch/pytorch/pull/138731 on behalf of https://github.com/jeanschmidt due to introduced regressions on linux-focal-cuda12.1-py3.10-gcc9-bazel-test ([comment](https://github.com/pytorch/pytorch/pull/138731#issuecomment-2438417669))
2024-10-25 17:37:32 +00:00
447bb72822 Revert "[c10d][CI] Improve world size setting in some tests (#138846)"
This reverts commit 9c35e33d9b02e384f0d504f942a916e9e849b163.

Reverted https://github.com/pytorch/pytorch/pull/138846 on behalf of https://github.com/jeanschmidt due to introduced breaks in linux-focal-cuda11.8-py3.10-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/138846#issuecomment-2438415315))
2024-10-25 17:35:27 +00:00
2980aed65b [inductor][memory] restructuring memory.py and turn on the flag (#137205)
Addressing additional comments given in PR https://github.com/pytorch/pytorch/pull/134874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137205
Approved by: https://github.com/eellison
2024-10-25 17:19:34 +00:00
817b4988e4 [dynamo][config-cleanup] Remove enable_cpp_guard_manager=False codepath (#138512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138512
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-10-25 16:41:55 +00:00
fe18a221eb Add debug backend that applies CrossRefFakeMode, use in compiler bisector (#138651)
I was debugging an internal ne divergence for a while that ended up being because of a bad meta. I added an explicit a config option and an explicit backend `aot_eager_decomp_partition_crossref` to enable the FakeCrossRefMode when running the graph.  I added an explicit backend bc I suspect it will be useful for internal models but I'm also happy to leave as config option.

It will only test ops that have meta to avoid memory overhead of hitting fallback path and running in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138651
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-10-25 15:58:36 +00:00
6cadf616ae [fx graph cache] FxGraphPickler: Remove hack to stabilize device string hashes (#138681)
Summary: With the fast pickling mode, we don't need the custom hack for replacing device strings in tensors. This was previously needed because, e.g., two strings "cuda" will pickle differently if they are the same object vs. not.

Test Plan:
The new test fails with fast mode commented out, but succeeds when enabled:
`python test/inductor/test_codecache.py -k test_stable_strings`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138681
Approved by: https://github.com/oulgen
2024-10-25 15:52:58 +00:00
78a0158540 [Dynamo] Improve args in higher_order_ops [1/N] (#138799)
Replaced hard-coded argument indices with meaningful variable names.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138799
Approved by: https://github.com/zou3519
2024-10-25 13:55:41 +00:00
45b8155a07 [CI] Run periodic jobs only on pytorch/pytorch repo (#138874)
Github by default tries to not run periodic jobs on forks, see https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/disabling-and-enabling-a-workflow
But there is a special test repo called `pytorch/canary`, that will run those workflows for next 60 days, which is a waste of resources
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138874
Approved by: https://github.com/huydhn
2024-10-25 13:42:37 +00:00
245026af2d [aotd] Unwrap unseen AsyncCollectiveTensor tangents (#138731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138731
Approved by: https://github.com/bdhirsh
2024-10-25 12:35:52 +00:00
2c82f73647 [Pipelining] Clean up hooks in zero bubble (#138720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138720
Approved by: https://github.com/wconstab
ghstack dependencies: #138119, #138504, #138735
2024-10-25 12:06:54 +00:00
12755f45ff [Pipelining] small comments and variable renames (#138735)
Addressing the comments in previous PRs to update the variable names and add additional code comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138735
Approved by: https://github.com/wconstab
ghstack dependencies: #138119, #138504
2024-10-25 12:06:54 +00:00
9c35e33d9b [c10d][CI] Improve world size setting in some tests (#138846)
Following change in #137161 , bumping world size for some test suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138846
Approved by: https://github.com/fduwjj
2024-10-25 10:40:21 +00:00
a1175e3437 [BE] Strides are always non-negative, remove pointless test (#138784)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138784
Approved by: https://github.com/Chillee
2024-10-25 10:39:32 +00:00
22d2e2d9a0 Set RUNPATH so installed tests can find the required shared libraries (#136627)
This change fixes the RUNPATH of installed c++ tests so that the linker can find the shared libraries they depend on.

For example, currently:
```bash
venv/lib/python3.10/site-packages/torch $ ./bin/test_lazy
./bin/test_lazy: error while loading shared libraries: libtorch.so: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136627
Approved by: https://github.com/malfet
2024-10-25 09:38:08 +00:00
86d4b7d60b [FX][export][dynamo] use tuple instead of list in normalized args_spec (#138212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138212
Approved by: https://github.com/jansel
2024-10-25 06:43:55 +00:00
ce631939f0 [Distributed] [18/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138692)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138692
Approved by: https://github.com/ezyang
2024-10-25 05:32:38 +00:00
b999daf7a9 Add sets to list of safe objects to de-serialize (#138866)
Lists, dicts and tuples are already allowed, it's a bit weird not to exclude set from the list of basic containers.

Test plan (in addition to unittest):
```python
torch.save({1, 2, 3}, "foo.pt")
torch.load("foo.pt", weights_only=True)
```

Fixes https://github.com/pytorch/pytorch/issues/138851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138866
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2024-10-25 05:23:08 +00:00
907f001a68 Bump onnx from 1.16.1 to 1.17.0 in /.ci/docker (#138719)
Bumps [onnx](https://github.com/onnx/onnx) from 1.16.1 to 1.17.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/onnx/onnx/releases">onnx's releases</a>.</em></p>
<blockquote>
<h2>v1.17.0</h2>
<p>ONNX v1.17.0 is now available with exciting new features! We would like to thank everyone who contributed to this release!
Please visit <a href="https://onnx.ai/">onnx.ai</a> to learn more about ONNX and associated projects.</p>
<h1>Key Updates</h1>
<h2>ai.onnx Opset 22</h2>
<ul>
<li>Update to support bfloat16:
<ul>
<li><a href="https://onnx.ai/onnx/operators/onnx__Acos.html#acos-22">Acos</a>, <a href="https://onnx.ai/onnx/operators/onnx__Acosh.html#acosh-22">Acosh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Asin.html#asin-22">Asin</a>, <a href="https://onnx.ai/onnx/operators/onnx__Asinh.html#asinh-22">Asinh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Atan.html#atan-22">Atan</a>, <a href="https://onnx.ai/onnx/operators/onnx__Atanh.html#atanh-22">Atanh</a>, <a href="https://onnx.ai/onnx/operators/onnx__AveragePool.html#averagepool-22">AveragePool</a>, <a href="https://onnx.ai/onnx/operators/onnx__Bernoulli.html#bernoulli-22">Bernoulli</a>, <a href="https://onnx.ai/onnx/operators/onnx__Conv.html#conv-22">Conv</a>, <a href="https://onnx.ai/onnx/operators/onnx__ConvTranspose.html#convtranspose-22">ConvTranspose</a>, <a href="https://onnx.ai/onnx/operators/onnx__Cos.html#cos-22">Cos</a>, <a href="https://onnx.ai/onnx/operators/onnx__Cosh.html#cosh-22">Cosh</a>, <a href="https://onnx.ai/onnx/operators/onnx__DeformConv.html#deformconv-22">DeformConv</a>, <a href="https://onnx.ai/onnx/operators/onnx__Det.html#det-22">Det</a>, <a href="https://onnx.ai/onnx/operators/onnx__Dropout.html#dropout-22">Dropout</a>, <a href="https://onnx.ai/onnx/operators/onnx__Elu.html#elu-22">Elu</a>, <a href="https://onnx.ai/onnx/operators/onnx__EyeLike.html#eyelike-22">EyeLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__GRU.html#gru-22">GRU</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalAveragePool.html#globalaveragepool-22">GlobalAveragePool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalLpPool.html#globallppool-22">GlobalLpPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GlobalMaxPool.html#globalmaxpool-22">GlobalMaxPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__GridSample.html#gridsample-22">GridSample</a>, <a href="https://onnx.ai/onnx/operators/onnx__HardSigmoid.html#hardsigmoid-22">HardSigmoid</a>, <a href="https://onnx.ai/onnx/operators/onnx__HardSwish.html#hardswish-22">HardSwish</a>, <a href="https://onnx.ai/onnx/operators/onnx__InstanceNormalization.html#instancenormalization-22">InstanceNormalization</a>, <a href="https://onnx.ai/onnx/operators/onnx__LSTM.html#lstm-22">LSTM</a>, <a href="https://onnx.ai/onnx/operators/onnx__LpNormalization.html#lpnormalization-22">LpNormalization</a>, <a href="https://onnx.ai/onnx/operators/onnx__LpPool.html#lppool-22">LpPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxPool.html#maxpool-22">MaxPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxRoiPool.html#maxroipool-22">MaxRoiPool</a>, <a href="https://onnx.ai/onnx/operators/onnx__MaxUnpool.html#maxunpool-22">MaxUnpool</a>, <a href="https://onnx.ai/onnx/operators/onnx__Mish.html#mish-22">Mish</a>, <a href="https://onnx.ai/onnx/operators/onnx__Multinomial.html#multinomial-22">Multinomial</a>, <a href="https://onnx.ai/onnx/operators/onnx__NegativeLogLikelihoodLoss.html#negativeloglikelihoodloss-22">NegativeLogLikelihoodLoss</a>, <a href="https://onnx.ai/onnx/operators/onnx__RNN.html#rnn-22">RNN</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomNormal.html#randomnormal-22">RandomNormal</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomNormalLike.html#randomnormallike-22">RandomNormalLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomUniform.html#randomuniform-22">RandomUniform</a>, <a href="https://onnx.ai/onnx/operators/onnx__RandomUniformLike.html#randomuniformlike-22">RandomUniformLike</a>, <a href="https://onnx.ai/onnx/operators/onnx__RoiAlign.html#roialign-22">RoiAlign</a>, <a href="https://onnx.ai/onnx/operators/onnx__Round.html#round-22">Round</a>, <a href="https://onnx.ai/onnx/operators/onnx__Selu.html#selu-22">Selu</a>, <a href="https://onnx.ai/onnx/operators/onnx__Sin.html#sin-22">Sin</a>, <a href="https://onnx.ai/onnx/operators/onnx__Sinh.html#sinh-22">Sinh</a>, <a href="https://onnx.ai/onnx/operators/onnx__Softplus.html#softplus-22">Softplus</a>, <a href="https://onnx.ai/onnx/operators/onnx__Softsign.html#softsign-22">Softsign</a>, <a href="https://onnx.ai/onnx/operators/onnx__Tan.html#tan-22">Tan</a>, <a href="https://onnx.ai/onnx/operators/onnx__ThresholdedRelu.html#thresholdedrelu-22">ThresholdedRelu</a></li>
</ul>
</li>
</ul>
<h2>Python Changes</h2>
<ul>
<li>Support for numpy &gt;= 2.0</li>
</ul>
<h1>Bug fixes and infrastructure improvements</h1>
<ul>
<li>Fix Check URLs errors <a href="https://redirect.github.com/onnx/onnx/pull/5972">5972</a></li>
<li>Use CMAKE_PREFIX_PATH in finding libprotobuf <a href="https://redirect.github.com/onnx/onnx/pull/5975">5975</a></li>
<li>Bump main VERSION_NUMBER to 1.17.0 <a href="https://redirect.github.com/onnx/onnx/pull/5968">5968</a></li>
<li>Fix source and pip tar.gz builds on s390x systems <a href="https://redirect.github.com/onnx/onnx/pull/5984">5984</a></li>
<li>Fix unique_name <a href="https://redirect.github.com/onnx/onnx/pull/5992">5992</a></li>
<li>Fix SegFault bug in shape inference <a href="https://redirect.github.com/onnx/onnx/pull/5990">5990</a></li>
<li>Fix onnx.compose when connecting subgraphs <a href="https://redirect.github.com/onnx/onnx/pull/5991">5991</a></li>
<li>Fix conversion from split 11 to split 18 <a href="https://redirect.github.com/onnx/onnx/pull/6020">6020</a></li>
<li>Update error messages for NegativeLogLikelihoodLoss inference function <a href="https://redirect.github.com/onnx/onnx/pull/6021">6021</a></li>
<li>Generalize input/output number check in shape inference <a href="https://redirect.github.com/onnx/onnx/pull/6005">6005</a></li>
<li>Replace rank inference with shape inference for Einsum op <a href="https://redirect.github.com/onnx/onnx/pull/6010">6010</a></li>
<li>build from source instruction with latest cmake change <a href="https://redirect.github.com/onnx/onnx/pull/6038">6038</a></li>
<li>Handle OneHot's depth value during shape inference <a href="https://redirect.github.com/onnx/onnx/pull/5963">5963</a></li>
<li>Not to install cmake in pyproject.toml on Windows <a href="https://redirect.github.com/onnx/onnx/pull/6045">6045</a></li>
<li>fix a skipped shape infer code <a href="https://redirect.github.com/onnx/onnx/pull/6049">6049</a></li>
<li>Include the &quot;.onnxtext&quot; extension in supported serialization format <a href="https://redirect.github.com/onnx/onnx/pull/6051">6051</a></li>
<li>Allow ReferenceEvaluator to return intermediate results <a href="https://redirect.github.com/onnx/onnx/pull/6066">6066</a></li>
<li>Fix 1 typo in numpy_helper.py <a href="https://redirect.github.com/onnx/onnx/pull/6041">6041</a></li>
<li>Remove benchmarking code <a href="https://redirect.github.com/onnx/onnx/pull/6076">6076</a></li>
<li>Prevent crash on import after GCC 8 builds <a href="https://redirect.github.com/onnx/onnx/pull/6048">6048</a></li>
<li>Check graph outputs are defined <a href="https://redirect.github.com/onnx/onnx/pull/6083">6083</a></li>
<li>Enable additional ruff rules <a href="https://redirect.github.com/onnx/onnx/pull/6032">6032</a></li>
<li>Add missing shape inference check for DequantizeLinear <a href="https://redirect.github.com/onnx/onnx/pull/6080">6080</a></li>
<li>Add bfloat16 to all relevant ops <a href="https://redirect.github.com/onnx/onnx/pull/6099">6099</a></li>
<li>fix(ci): install python dependencies with --only-binary :all: in manylinux <a href="https://redirect.github.com/onnx/onnx/pull/6120">6120</a></li>
<li>fix: install google-re2 with --only-binary option <a href="https://redirect.github.com/onnx/onnx/pull/6129">6129</a></li>
<li>Specify axis parameter for DequantizeLinear when input rank is 1 <a href="https://redirect.github.com/onnx/onnx/pull/6095">6095</a></li>
<li>Pin onnxruntime to 1.17.3 for release CIs <a href="https://redirect.github.com/onnx/onnx/pull/6143">6143</a></li>
<li>Fix INT4 TensorProto byte size is 5x larger than expected with negative values <a href="https://redirect.github.com/onnx/onnx/pull/6161">6161</a></li>
<li>Mitigate tarball directory traversal risks <a href="https://redirect.github.com/onnx/onnx/pull/6164">6164</a></li>
<li>Fix reference implementation for ScatterND with 4D tensors <a href="https://redirect.github.com/onnx/onnx/pull/6174">6174</a></li>
<li>Addition of group &gt; 1 in test and in backend for ConvTranspose <a href="https://redirect.github.com/onnx/onnx/pull/6175">6175</a></li>
<li>Support for bfloat16 for binary, unary operators in reference implementation <a href="https://redirect.github.com/onnx/onnx/pull/6166">6166</a></li>
<li>Refactor windows workflow to work on standard windows <a href="https://redirect.github.com/onnx/onnx/pull/6190">6190</a></li>
<li>Fix a few crashes while running shape inference <a href="https://redirect.github.com/onnx/onnx/pull/6195">6195</a></li>
<li>Update onnx to work with numpy&gt;=2.0 <a href="https://redirect.github.com/onnx/onnx/pull/6196">6196</a></li>
<li>Use sets to improve performance of dfs search <a href="https://redirect.github.com/onnx/onnx/pull/6213">6213</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="b8baa84466"><code>b8baa84</code></a> Set version 1.17.0 for official release (<a href="https://redirect.github.com/onnx/onnx/issues/6405">#6405</a>)</li>
<li><a href="6d77b80821"><code>6d77b80</code></a> [Cherry-Pick] Fix main url checks (<a href="https://redirect.github.com/onnx/onnx/issues/6312">#6312</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6327">#6327</a>)</li>
<li><a href="174938d8b7"><code>174938d</code></a> [Cherry-Pick] Fix protobuf pkg 5.28.0 failing on Windows (<a href="https://redirect.github.com/onnx/onnx/issues/6342">#6342</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6347">#6347</a>)</li>
<li><a href="f18d5931ad"><code>f18d593</code></a> [Cherry-Pick] Remove unused variables (<a href="https://redirect.github.com/onnx/onnx/issues/6303">#6303</a>) (<a href="https://redirect.github.com/onnx/onnx/issues/6324">#6324</a>)</li>
<li><a href="c58890537f"><code>c588905</code></a> Set version in rel-1.17.0 to 1.17.0rc1 (<a href="https://redirect.github.com/onnx/onnx/issues/6317">#6317</a>)</li>
<li><a href="4392c2c9ae"><code>4392c2c</code></a> Prepare for rel-1.17.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6281">#6281</a>)</li>
<li><a href="cb54169e4f"><code>cb54169</code></a> Update ort filter to 1.20.0 to skip tests known to fail with ort 1.19.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6306">#6306</a>)</li>
<li><a href="99e1fd352c"><code>99e1fd3</code></a> Bump reviewdog/action-misspell from 1.21.0 to 1.23.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6268">#6268</a>)</li>
<li><a href="1920565505"><code>1920565</code></a> Bump ossf/scorecard-action from 2.3.3 to 2.4.0 (<a href="https://redirect.github.com/onnx/onnx/issues/6273">#6273</a>)</li>
<li><a href="2e8f2289b9"><code>2e8f228</code></a> Bump mypy from 1.10.1 to 1.11.1 (<a href="https://redirect.github.com/onnx/onnx/issues/6275">#6275</a>)</li>
<li>Additional commits viewable in <a href="https://github.com/onnx/onnx/compare/v1.16.1...v1.17.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=onnx&package-manager=pip&previous-version=1.16.1&new-version=1.17.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138719
Approved by: https://github.com/ezyang

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-25 03:53:25 +00:00
94e341c6a3 [user triton] fix codegen for tl.constexpr globals (#138757)
Fixes #138509

tl.constexpr globals would be codegen-ed as `constexpr()` instead of `tl.constexpr()` if they were un-annotated. This fixes the issue (and adds a test). The correct handling was already added but the corrected string was not being used in the un-annotated branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138757
Approved by: https://github.com/oulgen
2024-10-25 03:00:42 +00:00
36c6ad71ba [tlparse] Add dynamo_graph_break_reason logging to trace_structured (#138778)
A common challenge during torch.compile enablement is to answer user's question: "where is the graph break?". This PR will help make it easier to answer by surfacing graph breaks and their corresponding user stack trace / compiler stack trace in a direct link e.g. `0_0_0/dynamo_graph_break_reason_0.txt` from tlparse index.html.

![image](https://github.com/user-attachments/assets/79cd43f5-af14-4d08-9d5b-cb47d8203851)

![image](https://github.com/user-attachments/assets/23233ee2-0d56-4526-bf9a-d22c337c4d18)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138778
Approved by: https://github.com/ezyang
2024-10-25 02:00:04 +00:00
9425c0767d Fix free symbol handling in FlexAttention (#138794)
Fixes https://github.com/pytorch/pytorch/issues/136196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138794
Approved by: https://github.com/Skylion007
ghstack dependencies: #138733
2024-10-25 01:20:42 +00:00
f737e3fe2f [inductor] Fix ReinterpretView call in TMADescriptor IR (#138759)
As a result of #137768, `ReinterpretView` call in the `TMADescriptor`
has become invalid. This leads to some TMA tests breaking in
test_triton_kernels.py. In this PR, we fix this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138759
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-10-25 00:45:44 +00:00
ed9169df98 Removed the typing information for already deleted ProcessGroupCudaP2P (#138753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138753
Approved by: https://github.com/weifengpy
2024-10-25 00:32:07 +00:00
2f4af0f4e6 [Profiler] Disable Dynamo-Sensitive Profiler Tests (#138762)
Summary: During compilation, a profiler context gets ignored so we should temporarily turn off tests that are failing due to dynamo. Once profiler integration with dynamo is introduced we can reintroduce these tests

Test Plan: Make sure CI is passing again

Differential Revision: D64867447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138762
Approved by: https://github.com/davidberard98
2024-10-25 00:25:49 +00:00
1d98a526dd preserve signatures with multiple calls + buffer mutations (#138669)
As called out in https://github.com/pytorch/pytorch/pull/137999, preserving signatures of multiple calls when buffer mutations are present was NYI. The main problem was that intermediate values of buffers were not tracked, so couldn't be propagated statefully between multiple calls (i.e., they would need to be explicitly passed around, defeating the unlifting needed for preserving signatures).

This PR fixes this situation, by introducing module attributes that carry the necessary intermediate values of buffer mutations. In general, a buffer mutation can have several intermediate values it depends on recursively, even other buffers. So rather than tying an intermediate value with a particular buffer, we tie it with the submodules that create and read it. We install an attribute on all modules that create or read a particular intermediate value, sharing the same initial storage (i.e., initialized with the same empty tensor). For the module that creates this intermediate value, we copy the value into the corresponding attribute; and for the modules that read it, we read the corresponding attribute instead.

Another complication that needed to be addressed was that a `run_decompositions` following an `export_for_training` was not preserving module call graphs, which is needed for unflattening and, in particular, used when remapping inputs. Fortunately some existing metadata already tracks provenance of nodes, which we could use to update a module call graph after functionalization / decomposition.

Differential Revision: D64806175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138669
Approved by: https://github.com/tugsbayasgalan
2024-10-25 00:13:25 +00:00
4c91481656 [c10d] allow sub group to be eagerly inited even if default one is not (#138665)
Summary:
Currently, eager mode is applied either to all PGs or NONE of them.
There are cases where we don't want to initialize the comms for default
PG, but we still want to initialize the comms for sub PG. Now with a
device_id passed to new group, we can achieve this case
Test Plan:
newly added UT

Tags:

Resolves https://github.com/pytorch/pytorch/issues/137018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138665
Approved by: https://github.com/kwen2501
ghstack dependencies: #138781
2024-10-24 23:51:28 +00:00
277b32c930 fix unflatten training ir test suffix (#138840)
Test Plan: none

Differential Revision: D64917214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138840
Approved by: https://github.com/zhxchen17
2024-10-24 23:42:54 +00:00
425ce2a7ee [c10d] use a promise to delay watchdog shutdown (#138828)
Summary:
We always need to give the heartbeat monitor thread time to write out flight recorder dumps. Otherwise, the watchdog thread kills the heartbeat monitor thread too fast before it has time to write out the Flight Recorder logs.
This change:
1. Removes the "sleep after exception" JK. We don't need to sleep for 8 minutes.
2. Use a promise between watchdog thread and heartbeat monitor thread to delay, at most, one minute to give Flight Recorder time to write out it's log on timeout.

Test Plan:
Tested on my local job and flight recorder successfully executed for the job.
https://fburl.com/mlhub/38fj5yne
The watchdog thread gives heartbeat thread time to write out the logs.

In the logs we see:
```
[trainer4]:I1023 17:39:29.755507 12592 ProcessGroupNCCL.cpp:1950] [PG ID 0 PG GUID 0(precheck) Rank 12] slept for 1647ms giving time for flight recorder dumps to finish.
```

Differential Revision: D64857928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138828
Approved by: https://github.com/d4l3k, https://github.com/fduwjj
2024-10-24 23:42:29 +00:00
751987eed1 [pt2] improve error logs for torch.cond and aoti package (#138647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138647
Approved by: https://github.com/ydwu4, https://github.com/angelayi
2024-10-24 23:38:07 +00:00
3e4ba18eb5 [aoti] fix typo in codegen_dynamic_scalar (#138760)
Summary: appears to be a typo

Test Plan: ci

Differential Revision: D64867271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138760
Approved by: https://github.com/ezyang
2024-10-24 23:16:30 +00:00
09848c892a [aot_compile] propagate ShapeEnv during lowering (#138362)
We found that `export() -> _inductor.aot_compile()` lowering, 3 different ShapeEnvs get created, leading to errors when one ShapeEnv processes expressions created by another ShapeEnv. This plumbs the 2 places where ShapeEnv creation happens, detecting the original ShapeEnv from the GraphModule example values, so the original ShapeEnv is just reused.

Differential Revision: D64613290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138362
Approved by: https://github.com/angelayi
2024-10-24 22:22:14 +00:00
51f6b946ae [torchbind] Add generic __deepcopy__ method (#137613)
Summary: Added a generic `__deepcopy__` method which will use the torchbind object's existing `__getattr__` and `__setattr__` to copy the torchbind object. This will later be used in [D64124825](https://www.internalfb.com/diff/D64124825)

Differential Revision: D64124826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137613
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-10-24 22:14:55 +00:00
282e6383c1 Add inductor cache metrics (#138603)
Each inductor event should have exactly one hit, miss, bypass etc. Add it to the inductor compile event.

Add triton_compile as a compiler phase with `dynamo_timed`. This way, we get PT2 Compile Event Logs for triton as well.

Here's what triton events look like:  {F1941513932}
And this on a cache hit(since we still redo this work):
 {F1941514350}

Inductor cache info:
 {F1941528530}

Differential Revision: [D64703392](https://our.internmc.facebook.com/intern/diff/D64703392/)

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138603
Approved by: https://github.com/oulgen
2024-10-24 22:09:34 +00:00
e78a3e260b [export] Add serdes_non_strict to tests (#138662)
Summary: We expand the tests to cover serdes_non_strict. Currently failing tests are skipped.

Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _serdes_non_strict
```

Differential Revision: D64709285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138662
Approved by: https://github.com/avikchaudhuri
2024-10-24 21:35:32 +00:00
500b2bc781 Have as_tensor always return a float64 tensor in dynamo (#138598)
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138598
Approved by: https://github.com/ezyang
ghstack dependencies: #138595
2024-10-24 20:50:28 +00:00
5b50b0a9bc remove dead code (#138690)
Fixes issue-138673: [issue](https://github.com/pytorch/pytorch/issues/138673)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138690
Approved by: https://github.com/Aidyn-A, https://github.com/colesbury
2024-10-24 20:29:24 +00:00
10a34dcd57 [PyTorch] Fix out-of-bounds array access in atomic_add_vec (#138744)
There is no guarantee that `len` here is enough for a full vector. This was causing at least one test failure on https://github.com/pytorch/pytorch/pull/137426.

Differential Revision: [D64857786](https://our.internmc.facebook.com/intern/diff/D64857786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138744
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542, #138655, #138716
2024-10-24 19:37:12 +00:00
0af7632c10 [PyTorch] Fix ASAN failures for vec_test_all_types Cast test (#138716)
The size of the destination array was too small.

Differential Revision: [D64843491](https://our.internmc.facebook.com/intern/diff/D64843491/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138716
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542, #138655
2024-10-24 19:37:12 +00:00
cbafe1e7f3 [PyTorch] Unbreak VectorizedN fmadd/fmsub/clamp (#138655)
These are ternary ops, not binary ops.

Differential Revision: [D64794253](https://our.internmc.facebook.com/intern/diff/D64794253/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138655
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486, #138542
2024-10-24 19:37:02 +00:00
ead5738ff2 [PyTorch] Fix inductor bug with unrolled vectorized prod (#138542)
This issue is one of two inductor bugs blocking land of #137426. Turned out to be simple

Differential Revision: [D64734116](https://our.internmc.facebook.com/intern/diff/D64734116/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138542
Approved by: https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #138486

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-10-24 19:36:51 +00:00
6aa673377b [PyTorch] Fix inductor CPU masked() body codegen when result dtype is bool and operator is where (#138486)
In this case, it looks like we expect the body to be a VecMask (unify_mask_base_type is called by where()), but we didn't make it a VecMask. Now we do.

Differential Revision: [D64702918](https://our.internmc.facebook.com/intern/diff/D64702918/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138486
Approved by: https://github.com/leslie-fang-intel, https://github.com/malfet
2024-10-24 19:36:41 +00:00
239a21f37e [Inductor] don't set XBLOCK larger than xnumel (#138730)
When fp8 dtype is involved, Inductor may set min_elem_per_thread to be a positive value. This will force increasing XBLOCK even for a small xnumel (e.g. 1). Inductor will report an error later when sanity check the triton config.

The simple fix here is to just not let XBLOCK to be larger than xnumel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138730
Approved by: https://github.com/Chillee
ghstack dependencies: #136782
2024-10-24 18:31:10 +00:00
e7f1e306df Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763)"
This reverts commit 362ca54f03f9bb72ba7633ed580fb788b1a8dea9.

Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/wdvr due to this change is breaking our prod training pipeline (verified with bisect) by increasing memory consumption 4x and causing OOM ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2435962833))
2024-10-24 17:46:09 +00:00
8197e4c70d Revert "[sparse] add search for optimal alg_id to torch.compile (#137427)"
This reverts commit 39bfba3f561e3125ce035de0bf90c8c7bcccd3ce.

Reverted https://github.com/pytorch/pytorch/pull/137427 on behalf of https://github.com/jcaip due to this PR breaks AO tests ([comment](https://github.com/pytorch/pytorch/pull/137427#issuecomment-2435906592))
2024-10-24 17:27:06 +00:00
5ea6777861 [subclass] Unwrap_tensor_subclasses micro optimization (#138498)
unwrap_tensor_subclasses -> get_plain_tensors

Is used at runtime. For small models this overhead is feasible in comparison with small compiled kernel.

1/ Removing asserts  from runtime path
2/ Removing list creation with using optional output list to append argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138498
Approved by: https://github.com/bdhirsh
2024-10-24 16:54:54 +00:00
fe458eef80 [c10d] fix a logic of using ncclCommSplit (#138781)
Summary:
Currently, whether split should be used depends on the size of subgroup.
It's possible that default PG is not eagerly initialized yet, but split is still
called.

This PR fixes this issue by removing split's  dependency on subgroup size
Test Plan:
Modified UT
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138781
Approved by: https://github.com/kwen2501
2024-10-24 16:16:35 +00:00
b021486405 Enable Windows Arm64 (#133088)
This PR enables Pytorch for Windows on Arm64 - CPU only.
Currently, there aren't any checks in place to build and test for Windows on Arm64, but we're working to implement those as soon as possible.
We recommend using [Arm Performance Libraries (APL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries) as a BLAS option, which is introduced in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133088
Approved by: https://github.com/malfet

Co-authored-by: cristian panaite <panaite.cristian2000@gmail.com>
Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com>
Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2024-10-24 16:10:44 +00:00
eqy
f7bb11dcc2 [cuDNN][cuDNN Frontend] Check in test for previously broken dBias check (#138725)
see https://github.com/pytorch/pytorch/issues/137347, let's try to land before https://github.com/pytorch/pytorch/pull/138709

CC @malfet @drisspg @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138725
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-10-24 15:33:58 +00:00
8f62832189 c10::nullopt -> std::nullopt (#138701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138701
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-10-24 15:03:32 +00:00
7e62ac51a1 [pt2] [testing] Skip inductor_freezing - test_cpp_wrapper_cuda internally (#138366)
Summary: It's been failing CI since probably forever; skip for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138366
Approved by: https://github.com/eellison
2024-10-24 14:40:13 +00:00
5c88a9f6c0 Assume that indices are non-negative in _unsafe_masked_index (#137315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137315
Approved by: https://github.com/eellison
2024-10-24 12:39:31 +00:00
0d9fb51028 Fix lru_cache where config is used (#134235)
Ensure that any use of functools.lru_cache does not prevent config from being changed after the function has already run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134235
Approved by: https://github.com/masnesral
2024-10-24 10:43:34 +00:00
e7d4de0e59 Eliminate C10_TYPENAME_CONSTEXPR (#138702)
Test Plan: Sandcastle

Differential Revision: D64833560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138702
Approved by: https://github.com/malfet
2024-10-24 10:21:01 +00:00
0efa590d43 [CI] Fix XPU CI failure (#138548)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/138577.

# Solution
1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170
2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`.
3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`.

# Additional Context
> Why update torch-xpu-ops commit pin here?

We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364).

> What does the feature of torch-xpu-ops update?

1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc;
2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d`
3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc;
4. fix build failure related to `C10_UNUSED`;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548
Approved by: https://github.com/malfet, https://github.com/EikanWang
2024-10-24 07:56:26 +00:00
dbf0fa811a Remove C10_HOST_CONSTEXPR_EXCEPT_WIN_CUDA and CONSTEXPR_EXCEPT_WIN_CUDA (#138479)
BC linter suppressed due to removal of `tools/linter/adapters/constexpr_linter.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138479
Approved by: https://github.com/eqy, https://github.com/malfet
2024-10-24 07:51:05 +00:00
96b30dcb25 [Windows][cpu] mkl use mimalloc as allocator on Windows (#138419)
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.

Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         3.91%     682.427ms       100.00%       17.448s       17.448s             1
                     aten::conv2d         0.18%      30.906ms        64.79%       11.305s       2.133ms          5300
                aten::convolution         0.45%      78.031ms        64.62%       11.275s       2.127ms          5300
               aten::_convolution         0.30%      51.670ms        64.17%       11.196s       2.113ms          5300
         aten::mkldnn_convolution        63.58%       11.093s        63.87%       11.145s       2.103ms          5300
                 aten::batch_norm         0.13%      23.536ms        20.10%        3.506s     661.580us          5300
     aten::_batch_norm_impl_index         0.28%      49.486ms        19.96%        3.483s     657.139us          5300
          aten::native_batch_norm        19.26%        3.360s        19.64%        3.427s     646.615us          5300
                 aten::max_pool2d         0.01%       1.038ms         5.84%        1.018s      10.181ms           100
    aten::max_pool2d_with_indices         5.83%        1.017s         5.83%        1.017s      10.171ms           100
                       aten::add_         3.38%     588.907ms         3.38%     588.907ms      85.349us          6900
                      aten::relu_         0.35%      60.358ms         1.67%     292.155ms      59.624us          4900
                 aten::clamp_min_         1.33%     231.797ms         1.33%     231.797ms      47.306us          4900
                      aten::empty         0.46%      80.195ms         0.46%      80.195ms       1.513us         53000
                     aten::linear         0.01%     927.300us         0.23%      39.353ms     393.532us           100
                      aten::addmm         0.20%      35.379ms         0.21%      37.016ms     370.155us           100
                 aten::empty_like         0.12%      20.455ms         0.17%      29.976ms       5.656us          5300
                aten::as_strided_         0.11%      18.830ms         0.11%      18.830ms       3.553us          5300
        aten::adaptive_avg_pool2d         0.00%     419.900us         0.08%      14.265ms     142.647us           100
                       aten::mean         0.01%       1.737ms         0.08%      13.845ms     138.448us           100
                        aten::sum         0.05%       8.113ms         0.05%       8.648ms      86.479us           100
                    aten::resize_         0.03%       5.182ms         0.03%       5.182ms       0.978us          5300
                       aten::div_         0.01%       1.445ms         0.02%       3.460ms      34.600us           100
                         aten::to         0.00%     337.000us         0.01%       2.015ms      20.154us           100
                   aten::_to_copy         0.01%     977.500us         0.01%       1.678ms      16.784us           100
                      aten::copy_         0.01%       1.474ms         0.01%       1.474ms       7.371us           200
                          aten::t         0.00%     775.900us         0.01%       1.410ms      14.104us           100
                    aten::flatten         0.00%     420.900us         0.01%       1.311ms      13.106us           100
                       aten::view         0.01%     889.700us         0.01%     889.700us       8.897us           100
                  aten::transpose         0.00%     410.700us         0.00%     634.500us       6.345us           100
                     aten::expand         0.00%     496.800us         0.00%     566.800us       5.668us           100
                      aten::fill_         0.00%     534.800us         0.00%     534.800us       5.348us           100
                 aten::as_strided         0.00%     293.800us         0.00%     293.800us       1.469us           200
              aten::empty_strided         0.00%     241.700us         0.00%     241.700us       2.417us           100
               aten::resolve_conj         0.00%      54.800us         0.00%      54.800us       0.274us           200
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.448s

Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.

So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html

This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.

For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-24 05:29:47 +00:00
a94c501b84 Fixed max-autotune in FlexAttention to reset kernel options appropriately (#138733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138733
Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng
2024-10-24 05:18:09 +00:00
cyy
2bcfbf2505 [Distributed] [17/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138465)
Follows  #137404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138465
Approved by: https://github.com/ezyang
2024-10-24 04:58:49 +00:00
cyy
53e356a1c0 [2/N] Enable cppcoreguidelines-special-member-functions (#138670)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138670
Approved by: https://github.com/sraikund16
2024-10-24 04:35:18 +00:00
cfdf658a91 [dynamo][modules] Support overridden __call__ on nn modules (#138619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138619
Approved by: https://github.com/williamwen42
ghstack dependencies: #138657
2024-10-24 03:49:26 +00:00
b1acd0978e [dynamo] Support range_iterator as a function input (#138657)
Fixes https://github.com/pytorch/pytorch/issues/138654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138657
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-10-24 03:49:26 +00:00
e5c3d7ab77 [ROCm] Improve performance of reductions on 1D and 2D tensors. (#137737)
This patch improves the performance of individual reductions on MI300X. These improvements are measured on individual sum reduction operations of varying sizes. The patch impacts the following tensor types:
- 1D tensors
- 2D tensors when reducing along dimension 0
- 2D tensors when reducing along dimension 1

Runtime reduction between 0 and 75% depending on tensor shape.

The patch uses the maximum number of threads per CU and the number of CUs itself to control the number of threadblocks in various situations (i.e. for various reduction types and tensor dimensions).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137737
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/xw285cornell
2024-10-24 03:41:16 +00:00
d8f22a1141 [c10d] Reorder GIL checker and c++ stack trace print with comments (#138734)
We found one case when the GIL deadlock happens and then FR timeout, I am wondering if we can do the GIL check before cpp stack trace print which can lead to hang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138734
Approved by: https://github.com/c-p-i-o
2024-10-24 02:21:37 +00:00
0b9320b7c5 fx_graph_cache: Remove custom amd JK (#137501)
This split in JKs was never actually used (We just set both JKs to the same values except when we accidentally didn't due to being humans who make mistakes). This simplifies the overall JK structure and eventually, will let us delete the duplicate JK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137501
Approved by: https://github.com/oulgen
2024-10-24 01:30:39 +00:00
32a3dbc645 [Pipelining] Free memory usage earlier in last stage (#138504)
This fix is similar to that done in #138119, except this is an edge case for the last stage. For the last stage we perform backward on the `loss` which we detached in the previous PR. However, we also hold the `stage_outputs` alive because we return all the output chunks in `merge_output_chunks()` after the step is over. This will also still keep the autograd graph alive, so detaching these tensors frees the memory earlier.

pre-fix:
<img width="1780" alt="image" src="https://github.com/user-attachments/assets/bb78bde7-fd5c-4eba-bfc9-f0359e20bbab">

post-fix:
<img width="1788" alt="image" src="https://github.com/user-attachments/assets/a26102d9-9db2-4fc8-946c-336b8430657c">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138504
Approved by: https://github.com/wconstab
ghstack dependencies: #138119
2024-10-24 00:44:03 +00:00
8945309c08 [Pipelining] fix extra memory usage in zero bubble (#138119)
Full debugging details in here: https://docs.google.com/document/d/1Pe_E0KWAfsJ6MCvKZ5aR28rTXX-rYLg13XxwXd6AALw/edit?usp=sharing

In zero bubble, we have two methods `stage_backward_input` and `stage_backward_weight`. During `stage_backward_input` we compute the gradients of the input with respect to the stage outputs and also retain the graph of the autograd graph (different than 1F1B where `retain_graph=False`). The output / loss was still being retained across the next schedule step() because we return the loss to the user and use the output to the next step. To allow autograd to free the variables in the graph we need to detach the output/loss after we don't need to use it autograd anymore.

Pre-fix:
<img width="1021" alt="image" src="https://github.com/user-attachments/assets/6c8bf469-32b1-4dac-85ff-b97991f9f0e3">

Post-fix:
<img width="1039" alt="image" src="https://github.com/user-attachments/assets/a1875038-e80b-4dd4-84f2-38727d7792dc">

without AC (7B model on titan):
10% memory improvement

with AC (7B model on titan)
50% memory improvement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138119
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-10-24 00:44:03 +00:00
889717aabd [CI/CD] Disable split build (#138752)
See https://github.com/pytorch/pytorch/issues/138750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138752
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-10-23 22:38:30 +00:00
1b31248933 [EZ] Fix typo in test_mps.py (#138738)
s/emedding_weight/embedding_weight/

Stolen from 074766d9b4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138738
Approved by: https://github.com/atalman
2024-10-23 22:15:35 +00:00
c92459488b Fix test on windows (#138641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138641
Approved by: https://github.com/huydhn
2024-10-23 21:53:32 +00:00
dd4dd85210 [hierarchical-compilation][inductor] Support invoke_subgraph HOP (#138031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138031
Approved by: https://github.com/eellison
ghstack dependencies: #137538, #138036, #137965
2024-10-23 21:32:14 +00:00
7622ede3cd Add dump_launch_params config in triton/inductor (#137143)
Summary: Moves the checking of TORCHINDUCTOR_DUMP_LAUNCH_PARAMS into the config module to pull it out of the critical path.

Test Plan: Existing unit tests cover this env variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137143
Approved by: https://github.com/eellison
2024-10-23 21:20:46 +00:00
9eadd7434e Refactor: Move _nested_int_aware_sort top level (#138693)
I need to use it from some other places later in the PR stack

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138693
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-10-23 21:15:05 +00:00
9b77d3109b [export] fix test_unbacked_bindings_for_divisible_u_symint (#138607)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138607
Approved by: https://github.com/angelayi
2024-10-23 21:10:05 +00:00
dbd6ada8c3 Clean up a c10::optional and fix documentation (#138700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138700
Approved by: https://github.com/Skylion007
2024-10-23 20:42:28 +00:00
8aedc649bd Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 19:13:44 +00:00
cd9c6e9408 Do not run CI on forks (#138714)
Add `if: github.repository_owner == 'pytorch'` for some jobs that were missing it

Fixes #138564
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138714
Approved by: https://github.com/huydhn, https://github.com/kit1980
2024-10-23 18:23:05 +00:00
ed313a5ca2 Introduce torch.sym_add, variadic add (#138660)
Tested internally here: https://www.internalfb.com/diff/D64057744
This is a reland after previous internal failures.
main change is
```
 if min is None and max is None:
        torch._check_is_size(size)
        return
```

Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660
Approved by: https://github.com/ezyang, https://github.com/bobrenjc93
2024-10-23 17:42:41 +00:00
72ea7ba89f Generate slice.Tensor view operations instead of as_strided when split is used in the original program. (#137225)
test_recompile assert that the changes do not add more recompilation by comparing with eager backend.
The reason of this is because slice can be lowered in more efficient way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137225
Approved by: https://github.com/zou3519
2024-10-23 17:42:16 +00:00
1bc73f3157 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 17:42:11 +00:00
c272526ea5 [SJD] [RFC] force setting last progress time (#138615)
Summary:
Currently, if watchdog + healthcheck are enabled via knobs but watchdog is disabled via SJD config, we observe a stuck when the watchdog loop attempts to open the watchdog file path. This is because the FileTimerClient that is usually set in TorchElasticWatchdog will not be set since disabling watchdog via SJD config bypasses the TorchElasticWatchdog initialization

The workaround is to update the healthcheck time when calling `get_last_progress_time`

Test Plan:

Logs show that the progress time value is being changed despite client not being set

Behavior when watchdog is enabled with SJD config is left unchanged

Differential Revision: D64733766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138615
Approved by: https://github.com/gag1jain
2024-10-23 15:29:00 +00:00
cdfe1bffd1 Revert "[PGNCCL] Use non-blocking mode by default in eager init (#138527)"
This reverts commit 8fbf866904661b16cba4c799af81121557ba9da8.

Reverted https://github.com/pytorch/pytorch/pull/138527 on behalf of https://github.com/jeanschmidt due to Seems to have introduce regressions on main, pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.g4dn.12xlarge.nvidia.gpu) checking if revert will do ([comment](https://github.com/pytorch/pytorch/pull/138527#issuecomment-2432479338))
2024-10-23 14:49:49 +00:00
2f007e5de5 Make trace log dir persist through multiple set_logs() calls (#137793)
Summary: Currently, calling `torch._logging.set_logs()` resets the log directory leading to multiple tlparse outputs. This prevents the dir from resetting after the first call.

Reviewed By: ezyang

Differential Revision: D64118047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137793
Approved by: https://github.com/ezyang
2024-10-23 14:23:03 +00:00
ecf2240243 [Inductor] New Triton Attrs Descriptor Fixups (#138390)
Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390
Approved by: https://github.com/jansel, https://github.com/huydhn
2024-10-23 14:13:49 +00:00
75c6787a16 [CI] Introduces experiment awsa100 to inductor-perf-compare.yml workflow using _runner-determinator.yml (#138204)
Adds the job `get-test-label-type` in `.github/workflows/inductor-perf-compare.yml` checking for the experiment `awsa100`.

It is then used by the job `linux-focal-cuda12_1-py3_10-gcc9-inductor-build` to define the prefix for the runners that will run the benchmark.

Those runners temporarily accept the labels `awsa100.linux.gcp.a100` and `linux.aws.a100`. This is used so we can migrate via experimentation from `linux.gcp.a100`. After successfully experiment with those instances we will remove those labels and update the workflows to use `linux.aws.a100` and decomisson the gcp fleet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138204
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2024-10-23 13:47:26 +00:00
04103f6ae9 Eliminate c10 string_utils (#138499)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138499
Approved by: https://github.com/swolchok
2024-10-23 13:40:19 +00:00
c2d26418c3 [Quant][Inductor] expand quantization conv-binary(-unary) pattern fusion inside inductor (#138051)
### Summary
Expand quantization conv-binary(-unary) pattern fusion inside inductor to support the following two patterns:
Pattern 1:
```
    Conv(X)   extra input
           \   /
            Add
             |
        Optional(relu)
             |
             Y
```
Pattern 2:
```
    extra input   Conv(X)
           \   /
            Add
             |
        Optional(relu)
             |
             Y
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138051
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2024-10-23 13:12:17 +00:00
2f1842fa83 [CD] fix xpu support packages version (#138189)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138189
Approved by: https://github.com/EikanWang, https://github.com/malfet, https://github.com/atalman
2024-10-23 12:25:43 +00:00
8fbf866904 [PGNCCL] Use non-blocking mode by default in eager init (#138527)
### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384
2024-10-23 08:51:54 +00:00
2d7e586c13 Fixed dead lock in execution trace (#136892)
Summary:
This DIFF is to fix dead lock issue in execution issue. ExecutionTraceObserver get a lock in recordOperatorStart and onFunctionExit. However, inside these two functions, the input/ouput values are evaluated, which will triger python GIL in some use cases. In this case, the lock order is ET locker -> GIL.

One of  the ads application get GIL first, then call all-gather to collect some metrics from all ranks. When ET is on, all-gather is captured by ET observer. In this case, the lock order is: GIL -> ET locker

That is the reason why dead lock happens. To fix it, I changed the ET locker scope, so the input/output evaluation is no longer inside the scope of the ET locker.

Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda

Differential Revision: D63556608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136892
Approved by: https://github.com/aaronenyeshi
2024-10-23 07:53:56 +00:00
cab5f54dee [ONNX] Fix sequence handling in graph building (#138656)
Previous to this PR, op.Concat is called without required attributes: axis, and val and arg seems wrongly coded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138656
Approved by: https://github.com/justinchuby
2024-10-23 07:47:58 +00:00
5402677021 add CUDA 12.6 to conda docker image (#138417)
Adds cuda 12.6 to common installation script.
Adds cuda 12.6 to conda docker image build matrix.

fixes #138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138417
Approved by: https://github.com/cyyever, https://github.com/atalman
2024-10-23 07:30:51 +00:00
5ceef8c470 Add support for SymFloats in split_module fx pass (#138599)
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138599
Approved by: https://github.com/ezyang
2024-10-23 06:56:13 +00:00
96c86758e2 Support conditionals on sym node variables in the __bool__ and __len__ case (#138595)
As discussed with @ezyang, this set of diffs are extracting fixes to problems discovered to flipping `specialize_float=False` in https://github.com/pytorch/pytorch/pull/137782. Since these codepaths are exercised in existing tests, I'm going to bias towards shipping speed and put these up with the primary test plan as the global CI. These code paths are all tested via existing tests when `specialize_float=False` and it feels a bit wonky to add more gated tests that only test behavior when this flag is True, especially since these code paths are already covered. That being said, I'm happy to add individual tests if reviewers insist or have a different POV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138595
Approved by: https://github.com/ezyang
2024-10-23 06:44:09 +00:00
72dde6e84b [ONNX] Avoid optimize onnx_dynamo-fallback (#138265)
Previous to this PR, when a model fails to be exported, it falls back to try with the legacy torchscript exporter. However, we didn't stop when it's exported with torchscript exporter, an optimization is applied to the graph.

It's ideal that the optimization can also boost the performance of the model exported with the legacy torchscript exporter, but currently, for benchmarking purpose and what fallback guarantee to the users, we should keep it simple and only return the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138265
Approved by: https://github.com/xadupre, https://github.com/justinchuby
2024-10-23 04:13:32 +00:00
bb65c9b883 [PyTorch] Classify Unsupported mutated Dynamic Shapes as User Error (#137054)
Summary: We don't need an assert on for unsupported dyn shape inputs, removing the assert and raising a user exception instead.

Differential Revision: D63661569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137054
Approved by: https://github.com/bdhirsh
2024-10-23 03:15:37 +00:00
cyy
fbd14315f9 Update ruff to 0.7.0 (#138597)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138597
Approved by: https://github.com/ezyang
2024-10-23 03:00:30 +00:00
06b5330674 [easy] Log subproc pool creation (#138642)
Summary: Request from internal to log subproc pool creation

Test Plan:
```
$ TORCH_LOGS=+torch._inductor.async_compile python ~/add.py
I1022 14:12:41.915000 444394 torch/_inductor/async_compile.py:165] Creating subprocess pool with 32 workers
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138642
Approved by: https://github.com/eellison
2024-10-23 02:41:42 +00:00
cyy
86cca3fb05 [1/N] Don't skip ASAN on some tests (#138571)
Clang15's ASAN is new enough so that it's possible to re-evaluate the disabled ASAN  tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138571
Approved by: https://github.com/ezyang
2024-10-23 02:38:45 +00:00
d437df342b [tests] fix broken tests caused by AotEagerAndRecordGraphs typo (#138492)
Summary:
Name change happened in https://github.com/pytorch/pytorch/pull/138231

AttributeError: module 'torch._dynamo.testing' has no attribute 'AOTEagerAndRecordGraphs'. Did you mean: 'AotEagerAndRecordGraphs'?

Test Plan: ci

Differential Revision: D64704686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138492
Approved by: https://github.com/aakhundov
2024-10-23 02:25:21 +00:00
fee2f331ce Update torchbench.txt (#138569)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138569
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-10-23 01:42:25 +00:00
f2ebf6d94a [PGNCCL] Ensure comm is ready before all accesses (#138384)
Previously we only wait for comm to become ready after its initialization.
That's not enough. There are other NCCL APIs that can cause the comm to be InProgress, e.g. P2P calls, commSplit, commFinalize, etc.
Therefore, we just ensure comm is ready every "next time" we need to access ncclComm.
The place to add such gate keeper is `getNcclComm`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138384
Approved by: https://github.com/shuqiangzhang, https://github.com/fduwjj
ghstack dependencies: #137855, #138488, #138374
2024-10-23 01:36:58 +00:00
37149d032c Fix .to(cpu) for Storage (#138011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138011
Approved by: https://github.com/albanD
2024-10-23 01:31:48 +00:00
555bddbef7 [AOTI][refactor] Move use_minimal_arrayref_interface logic (#138250)
Summary: Move use_minimal_arrayref_interface specific logic from CppWrapperCpu to CppWrapperCpuArrayRef. This is a copy-on-write style refactor, to simply the default AOTI generated code.

Differential Revision: [D64598715](https://our.internmc.facebook.com/intern/diff/D64598715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138250
Approved by: https://github.com/chenyang78
ghstack dependencies: #138544, #138379
2024-10-23 01:00:34 +00:00
2cee5a39ad [AOTI] Fix check_model_with_multiple_inputs in test_aot_inductor (#138379)
Summary: Add missing use_minimal_arrayref_interface setting to check_model_with_multiple_inputs.

Differential Revision: [D64635211](https://our.internmc.facebook.com/intern/diff/D64635211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138379
Approved by: https://github.com/hl475
ghstack dependencies: #138544
2024-10-23 00:54:29 +00:00
d428d81c7f Remove some pre-cpp17 stuff (#138410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138410
Approved by: https://github.com/Skylion007
2024-10-23 00:38:03 +00:00
f4b3813989 Wrap autograd and autocast ops in training IR (#138516)
Differential Revision: [D64732361](https://our.internmc.facebook.com/intern/diff/D64732361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138516
Approved by: https://github.com/yushangdi
ghstack dependencies: #138261
2024-10-23 00:37:54 +00:00
9f7b987087 Revert "[Inductor] New Triton Attrs Descriptor Fixups (#138390)"
This reverts commit 215999452eb5517213b3a31f72eb9a7e843d12a0.

Reverted https://github.com/pytorch/pytorch/pull/138390 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it still has another lint error ([comment](https://github.com/pytorch/pytorch/pull/138390#issuecomment-2430566004))
2024-10-23 00:37:28 +00:00
69f18587d6 Move test_serialize to training IR (#138261)
Differential Revision: [D64572253](https://our.internmc.facebook.com/intern/diff/D64572253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138261
Approved by: https://github.com/yushangdi
2024-10-23 00:32:32 +00:00
662d07e93e Remove parallel_and and parallel_or (#138135)
Not used, suggested by @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138135
Approved by: https://github.com/ezyang
2024-10-23 00:22:22 +00:00
cyy
38d3c27849 [1/N] Enable cppcoreguidelines-special-member-functions (#137405)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137405
Approved by: https://github.com/ezyang
2024-10-23 00:16:53 +00:00
7e951c1675 [EZ][DTensor] Update DTensor readme to use the new import path (#138625)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138625
Approved by: https://github.com/XilunWu
2024-10-23 00:08:36 +00:00
3441ea7d74 [dynamo] reset compiler stance after test (#138277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138277
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-10-23 00:07:33 +00:00
a825667670 [executorch hash update] update the pinned executorch hash (#135287)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135287
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-10-22 23:40:57 +00:00
5942b29850 Disabling amp context when invoking compiler (#138624)
Fix for https://github.com/pytorch/pytorch/issues/133974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138624
Approved by: https://github.com/bdhirsh, https://github.com/drisspg
2024-10-22 23:21:55 +00:00
215999452e [Inductor] New Triton Attrs Descriptor Fixups (#138390)
Fixes additional areas where we need to use the new Triton AttrsDescriptor if it is available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138390
Approved by: https://github.com/jansel
2024-10-22 23:16:05 +00:00
10f16cc7da Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit 8aacbee8e0d6c03096f2ce94b70e2a8fab17ee81.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/wdvr due to this one has failing internal tests, not related to a landrace with #138398 - reverting this one ([comment](https://github.com/pytorch/pytorch/pull/136526#issuecomment-2430460176))
2024-10-22 22:53:56 +00:00
39bfba3f56 [sparse] add search for optimal alg_id to torch.compile (#137427)
Summary:

This PR adds a lowering for `torch._cslt_sparse_mm` to find the optimal
alg_id and cache it when running with `torch.compile`

Seeing speedups on both bfloat16 and float8 dtypes:
<img width="641" alt="Screenshot 2024-10-17 at 2 10 38 PM" src="https://github.com/user-attachments/assets/b928cd11-32a3-43e5-b209-8e4028896f0b">
<img width="1274" alt="Screenshot 2024-10-17 at 1 39 03 PM" src="https://github.com/user-attachments/assets/d9edd684-a8ec-46fd-b3da-2e76dbcb7bb6">

* `torch._cslt_sparse_mm_search` has been modified to return optimal
  split-k parameters as well as max alg_id.

* max_id is now available in `torch.backends.cusparselt` via
  `torch.backends.cusparselt.get_max_alg_id()`

* fixed meta registrations for float8

Test Plan:

python test/test_sparse_semi_structured.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137427
Approved by: https://github.com/cpuhrsch
2024-10-22 22:39:42 +00:00
b4cfb9c014 [EZ] Use at::detail nested namespace in Dispatch.h (#138633)
Instead of `namespace at { namespace detail {`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138633
Approved by: https://github.com/Skylion007
2024-10-22 22:13:21 +00:00
54fbd897d9 [AOTI][refactor] Clean up test_aot_inductor skip list (#138544)
Summary: Remove skips for already fixed tests. Change remaining skip to xfail so that the failure list can be more proactively maintained.

Differential Revision: [D64761257](https://our.internmc.facebook.com/intern/diff/D64761257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138544
Approved by: https://github.com/chenyang78, https://github.com/hl475
2024-10-22 21:32:49 +00:00
a16476b671 Add support for adding extra metadata to chromium events, log to separate columns (#138477)
This diff does a few things:

## Add metadata to events in progress
Adds the ability to add extra metadata to Chromium Events via `add_event_data`.
Metadata can only be added to chromium events that have started, but not ended (so, in progress events)
- When you add the data, the metadata is appended to the metadata when you call log_event_end().
- The metadata appears in chromium events in tlparse. It also gets logged to scuba.

## New `dynamo` chromium event
We add a new `dynamo` chromium event to the top of the stack, where we collect various metadata found in dynamo_compile. So the new order of events goes:

```
__start__
-> dynamo (dynamo compile metrics)
-> entire_frame_compile (compile.inner)
-> backend_compile (i.e. aotdispatch)
-> create_aot_dispatch_function
-> inductor_compile
-> ...
```

BackwardCompilationMetrics doesn't have any dynamo specific information (as it's mostly inductor timings). So we don't include that here.

*FAQ: Why can't we use `entire_frame_compile` as the event?*
This is mostly due to backward compatibility with `dynamo_compile`. `dynamo_compile` collects CompilationMetrics outside of `compile.compile_inner`, and uses `dynamo_timed` to grab timings from phases of the compiler, including `entire_frame_compile`. So we don't have a CompilationMetric object until after an `entire_frame_compile` event ends! Separately, `dynamo` as a name for all of dynamo compile is more descriptive than `entire_frame_compile`, imo.

## Log metadata as separate columns
(Meta only): Separately, this also changes the `metadata` column in PT2 Compile Events. Instead of logging a single metadata column in JSON, it separates the JSON into separate columns. This is much better for data analysis. Now that this table is more mature, I think logging keys to separate columns is a better system.Differential Revision: [D64696287](https://our.internmc.facebook.com/intern/diff/D64696287/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D64696287/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138477
Approved by: https://github.com/aorenste
2024-10-22 21:17:44 +00:00
3b2b5486ea Fixes issue with torch._dynamo.assume_constant_result with global functions (#132431)
This PR fixes an issue with `torch._dynamo.assume_constant_result` causing global values to be overwritten.
Currently `torch._dynamo.assume_constant_result` saves the constant result into a global variable derived from the name of the function.  This causes that function to be overwritten in the global scope.  This PR checks that the name is unique in the global scope as well, avoiding the issue of overriding the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132431
Approved by: https://github.com/jansel
2024-10-22 21:14:26 +00:00
e3af290165 [export] Add retraceability_non_strict to tests (#138380)
Summary: We expand the tests to cover retraceability_non_strict. Currently failing tests are skipped.

Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test:test_export -- -r _retraceability
```

Differential Revision: D64611532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138380
Approved by: https://github.com/angelayi
2024-10-22 21:05:51 +00:00
d1be61ce4e Update copyrights to 2024 (#138638)
Spiritual successor of https://github.com/pytorch/pytorch/pull/119413 + CPP docs copyright update as well
Fixes https://github.com/pytorch/pytorch/issues/138630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138638
Approved by: https://github.com/atalman
2024-10-22 21:00:58 +00:00
dbd0a39c79 Bump webrick from 1.7.0 to 1.8.2 in /ios/TestApp (#136593)
Bumps [webrick](https://github.com/ruby/webrick) from 1.7.0 to 1.8.2.
- [Release notes](https://github.com/ruby/webrick/releases)
- [Commits](https://github.com/ruby/webrick/compare/v1.7.0...v1.8.2)

---
updated-dependencies:
- dependency-name: webrick
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-22 13:32:50 -07:00
f089d5ffef Improve input validation for NJT pointwise ops (#138602)
Before this PR, NJT would dispatch e.g. `NJT * nested_int` to `mul.Tensor`, wrongly interpreting the SymInt as a tensor and outputting garbage. This PR verifies that there are no nested ints in the list of args before dispatching for pointwise ops.

I originally tried checking that `the number of passed tensor args == the number of func schema tensor args`, but this wrongly disallows `nt * 2`, which (non-intuitively to me at least at first) dispatches via the `mul.Tensor` overload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138602
Approved by: https://github.com/soulitzer
2024-10-22 20:13:12 +00:00
cyy
1c77b13c06 [6/N] Fix extra warnings brought by clang-tidy-17 (#138572)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138572
Approved by: https://github.com/Skylion007
2024-10-22 19:46:38 +00:00
a71723bf12 [ONNX] Add complex constant support (#138279)
Transform complex python constant to float representation as well, like what we have with tensors.

PS: I find it's not reasonable to add "complex->float" in IR side, so I put it here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138279
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-10-22 19:42:59 +00:00
c7a20939b4 Remove unused enforce_cond_guards_match Dynamo feature flag. (#138589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138589
Approved by: https://github.com/clee2000
2024-10-22 19:36:01 +00:00
078dca1ce8 Aarch64 binary builds - fix passing env_file to Docker (#138588)
Aarch64 builds skipped the logic of sourcing binary env file. And as a result PYTORCH_EXTRA_INSTALL_REQUIREMENTS passed to Aarch64 builds have not included triton dependency constraint. This PR makes sure Aarch64 builds follow same path as our regular manywheel builds.

To work around this issue we had to inject triton in aarrch64 builds for release 2.5, which is not ideal: https://github.com/pytorch/builder/pull/2011
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138588
Approved by: https://github.com/jeanschmidt, https://github.com/malfet
2024-10-22 19:04:19 +00:00
eqy
c0e8458aab [Flex Attention] Don't compute fill order to compute stride order just to get fill order back (#138376)
Was a bit confusing to read when working on #138354

"computer-assisted proof"
```
import random

def argsort(seq):
    # preserve original order for equal strides
    getter = seq.__getitem__
    a_r = range(len(seq))
    return list(reversed(sorted(a_r, key=getter, reverse=True)))  # noqa: C413

def stride_order2fill_order(order):
    """
    Convert stride order to fill order
    For channel last format,

    stride order = [3, 0, 2, 1] and fill order = [1, 3, 2, 0]
    """
    lookup = {pos: idx for idx, pos in enumerate(order)}
    fill_order = [lookup[i] for i in range(len(order))]
    return fill_order

def get_stride_order(seq):
    """
    Convert strides to stride order
    """
    sorted_idx: List[int] = argsort(seq)
    out = [0 for _ in range(len(seq))]
    a = sorted_idx.copy()
    for i, elem in enumerate(sorted_idx):
        out[elem] = i
    fillorder = stride_order2fill_order(out)
    assert fillorder == sorted_idx
    return out

for _ in range(1000):
    a = [0, 1, 2, 3]
    random.shuffle(a)
    get_stride_order(a)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138376
Approved by: https://github.com/drisspg
2024-10-22 18:40:39 +00:00
2dab4ccb65 [Inductor][ROCm][CK] add CK grouped conv2d fwd kernels to ROCm codegen (#137947)
Plug into lowering and end to end test in a later PR

Instance parsing companion PR https://github.com/ROCm/composable_kernel/pull/1585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137947
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2024-10-22 18:25:23 +00:00
6e4c19289c [EZ] [BE] Remove (now) unused scale config (#138511)
Final step of moving scale config files to test-infra repo.  Details in https://github.com/pytorch/test-infra/pull/5767

Scale configs are now read from test-infra.  This PR is just cleaning up stale files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138511
Approved by: https://github.com/clee2000
2024-10-22 18:08:42 +00:00
f7e36d8d6f Fix for MSVC problem on Windows Arm64 (#136765)
This PR proposes a workaround for an internal issue introduced in MSVC 14.37 for Windows Arm64 target. It is still an ongoing problem.
The fix will be released with the future versions of Visual Studio 2022 but until then the changes to cpu/vec/vec_base.h should be sufficient.
We also opened a new ticket on Visual Studio Developer Community, it can be found here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136765
Approved by: https://github.com/malfet

Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2024-10-22 18:07:58 +00:00
fc9093c3d2 Revert "Remove C10_DEPRECATED (#138406)"
This reverts commit 70ec86d7542d461ff6f01ba1a1c9a4f38637af8e.

Reverted https://github.com/pytorch/pytorch/pull/138406 on behalf of https://github.com/wdvr due to failing internal tests - see D64714374 ([comment](https://github.com/pytorch/pytorch/pull/138406#issuecomment-2429912896))
2024-10-22 18:00:41 +00:00
cc93c1e5e4 Upload artifacts during test run (#125799)
Zip and upload artifacts while run_test is running
Upgrade boto3 because I get errors about not having `botocore.vendored.six.move` if I don't
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125799
Approved by: https://github.com/huydhn
2024-10-22 16:48:57 +00:00
2e48788a35 [hierarchical-compilation][invoke_subgraph] Use tracing context to cache artifacts of dispatch keys (#137965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137965
Approved by: https://github.com/zou3519
ghstack dependencies: #137538, #138036
2024-10-22 15:33:42 +00:00
e045e8f0df [hierarchical-compilation][invoke_subgraph] Graph break on input mutation or aliasing (#138036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138036
Approved by: https://github.com/zou3519
ghstack dependencies: #137538
2024-10-22 15:33:42 +00:00
4dd4d38ca9 [hierarchical-compilation][hop] Introduce invoke_subgraph (#137538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137538
Approved by: https://github.com/zou3519
2024-10-22 15:33:34 +00:00
046f02d2de [ROCm] index_put performance improvement (#138259)
On ROCm, using a non-vectorized index_put kernel provides ~2x perf improvement over the hipified CUDA kernel.  None of the existing unit tests were exercising the large index case so a new unit test was added.

It was also noted that the scale value in the original kernel was hard-coded to 1.0 which would be a no-op, so it was removed from the simplified rocm kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138259
Approved by: https://github.com/xw285cornell, https://github.com/leitian, https://github.com/eqy
2024-10-22 15:21:43 +00:00
2827befe61 [AOTI][reland] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138541)
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement.

Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed.

Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541
Approved by: https://github.com/frank-wei
2024-10-22 14:17:27 +00:00
bb8bc7d6b3 config: simplify most of the config handling and fix some bugs (#138377)
This PR combines a number of cleanups in one PR. If any of the specific cleanups don't seem to make sense, let me know and I can remove them.

Cleanups

- This PR adds a set of test suites for the config module code, which handles basically all the APIs and ways it is used. Please let me know if you see anything critical that is not tested that I missed. This test suite is primarily used as the regression test suite for later changes in this diff. Note that there is some dynamo specific testing of the config module, but it isn't as verbose.
- I removed all internal usage of shallow_copy_dict. Those usages could all use the deep copy, and did not depend on the reference behavior of certain config values that shallow_copy_dict allows.
- I removed shallow copy semantics for configuration with a deprecation warning. I think this requires a release note, so hopefully I did that correctly. Let me know if we want to continue to expose shallow copy value semantics, but I just can't find a case where I expect anyone would want it. It also complicated later internal changes to the API (i.e. breaking apart various layers of the config changes).
- I fixed what I believe is a bug in how hashes are calculated on configs. In particular, if you got the hash, then made a config change, and then got the hash again, it would not update the hash. @oulgen, please let me know if I'm misunderstanding this behavior and it is desired.
- I switched our multiple implementations of iterating through the dictionary to a single one. This is primarily to make later changes easier, but it also makes it clear how inconsistent our various config ignoring options are. Let me know if people would be interested in me unifying the various options for ignoring config values.
- I updated the test patcher (not the performance critical one, just the normal one), to use __setattr__ and __getattr__ to remove direct API access to the underlying config fetcher.

For release notes, Not sure exactly how to communicate this, but something like
"ConfigModule.to_dict, and ConfigModule.shallow_copy_dict no longer retain their shallow copy semantics, which allowed reference values objects to be modified. If you wish to modify the config object, call load_config explicitly".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138377
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jovianjaison
2024-10-22 13:40:26 +00:00
1b61313acd Add type stub for SymInt.rsub (#138543)
Fixes https://github.com/pytorch/pytorch/issues/138478

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138543
Approved by: https://github.com/malfet
2024-10-22 13:27:32 +00:00
8c840fb921 Add out_dtype kw argument to optimize_bsr_dense_addmm (#136626)
As in the title.

Addresses the task in https://github.com/pytorch/ao/pull/821#issuecomment-2373290266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136626
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-10-22 09:52:25 +00:00
5a13282c75 [compiled autograd] tls access helpers (#138061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138061
Approved by: https://github.com/yf225
ghstack dependencies: #137953, #137821
2024-10-22 08:03:52 +00:00
49fa437097 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-22 08:03:52 +00:00
75259145ec [compiled autograd] directly use python Logger class in cpp (#137953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953
Approved by: https://github.com/jansel, https://github.com/yf225
2024-10-22 08:03:52 +00:00
60c1433041 [aoti] Cond symint input support (#138373)
If the input is a symint, we don't want to add the aoti_torch_assign_tensors_out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138373
Approved by: https://github.com/larryliu0820, https://github.com/desertfire
2024-10-22 07:53:22 +00:00
51045e6251 make DimHints compatible with Dims (#138490)
Previously we'd been raising UserErrors when `Dim()` and DimHints (`Dim.AUTO/Dim.DYNAMIC`) were both specified in `dynamic_shapes`, this PR stops that, and uses `Dim()` objects to guide DimHints.

The key to this was making the `EqualityConstraint` class happy when it checks that inferred equivalence relations were specified in the original `dynamic_shapes` spec, and this introduces a `RelaxedConstraint` object to mark the hinted dimensions, so equality checks between `RelaxedConstraints` and other constraints are treated as valid.

Current behavior is that:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x - y

inputs = (torch.randn(4, 4), torch.randn(4, 4))
shapes = {
    "x": (Dim.AUTO, Dim("d1", min=3)),
    "y": (Dim("d0", max=8), Dim.DYNAMIC),
}
ep = export(Foo(), inputs, dynamic_shapes=shapes)
```

The dimensions marked `AUTO` and `DYNAMIC` will have max & min ranges of 8 & 3 respectively. Note that inferred equality between `Dim()` objects & `Dim.STATIC` will still raise errors - `Dim()` suggests not specializing to a constant.

Differential Revision: D64636101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138490
Approved by: https://github.com/avikchaudhuri
2024-10-22 07:43:48 +00:00
9a9a0abc28 [SDPA-CUDNN] Make CuDNN Attention Opt in (#138522)
# Summary
Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5:

1. https://github.com/pytorch/pytorch/issues/138529
2. https://github.com/huggingface/diffusers/issues/9704
3. https://github.com/pytorch/pytorch/pull/138354

In light of the above we are going to make the CuDNN backend Opt-in by default.

This can be done easily with the context manager for choosing backends I.e.:
``` Python
from torch.nn.attention import sdpa_kernel, SDPBackend

with sdpa_kernel(SDPBackend.CUDNN_ATTENTION):
    out = F.scaled_dot_product_attention(q, k, v)

```

This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager).

Cc @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138522
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet
2024-10-22 07:23:06 +00:00
2b4af6fa74 Mark torch.get_device as overridable at the python level (#132706)
Summary:
- add a value to `get_testing_overrides` function for `torch.get_device()`
- remove `torch.get_device()` from the `get_ignored_functions` list

Test Plan:
Existing override testing infra, which should pick up the updates to these two variables.

Closes the loop on:
https://github.com/pytorch/pytorch/pull/132567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132706
Approved by: https://github.com/ezyang
2024-10-22 07:20:42 +00:00
84e5f34fd1 bug in unbacked_bindings for a*u0 (#138136)
Summary: we were storing a*u0 instead of u0 in unbacked_bindings / unbacked_var_to_val

Test Plan: -

Differential Revision: D64508936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138136
Approved by: https://github.com/ezyang
2024-10-22 07:04:30 +00:00
a80b87353c [pt2] Log is_forward field to dynamo_compile scuba table (#138505)
Differential Revision: [D64711721](https://our.internmc.facebook.com/intern/diff/D64711721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138505
Approved by: https://github.com/oulgen
2024-10-22 05:50:49 +00:00
0b4a071a1d [CP] Implement AllGather based context parallelism (#132820)
Summary:

This implementation does not utilize the benefit that after allgather we can directly perform the SDPA without doing the ring-based SDPA, but we can overlap the communication with the first sharded kv computation. This implementation shows some performance benefit and memory saving compared to the original alltoall implementation in certain cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132820
Approved by: https://github.com/XilunWu
2024-10-22 05:25:50 +00:00
6b29d40e9b [PGNCCL] Add default value for nccl_nonblocking_timeout (#138374)
- Added default value for `nccl_nonblocking_timeout` (30 mins, previous: -1).
- Reuse C10D_CHECK_TIMEOUT in other CHECK macros

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138374
Approved by: https://github.com/eqy
ghstack dependencies: #137855, #138488
2024-10-22 05:06:18 +00:00
03c72976a5 Properly uses ref-counting for torch.cuda.use_mem_pool (#133600)
This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`.

The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well.

Part of https://github.com/pytorch/pytorch/issues/124807.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-10-22 03:21:53 +00:00
89067402d4 [easy] in ROCmTemplate set kwargs when creating Buffer (#138521)
Summary: https://github.com/pytorch/pytorch/pull/137768 makes Inductor IR kw only

Test Plan: CI

Differential Revision: D64723804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138521
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-22 03:13:16 +00:00
cyy
f881094366 Use Wmissing-prototypes on torch_cuda (#136080)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136080
Approved by: https://github.com/ezyang
2024-10-22 02:04:19 +00:00
9f7c26bef3 Fix training IR bug by changing passes order (#138292)
Inserting runtime_assertions cause gm to have different names but the graph signature was populated earlier. To avoid this kind of errors in the future, I refactored these steps into a helper function.

Differential Revision: [D64576251](https://our.internmc.facebook.com/intern/diff/D64576251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138292
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #138266
2024-10-22 01:24:14 +00:00
012ff2a0aa Don't try to load cufile (#138501)
Trying to loading it caused a big issue with 2.5.0 release - https://github.com/pytorch/pytorch/issues/138324

cufile is not actually used currently by default, see #133489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138501
Approved by: https://github.com/atalman, https://github.com/mikaylagawarecki, https://github.com/malfet
2024-10-22 01:13:27 +00:00
5adc33d3b8 Training IR should preserve custom metadata (#138266)
Differential Revision: [D64576252](https://our.internmc.facebook.com/intern/diff/D64576252)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138266
Approved by: https://github.com/yushangdi
2024-10-22 01:09:56 +00:00
0a38c0ec89 [inductor] add a threshold for membw saving during fusion (#136782)
Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782
Approved by: https://github.com/jansel
2024-10-22 00:50:00 +00:00
3b186c5659 Revert "[AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303)"
This reverts commit 1417b2cd0562e0e4d4349024ef7c731b99214890.

Reverted https://github.com/pytorch/pytorch/pull/138303 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138303#issuecomment-2427991065))
2024-10-22 00:46:48 +00:00
d7e0e1dbc4 [DeviceMesh] Use split_group to create sub_groups for nccl backend if the default pg is eagerly initialized (#138129)
Use `split_group()` to create sub_groups for nccl backend if the default pg is eagerly initialized. Otherwise, it will still go through the normal lazy init process and call `new_group()` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138129
Approved by: https://github.com/kwen2501
2024-10-22 00:00:05 +00:00
a7f49de485 Fixes issue with enums in a tuple for dynamo (#133123)
Currently when tuples values are encountered in dynamo, they are encoded using `repr(arg)`.  This causes an issue if one of the values inside of the tuple will not be properly encoded.  In this case, if an enum is contained inside of a tuple, it will cause invalid python code to be generated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133123
Approved by: https://github.com/jansel
2024-10-21 23:45:11 +00:00
e24871eb3c Add environment variable to force no weights_only load (#138225)
In preparation for `weights_only` flip, if users don't have access to the `torch.load` call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138225
Approved by: https://github.com/albanD
2024-10-21 23:26:15 +00:00
ec4ce094b2 [Traceable FSDP2][CI] Skip more tests on rocm (#138497)
Some of the test checks doesn't work well with rocm.

Fixes https://github.com/pytorch/pytorch/issues/138409.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138497
Approved by: https://github.com/fduwjj
2024-10-21 23:11:01 +00:00
77868697b7 [inductor][subgraph] Add size asserts (#138424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138424
Approved by: https://github.com/eellison
ghstack dependencies: #137555
2024-10-21 22:43:49 +00:00
853da168fc [AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785)
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.

Test Plan: tbd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-10-21 21:45:13 +00:00
20a2d39557 Log all failing test repros to scuba (#138394)
This has the benefit that

1) It's much easier to aggregate test failure repros into say a CSV or shell script from scuba
2) We can do analysis (eg. set different two sets of tests across two PRs)
3) We can get results faster at the test-level granularity instead of job-level granularity we see in the HUD/GH.

I tested this by introducing a breaking change, adding ci-scribe label and then verifying that the failed tests were logged to scuba: https://fburl.com/scuba/torch_open_source_signpost/w6qt7qr9

I then reverted the breaking change and published this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138394
Approved by: https://github.com/ezyang
2024-10-21 21:35:47 +00:00
ef52bbbf23 More appropriate socket errors and debug messages (#130347)
Fixes #128998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130347
Approved by: https://github.com/fduwjj
2024-10-21 21:28:40 +00:00
364340c7ee [Forward Fix][PGNCCL] Add define guard for NCCL_SPLIT_NOCOLOR (#138488)
Forward fix for build issue introduced by #137855:
```
In file included from fbcode/caffe2/torch/csrc/distributed/c10d/NCCLUtils.cpp:2:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp:508:21: error: use of undeclared identifier 'NCCL_SPLIT_NOCOLOR'
  508 |     int split_color{NCCL_SPLIT_NOCOLOR - 1};
      |                     ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138488
Approved by: https://github.com/fduwjj
ghstack dependencies: #137855
2024-10-21 21:14:20 +00:00
134f6cda7e Support record_stream() for NJT (#137099)
Does what it says on the tin. I believe the right behavior here is to ensure that `record_stream()` is called on all tensor components of the NJT to ensure they all live until stream computation is complete.

This is an ask from torchrec as the op is used there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137099
Approved by: https://github.com/ngimel
2024-10-21 21:10:42 +00:00
70ec86d754 Remove C10_DEPRECATED (#138406)
Looking in the code I see
```
// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
```
But looking at the [MSVC C++ support table](https://learn.microsoft.com/en-us/cpp/overview/visual-cpp-language-conformance?view=msvc-170) I see that the `[[deprecated]]` attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 _or later_.

Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support `[[deprecated]]`.

Therefore, since we are finished deprecating old MSVCs we can deprecate `C10_DEPRECATED`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138406
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-21 20:57:27 +00:00
bb2e090b7d [user triton] typing triton_kernel_wrap.py (#138230)
Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-10-21 20:34:49 +00:00
60081c29ec Use cuda 12.4 pytorch_extra_install_requirements as default (#138458)
Since cuda 12.4 binaries are default binaries on pypi now. The pytorch_extra_install_requirements need to use 12.4.
This would need to be cherry-picked to release 2.5 branch to avoid injecting these versions into metadata during pypi promotion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138458
Approved by: https://github.com/malfet
2024-10-21 20:16:37 +00:00
c1ead6fba3 Bugfix for passing None args to user defined Triton kernel (#138472)
add test

fewer failing tests

more tests passing

tests passing

lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138472
Approved by: https://github.com/aakhundov
2024-10-21 20:00:04 +00:00
8ad191ae21 [dynamo] Replace __str__ with __repr__ in some places (#136316)
## The problem

In a typical debugger, `repr()` is used to display variables and not `str()`.

Several classes in Dynamo have a `__str__()` method that returns useful information and a  `__repr__()` that does not. Having to call `str(x)` or `[str(i) for i in x]` in the debugger all the time is a chore.

`str()` should be ["informal, nicely printable"](https://docs.python.org/3/library/stdtypes.html#str) and `repr()` should ["attempt to return a string that would yield an object with the same value when passed to eval()](https://docs.python.org/3/library/functions.html#repr)".

## The solution

In the Python object model, if there is no `__str__` method, `__repr__`  is used instead (but not the other way around).

So renaming `__str__` to `__repr__` in a few cases where no `__repr__` method exists now should not change observable behavior, and should make debugging easier.

The specific classes changed were all in `torch._dynamo.variables`:

* `builtin.BuiltinVariable`
* `constant.ConstantVariable`
* `constant.EnumVariable`
* `functions.UserMethodVariable`
* `lazy.LazyVariableTracker`
* `lazy.LazySymNodeFormatString`
* `misc.GetAttrVariable`
* `misc.NullVariable`
* `user_defined.UserDefinedObjectVariable`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136316
Approved by: https://github.com/XuehaiPan, https://github.com/jansel
2024-10-21 19:50:38 +00:00
41f7d01ccf Increase Docker push timeout limit from 15 to 30m (#138487)
Some images now take more than 15 to finish pushing and keep timing out, for example, https://github.com/pytorch/pytorch/actions/runs/11442231435/job/31832143440
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138487
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/ZainRizvi
2024-10-21 19:44:52 +00:00
32d4582e02 Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814)"
This reverts commit 16caa8c1b3a02e47b5f52d3c2d40d7931cc427dc.

Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to checking if this will solve inductor errors ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2427565425))
2024-10-21 19:40:58 +00:00
ff2f751bfb [tools] fix nightly pull tool when the conda environment not exists (#138448)
Now, `conda env remove --name env` exits with errors if the given environment does not exist. This PR check the existance of the environment before trying to remove it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138448
Approved by: https://github.com/ezyang
2024-10-21 19:35:48 +00:00
071f6f2de8 Revert "[ROCm] Fix ADDMM hipBLASLt regression (#138267)"
This reverts commit 14a3e12985e4550440a8a1755d3418e9b02b4950.

Reverted https://github.com/pytorch/pytorch/pull/138267 on behalf of https://github.com/jeffdaily due to this PR went to far when partially reverting #137604; the env var default should be the same on ROCm and CUDA ([comment](https://github.com/pytorch/pytorch/pull/138267#issuecomment-2427550465))
2024-10-21 19:33:13 +00:00
abbd71d29d [BE][Easy] enable PYFMT for torch.fx (#138443)
Reproduce command:

```bash
ghstack checkout https://github.com/pytorch/pytorch/pull/138443
git checkout HEAD~1 torch/
lintrunner -a --take "PYFMT" --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138443
Approved by: https://github.com/ezyang
2024-10-21 19:15:49 +00:00
8231180147 [dynamo][refactor] Refactor Wrap HOP to reuse it for invoke_subgraph (#137555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137555
Approved by: https://github.com/zou3519
2024-10-21 18:26:29 +00:00
c6609ece84 [ONNX] Remove deprecated export_to_pretty_string (#137790)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
ghstack dependencies: #137789
2024-10-21 18:17:48 +00:00
07cc4bd3e2 typing compile_fx.py (#138033)
Type annotations for compile_fx.
- Some of the stuff here is pretty complicated (functions which return functions that take functions) so I bailed on those and used `Any` just to get the rest landed.
- There are also changes to type signatures in other files which I did just to let mypy know more about the types in compile_fx.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138033
Approved by: https://github.com/Skylion007
2024-10-21 18:14:59 +00:00
81738403a2 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-21 17:52:21 +00:00
6e38c87ad0 [ONNX] Remove ExportTypes (#137789)
Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward.

Differential Revision: [D64412947](https://our.internmc.facebook.com/intern/diff/D64412947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2024-10-21 17:50:28 +00:00
af0bc75460 Remove deprecated alias macro(1/3) (#137556)
**Detailed Descriptions:**
- Remove AT_ERROR Macro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137556
Approved by: https://github.com/ezyang
2024-10-21 17:32:32 +00:00
16caa8c1b3 [BE]: Update Typeguard to TypeIs for better type inference (#133814)
Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814
Approved by: https://github.com/ezyang
2024-10-21 17:20:06 +00:00
9bb327bfc6 Revert "[AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785)"
This reverts commit a8b912f39d36bd2e6d204808d866439d0075f1a5.

Reverted https://github.com/pytorch/pytorch/pull/137785 on behalf of https://github.com/ezyang due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/137785#issuecomment-2427295668))
2024-10-21 17:18:56 +00:00
02dd3b8e32 [dynamo][NFC] Remove unused method InliningInstructionTranslator.check_replace_is_safe (#137906)
This method was no longer needed after #113725; the checking logic is
now in `SideEffects.check_allowed_side_effect`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
ghstack dependencies: #137905
2024-10-21 16:43:34 +00:00
1032ce6bd3 Only upload test/test-reports as artifacts (#138019)
Fixes https://github.com/pytorch/pytorch/issues/137851

This is possibly too restrictive but I spot checked and I don't think any of the files outside of test/test-reports are important, but I can't guarantee that someone was putting something elsewhere and expecting for it to still be zipped

Outputs can be see on HUD by clicking show artifacts
Some examples:
Logs
<img width="293" alt="image" src="https://github.com/user-attachments/assets/9a2db9b1-0f62-4209-909b-4f56a908619d">

XMLs
<img width="234" alt="image" src="https://github.com/user-attachments/assets/a639fe38-a112-4ea5-abba-ad1d5b25bb43">

JSONs
<img width="180" alt="image" src="https://github.com/user-attachments/assets/be7a49ac-5258-4bc5-981d-3f134ebd343d">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138019
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/ZainRizvi
2024-10-21 16:43:30 +00:00
0a4197490c Delay mul/pow expansion for _SympyT to enable more folding (#138235)
Instead of calling `safe_expand` right after symbolic expression construction, we invoke it in `ShapeEnv.simplify`. This enables more simplification with product form, e.g.,
```
(a + b)^2 / (a + b) --> (a + b)
```
which won't happen if we expand eagerly during product construction:
```
(a^2 + 2ab + b^2) / (a + b) --> no change
```

Fixes #136044.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138235
Approved by: https://github.com/ezyang
2024-10-21 16:38:47 +00:00
701ddf962a [inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089)
replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example.

This also adds metadata for to register_replacement patterns, including pad_mm.

This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now.

Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089
Approved by: https://github.com/aakhundov
2024-10-21 16:33:12 +00:00
279ddfc6ee Add type check for dilation in torch.quantized_max_pool3d() (#137845)
Fixes #136716

repro:

```python
import torch

input = torch.randn([1, 1, 1, 1, 1])
input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32)
torch.quantized_max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1)) # crash

input = torch.randn([1, 1, 1, 1, 1])
input = torch.quantize_per_tensor(input, 0.1, 10, torch.qint32)
result = torch.nn.functional.max_pool3d(input, (1, 1, 1), (1, 1, 1), (0, 0, 0), (-3, 1, 1))  # crash
```

result:

```
RuntimeError: Expected dilation >= 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137845
Approved by: https://github.com/albanD
2024-10-21 16:15:57 +00:00
a8b912f39d [AC] Backward Pass Aware AC - adding hooks to partitioner to pass callable (#137785)
Summary: same as title. Plan is to pass a callable to the partitioner to perform custom autoAC via an ILP. This is the same as a previous diff D63714905 which was landed and then subsequently reverted by PyTorch Release Engineering because of a failing unit test (f7b8d36c28). We think the unit test is buggy, and we also fix the same.

Test Plan: tbd

Differential Revision: D64246495

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137785
Approved by: https://github.com/basilwong
2024-10-21 15:30:07 +00:00
cyy
7ec21a6f0f Enable clang-tidy on torch/csrc/api (#138437)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138437
Approved by: https://github.com/r-barnes
2024-10-21 14:22:38 +00:00
8aacbee8e0 Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
ghstack dependencies: #138323
2024-10-21 13:51:54 +00:00
649f8117ad Add deprecated warning for lazyInitXXX API (#138323)
Detailed Descriptions:
Involved APIs are as followed:
- ``lazyInitCUDA``
- ``lazyInitHIP``
- ``lazyInitXPU``
- ``lazyInitMTIA``
- ``lazyInitPrivateUse1``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138323
Approved by: https://github.com/malfet
2024-10-21 13:51:54 +00:00
1417b2cd05 [AOTI] Fix test_index_put_with_none_index_cpu_with_stack_allocation (#138303)
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement. Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole.

Differential Revision: [D64598714](https://our.internmc.facebook.com/intern/diff/D64598714)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138303
Approved by: https://github.com/hl475
2024-10-21 13:47:50 +00:00
8f3efb8797 Update slow tests (#133203)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weeekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133203
Approved by: https://github.com/pytorchbot
2024-10-21 12:00:52 +00:00
cyy
14fc6b70ea Remove torch/csrc/api/include/torch/linalg.h (#138435)
Only one place in OSS uses it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138435
Approved by: https://github.com/r-barnes
2024-10-21 07:04:27 +00:00
5f940a44af [AMD] Fix torch ck backend build with 6.2.1 (#138434)
Summary: It's complaining about missing __hip_bfloat162 definition w/o this header.

Differential Revision: D64673284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138434
Approved by: https://github.com/yaoyj11, https://github.com/houseroad
2024-10-21 06:38:38 +00:00
362ca54f03 [c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763)
This PR aims to support the following use case:
```python
def all_reduce_eager(x):
    y = x * x
    req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True)
    assert isinstance(req, torch.distributed.Work)
    return y

@torch.compile(fullgraph=True)
def all_reduce_wait_compiled(y):
    torch.ops.c10d_functional.wait_tensor(y)
    return y * y
```
where the collective is issued in eager (with `async_op=True`) but waited in compiled region.

This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel.

------

Test commands:
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_work_registry`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_work_registry`
- `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal`
- `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli`
- `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True`
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees`
- `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco`

------

Differential Revision: [D64511994](https://our.internmc.facebook.com/intern/diff/D64511994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763
Approved by: https://github.com/yifuwang
2024-10-21 06:02:57 +00:00
cyy
a170ff4167 Prepare to enable ASAN on CUDA (#138404)
See which tests fail

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138404
Approved by: https://github.com/ezyang
2024-10-21 03:55:29 +00:00
9ad2736627 Remove extraneous C++14 comment (#138408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138408
Approved by: https://github.com/Skylion007
2024-10-21 03:54:41 +00:00
6987bfb40a Revert "[dynamo][NFC] Remove unused method InliningInstructionTranslator.check_replace_is_safe (#137906)"
This reverts commit 3c7d9d6c7fa565e811675be7dd84e5ef7c8ba7a0.

Reverted https://github.com/pytorch/pytorch/pull/137906 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137906#issuecomment-2425505452))
2024-10-21 03:42:38 +00:00
fb0da32377 [DeviceMesh] Small refactor to optimize DeviceMesh subgroup creation (#138117)
As `backend`, `pg_options`, and `group_desc` are the same for each mesh dimension, we don't need to get or create these args for `new_group` multiple times. This PR moves it from the inner loop of the subgroup creation (each subgroup ranks of each mesh dimension) to the outer loop (each mesh_dimension).

For example, given we have a 2 * 4 DeviceMesh, we are re-creating the variables `backend`, `pg_options`, and `group_desc` 2*4 = 8 times. After the change, we only create these variables once per mesh dimension, which is 2 times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138117
Approved by: https://github.com/kwen2501
2024-10-21 03:04:24 +00:00
cyy
a05b64a38f [5/N] Fix extra warnings brought by clang-tidy-17 (#138403)
Follows #137983
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138403
Approved by: https://github.com/ezyang
2024-10-21 02:59:54 +00:00
cyy
82eb09aafd [Environment Variable][4/N] Use thread-safe getenv functions (#137843)
Follows #137328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843
Approved by: https://github.com/ezyang
2024-10-21 02:58:59 +00:00
2d3455e7d9 [c10d] try fix the unstableness of test_get_future_result (#138415)
Summary:
Seems depends on the platform, nccl error or timeout would be raised
first on rank 0. Now we try to force the timeout by not exiting other
ranks
Test Plan:
Tests pass locally

Tags:

Fixes https://github.com/pytorch/pytorch/issues/138397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138415
Approved by: https://github.com/kwen2501
2024-10-21 01:17:30 +00:00
cyy
e7b8a9a4c1 [5/N] Fix clang-tidy warnings in torch/csrc/api/ (#138389)
Follows #138382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138389
Approved by: https://github.com/ezyang
2024-10-21 01:12:37 +00:00
e4ad02892f Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-20 23:48:54 +00:00
4f45a052ad Fix try_solve for s1*s2 == 0 when both symbols are unknown (#137919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137919
Approved by: https://github.com/ezyang
2024-10-20 23:33:08 +00:00
09cf163ae3 Fix for mixed_mm tests failures on SM70 and lower (#138183)
This PR fixes mixed_mm tests that are failing on SM70 and lower as discussed here https://github.com/pytorch/pytorch/pull/123762#issuecomment-2406601729.

The failure occurs because some of the mixed_mm tests expect triton code to be generated, but on SM70 and lower, the generation of triton code is skipped (see https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L693). These tests will now be skipped when running on SM70 and lower. I do not have access to an SM70 GPU, so I was not able to test these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138183
Approved by: https://github.com/ezyang
2024-10-20 21:14:31 +00:00
a1899b5a9e Revert "[Environment Variable][4/N] Use thread-safe getenv functions (#137843)"
This reverts commit 239ad73cb1c8a91f0a2de21d27af3d98f5a8dddc.

Reverted https://github.com/pytorch/pytorch/pull/137843 on behalf of https://github.com/yf225 due to Sorry for reverting your PR but I believe this PR breaks the binary builds. Example: https://ossci-raw-job-status.s3.amazonaws.com/log/31790258895, with error message: `getenv is not a member of c10::utils`, might be easier to search for `not a member of` in the log ([comment](https://github.com/pytorch/pytorch/pull/137843#issuecomment-2425192780))
2024-10-20 19:48:14 +00:00
a9f4f89cd5 [CI] Add Compiled DDP / Compiled FSDP2 / compute-comm reordering tests to test_inductor_distributed (#138178)
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501
2024-10-20 19:38:18 +00:00
cyy
239ad73cb1 [Environment Variable][4/N] Use thread-safe getenv functions (#137843)
Follows #137328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137843
Approved by: https://github.com/ezyang
2024-10-20 13:05:04 +00:00
07fd61e106 [SDPA] Fix warning message (#138278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138278
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-10-20 08:00:56 +00:00
f568d48890 Enable git long paths checkout on Windows (#138411)
Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376

According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround.

### Testing

Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411
Approved by: https://github.com/wdvr
2024-10-20 07:18:44 +00:00
f8303740f7 Revert "Enable git long paths checkout on Windows (#138411)"
This reverts commit 12283035f8c08cd3487bfaac25ccef7da90952ba.

Reverted https://github.com/pytorch/pytorch/pull/138411 on behalf of https://github.com/huydhn due to Opps, I forgot Windows binary build, let me revert and reland this one ([comment](https://github.com/pytorch/pytorch/pull/138411#issuecomment-2424661640))
2024-10-20 06:50:48 +00:00
12283035f8 Enable git long paths checkout on Windows (#138411)
Checking out PyTorch on Windows starts to fail after ROCm change https://github.com/pytorch/pytorch/pull/131004 in which one of the submodule path, `third_party/composable_kernel`, is getting too long https://hud.pytorch.org/pr/pytorch/pytorch/131004#31778700376

According to https://github.com/actions/checkout/issues/1285, there is no fix in GHA checkout, but we can set `git config --system core.longpaths true` to enable long paths support in Git as a workaround.

### Testing

Windows checkout is ok now https://github.com/pytorch/pytorch/actions/runs/11423112351/job/31781916540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138411
Approved by: https://github.com/wdvr
2024-10-20 06:32:34 +00:00
d1027c2be6 Revert "Update sympy version constraint to 1.13.3 (#138338)"
This reverts commit d8279ad9d162b5ce71699f462d3664c3745b14f5.

Reverted https://github.com/pytorch/pytorch/pull/138338 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think a bunch of inductor tests and test_dynamic_shapes are failing in trunk after this lands d8279ad9d1 ([comment](https://github.com/pytorch/pytorch/pull/138338#issuecomment-2424487225))
2024-10-20 03:19:02 +00:00
3f3b692a00 [ROCm] CK-based GEMM (#131004)
- composable_kernel as a third_party submodule
- "ck" as a `torch.backends.cuda.preferred_linalg_library()`
- reference CK gemm implementations for float, bfloat16, and half types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131004
Approved by: https://github.com/xw285cornell, https://github.com/pruthvistony

Co-authored-by: Andres Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2024-10-20 02:57:43 +00:00
0a2407b93c [dynamo] Support omegaconf DictConfig (#138378)
Fixes https://github.com/pytorch/pytorch/issues/138224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138378
Approved by: https://github.com/jansel
ghstack dependencies: #138359
2024-10-20 02:43:17 +00:00
f892543c1f [dynamo] Support TypedDict (#138359)
Seen in vLLM.

Fixes https://github.com/pytorch/pytorch/issues/132629
Fixes https://github.com/pytorch/pytorch/issues/133613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138359
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-20 02:43:17 +00:00
cyy
1f349eed61 [4/N] Fix extra warnings brought by clang-tidy-17 (#137983)
Follows #137552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137983
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-10-20 01:02:33 +00:00
b1b7c714ed Add deprecated C10_UNUSED and C10_NODISCARD macros back (#138398)
For backwards compatibility. Disallow internal use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138398
Approved by: https://github.com/malfet
2024-10-20 00:21:19 +00:00
d8279ad9d1 Update sympy version constraint to 1.13.3 (#138338)
`simpy` was pinned to version 1.13.1 due to test failures with version 1.13.2 on Windows and mac, as reported in https://github.com/pytorch/pytorch/pull/133235. Now that a newer version, 1.13.3, has been released, this PR aims to verify if the test failure has been resolved and also allow building with newer versions for packaging purposes (e.g., https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277#discussion_r1806721862).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138338
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-20 00:20:02 +00:00
14a3e12985 [ROCm] Fix ADDMM hipBLASLt regression (#138267)
Fixes #138067

A partial reversion of this PR: https://github.com/pytorch/pytorch/pull/137604

The breakage is on AMD GPUs that do not fully support hipBLASLt, e.g. gfx1100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138267
Approved by: https://github.com/malfet
2024-10-20 00:19:10 +00:00
47e80abc7a Revert "[inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089)"
This reverts commit fb44658415e50b5be6a187ff3f14243c0fdf3daf.

Reverted https://github.com/pytorch/pytorch/pull/138089 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but the new test_original_aten_preserved_pad_mm test runs OOM in trunk fb44658415 ([comment](https://github.com/pytorch/pytorch/pull/138089#issuecomment-2424297269))
2024-10-19 23:55:01 +00:00
fcedf93d1e [Traceable FSDP2] Add _compiled_autograd_enabled global state variable (#138187)
After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR.

Fixes https://github.com/pytorch/pytorch/issues/138177.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187
Approved by: https://github.com/xmfan, https://github.com/awgu
2024-10-19 19:10:31 +00:00
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
fb44658415 [inductor] Preserve metadata across replace_by_example and register_replacement patterns (#138089)
replace_by_example is used to implement some pattern-matching passes in inductor. Previously, replace_by_example would generate nodes with very little metadata. In particular, `meta["original_aten"]` would be lost; that meant that when generating triton kernel names, you could get empty names like `triton_tem_fused_0` if the input nodes to the fused kernel were the result of a pattern-matching pass that used replace_by_example.

This also adds metadata for to register_replacement patterns, including pad_mm.

This fixes the issue by copying metadata from the original node to the replacement nodes. If there are multiple original nodes we skip the metadata transfer; so if you have a `add(z, mm(x, y))`, then the metadata won't be transferred right now.

Differential Revision: [D64480755](https://our.internmc.facebook.com/intern/diff/D64480755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138089
Approved by: https://github.com/aakhundov
2024-10-19 16:37:08 +00:00
38ea487338 Re-raise in _run_sympy_handler to reduce log spew (#138356)
Fixes: https://github.com/pytorch/pytorch/issues/138069

I tested this by running `python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_builtins_round_float_ndigits_pos_dynamic_shapes_cpu` before and after the change and verifying no more log spew.

I'm uncertain on if it makes sense to add a test for this PR. Question for reviewers: is there a standard paradigm for testing these log spew based fixed? Happy to add a test if someone can point me towards the right direction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138356
Approved by: https://github.com/ezyang
2024-10-19 16:02:45 +00:00
c0879d0c21 Fix lint
Regression casued by fddabc6e0b that was force merged
2024-10-19 08:33:41 -07:00
cyy
cdc9f14227 [4/N] Fix clang-tidy warnings in torch/csrc/api/ (#138382)
Follows #138328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138382
Approved by: https://github.com/ezyang
2024-10-19 13:32:51 +00:00
fddabc6e0b C10_UNUSED to [[maybe_unused]] (#6357) (#138364)
Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-10-19 13:17:43 +00:00
cyy
2f6a70bfea Enable more UBSAN checks (#138288)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138288
Approved by: https://github.com/ezyang
2024-10-19 13:00:26 +00:00
cyy
675e16e137 [3/N] Fix clang-tidy warnings in torch/csrc/api/ (#138328)
Follows #136998
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138328
Approved by: https://github.com/ezyang
2024-10-19 07:07:39 +00:00
795255a7c8 Revert "[Traceable FSDP2] Add _compiled_autograd_enabled global state variable (#138187)"
This reverts commit 0c913b35aaea9ca33510239e939957ec5fe66d78.

Reverted https://github.com/pytorch/pytorch/pull/138187 on behalf of https://github.com/yf225 due to linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) test_compiled_autograd_ctx failed ([comment](https://github.com/pytorch/pytorch/pull/138187#issuecomment-2423609108))
2024-10-19 06:12:47 +00:00
de16159e56 [MPS] Fix sliced cast (#138314)
This fixes internal crash due to the invalid bufer size computation if sliced API is used

Not sure what was the purpose of
```c++
IntArrayRef baseShape;
if (src.is_view()) {
  baseShape = src._base().sizes();
} else {
  baseShape = getIMPSAllocator()->getBufferShape(src.storage().data());
}
int flattenedShaped = 1;
for (const auto i : c10::irange(baseShape.size())) {
  flattenedShaped *= baseShape[i];
}
```
As flattenShaped could be much easier computed as `[srcBuf
lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do.

When someone allocated buffer to hold say uint8 and that view-casted it
to float16, attempt to compute `baseShape` returned sizes of original
tensor in its data type, rather than size in new dtypes

Fixes https://github.com/pytorch/pytorch/issues/137800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314
Approved by: https://github.com/albanD, https://github.com/DenisVieriu97
2024-10-19 05:17:09 +00:00
0c913b35aa [Traceable FSDP2] Add _compiled_autograd_enabled global state variable (#138187)
After https://github.com/pytorch/pytorch/pull/137821, we will no longer be able to call the Compiled Autograd state getter under Dynamo tracing. One solution is to cache the "Compiled Autograd enabled" state outside of compile for FSDP2, and just read from the cache when we need the check. This is implemented by this PR.

Fixes https://github.com/pytorch/pytorch/issues/138177.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138187
Approved by: https://github.com/xmfan, https://github.com/awgu
ghstack dependencies: #138245, #138174
2024-10-19 04:33:35 +00:00
8f118e53d7 [CI] Fix CompiledDDP failure when the gradient is not contiguous; Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138174)
Summary:
As title

`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174
Approved by: https://github.com/yf225, https://github.com/kwen2501
ghstack dependencies: #138245

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-19 04:33:35 +00:00
3cfd244495 Add USE_SYSTEM_NVTX option (#138287)
## Summary

We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages.

This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior).

## Test Plan

The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287
Approved by: https://github.com/albanD
2024-10-19 04:26:01 +00:00
a20a17fd6f [Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669)
Fixes https://github.com/pytorch/pytorch/issues/114369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669
Approved by: https://github.com/anijain2305
2024-10-19 04:12:45 +00:00
88eb15a3e3 [audio hash update] update the pinned audio hash (#138139)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138139
Approved by: https://github.com/pytorchbot
2024-10-19 04:02:21 +00:00
7d076b9e3a updated EC2 fetching of metadata to use IMDSv2 (#138286) 2024-10-18 20:58:47 -07:00
ac7f52b301 Revert "[inductor] add a threshold for membw saving during fusion (#136782)"
This reverts commit 6647320de2077c10309f5025a007d51c7fb542d8.

Reverted https://github.com/pytorch/pytorch/pull/136782 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_memory starts to fail after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/136782#issuecomment-2423549196))
2024-10-19 03:43:42 +00:00
fecd370ea1 [c10d] Fix color value for comm split being negative (#137855)
Fixes https://github.com/pytorch/pytorch/issues/137856.

### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
    int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`).

But that's not all.

### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]:     1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855
Approved by: https://github.com/wconstab
2024-10-19 03:17:19 +00:00
542f7c8383 Eliminate C10_NODISCARD (#138336)
Test Plan: Sandcastle

Reviewed By: swolchok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138336
Approved by: https://github.com/Skylion007
2024-10-19 02:54:06 +00:00
a4b6ef178c [c10d] Reorder cpp stack dump and FR dump and add log prefix to loggings (#138368)
The rationale behind this PR is to:
1. Move the dump of c++ traces after FR dump because the FR dump is timed meaning that it will not block forever, while the dumping of c++ traces is likely to be blocking. so that we swap the order. Ideally we also want to make cpp stacktrace dump to be a future wait, if we want to go down this path, we can also make it happen in an another PR.
2. Add log Prefix to the logs which have not been added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138368
Approved by: https://github.com/c-p-i-o
2024-10-19 02:43:41 +00:00
ea412d5554 [AOTI] Fix a special case compile time data type codegen for sym int variables (#138106)
Summary:
This change unblocks the CFR AOTI lowering runtime error.

TL;DR:

In this model, one triton kernel expects a scalar input dtype as i64, but getting an i32. The reason is "auto"  can infer a smaller data type if the variable it passed in e.g. is i32. thus cause CUDA IMA.
 Original problematic kernel: `triton_poi_fused_add_ge_logical_and_logical_or_lt_46_grid_100`.

This diff manually cast it to i64 for all symbolic arguments in compile time  for i64 triton kernel inputs, instead of use `auto var_x = {arg}` in cpp wrapper code.

Test Plan:
Verified in FLB locally:

```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 TORCH_LOGS="output_code" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCH_SHOW_CPP_STACKTRACES=1 CUDA_LAUNCH_BLOCKING=1 ~/fbsource/buck-out/v2/gen/fbcode/98e643f8bb44fe9d/hpc/new/models/feed/benchmark/__feed_lower_benchmark__/feed_lower_benchmark.par --skip-eager --skip-flop-estimation --lower-backend="AOT_INDUCTOR" --sync-mode=0 --precision bf16 --output-precision bf16  --lower-presets="ifr_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change" --remove-unexpected-type-cast=False --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/924293663/0/gpu_lowering/input.merge"```

Differential Revision: D64490039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138106
Approved by: https://github.com/ColinPeppler
2024-10-19 02:30:53 +00:00
d5035f0aab fix codecache write_atomic path issue on Windows. (#138331)
Fixes #138211

`Path.rename` function has Windows OS specific behavior, that will raise `FileExistsError` when the target file existing.
This behavior is not happened on Linux, so I write a small repoduce code to figure out what happened.

After stepping trace the repo code:
```python
import os
import sys
from pathlib import Path

_IS_WINDOWS = sys.platform == "win32"

def test_case():
    cwd = os.getcwd()
    path1 = os.path.join(cwd, "haha1.txt")
    path2 = Path(os.path.join(cwd, "haha2.txt"))

    try:
        path2.rename(path1)
    except FileExistsError as e_file_exist:
        if _IS_WINDOWS:
            # on Windows file exist is expected: https://docs.python.org/3/library/pathlib.html#pathlib.Path.rename
            shutil.copy2(path2, path1)
            os.remove(path2)
        else:
            raise e_file_exist
    except BaseException as e:
        raise e

    print("run here.")

if __name__ == "__main__":
    test_case()
```
We found the code `path2.rename(path1)` can breakdown into:
1. copy file2's content to file1.
2. delete file2.

So, we can implemented equal code on Windows path:
```python
shutil.copy2(src=tmp_path, dst=path)
os.remove(tmp_path)
```

So, we can get current PR.

TODO: need cherry-pick to release/2.5 branch, CC: @atalman .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138331
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-19 01:27:12 +00:00
949b6f685d Enable -Werror on s390x (#136527)
Enable -Werror on s390x

Example of original issue on s390x:
https://github.com/pytorch/pytorch/actions/runs/11014606340/job/30585632704

Most of warnings are not specific to s390x, but specific to gcc-13 or gcc-14. To test it on s390x an image with gcc-13 is needed. For s390x it's tested for new regressions on every merge due to trunk workflow.

`-Wdangling-reference` produces either obviously false warnings or suspicious warnings, which on closer inspection look plausibly safe.

`-Wredundant-move` with new gcc complains about `std::move(...)` disabling copy elision. But removing `std::move(...)` makes used clang versions complain about copying objects when they could be moved. For now also disable it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136527
Approved by: https://github.com/malfet
2024-10-19 01:18:42 +00:00
4a3c9400fe Update cpuinfo submodule (#138351)
To suppress error on ARM systems where PR_SVE_GET_VL is missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138351
Approved by: https://github.com/Skylion007
2024-10-19 01:12:29 +00:00
ff598f2f4d [DTensorTestbase] Add an optional eager_init flag to with_comms() to support eager init nccl communicator for DeviceMesh test case (#138108)
Add an optional `eager_init` flag to `with_comms`.
When `eager_init` is True and backend is `nccl`, we pass the `device_id` to `init_process_group()` for eager initialization.
Otherwise, `device_id` is still `None` and this goes through the normal lazy call.
Default for `eager_init` is False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138108
Approved by: https://github.com/kwen2501
2024-10-19 01:04:55 +00:00
b3ae1b1b73 [CMake] remove duplicated cmake options for Gloo and C10D (#138318)
just a trival fix  :P
cmake options from line 345 to line 357 are identical to these of line 358 to line 369, remove the duplicated lines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138318
Approved by: https://github.com/janeyx99
2024-10-19 00:26:25 +00:00
6647320de2 [inductor] add a threshold for membw saving during fusion (#136782)
Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782
Approved by: https://github.com/jansel
2024-10-19 00:22:43 +00:00
e8b1409dcf Revert "[user triton] typing triton_kernel_wrap.py (#138230)"
This reverts commit 2f61b69603756c1fcaef71b231e598df31e20f42.

Reverted https://github.com/pytorch/pytorch/pull/138230 on behalf of https://github.com/wdvr due to Reverting this, as it started failing tests on main ([comment](https://github.com/pytorch/pytorch/pull/138230#issuecomment-2423354596))
2024-10-18 23:12:29 +00:00
4632594546 [inductor] Move V.graph.scheduler.current_device to V.graph.current_device (#138252)
There are some places where it would be nice to use this, but the scheduler hasn't yet been created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252
Approved by: https://github.com/eellison
ghstack dependencies: #138170
2024-10-18 23:05:54 +00:00
85a6a782e5 [inductor] Generalize WorkspaceArg for graph-level semaphores (#138170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138170
Approved by: https://github.com/Chillee
2024-10-18 23:05:54 +00:00
13bcb065f5 [compiled autograd] enable some reentrant tests (#137290)
Some seem to fail due to queue_callback usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137290
Approved by: https://github.com/yf225
2024-10-18 22:25:08 +00:00
47e4045566 Revert "[pt2] Log is_forward field to dynamo_compile scuba table (#138097)"
This reverts commit 4e9273c84edafdcfff57521dde6675b967181ba8.

Reverted https://github.com/pytorch/pytorch/pull/138097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it has a land race with https://github.com/pytorch/pytorch/pull/137803 ([comment](https://github.com/pytorch/pytorch/pull/138097#issuecomment-2423297516))
2024-10-18 22:00:40 +00:00
bd7cbddfe3 [CODEOWNERS] Remove aaronenyeshi from Profiler paths (#138346)
As title, remove aaronenyeshi from Profiler paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138346
Approved by: https://github.com/sraikund16
2024-10-18 21:46:00 +00:00
c88b77af9c [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 21:39:39 +00:00
7faa1284ab [ptd][amd] call alltoallv instead of send/recv (#136368)
Summary:
as $title

AMD provides a2av API, we should just use it instead of implementing PTD's own set of send/recv.
we should not skip 0B send/recv within a2av, it may lead to dead lock: see details https://github.com/ROCm/rccl/pull/1349

Test Plan:
before:

mvai-job will timeout on all2all

https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240913-1426-327e119d?job_attempt=1&version=0&env=PRODUCTION

after:

https://www.internalfb.com/mlhub/pipelines/runs/mast/fire-cenzhao-20240919-1932-ebce94e6?job_attempt=0&tab=execution_details&env=PRODUCTION

latest APS job: https://fburl.com/mlhub/vn6dj7zp

Differential Revision: D63076315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136368
Approved by: https://github.com/xw285cornell
2024-10-18 21:31:57 +00:00
5b58697cc7 [Profiler] Clang bugs in Collection [1/n] (#138296)
Summary: I have to keep bypassing issues because of these clang rules. Let's start with all of the bugs instead of the variable name ones because that will introduce a lot of lines of code and can make things hard to read

Test Plan: Format tests pass.

Differential Revision: D64411171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138296
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
2024-10-18 21:06:50 +00:00
295de00908 [PT2 Compile Events] Revamp PT2 Compile/chromium event logging [1/?] (#138093)
This diff is the starting steps of https://docs.google.com/document/u/2/d/1kAEBt4AyW7HTAhXHbjoz8FBFHNyyEA2Qo2mPn7v3WUQ/edit?usp=drive_web&ouid=113555078003219714709

It implements the following changes:

- Only log spans to scuba, so no start events are ever logged
- Log events as the full event name, without "START" or "END"
- Only log to scuba major phases from chromium events. These are:
  - entire_frame_compile (dynamo)
  - backend_compile (aotdispatch)
  - inductor_compile (inductor)
  - codegen (inductor codegen)

Tlparse chromium events stay basically the same. But I implemented a few changes to clean that up as well:
- When there's a phase name available, log the phase name instead of the function name as the event name. This simplifies the trace to not have two identical rows. The fn_name is avaliable as metadata on the chromium event, if interested
- Log new events for pre and post grad passes. These do *not* log to scuba.

By making the phases much simpler in Scuba, with only categories for major phases of PT2 Compilation, we pave the way to add **much** more metadata and information to each individual event type. Diffs for that will come later.

**IMPLEMENTATION NOTES:**
- The logic for `log_chromium_event_internal` (which is the function that logs to Scuba) lives in chromium_events for now, but in the future as we add more metadata, it may belong independently in dynamo_timed or even outside of dynamo_timed. I haven't explored in detail what the refactor will look like. Once we start logging metadata for dynamo, aotdispatch, inductor, I suspect we will call log_pt2_compile_event directly, instead of making chromium event logger handle the pt2_compile_event logic. But that refactor is left for another PR on top of this one.

- There's an interesting space after pre grad passes within AOT autograd logic, that's between create_aot_dispatcher_function and pre grad passes. I'm not sure what we're spending time doing in that time, but I'll find out with a profile later.

Differential Revision: [D64479033](https://our.internmc.facebook.com/intern/diff/D64479033/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138093
Approved by: https://github.com/ezyang
2024-10-18 20:36:08 +00:00
3c7d9d6c7f [dynamo][NFC] Remove unused method InliningInstructionTranslator.check_replace_is_safe (#137906)
This method was no longer needed after #113725; the checking logic is
now in `SideEffects.check_allowed_side_effect`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137906
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
ghstack dependencies: #137905
2024-10-18 20:20:42 +00:00
162eba2dee [dynamo] Remove mutable_local.source and index on VariableTracker rather than MutableLocalBase (#137905)
This patch addresses parts of the side-effect refactor proposed in #133027;
specifically, it does 3 things:

1. Change `SideEffects.store_attr_mutations` and `PyCodegen.tempvars`
   to index on `VariableTracker` rather than `MutableLocalBase`.
2. Remove the `source` field from `MutableSideEffects` and
   `AttributeMutation`, and use `VariableTracker.source` instead.
3. Plumb a `overridden_sources: Dict[Source, Source]` from
   `handle_aliases_for_stolen_lists` to `PyCodegen` so that we don't
   update `VariableTracker.source` in place, while still preserving what
   `handle_aliases_for_stolen_lists` needed (i.e., modifying codegen for
   certain `VariableTracker`).

(1) and (2) are merged in 1 patch because of some dependency between
a. `OutputGraph.handle_aliases_for_stolen_lists` which iterates over
   `sideSideEffects.store_attr_mutations.keys()`, and potentially update
   its source field to be completely different.
b. `SideEffects.codegen_update_mutated`, which happens after the above
   and uses `cg(var.mutable_local.source)`.
where if we apply (1) only, (b) breaks, and if we apply (2) only, (a)
breaks.

(3) is needed for correctness, see comments in the PR for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137905
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos
2024-10-18 20:20:42 +00:00
7b39fb5712 Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 9f81270d7589fd7fa98dc247ae4b1b7ab239ca3c.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))
2024-10-18 20:09:40 +00:00
cd1e9b0e60 [EZ] Remove canary scale config (#138361)
Removing just the LF canary scale config for now to test the changes in https://github.com/pytorch/test-infra/pull/5767

Those changes have been deployed to prod and appear to be working, but this will be the final proof that it is in fact reading the test-config version of scale-config and not the pytorch/pytorch copy.

Note: This will break the Scale config validation workflow on test-infra, but it's worth it since this test will be very short lived and that workflow only runs when someone modifies scale config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138361
Approved by: https://github.com/wdvr
2024-10-18 20:02:00 +00:00
1ac42b5f3e graph.py: Refine unspec variable finding (#137303)
Add an additional check that scalars wrapped to 0-D tensors by dynamo are actually 0-D.  This fixes a bug where a 1-D tensor was mistakenly converted to a scalar value rather than passed as a pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137303
Approved by: https://github.com/eellison
ghstack dependencies: #135701
2024-10-18 20:00:25 +00:00
d5bb70afe3 [Pipelining] Remove unnecessary {0,1} qualifier from regex (#138271)
There should always be 1 action.  This may be an artifact from trying to
extend the regex to handle the fused SEND_F_RECV_B style actions, which
was abandoned.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138271
Approved by: https://github.com/H-Huang
ghstack dependencies: #138142
2024-10-18 19:52:07 +00:00
f23e8a8923 [Pipelining] Fix/improve format_pipeline_order (#138142)
Fix issue where format fn modified original data structure- avoid this.

Change from printing "None" to empty string, for cleaner visualization
of bubbles
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138142
Approved by: https://github.com/H-Huang
2024-10-18 19:52:07 +00:00
d512d0e227 Always use aten.constant_pad_nd for mm padding (#137820)
Summary: From experiment, it seems like aten.constant_pad_nd has better QPS compared to torch.cat. The qps gain for ig ctr is ~10%, and ~5% for oc.

Test Plan:
```
buck2 run mode/opt -c fbcode.nvcc_arch=a100 //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/585279927/480/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```
```
buck2 run mode/opt //caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --model-path=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/588102397/1500/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend=AOT_INDUCTOR
```

Differential Revision: D64271583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137820
Approved by: https://github.com/eellison
2024-10-18 19:35:03 +00:00
2f61b69603 [user triton] typing triton_kernel_wrap.py (#138230)
Remove `# mypy: allow-untyped-defs` from triton_kernel_wrap.py, and fixed all the mypy errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138230
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-10-18 19:29:31 +00:00
1f32a1fb80 Replace torch.export default decomp table to be lazily populated (#137650)
In this PR, we implement lazy dictionary for export decomp behaviour for following reasons:
1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible.

I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions)

Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650
Approved by: https://github.com/justinchuby, https://github.com/bdhirsh
2024-10-18 19:28:52 +00:00
ea8ea2f33f Improve build_with_deb_info (#138290)
To skip over the command that do not have output file specified

Recently I've noticed that `generate_torch_version.py` started to run on every rebuild, and this results in a failed plan for deb info rebuilds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138290
Approved by: https://github.com/Skylion007
2024-10-18 18:50:12 +00:00
4e9273c84e [pt2] Log is_forward field to dynamo_compile scuba table (#138097)
Summary: ^^

Test Plan:
Ran a test script out of fbcode: D64350202. Then:

```
(pytorch-3.10_4) devvm2296:~/fbcode  $ scuba -e="select time,co_filename,is_forward from \`dynamo_compile/sandbox\` where is_forward is not null"
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|    time    |                                                                                    co_filename                                                                                    | is_forward |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
| 1729032583 | /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py |          1 |
| 1729032583 | null                                                                                                                                                                              |          0 |
| 1729032650 | /data/users/slarsen/fbsource/buck-out/v2/gen/fbcode/1638b36e975169f6/scripts/slarsen/torch_compile_model/__run__/run-inplace#link-tree/scripts/slarsen/torch_compile_model/run.py |          1 |
| 1729032650 | null                                                                                                                                                                              |          0 |
+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
4 row(s) in set (0 warnings, 131 errors, 0.80 sec)
```

Reviewed By: ezyang

Differential Revision: D64438144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138097
Approved by: https://github.com/ezyang
2024-10-18 18:48:52 +00:00
195d0a666b [BE][Ez]: Use interned hardcoded string FURB156 (#138330)
Uses string constants from string module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138330
Approved by: https://github.com/albanD
2024-10-18 18:26:16 +00:00
9c2a80322a Add Programmable Google Search (#137716)
- Adding the code for the programmable Google search
- Adding the CSS overrides.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137716
Approved by: https://github.com/seemethere, https://github.com/albanD

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2024-10-18 18:18:16 +00:00
8d869c9ec7 Skip test_circular_dependencies on ROCm (#138312)
The test is flaky on ROCm and has been disabled for quite a while https://github.com/pytorch/pytorch/issues/110040.  The disabled issue was opened and then closed several times, so it's better to close that issue and skip the test here.

(Not really fix the issue, I just want the test to be skipped on PR instead of being disabled, then close the issue)
Fixes https://github.com/pytorch/pytorch/issues/110040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138312
Approved by: https://github.com/jithunnair-amd, https://github.com/clee2000
2024-10-18 18:17:48 +00:00
620039c38c [inductor] Respect ir_dataclass(frozen=...) in Python 3.9 (#138247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138247
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2024-10-18 17:55:12 +00:00
ada7a8c217 Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178)"
This reverts commit 8cb91109061648497ca09d6f1f9b9e13a2f5557e.

Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/yf225 due to because https://github.com/pytorch/pytorch/pull/138174 is reverted, we need to revert this too ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2422961292))
2024-10-18 17:51:54 +00:00
59158f640c [dynamo] Support equality comparison between Tensor and None (#138289)
This patch updates the `wrap_fx_proxy_cls` function to allow boolean output when the operation is one of
`supported_const_comparison_op_values`.

Fixes #120907.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138289
Approved by: https://github.com/williamwen42
2024-10-18 17:49:26 +00:00
9ea271d40b Expand doc for bundled autotune cache (#138298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138298
Approved by: https://github.com/ezyang, https://github.com/oulgen
2024-10-18 17:43:47 +00:00
4bba038b2f Add diagonal_copy to torch/_decomp/__init__.py (#136730)
Fixes https://github.com/pytorch/pytorch/issues/117349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136730
Approved by: https://github.com/masnesral
2024-10-18 17:39:17 +00:00
666572d819 Update viable strict workflow (#138262)
Corresponds to https://github.com/pytorch/test-infra/pull/5775

Tested in https://github.com/pytorch/pytorch/actions/runs/11393196544/job/31700963325?pr=138262 by adding my branch to the environment and pointing the workflow at my test-infra branch and commenting out the parts that did the push + upload record to s3

Versioning would have been good for this...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138262
Approved by: https://github.com/huydhn
2024-10-18 17:28:55 +00:00
912ea5601b Move manywheel binary scripts to pytorch (#138103)
PR to remove Manywheel Scripts:
https://github.com/pytorch/builder/pull/2017

Test PR : https://github.com/pytorch/pytorch/pull/138325

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138103
Approved by: https://github.com/malfet
2024-10-18 17:11:28 +00:00
358ff3b731 [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 1) (#136069)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_autoheuristic.py`
reuse `test/inductor/test_b2b_gemm.py`
reuse `test/inductor/test_custom_lowering.py`
reuse `test/inductor/test_efficient_conv_bn_eval.py`
reuse `test/inductor/test_group_batch_fusion.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136069
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
2024-10-18 16:58:09 +00:00
8dd575faf6 [BE] Modernize C10_UNUSED (#138102)
[`[[maybe_unused]]`](https://en.cppreference.com/w/cpp/language/attributes/maybe_unused) is part of C++17 standard

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138102
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet, https://github.com/eqy
2024-10-18 16:33:01 +00:00
de51ed8610 [AOTI] Add C shim for _mkl_linear (#137880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137880
Approved by: https://github.com/desertfire
2024-10-18 16:26:19 +00:00
26ac5671dc Revert "Fix CompiledDDP failure when the gradient is not contiguous (#138174)"
This reverts commit 0ecafda6024f50734118dd794ac71b86c6e6d569.

Reverted https://github.com/pytorch/pytorch/pull/138174 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but I think it fails test_compute_comm_reordering in trunk for rocm and multigpu setup ([comment](https://github.com/pytorch/pytorch/pull/138174#issuecomment-2422818971))
2024-10-18 16:17:54 +00:00
98856f7ea1 Increase max runners available for linux.12xlarge and windows.8xlarge.nvidia.gpu.nonephemeral (#138332)
Related PR on test-infra: https://github.com/pytorch/test-infra/pull/5785
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138332
Approved by: https://github.com/clee2000, https://github.com/huydhn
2024-10-18 16:17:36 +00:00
af306a392c Revert "Dont decompose aten.baddmm in inductor (#137904)"
This reverts commit 7a117f3b3eea4cfeef21da2e3a8a1e39c30fa07d.

Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to unfortunately the failures on the previous import are still present on the current one D64568703 ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2422789143))
2024-10-18 16:01:01 +00:00
5a81475884 Documentation Update: Fix Missing Whitespace in Optimizer Docs (#138321)
### Description:

This PR addresses a minor [formatting issue identified in a previous contribution to the Optimizer documentation](https://github.com/pytorch/pytorch/pull/134107#discussion_r1800833948).

Specifically, it fixes the missing whitespace after `param_names` in the section on utilizing named parameters to load the optimizer state dict.

You can find the related docs here:
[Optimizer Documentation](https://pytorch.org/docs/main/optim.html#how-to-utilize-named-parameters-to-load-optimizer-state-dict).

@janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138321
Approved by: https://github.com/janeyx99
2024-10-18 15:41:43 +00:00
86aefa9405 typing subproc_pool.py (#138032)
Added type annotations to subproc_pool.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138032
Approved by: https://github.com/Skylion007
2024-10-18 15:31:05 +00:00
aa3ae50c07 Fixing MPS conv1d error message for output 2**16 (#134770)
Fixes #134416 by removing the misleading message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134770
Approved by: https://github.com/malfet
2024-10-18 14:13:20 +00:00
c4ed03cea1 Add proper handling for view and factory function for csan (#138236)
In particular, properly handle that some functions only read/write metadata on the Tensor and thus should not be detected as read/write by csan.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138236
Approved by: https://github.com/ngimel
2024-10-18 14:04:18 +00:00
0ff6f7a040 Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)"
This reverts commit 1581a93e8705dc23f649573d4404cd6816d614af.

Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))
2024-10-18 13:21:17 +00:00
e027403dea ILP for Auto SAC (Selective Activation Checkpointing) (#137908)
This PR presents a mixed integer linear programming (MILP) formulation that can be utilized to determine, under a memory budget, which modules to apply activation checkpointing (AC) and the amount of activation memory that should be discarded for each module. The MILP uses information collected from MemTracker, Runtime Estimator, and SAC Estimator, introduced in these PRs:
* https://github.com/pytorch/pytorch/pull/124688
* https://github.com/pytorch/pytorch/pull/134243
* https://github.com/pytorch/pytorch/pull/135208

End-to-end example and its sample output:

```
import copy
from typing import Tuple

import torch
from torch._subclasses.fake_tensor import FakeTensorMode

from torch.distributed._tools.ilp_utils import (
    aggregate_stats,
    get_peak_memory_runtime_baseline,
    parse_module_info,
)
from torch.distributed._tools.mem_tracker import _ModState, MemTracker
from torch.distributed._tools.runtime_estimator import RuntimeEstimator
from torch.distributed._tools.sac_estimator import SACEstimator
from torch.distributed._tools.sac_ilp import sac_milp
from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
)

def _init_model_input_optimizer() -> Tuple[
    torch.nn.Module, torch.optim.Optimizer, torch.Tensor
]:
    bsz = 8
    model_args = ModelArgs(
        n_layers=4,
        n_heads=12,
        vocab_size=8192,
        max_seq_len=1024,
        dim=768,
        dropout_p=0.1,
    )
    with torch.device(torch.cuda.current_device()):
        model = Transformer(model_args)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=True)
    inp = torch.randint(
        0,
        model_args.vocab_size,
        (bsz, model_args.max_seq_len),
        device=torch.cuda.current_device(),
    )
    return (model, optimizer, inp)

def _run_and_get_mem_tracker(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    inp: torch.Tensor,
) -> MemTracker:
    mem_tracker = MemTracker()
    mem_tracker.track_external(model, optimizer)
    with mem_tracker as mt:
        for iter_idx in range(2):  # running twice to initialize optimizer
            output = model(inp)
            output.sum().backward()
            if iter_idx == 1:
                last_snapshot = mt.get_tracker_snapshot("current")
            optimizer.step()
            optimizer.zero_grad()
            if iter_idx == 0:
                mt.reset_mod_stats()
    assert last_snapshot is not None
    for mod_stats in mem_tracker.memory_tracking.values():
        if _ModState.POST_BW not in mod_stats.snapshots.keys():
            mod_stats.snapshots.setdefault(_ModState.POST_BW, []).append(
                copy.deepcopy(last_snapshot)
            )
    return mem_tracker

def _run_and_get_runtime_estimator(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    inp: torch.Tensor,
) -> RuntimeEstimator:
    def _run_one_step() -> None:
        output = model(inp)
        output.sum().backward()
        optimizer.step()
        optimizer.zero_grad()

    # Initializing optimizer states and warm-up
    _run_one_step()

    runtime_estimator = RuntimeEstimator()
    with runtime_estimator(estimate_mode_type="operator-level-cost-model"):
        _run_one_step()  # We use only one iteration for estimation
    return runtime_estimator

def _run_and_get_sac_estimator(
    model: torch.nn.Module,
    inp: torch.Tensor,
) -> SACEstimator:
    sac_estimator = SACEstimator()
    with sac_estimator(estimate_mode_type="operator-level-cost-model"):
        loss = model(inp).sum()
    loss.backward()
    return sac_estimator

def main():
    with FakeTensorMode():
        model, optimizer, inp = _init_model_input_optimizer()
        mem_tracker = _run_and_get_mem_tracker(model, optimizer, inp)
        runtime_estimator = _run_and_get_runtime_estimator(model, optimizer, inp)
        sac_estimator = _run_and_get_sac_estimator(model, inp)
        mod_info = aggregate_stats(
            model,
            mem_tracker,
            runtime_estimator,
            sac_estimator,
            torch.device(torch.cuda.current_device()),
        )
        g = parse_module_info(mod_info)

        peak_mem, compute_time = get_peak_memory_runtime_baseline(g)
        print("=== WITHOUT AC ===")
        print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB")
        print(f"compute_time: {round(compute_time, 2)} ms")

        ac_decisions, recomputation_time, peak_mem = sac_milp(g, memory_budget=1.75)
        print("=== WITH AC ===")
        print(f"ac_decisions: {ac_decisions}")
        print(f"peak_mem: {round(peak_mem / 2**30, 2)} GiB")
        print(f"recomputation_time: {recomputation_time} ms")

if __name__ == "__main__":
    main()
```

```
=== WITHOUT AC ===
peak_mem: 2.41 GiB
compute_time: 97.97 ms
=== WITH AC ===
ac_decisions: {'Transformer.layers.0': 0.5232, 'Transformer.layers.1': 0.5232, 'Transformer.layers.2': 0.6849, 'Transformer.layers.3': 0.5232}
peak_mem: 1.75 GiB
recomputation_time: 5.92 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137908
Approved by: https://github.com/weifengpy
2024-10-18 12:45:37 +00:00
7b863230ea [Docs] Optimize parameter description to declare allowed type (2/N) (#138152)
Inspired by issue #137422 and #103847

Optimize method parameter types in docs to given users a more clear about what expected to pass to methods.

Previous PR:
- [x] https://github.com/pytorch/pytorch/pull/137956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138152
Approved by: https://github.com/albanD
2024-10-18 11:18:19 +00:00
354bc3ac11 [dynamo] Remove an unused variable in repro.after_aot (#138094)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138094
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-10-18 09:37:10 +00:00
e1c4548441 [dynamo] Simplify creation of VariableTrackers (#135714)
## `VariableTracker::build()` hides the Builders

### The problem

In the current code, creating a `VariableTracker` involves choosing one of two `Builder` classes and either calling a method, or calling a constructor that creates an object that you immediately call, [like this](083c9149b7/torch/_dynamo/variables/functions.py (L761-L768)).

Variations on this code are repeated in many places.

More, the `Builder` classes have a lot of dependencies, so they have to be loaded late in the whole import process to avoid circular imports, so they end up being repeatedly imported at local scope.

### The solution

In this commit, the import from `builder` and the logic of choosing and calling the Builder class are hidden in a single static factory method, `VariableTracker.build()`, easier to reason about and to import.

This commit net lowers the total lines of code by over 150 lines by removing repetitive logic and unnecessary local imports.

**CHANGES:** Originally the name of the static method was `VariableTracker.create()` but a static method on a derived class, `LazyVariableTracker.create()` now exists with a different signature that's irreconcilable, so the new static method was renamed to `VariableTracker.build()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135714
Approved by: https://github.com/jansel
2024-10-18 09:36:46 +00:00
1581a93e87 [Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245
Approved by: https://github.com/yf225
2024-10-18 09:10:01 +00:00
1a8b4c65ac Fix scatter and gather shape check error message (#138310)
The error message seems incorrect based on the surrounding code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138310
Approved by: https://github.com/Microve, https://github.com/fegin
2024-10-18 07:49:07 +00:00
517012058d Move test_db to training IR (#138251)
Differential Revision: [D64560792](https://our.internmc.facebook.com/intern/diff/D64560792)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138251
Approved by: https://github.com/yushangdi
ghstack dependencies: #138249
2024-10-18 07:42:13 +00:00
29264fcbef Move test_verifier to training IR (#138249)
Differential Revision: [D64560351](https://our.internmc.facebook.com/intern/diff/D64560351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138249
Approved by: https://github.com/yushangdi
2024-10-18 07:36:29 +00:00
5d01126616 preserve module signature with multiple calls (#137999)
Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places.

Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export.

Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.)

Differential Revision: D64406945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999
Approved by: https://github.com/tugsbayasgalan
2024-10-18 07:30:22 +00:00
14e6624473 Update wmic command used in collect_env.py to its counterpart in powershell due to its deprecation (#138297)
As title.
`wmic` is deprecated in Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138297
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-18 07:03:17 +00:00
d116d007ee Add host-side Triton TMA support to Inductor (#137950)
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:

- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- Due to Dynamo support implemented in the previous PR, the `tma_descriptor_metadata` dict is delivered to the `triton_kerenl_wrap_` lowering and passed to the `ir.UserDefinedTritonKernel` as additional argument.
- Looking into the `tma_descriptor_metadata`, `ir.UserDefinedTritonKernel` substitutes the corresponding `TensorBox` arguments of the kernel (swapped upstream in Dynamo) by the new `ir.TMADescriptor` nodes implementing TMA descriptors in Inductor IR.
- `ir.TMADescriptor.__init__` provides the wiring between the upstream underlying `ir.TensorBox` and the downstream `ir.UserDefinedTritonKernel` kernel. In particular, we use `ir.NonOwnedLayout` wrapping `ir.ReinterpretView` to avoid the upstream tensor's buffer being deleted prematurely (before the TMA descriptor is used in the Triton kernel).
- Via `ir.TMADescriptor.codegen`, the Triton's `create_{1d,2d}_tma_descriptor` function call is codegened in the wrapper (in the host code).
- New `TMADescriptorArg` dataclass is added to handle the Triton kernel metadata pertinent to host-side TMA.
- AOT Inductor support will be implemented in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137950
Approved by: https://github.com/eellison
ghstack dependencies: #137677
2024-10-18 06:27:24 +00:00
82443798aa [Distributed] Refactor compress hook to remove duplicated code (#138182)
Fix TODO in code

```python
# TODO: create an internal helper function and extract the duplicate code in FP16_compress and BF16_compress.
```

1. Extract common logic in `fp16_compress_hook` and `bf16_compress_hook` to `_compress_hook` method
2. Let `fp16_compress_hook` and `bf16_compress_hook` invoke  `_compress_hook` with difference `dtype`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138182
Approved by: https://github.com/awgu
2024-10-18 06:01:15 +00:00
80a58b7207 Use fresh cache directory in test_cudacodecache (#138243)
This test frequently times out flakily, for example, https://github.com/pytorch/pytorch/actions/runs/11377972115/job/31654107609#step:22:2376.  I still couldn't reproduce this behavior locally running this multiple times and in parallel.  ~~So, I suspect that the error only shows up when other tests are run in paralel.~~

~~I attempt to run this serially in this PR, once land, I can monitor trunk to see if this helps.~~

Running serially still ends up with a timing out https://github.com/pytorch/pytorch/actions/runs/11391445912/job/31697603438, another try with fresh cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138243
Approved by: https://github.com/clee2000
2024-10-18 05:45:39 +00:00
0b168ceb6d Collect Nvidia libraries with collect_env.py (#138076)
Collect Nvidia libraries to diagnose issues like #133548.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138076
Approved by: https://github.com/malfet
2024-10-18 05:05:00 +00:00
8cb9110906 [CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178)
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan, https://github.com/fduwjj, https://github.com/fegin, https://github.com/kwen2501
2024-10-18 04:58:58 +00:00
a9014d2287 [BE][MPS] Compile without warnings on MacOS15 (#138238)
By guarding the calls to `-[MTLCompileOptions setFastMathEnabled]` with `C10_DIAGNOSTIC_PUSH` and `POP`
and using `-[MTLCompileOptions setMathMode:]` and `-[MTLCompileOptions setMathFloatingPointFunctions:]` on MacOS15
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138238
Approved by: https://github.com/atalman
2024-10-18 04:20:15 +00:00
cc6c248919 [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 2) (#136856)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_inductor_freezing.py`
reuse `test/inductor/test_layout_optim.py`
reuse `test/inductor/test_loop_ordering.py`
reuse `test/inductor/test_memory_planning.py`
reuse `test/inductor/test_padding.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136856
Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/jansel
2024-10-18 03:58:00 +00:00
c3cd9939fc aten | Deduplicate and silence set but unused variable warning. (#138270)
Summary:
Turns out we have two functions called slightly differently but they do exactly the same thing.
Also silence the warning if the message is stripped out.

Test Plan: Sandcastle, no behavior change.

Differential Revision: D64566719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138270
Approved by: https://github.com/boguscoder, https://github.com/cyyever
2024-10-18 03:09:46 +00:00
73a153b931 [dynamo] add compiler.set_stance raw function call test and doc example (#138276)
Followup to https://github.com/pytorch/pytorch/pull/137504#issuecomment-2420107198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138276
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-10-18 02:54:22 +00:00
8b426d80dc [hops][refactor] Refactor the aliasing/mutation detection functions (#138234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138234
Approved by: https://github.com/ydwu4
ghstack dependencies: #138231
2024-10-18 02:35:00 +00:00
e714ebf664 [dynamo][testing] Update AOTEagerandRecordGraphs backend (#138231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138231
Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/aakhundov
2024-10-18 02:35:00 +00:00
8a5dd7f59b Allow SequentialLR to include ChainedScheduler (#133450)
This fixes #132745 and allows a `SequentialLR` to include schedulers that are compound scheduler types (i.e., a `ChainedScheduler`), which contain a list of schedulers in a `_schedulers` attribute.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133450
Approved by: https://github.com/janeyx99
2024-10-18 02:29:38 +00:00
8cda774a03 Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773)
# Motivation
Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-10-18 02:28:08 +00:00
6d8c9be54b [reland] Add int1 to int7 dtypes (#137928)
Summary:
Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases
for weight quantization

Test Plan:
python test/test_quantization.py -k test_uint4_int4_dtype

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D64344944](https://our.internmc.facebook.com/intern/diff/D64344944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137928
Approved by: https://github.com/malfet
2024-10-18 02:02:08 +00:00
7365a57dc0 [BC] Add check for core ATen opset schema BC (#137664)
Summary: Based on core ATen opset BC policy: https://dev-discuss.pytorch.org/t/core-aten-opset-backward-forward-compatibility-policy/1772

Encorcing this policy in `check_forward_backward_compatibility.py`.
Basically the script will error out if any BC breaking schema changes
occurs to core ATen operators.

Test Plan:

Run `python test/forward_backward_compatibility/dump_all_function_schemas.py --filename nightly_schemas.txt`

Manually added a argument to `nightly_schemas.txt`, `convolution`
schema, see the following error:

```
[WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:329] Can NOT find backward compatible schemas after changes for schema aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor from the following candidates:
[
        aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups) -> Tensor
	aten::convolution.out(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, *, Tensor(a!) out) -> Tensor(a!)
]. Please contact PyTorch team to confirm if this BC breaking change is safe or not.
...
[WARNING 2024-10-09 15:54:36,224 check_forward_backward_compatibility.py:342] The PR is introducing backward incompatible changes to core ATen operators. Please contact PyTorch team to confirm whether this change is wanted or not.

Broken ops: [
	aten::convolution(Tensor input, Tensor weight, Tensor? bias, SymInt[] stride, SymInt[] padding, SymInt[] dilation, bool transposed, SymInt[] output_padding, SymInt groups, SymInt new_arg) -> Tensor
]
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137664
Approved by: https://github.com/albanD
2024-10-18 01:58:33 +00:00
21a9c06ca9 [c10d] differentiate timeout errors from nccl errors (#138240)
Summary:
Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths.
It's important for c10d to differentiate different reasons of watchdog
failures. E.g, timeout vs nccl errors, and possibly let users to handle the
errors differently depends on the type of errors
Test Plan:
UT
Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240
Approved by: https://github.com/Skylion007
2024-10-18 01:36:32 +00:00
95f869c3d7 [pytorch_operator_stats] log if using torchscript runtime (#137986)
Summary: logs if an operator is run with the TorchScript runtime, using a thread_local variable set in `InterpreterState.run()`

Test Plan: buck2 run mode/dev-nosan caffe2/torch/fb/observers:scuba_observer_runner

Reviewed By: zou3519

Differential Revision: D64200781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137986
Approved by: https://github.com/angelayi
2024-10-18 00:55:22 +00:00
ad28565ed7 Use C++17 Convention Methods in PyTorch (#137958)
Detailed Descriptions:
- `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>`
- `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>`
- and so on

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137958
Approved by: https://github.com/janeyx99
2024-10-18 00:52:51 +00:00
b7cf8fb800 c10 | Silence 'deprecated-dynamic-exception-spec' warning when importing cxxabi. (#138219)
Summary: cxxabi header specifically from llvm violates this, ignore the warning when including it.

Test Plan: No runtime behavior change, sandcastle only

Differential Revision: D64540217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138219
Approved by: https://github.com/boguscoder
2024-10-18 00:42:45 +00:00
2f91d7c63f [Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113)
Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.

In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113
Approved by: https://github.com/xmfan
2024-10-18 00:13:00 +00:00
6d473e0dda [autolint] move to use a label (#138263)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138263
Approved by: https://github.com/huydhn
2024-10-18 00:12:52 +00:00
a3172809a1 [EZ] Fix typo in Normalization.mm (#138283)
Introduced by 6b76a21ebd
One likely has to wait for 125 years to MacOS-150 release :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138283
Approved by: https://github.com/kit1980
2024-10-18 00:01:21 +00:00
b14c9b7250 [AMD] Hipify torchaudio_decoder (#138181)
Summary:
X-link: https://github.com/pytorch/audio/pull/3843

Continue to hipify more torchaudio targets.

Test Plan:
CI

  buck build mode/opt-amd-gpu pytorch/audio/src/...

Differential Revision: D64298970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138181
Approved by: https://github.com/houseroad
2024-10-17 23:37:37 +00:00
0ecafda602 Fix CompiledDDP failure when the gradient is not contiguous (#138174)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138174
Approved by: https://github.com/yf225, https://github.com/kwen2501

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-10-17 23:08:24 +00:00
2fc6c32b4c Ensure version file is regenerated at change (#138237)
Fixes observed error where `version.py` would not be regenerated by CMake without deleting the file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138237
Approved by: https://github.com/Skylion007
2024-10-17 22:46:05 +00:00
770fcaf2ab Fix the Rank of logsumexp Tensor and mGPU support. (#137717)
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors.

The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture.

Fixes #131316 #137414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717
Approved by: https://github.com/drisspg

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2024-10-17 21:58:14 +00:00
9f81270d75 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-17 21:27:35 +00:00
69ba89da11 Fix cuda sanitizer and as_subclass calls (#138218)
This fixes 4 main issues:
- The way the cuda sanitizer handle it's state is weird. In particular, because the lifetime of the Mode is linked to the submodule, then this might outlive the python runtime and other modules loaded. On my current version, this even outlives the "sys" module. Given that I'm not sure the impact of changing this lifetime handling, I'm making the exit handler a no-op when python is already dying and thus no point cleaning up.
- Adds a "disable" method to be able to test after the mode is enabled.
- Fix `Tensor.as_sublass()` to properly disable modes when creating the new Tensor object just like we already do in `make_subclass` and `make_wrapper_subclass`. The change here is just to apply the exact same treatment to it.
- ~Fix `Tensor.as_subclass()` not to propagate autograd as there is no valid backward associated here.~ We have test that check that this behavior happen so I guess this is not an obvious bugfix and expected behavior. Reverted that change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138218
Approved by: https://github.com/ngimel
2024-10-17 21:18:32 +00:00
b14269dcfb Make Context to be Device-agnostic Step by Step (1/N) (#136519) (#138155)
Summary:
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Original pull request: https://github.com/pytorch/pytorch/pull/136519

Test Plan: contbuild & OSS CI, see 4a8e49389c

Reviewed By: malfet

Differential Revision: D64471142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138155
Approved by: https://github.com/malfet, https://github.com/bobrenjc93
2024-10-17 20:58:56 +00:00
7a117f3b3e Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-17 19:24:54 +00:00
54839781ed Update lint failure msg to encourage lintrunner -a locally (#138232)
This is only a minor patch that I hope will change how I talk to contributors when lint fails, so that I can tell them to read the logs about lintrunner. There have been too many times when I have had to click the "approve all workflows" just for lint to fail again cuz the developer is manually applying every fix and using CI to test. I understand there are times when lintrunner doesn't work, but I'd like most contributors to at least give it a swirl once to start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138232
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-10-17 19:13:55 +00:00
dfb5ac05cc [Record Function] Add Kwargs only USER_SCOPE Macro (#138020)
Summary: Add a macro such that users can easily add a USER annotation with kwargs only

Test Plan: Will use D63801503 to test this E2E. Added unit test as well that makes sure that the kwargs get recorded correctly

Differential Revision: D64420328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138020
Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi
2024-10-17 18:48:49 +00:00
0c76c68d7d [tlparse][AOTAutograd] Rename to aot_inference_graph in tlparse output (#137803)
Compiled Autograd uses this AOT inference path, but it shows up as "aot_forward_graph" in tlparse output, which causes it to not be easily differentiable from normal "aot_forward_graph"s that are also in the tlparse output. This PR renames it to "aot_inference_graph" which makes it easier to tell which tlparse graph block is from Compiled Autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137803
Approved by: https://github.com/Microve, https://github.com/bdhirsh, https://github.com/ezyang
2024-10-17 18:44:37 +00:00
d531bd509e [Docs] Fix description in torch.save docs to show default for pickle_protocol instead of variable name (#138153)
Fixes #138013

Replace `DEFAULT_PROTOCOL` with actual default value `2` in `torch.save` method document

Before
![image](https://github.com/user-attachments/assets/cdd77d14-c009-4848-8538-9256bf22c32a)

After
![image](https://github.com/user-attachments/assets/f6b1063d-c955-478a-8d42-702b988426aa)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138153
Approved by: https://github.com/mikaylagawarecki
2024-10-17 18:13:05 +00:00
8abbd1c7c7 Modernize C10_NODISCARD to [[nodiscard]] (#138151)
PyTorch is C++17 now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138151
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-17 18:07:39 +00:00
6752e7dc3e Moved some of Inductor IR nodes to be frozen (#137859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137859
Approved by: https://github.com/ezyang
2024-10-17 18:04:45 +00:00
0b2c12cb4d Support more foreach ops for tensor beta support (#134170)
Add more foreach ops so we don't have fallbacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134170
Approved by: https://github.com/eellison
2024-10-17 17:51:31 +00:00
92fdea8a39 remove skips due to https://github.com/pytorch/torchdynamo/issues/1991 (#138133)
Closes https://github.com/pytorch/pytorch/issues/93479. A bunch of other dynamo-wrapped tests also exhibit "torch.* returned non-Tensor output unimplemented" making the issue seem less relevant to me. Some tests are marked as xfail as they fail for other reasons.

If these tests are indeed important, we should create a new issue to track them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138133
Approved by: https://github.com/ezyang
2024-10-17 17:42:46 +00:00
6b76a21ebd [PyTorch] Fix incorrect macOS 15.0 gating in MPS backend (#138022)
The ifdef as written just checks if the macOS 15.0-capable SDK is being used. You also need a runtime gate to make sure macOS 15 is in use.

Differential Revision: [D64429453](https://our.internmc.facebook.com/intern/diff/D64429453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138022
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137722, #138014
2024-10-17 17:35:34 +00:00
d2a6c73235 Revert "[CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178)"
This reverts commit 20af56d4359c3f5fed2e8f94e111a8502f2ebeb3.

Reverted https://github.com/pytorch/pytorch/pull/138178 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the new tests are failing inductor distributed jobs ([comment](https://github.com/pytorch/pytorch/pull/138178#issuecomment-2420109501))
2024-10-17 17:32:06 +00:00
2a50d77823 Move test_experimental.py to training IR (#138140)
Differential Revision: [D64510938](https://our.internmc.facebook.com/intern/diff/D64510938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138140
Approved by: https://github.com/avikchaudhuri
2024-10-17 17:30:10 +00:00
ecc5e05854 Refactor NJT min / max seqlen handling for convenience (#138130)
There's an annoying pattern emerging for pulling out the NJT min / max seqlen ints if they exist without computing / caching if they don't. This PR introduces private convenience functions to simplify handling this and avoiding redundant checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138130
Approved by: https://github.com/soulitzer
2024-10-17 17:28:39 +00:00
66478d0cf7 Revert "[compiled autograd] directly use python Logger class in cpp (#137953)"
This reverts commit af916613687d3bcc1d15362ba2fdf9312378c500.

Reverted https://github.com/pytorch/pytorch/pull/137953 on behalf of https://github.com/clee2000 due to breaking builds internally D64479234, I think it makes the build size of a package too large? The logs link to a wiki with instructions of what to do ([comment](https://github.com/pytorch/pytorch/pull/137953#issuecomment-2420086928))
2024-10-17 17:19:36 +00:00
3b0f3059f6 Revert "[Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113)"
This reverts commit ebe37b23f11e150cd3afa5464193ee036e15277f.

Reverted https://github.com/pytorch/pytorch/pull/138113 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert https://github.com/pytorch/pytorch/pull/137953, please rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/138113#issuecomment-2420079703))
2024-10-17 17:16:44 +00:00
375dcb960f Revert "Avoid some dangling reference warnings (#132535)"
This reverts commit f3d7a02716d8725dcedff86094bd7e20f73155f1.

Reverted https://github.com/pytorch/pytorch/pull/132535 on behalf of https://github.com/clee2000 due to broke some internal builds D64479234 ([comment](https://github.com/pytorch/pytorch/pull/132535#issuecomment-2419983509))
2024-10-17 16:23:36 +00:00
348f208504 Autocast re-tracibility (#138082)
Summary:
Support autocast re-tracing by giving it the same treatment as set_grad.

In re-tracing, when dynamo encounters an autocast HOP, we want it to trace through `with torch.autocast()` again, and replace the HOP with the traced subgraph.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_with_autocast
```

Differential Revision: D63856081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138082
Approved by: https://github.com/ydwu4
2024-10-17 16:09:11 +00:00
3087b5e431 [cond] support lifted symint inputs in subgraph (#137519)
As titled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137519
Approved by: https://github.com/eellison
2024-10-17 16:09:06 +00:00
2414c3f534 AOTI fixes for MI300 lowering (#137939)
Summary:
1) Add sleef back to enable SIMD on AMD
2) adding kpack to triton compute_meta  for AMD triton, since there will be user-defined triton kernels using this for k-dim packing

Test Plan:
```
HIP_VISIBLE_DEVICES=0 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCH_LOGS="output_code,graph_code" buck run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark --  --skip-flop-estimation --skip-trt --skip-ait --enable-aot-inductor --sync-mode=0 --gpu-trace --sample-input-tile-factor=1  --load="manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge" --lowering-input-str='{"serialized_inference_model_input_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/input.merge","serialized_inference_model_output_path":"ads_storage_fblearner/tree/user/facebook/fblearner/predictor/925729118/0/gpu_lowering/mi300_output.merge","submodule_names_to_lower":["merge"],"inductor_lowering_context":{"aot_inductor_lowering_settings":{"use_scripting":true,"preset_lowerer":"ifu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":3,"output_precision":3, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}},"model_entity_id":925729118,"model_snapshot_id":0,"add_sample_inputs":false,"hardware_type":0,"platform_arch":1,"dense_in_place_format":2}' --precision=bf16 2>&1 | tee local_benchmark_log.txt

```

Differential Revision: D64262924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137939
Approved by: https://github.com/frank-wei
2024-10-17 16:09:04 +00:00
502c6183e0 Prevent tuple instances from being weak-referenced. (#137838)
Summary:
Currently, https://fburl.com/code/uka25j1i checks whether the guarded object supports weakref by looking at its `__class__`
```
if hasattr(guarded_object.__class__, "__weakref__") and not isinstance(
    guarded_object, enum.Enum
):
    obj_ref = weakref.ref(guarded_object)
```

However, we have reason to modify this slightly because we use classes that "pretend" to be some other classes (e.g. nn.Parameter). Example https://fburl.com/code/8bcktgoh :
```
class QuantizedWeights:
    # TODO: Ugly trick so torch allows us to replace parameters
    # with our custom weights. Do this properly.
    property
    def __class__(self) -> Type[nn.parameter.Parameter]:
        return nn.Parameter

    property
    def grad_fn(self) -> None:
        return None
```

For example, Fp8RowwiseWeights which inherit from the base class above and also from namedtuple, actually does not have `__weakref__` attribute, but its "class" will say it does.

I think the easiest change is to use instance-level checking rather than class-level
```
if hasattr(guarded_object, "__weakref__") ...
```

But I'm wondering if this will harm any of the existing behaviors.

I'd appreciate reviews from the experts

(I just added all recommended reviewers since I'm not sure who is the best person to consult...)

Test Plan: CI?

Reviewed By: YJYJLee

Differential Revision: D64140537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137838
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-10-17 16:08:32 +00:00
7e16c9d5f2 include bw_compiler in strobelight profile (#138060)
Summary: title + tlparse will have the phase name.

Test Plan: {F1933087525}

Differential Revision: D64450315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138060
Approved by: https://github.com/ezyang
2024-10-17 16:08:28 +00:00
20af56d435 [CI] Add Compiled DDP and Compiled FSDP2 tests to test_inductor_distributed (#138178)
`test_replicate_with_compiler.py` and `test_fully_shard_compile.py` requires bf16, so needs to be run within test_inductor_distributed job (which uses A10G (SM80) and has bf16 support).

This allows us to migrate distributed jobs to T4 machines in https://github.com/pytorch/pytorch/pull/137161, as the compiled distributed jobs are the only blocking ones now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138178
Approved by: https://github.com/xmfan
2024-10-17 10:51:07 +00:00
8cfe28e4e3 [Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-10-17 09:06:57 +00:00
47077bfcb5 Remove an unused variable in _subclasses.fake_tensor (#138086)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138086
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-17 09:05:25 +00:00
ba10259115 Increase default COMPILE_STROBELIGHT_MAX_STACK_LENGTH to 500 (#138006)
Summary: pt2 call stacks are long, this reduces truncated stack
<img width="1363" alt="Screenshot 2024-10-15 at 11 35 11 AM" src="https://github.com/user-attachments/assets/d09a8fb5-eafc-4440-ab58-464889dc6df8">
<img width="1373" alt="Screenshot 2024-10-15 at 11 35 26 AM" src="https://github.com/user-attachments/assets/c4c9c245-54d1-4e35-b16f-029ece335e03">

Differential Revision: D64414746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138006
Approved by: https://github.com/bobrenjc93
2024-10-17 07:31:32 +00:00
5b7f4767ff Fix https://github.com/pytorch/pytorch/issues/138062 (#138137)
Fixes https://github.com/pytorch/pytorch/issues/138062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138137
Approved by: https://github.com/mlazos
2024-10-17 07:12:15 +00:00
f3c3f3a3c3 Fix assigning tensor with requires_grad as constant in export (#137997)
When we insert cojstants into unlifted graph, we need to detach them if they require grad BUT when we detach we need to preserve the original aliasing information.

Differential Revision: [D64406859](https://our.internmc.facebook.com/intern/diff/D64406859/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137997
Approved by: https://github.com/avikchaudhuri
2024-10-17 06:41:10 +00:00
38d9924bfc Disable lint suggestions on my PRs (#138054)
The suggestions unusably clog up early draft PRs that are not necessarily lint clean yet. Making matters worse, even if I fix them I have to manually click through hundreds of comments to "Resolve" them even though I've fixed it. Disabling it on ghstack helps, but I occasionally do standard PRs via fbcode export mechanism. Opt me out.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138054
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/PaliC
2024-10-17 05:28:37 +00:00
cyy
af8bd323e8 Remove legacy Caffe2 pthreadpool from CMake (#134936)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936
Approved by: https://github.com/ezyang
2024-10-17 05:22:08 +00:00
9c084cccfd [Pytorch][ATEN] Enable FP8 concatenate (#138046)
Summary: Float8 is becoming and increasingly popular datatype now that it is well supported on GPUs. This  diff enables FP8 to work with `torch.cat`. This is pretty straight forward since memory operations dont vary based on the input dtype, but can be quite helpful for FP8 based models.

Test Plan:
```
buck2 run mode/opt -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=h100a -c fbcode.platform010_cuda_version=12 //caffe2/test:tensor_creation -- -r test_cat_all_dtypes_and_devices
```

Differential Revision: D64443965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138046
Approved by: https://github.com/eqy, https://github.com/qchip, https://github.com/jianyuh
2024-10-17 04:58:54 +00:00
ebd60f4074 update CMAKE_PREFIX_PATH setting command (#134934)
Current setting command of the `CMAKE_PREFIX_PATH` environment variable will overwrite values if it had already been set with some values. Changing it to `:` appends the conda env search path to its values to avoid library not found issues.
`export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}:${CMAKE_PREFIX_PATH}`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134934
Approved by: https://github.com/malfet, https://github.com/EikanWang
2024-10-17 04:19:18 +00:00
7db1f0b7b5 Minor assert error message improvement (#138053)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138053
Approved by: https://github.com/Skylion007
2024-10-17 03:54:15 +00:00
ebe37b23f1 [Compiled Autograd] Check Dynamo stance to decide whether to fallback to eager (#138113)
Dynamo stance is recently added in https://github.com/pytorch/pytorch/pull/137504. When Dynamo stance is "force_eager" (user explicitly wants to fall back to eager), we would like Compiled Autograd to fall back to eager as well. This will allow the Traceable FSDP2 use case to work since "eager forward + compiled autograd backward" is not supported for Traceable FSDP2.

In general, if user wants to do "eager forward + compiled autograd backward", they should explicitly run the forward in eager instead of applying compile and then set stance to "force_eager".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138113
Approved by: https://github.com/xmfan
ghstack dependencies: #138105
2024-10-17 03:45:10 +00:00
fe43f72be7 [AOTI] Remove the non-ABI-compatible mode (part 2) (#138047)
Summary: Continue to clean up non-ABI-compatible mode related code.

Differential Revision: [D64444327](https://our.internmc.facebook.com/intern/diff/D64444327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138047
Approved by: https://github.com/chenyang78
ghstack dependencies: #137982, #138016, #138009
2024-10-17 02:54:24 +00:00
2e67d7cc35 [AOTI] Remove the non-ABI-compatible mode (part 1) (#138009)
Summary: The ABI-compatible mode has been turned on as default in https://github.com/pytorch/pytorch/pull/136534. Removing the non-ABI-compatible logic to greatly simplify the wrapper codegen logic.

Differential Revision: [D64439676](https://our.internmc.facebook.com/intern/diff/D64439676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138009
Approved by: https://github.com/chenyang78
ghstack dependencies: #137982, #138016
2024-10-17 02:48:26 +00:00
7711f00553 [BE] Delete unused operator!= from the test (#138122)
If method is unused, why not delete it altogether?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138122
Approved by: https://github.com/swolchok
2024-10-17 02:24:48 +00:00
906fe05895 Naive impls for NJT matmul (#138121)
Our matmul support is abysmal - let's at least get this working and do it performantly later.

Bonus: implements `bmm` as well.

jagged <-> padded dense conversions are utilized when possible, and an unbind-based fallback otherwise (the former works with torch.compile and the latter doesn't). Some testing is missing because we don't have factory function support yet :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138121
Approved by: https://github.com/cpuhrsch
2024-10-17 01:31:46 +00:00
b4f7f4bf49 [Docs] Optimize parameter description to declare allowed type (1/N) (#137956)
Inspired by issue #137422 and #103847

Optimize method parameter types in docs to given users a more clear about what expected to pass to methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137956
Approved by: https://github.com/albanD
2024-10-17 01:19:55 +00:00
c69f4518ec [SymmetricMemory] fix a race condition in _pipelined_produce_and_all2all that can cause correctness issues for very small chunk_producers (#138126)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138126
Approved by: https://github.com/lessw2020
2024-10-17 01:05:41 +00:00
69e125a7e9 AOTInductor: fixup test (follow-up to #137401) (#137692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137692
Approved by: https://github.com/desertfire
2024-10-17 00:40:21 +00:00
94537e70b5 Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100
Approved by: https://github.com/Skylion007
2024-10-17 00:34:56 +00:00
504904c9c6 [Traceable FSDP2] Add compiled_autograd_enabled helper function (#138105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138105
Approved by: https://github.com/awgu, https://github.com/xmfan
2024-10-17 00:04:06 +00:00
0e9708f907 tensor constant with wrapped method (#138091)
Summary:
Tensor constants can show up through wrapped methods, so that they may not always be found in constant attributes. They need to be fakified and their meta vals need to be found to create graph signatures nevertheless. Otherwise non-strict barfs.

Longer term maybe we should pull this fakification up in non-strict.

Test Plan: added test

Differential Revision: D64480272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138091
Approved by: https://github.com/tugsbayasgalan
2024-10-17 00:00:04 +00:00
4b3035f2fe Revert "Add decomposition for permute_copy (#130944)"
This reverts commit e7a4ad3b409c226a1da0f597c66ece7c06de0e9e.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))
2024-10-16 23:18:53 +00:00
5254a0d383 Revert "Dont decompose aten.baddmm in inductor (#137904)"
This reverts commit cef6c3dcb07aafe25d62427e55442a46d7af3500.

Reverted https://github.com/pytorch/pytorch/pull/137904 on behalf of https://github.com/clee2000 due to failing internal tests D64418200, some results not within tolerance? ([comment](https://github.com/pytorch/pytorch/pull/137904#issuecomment-2418122735))
2024-10-16 23:16:44 +00:00
ea2726452a add myself as codeowner in aot_autograd (#138075)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138075
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #136670
2024-10-16 22:41:39 +00:00
a682194a11 inductor: use previous guards to know if a size is 1 for broadcasting (#136670)
Fixes https://github.com/pytorch/pytorch/issues/136640

Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.

In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.

In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```

I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.

I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:

(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions

(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing  `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.

Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
2024-10-16 22:41:39 +00:00
56379e2c17 Remove an unused variable in _subclasses.fake_impls (#138085)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138085
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-10-16 22:41:04 +00:00
0bfa1bf21d [scan] support closure (#135602)
This PR adds an additional_inputs argument to support closures similar to what we've done for while_loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135602
Approved by: https://github.com/zou3519
ghstack dependencies: #135600, #135601
2024-10-16 22:28:03 +00:00
819d6b139c [scan] flatten subgraph output and make subgraph inputs to be a slice (#135601)
This pr introduces two changes:
1. Before this pr, the subgraphs output is ([], []), in this pr, we change it to a flattened list for easier codegen and consistency with other control flow operators.

2. Before the PR, the combine_fn of scan takes a sliced input but keep the sliced dimension. For exmaple, suppose xs = torch.randn(3, 4, 5) and we scan over dim 0, the combine_fn looks like:
```
# x.shape = (1, 4, 5) instead of (4, 5)
def combine_fn(carry, x):
  ...
```

In this PR, we fixed this and also simplify some of the slicing logic.

3. this diff also make sure we always stack ys on fist dimension.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135601
Approved by: https://github.com/zou3519
ghstack dependencies: #135600
2024-10-16 22:28:03 +00:00
0437a22d43 [scan] fix typo in signature and remove wrapper (#135600)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135600
Approved by: https://github.com/zou3519
2024-10-16 22:27:59 +00:00
443472b1ca [AOTI] Remove explicit abi_compatible setting in tests (#138016)
Differential Revision: [D64439674](https://our.internmc.facebook.com/intern/diff/D64439674)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138016
Approved by: https://github.com/malfet
ghstack dependencies: #137982
2024-10-16 21:35:46 +00:00
6bc57549f9 [AOTI] Remove non-ABI-compatible tests (#137982)
Summary: Remove non-ABI-compatible mode tests since ABI-compatible has been turned on as default. Also clean up tests that explicitly set ABI-compatible to True.

Differential Revision: [D64439673](https://our.internmc.facebook.com/intern/diff/D64439673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137982
Approved by: https://github.com/malfet
2024-10-16 21:35:46 +00:00
a040c4a260 Use std::move on stringstream to prevent unnecessary copy. (#138065)
- Takes advantage of C++20's improved handling of move semantics for std::basic_stringbuf.
- Reduces unnecessary copying and improves memory efficiency, especially for long formatted strings.

Benchmark(proof of concept): https://quick-bench.com/q/qohAu0ARH3vSDyKVsoKEfXOO6BI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138065
Approved by: https://github.com/Skylion007
2024-10-16 21:35:10 +00:00
b72ff35f22 [c10d][ez] Add more inline comments to CUDAEventCache code (#138079)
Address @kwen2501 's feedback in https://github.com/pytorch/pytorch/pull/138048, add more inline comments to the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138079
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040, #138048, #138059
2024-10-16 20:43:28 +00:00
f2c96f5d87 Add AOTI test (#138043)
Summary:
add back the test that's removed in D63916320.

It should work now as D64361273 added back the workspace change.

Test Plan: CI

Differential Revision: D64442054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138043
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
2024-10-16 20:41:07 +00:00
f95ddf0b31 [c10d] record world size in log (#138044)
Summary:
Record the world size in log and scuba table.
This helps us quickly figure out if there are missing flight recorder files form ranks.

Test Plan: Ran locally and noted that size was logged to scuba

Differential Revision: D64442949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138044
Approved by: https://github.com/Skylion007
2024-10-16 20:14:02 +00:00
24ee4af86b Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit 2b7c7a20b9c0e8e7f2773ffc5c9f79c3cae2070b.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))
2024-10-16 20:05:38 +00:00
a0a978ce23 [aoti config] add raise_error_on_ignored_optimization (#138035)
Summary: Unfortunately this means adding another config.

Test Plan: ci

Differential Revision: D64437699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138035
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-10-16 18:38:47 +00:00
f1c741dbe9 Fixes GuardOnDataDependentSymNode error in masked_fill (#137060)
Fixes [P1621441513](https://www.internalfb.com/phabricator/paste/view/P1621441513) ([ref to internal post](https://fb.workplace.com/groups/6829516587176185/posts/1051474609896021/?comment_id=1055262166183932&reply_comment_id=1056583932718422))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137060
Approved by: https://github.com/ezyang
2024-10-16 18:16:33 +00:00
f173623bb2 [td] try catch exception, do not run td if not results (#138087)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138087
Approved by: https://github.com/wdvr
2024-10-16 18:04:25 +00:00
dabe2a3c3b [Torch] Support meta device in random.fork_rng (#137715)
Summary:
## Why
random.fork_rng doesn't support meta device:
```
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/aps_models/ads/tools/memory_estimator/estimation_dense.py", line 655, in estimate_dense_memory_size
[rank0]:     losses.sum().backward()
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 604, in backward
[rank0]:     return handle_torch_function(
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/overrides.py", line 1718, in handle_torch_function
[rank0]:     result = mode.__torch_function__(public_api, types, args, kwargs)
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/_device.py", line 106, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/_tensor.py", line 613, in backward
[rank0]:     torch.autograd.backward(
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/__init__.py", line 347, in backward
[rank0]:     _engine_run_backward(
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/autograd/graph.py", line 825, in _engine_run_backward
[rank0]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1125, in unpack_hook
[rank0]:     frame.recompute_fn(*args)
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/utils/checkpoint.py", line 1507, in recompute_fn
[rank0]:     with torch.random.fork_rng(
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/runtime/lib/python3.10/contextlib.py", line 135, in __enter__
[rank0]:     return next(self.gen)
[rank0]:   File "/data/users/lyu1/fbsource/buck-out/v2/gen/fbcode/581363ebaea3320a/aps_models/ads/tools/memory_estimator/__memory_estimator__/memory_estimator-inplace#link-tree/torch/random.py", line 153, in fork_rng
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: torch has no module of `meta`, you should register a module by `torch._register_device_module`.
```

This blocks us from running backward() on model with checkpoint enabled in meta mode.

## What
This diff handles the case of meta device in random.fork_rng.

Test Plan: Tested with toy model which has checkpoint on its module: P1641201046

Differential Revision: D64161410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137715
Approved by: https://github.com/kit1980
2024-10-16 18:00:39 +00:00
a47bb4a393 Fix autocast for non-strict export (#137495)
Summary:

add testing for autocast and set_grad nodes for export_for_training. In export_for_training, we do not wrap the autocast and set_grad node in to HOP, but we should still have the set_grad_enabled/autocast nodes.

add support for autocast in non-strict export. Previously, `_enter_autocast` and `_exit_autocast` nodes don't show up in the export graph when we use `strict=False`.

- In autocast's enter and exit function, we dispatch to `PreDispatchTorchFunctionMode.__torch_function__`.
 if we have PreDispatchTorchFunctionMode in our function_mode_stack, the call stack looks like below. This is mostly the same call stack as strict mode, except strict mode enters [here](https://www.internalfb.com/code/fbsource/[0d4f1135cacdb26c6e01d5dce1ce52a15d61ee48]/xplat/caffe2/torch/_dynamo/variables/ctx_manager.py?lines=806).
```
- torch.amp.autocast.__enter__()'s torch.overrides.handle_torch_function
- torch.fx.experimental.proxy_tensor.TorchFunctionMetadataMode.__torch_function__
- torch.amp._enter_autocast()'s torch.overrides.handle_torch_function
- PreDispatchTorchFunctionMode.__torch_function__
```
- in `PreDispatchTorchFunctionMode.__torch_function__`, we create the autocast nodes.
- to match the strict mode behavior, we let the input node to the `_exist_autocast` node be the corresponding `_enter_autocast` node. This requires us to maintain a stack in `PreDispatchTorchFunctionMode`.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_with_autocast
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_with_set_grad
```

Differential Revision: D64016023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137495
Approved by: https://github.com/bdhirsh
2024-10-16 17:39:00 +00:00
7ba706c74e update get start xpu (#137479)
1. respect the comment from the community, downgrade the "Beta" to "Prototype" for the first xpu release with wheel
2. add wheels installation of torchaudio & torchvision for nightly on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137479
Approved by: https://github.com/atalman, https://github.com/malfet
2024-10-16 17:36:29 +00:00
7e704c2073 [c10d] Add unit test for CUDAEventCache to ensure caching is working (#138059)
We created a simple test to validate the cache is indeed working and when the cache is indeed used up. I revert the fix in (https://github.com/pytorch/pytorch/pull/138040) and the test indeed failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138059
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040, #138048
2024-10-16 17:34:57 +00:00
dd32a32cb6 Revert "Expose option to disable CRC-32 computation during torch.save (#137735)"
This reverts commit 534fa96f2d9a4feb1dcdfaecb3d73990db60f819.

Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))
2024-10-16 17:03:06 +00:00
2b7c7a20b9 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy
2024-10-16 16:42:57 +00:00
0a6c40faba Fix constant returning (#137993)
When the constants are used twice in the exported graph (second one is returned as output), the lifting constant pass doesn't account for the second one being the output. THis PR fixes that.

Differential Revision: [D64406108](https://our.internmc.facebook.com/intern/diff/D64406108/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137993
Approved by: https://github.com/avikchaudhuri
2024-10-16 16:42:09 +00:00
189c95457d [PyTorch] Don't hardcode 4 * Vec::size() in vectorized_reduction (#138014)
This will break once we support 128-bit vectors, and there's no reason to do it.

Differential Revision: [D64421982](https://our.internmc.facebook.com/intern/diff/D64421982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138014
Approved by: https://github.com/malfet, https://github.com/Skylion007
ghstack dependencies: #137722
2024-10-16 16:41:59 +00:00
a12c859b00 [PyTorch] Check defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256) instead of defined(CPU_CAPABILITY_NEON) (#137722)
The CPU_CAPABILITY system is for rebuilding kernels multiple times with different vector ISA targets. CPU_CAPABILITY_NEON was not being used for that, just as an extra flag for inductor. As a result, CPU_CAPABILITY_NEON-gated code was unnecessarily unavailable outside inductor. Fixes #137704

Differential Revision: [D64197046](https://our.internmc.facebook.com/intern/diff/D64197046/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137722
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-10-16 16:41:59 +00:00
361f42bc42 Revert "[compiled autograd] Compiled autograd configs in TLS (#137821)"
This reverts commit 9aba0b91c8df4a15654f9ccc02abca31bdd81650.

Reverted https://github.com/pytorch/pytorch/pull/137821 on behalf of https://github.com/wdvr due to Reverting this for now, it is failing test_public_bindings in trunk ([comment](https://github.com/pytorch/pytorch/pull/137821#issuecomment-2417351788))
2024-10-16 16:38:29 +00:00
af27f7888b [dynamo] Remove an unused variable in AOTDispatchAutograd (#137989)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137989
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-16 16:37:19 +00:00
753ba5d30a Move basic dependencies install to requirements-ci (#138024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138024
Approved by: https://github.com/huydhn
ghstack dependencies: #137991, #137992, #138023
2024-10-16 16:21:33 +00:00
4c8718d8e7 [dynamo] add torch.compiler.set_stance (#137504)
Attempt # 2 at https://github.com/pytorch/pytorch/pull/132926 to implement https://github.com/pytorch/pytorch/issues/123771.

Implement a new `torch.compiler.set_stance` function that can force `torch.compile` regions to run eagerly.

See added tests for usage examples.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137504
Approved by: https://github.com/yf225, https://github.com/jansel
2024-10-16 16:18:25 +00:00
960c3bff98 [c10d] Refactor CUDAEventCache Create to use deque rather than stack (#138048)
We used a LIFO stack to store the CudaEvent in the cache. ,Somehow we like FIFO deque better so aside from improving the readability of the code, we use a deque instead. As @wconstab pointed out, both methods are equally correct because the moment we put the event into stack/deque, the event is already ready for reuse, this change mostly is a preference change not trying to fix anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138048
Approved by: https://github.com/kwen2501
ghstack dependencies: #138040
2024-10-16 14:44:39 +00:00
932ae131fb Remove an unused variable in _inductor/codegen/simd.py (#138000)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138000
Approved by: https://github.com/Skylion007
2024-10-16 13:54:21 +00:00
f3d7a02716 Avoid some dangling reference warnings (#132535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132535
Approved by: https://github.com/aaronenyeshi
2024-10-16 13:41:12 +00:00
0c63de9755 [dynamo] Remove an unused variable in AutogradFunctionApplyVariable (#137985)
----

* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137985
Approved by: https://github.com/zou3519
2024-10-16 13:08:45 +00:00
15722debfb Remove two unused variables in _functorch/partitioners.py (#137998)
* Extracted from https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137998
Approved by: https://github.com/Skylion007
2024-10-16 10:58:31 +00:00
9aba0b91c8 [compiled autograd] Compiled autograd configs in TLS (#137821)
Multithreaded doesn't work yet, this adds python side TLS only for the python side state

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137821
Approved by: https://github.com/jansel, https://github.com/yf225
ghstack dependencies: #137953
2024-10-16 09:28:32 +00:00
af91661368 [compiled autograd] directly use python Logger class in cpp (#137953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137953
Approved by: https://github.com/jansel, https://github.com/yf225
2024-10-16 09:28:32 +00:00
7f88bf96f9 test_execution_trace.py: Use instantiate_device_type_tests to run GPU tests on HPU as well (#133975)
**MOTIVATION**

We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

**CHANGES**

- Add support for HPU devices within the payload function.
- Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
- Expand the supported_activities() function to include checks for torch.profiler.ProfilerActivity.HPU.
- Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133975
Approved by: https://github.com/briancoutinho, https://github.com/aaronenyeshi
2024-10-16 07:53:06 +00:00
deaf0418b2 [2/N] Fix clang-tidy warnings in torch/csrc/api/ (#136998)
Follows #134545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136998
Approved by: https://github.com/ezyang
2024-10-16 07:50:59 +00:00
f4158558aa [c10d] disable watchdog thread in blockingWait mode (#138001)
Summary:
Blocking wait mode is not widely used, probably useful in debugging.
in blockingWait mode, we don't need to enable the watchdog thread to
check the timeout or nccl error because the main thread would throw an
exception if error happens and it is obvious to user which work fails
and its user's responsibility to handle the exception.
Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138001
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #137799
2024-10-16 07:42:22 +00:00
78632b97b1 Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit f43c4d28b8f955fe1f2b80f193815edadc95507b.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))
2024-10-16 07:26:34 +00:00
7480e6938d [inductor] Add LoopBody.op_counts (#137945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137945
Approved by: https://github.com/eellison
ghstack dependencies: #137946
2024-10-16 06:35:10 +00:00
0d7b2118ed [inductor] Refactor triton dtype helpers (#137946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137946
Approved by: https://github.com/eellison
2024-10-16 06:35:10 +00:00
97f7fc1d31 Support retry when building Docker images (#138012)
Similar to https://github.com/pytorch/test-infra/pull/5759, I'm seeing flaky network error from time to time when building Docker images, for example https://github.com/pytorch/pytorch/actions/runs/11352439248/job/31575206417.

So, adding retrying to mitigate this class of flaky failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138012
Approved by: https://github.com/atalman
2024-10-16 06:10:41 +00:00
084657e012 [c10d] Fix data corruption bug after CUDAEventCache is enabled (#138040)
Here is why we see using `CUDAEventCache` cause crash and data corruption.
1. The deleter is doing its job and append the job the stack.
2. In create, instead of getting a reference, we are getting a copy of eventsArray_[i] (which is a std::vector). This is bad because we didn't really remove the element from the stack. While we thought we already pop up the last one from the stack, but it turns out the last one is still in the stack; we end up reusing the same event again and again. What's worse, since we keep adding new events to the stack, this will eventually explode the stack and a crash happens.

Fix is easy, just get a reference. Local torchtitan run see a non-Nan loss.

Also we want to use a deque instead of a stack, and refactor the code a bit to make it more readable. (in a separate PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138040
Approved by: https://github.com/kwen2501, https://github.com/shuqiangzhang
2024-10-16 05:20:29 +00:00
f43c4d28b8 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere, https://github.com/eqy
2024-10-16 05:03:08 +00:00
60b4858977 [BE][Docker] Don't update scikit-learn (#138023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138023
Approved by: https://github.com/huydhn
ghstack dependencies: #137991, #137992
2024-10-16 05:01:40 +00:00
7f6e85bb93 [BE] Move numpy installation logic to requirements-ci.txt (#137992)
And slightly adjust versioning logic, as current one seems to exist to hide version conflicts:
 - 1.21.2 for Python-3.9
 - 1.24.2 for Python-3.10 (to resolve conflict with numba-0.55.2)
 - 1.26.2 for Python-3.11 or 3.12
 - 2.1.2 for Python-3.13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137992
Approved by: https://github.com/Skylion007, https://github.com/huydhn
ghstack dependencies: #137991
2024-10-16 04:30:29 +00:00
12f4d91e84 Enable Python-3.13 builds on MacOS (#138037)
All logic changes happen in builder repo, namely:
 - a01e87535b
 - bcd0972459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138037
Approved by: https://github.com/huydhn
ghstack dependencies: #138041
2024-10-16 04:24:12 +00:00
66b39fd474 refactor KERNEL_MPS via resuing KERNEL (#137831)
# Motivation
Reuse `KERNEL` to simplify `KERNEL_MPS` for mps autocast code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137831
Approved by: https://github.com/malfet
2024-10-16 03:54:13 +00:00
2c94c54f10 Export XPU libs to be public (#136974)
# Motivation
Export XPU-related libs to be public. Now they are included in `TORCH_LIBRARIES`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136974
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-10-16 03:41:01 +00:00
80f3ee41dc [SymmetricMemory] fix incorrect numel caculations that are using int as std::accumulate's accumulator (#138038)
Fixes https://github.com/pytorch/pytorch/pull/137567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138038
Approved by: https://github.com/weifengpy
2024-10-16 03:34:26 +00:00
75109682b6 [Pipelining] Refactor Interleaved1F1B and ZeroBubble (#137783)
NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns.

`ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783
Approved by: https://github.com/wconstab
2024-10-16 03:05:14 +00:00
809ff3b274 Add host-side Triton TMA support to Dynamo (#137677)
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:

- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
- The newly introduced variables have `reconstruct` methods used in case of graph breaks.
- The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
- In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
- In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
- JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.

Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677
Approved by: https://github.com/zou3519
2024-10-16 02:18:48 +00:00
dd2ae7d0c9 [BE] Use x in [foo, bar] (#138041)
As shorthand for `x == foo or x == bar`
And `x not in [foo, bar]` as shorthand for `x != foo and x != bar`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138041
Approved by: https://github.com/huydhn
2024-10-16 01:57:37 +00:00
64ccebd2e0 update labeler for module: compiled autograd (#137954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137954
Approved by: https://github.com/yf225
2024-10-16 01:56:21 +00:00
aa28062169 [ROCm] TunableOp more unit test follow-up - Part 2 (#134517)
More unit tests to cover TunableOp functionality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134517
Approved by: https://github.com/jeffdaily
2024-10-16 01:49:47 +00:00
7fa7333299 [Distributed][Test] Fix todo in distributed test files (#136836)
Refactor distributed test code:
- Fix TODO: (rohan-varma): remove model
- Fix TODO: add comments for TestTraverse
- Migrate deprecated method call `load_state_dict` and `save_state_dict`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136836
Approved by: https://github.com/kwen2501
2024-10-16 01:15:12 +00:00
a1b22e369b [c10d] add an API to get the future result(success or failure) of a collective and customize error handling (#137799)
Summary:
This PR is trying to let users to know what exact collective call from the python thread is failing, and
customize their own error handling function, instead of watchdog thread crashing everything.

This is potentially very useful in fault tolerant training, in which we can have in-process restart.
E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job.

This is to allow users to check the status of each collective call,
using the ivalue::future libs in PT core. This also allows users to
attach its customized failure handling functions by:
work.get_future_result().then(erro_handling_func)

Note that the above call is also non-blocking for CPU thread
Test Plan:
Added a new test: test_get_future_result to verify the workResutl is
correctly propagated to the users

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799
Approved by: https://github.com/fduwjj, https://github.com/wconstab
2024-10-16 00:20:09 +00:00
8d9c9727c0 aten | Fix set but unused variables warning in release builds. (#138008)
Summary: Fixing a warning that happens only in release builds.

Test Plan: Sandcastle + dependent diffs

Reviewed By: boguscoder

Differential Revision: D64415854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138008
Approved by: https://github.com/boguscoder, https://github.com/Skylion007
2024-10-16 00:05:39 +00:00
46ec4ad021 Add code pointer to internal Meta implementation (#137984)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137984
Approved by: https://github.com/albanD
2024-10-15 23:35:22 +00:00
4557f6e339 Revert "[Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669)"
This reverts commit bf0b67059882933574f71a3b11b2f0127915ee5b.

Reverted https://github.com/pytorch/pytorch/pull/137669 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing test_public_bindings in trunk, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/137669#issuecomment-2415331274))
2024-10-15 23:22:58 +00:00
19665f4619 [fake_tensor][cache] Supports ops with tuple of output tensors (#137935)
This is needed for invoke_subgraph work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137935
Approved by: https://github.com/masnesral
2024-10-15 22:15:07 +00:00
5d5783a263 Improve the scheduling of _pipelined_multi_all_gather_and_consume (#137850)
```
Parallelization strategy: after each rank copies its shard into its local
p2p buffer, every rank issues independent p2p copy -> shard_consumer
sequences to two streams. In addition to computation/communication
overlapping, the strategy allows for computation/computation overlapping,
greatly reducing quantization inefficiency.

Notation:
- "mv" for the copy to local buffer
- "cp" for p2p copies
- "b" for barriers

Constraints:
- The GPU scheduler may or may not overlap "mv" with the first shard_consumer.
- "cp" from different streams cannot overlap.

Ideal scenario 0 - "mv" overlaps with the first shard_consumer:

stream 0: [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]

Ideal scenario 1 - "mv" is scheduled before the first shard_consumer:

stream 0:       [ shard_consumer ][ cp ][ shard_consumer ]
stream 1: [ mv ][b][ cp ][ shard_consumer ]

Suboptimal scenario 0 - "mv" is scheduled after the first shard_consumer:

stream 0: [ shard_consumer ]               [ cp ][ shard_consumer ]
stream 1:                   [ mv ][b][ cp ][ shard_consumer ]

Suboptimal scenario 0 - "b" is scheduled after the first shard_consumer:

stream 0:       [ shard_consumer ]         [ cp ][ shard_consumer ]
stream 1: [ mv ]                  [b][ cp ][ shard_consumer ]

We haven't yet figured out a way to ensure "mv" and "b" are either
overlapped with or scheduled before the first shard_consumer. Thus, to
prevent suboptimal scenarios, we are giving up the chance to overlap "mv"
and "b" with the first shard_consumer for now.
```

This PR improves the scheduling for mm kernels with high SM utilization. The GPU scheduler tends to not overlap local DtoD copies with such kernels, which leads to suboptimal scheduling. The following is an example of pipelining PyTorch's cutlass-based, row-wise scaling fp8 kernel:

Before this PR:
<img width="298" alt="image" src="https://github.com/user-attachments/assets/81e0a7f4-18ee-47c6-b258-04fdaca7a6a2">

With this PR:
<img width="253" alt="image" src="https://github.com/user-attachments/assets/982de5a8-da1e-4a8f-b67e-c9c869b0a77f">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137850
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738, #137805, #137836
2024-10-15 21:35:14 +00:00
2ae1a4caa1 Improve the scheduling of _pipelined_produce_and_all2all (#137836)
```
Parallelization strategy: every rank issues independent compute
-> barrier -> p2p copy sequences on two streams. In addition to
computation/communication overlapping, the strategy allows for
computation/computation overlapping, greatly reducing
quantization inefficiency.

Ideally, stream activities would look like this ("b" for
barriers, "cp" for p2p copies):

[rank 0]
stream 0:         [  chunk_producer  ][b][ cp ][  chunk_producer ][b][ cp ]
stream 1: [  chunk_producer  ][b][ cp ][  chunk_producer  ][b][ cp ]

[rank 1]
stream 0:         [  chunk_producer  ][b][ cp ][  chunk_producer ][b][ cp ]
stream 1: [  chunk_producer  ][b][ cp ][  chunk_producer  ][b][ cp ]

Note that the barriers synchronize streams with the same ID
across ranks. They don't synchronize streams on the same rank.

Since the work on both streams is independent, there's no
guarantee that the chunk_producer from stream 0 or stream 1 will
be scheduled first. If there is a scheduling mismatch across
ranks, the barrier forces all ranks to wait for the slowest.

When scheduling mismatches occur among ranks, the stream
activities might look like this (note that p2p copies from
different streams cannot overlap with each other):

[rank 0]
stream 0: [  chunk_producer  ][b        ][ cp ][  chunk_producer ][b       ][ cp ]
stream 1:         [  chunk_producer  ][b]      [ cp ][  chunk_producer  ][b]      [ cp ]

[rank 1]
stream 0:         [  chunk_producer  ][b]      [ cp ][  chunk_producer  ][b]      [ cp ]
stream 1: [  chunk_producer  ][b        ][ cp ][  chunk_producer  ][b      ][ cp ]

To prevent this, we need to ensure that the chunk_producer on
stream 1 gets scheduled first on every rank. Without access to
the underlying kernels, CUDA offers no API to control the
scheduling order of two independent, overlapping kernels. Our
solution is to issue a small sleep kernel in stream 0. The sleep
duration is insignificant, but having an extra task in stream 0
will almost guarantee that the chunk_producer on stream 1 gets
scheduled first. Once the first chunk_producer is scheduled in
the correct order, there's very little room for the scheduling
order of subsequent kernels to be inconsistent across ranks.
```

Currently, we perform stream synchronization to ensure scheduling order. The stream synchronization has no bearing on correctness, but prevents inconsistent scheduling orders across ranks.

Without the stream synchronization, ranks may have inconsistent scheduling order, and the barriers cause all ranks to wait for the slowest rank:
<img width="379" alt="image" src="https://github.com/user-attachments/assets/ffb97e76-7e19-4449-b121-83c32ec3e91d">

With stream synchronization, the inconsistent scheduling order issue is addressed, but we lose compute/compute overlapping (this is the state before this PR):
<img width="378" alt="image" src="https://github.com/user-attachments/assets/4cb76246-625f-4fc1-b49a-823ae46d3f23">

With this PR, we get both consistent scheduling order across ranks and compute/compute overlap:
<img width="327" alt="image" src="https://github.com/user-attachments/assets/51ab1bdc-4f60-46e0-b53c-6d208e2d4888">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137836
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738, #137805
2024-10-15 21:35:14 +00:00
ef541c1a65 [fused_all_gather_scaled_matmul] support rowwise scaling (#137805)
This PR add support for `A_scale` to be row-wise scale. The op can automatically detect whether the row-wise scale is sharded or replicated. When the row-wise scale is sharded, the op would all-gather the scale in a pipelined fashion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137805
Approved by: https://github.com/weifengpy
ghstack dependencies: #137643, #137738
2024-10-15 21:35:14 +00:00
05edaeaded [fused_scaled_matmul_reduce_scatter] support rowwise scaling (#137738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137738
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137643
2024-10-15 21:35:14 +00:00
91bc9dc2c9 [SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() (#137643)
Suggested by @lw for better safety/reliability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137643
Approved by: https://github.com/weifengpy, https://github.com/lw
2024-10-15 21:35:14 +00:00
eaec72d1e6 Link directly to new Custom Ops Landing Page (#137933)
e.g., click on first link in https://docs-preview.pytorch.org/pytorch/pytorch/137933/library.html#testing-custom-ops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137933
Approved by: https://github.com/zou3519
2024-10-15 21:18:21 +00:00
aef4317ec8 [c10d] socket: retry connection timeout failures (#138003)
This will retry connection timeout failures up to the timeout duration. Under heavy load the server may not be able to immediately accept the connection. In such a case we do want to retry the connection rather than fall back to ipv4 for the remaining of the connection timeout.

The connection timeout here is not the same as the c10d timeout which appears to be higher. We could adjust the linux timeout directly but using the c10d retry loop keeps things more consistent and gives us things like exponential backoff, logs, etc.

Example failure:
```
 socket.cpp:752] [c10d] The client socket has failed to connect to [...]:29400 (errno: 110 - Connection timed out).
 socket.cpp:752] [c10d] The IPv4 network addresses of (..., 29400) cannot be retrieved (gai error: -2 - Name or service not known).
... repeats ipv4 connection failure
```

From Linux man page: https://man7.org/linux/man-pages/man2/connect.2.html
```
ETIMEDOUT
              Timeout while attempting connection.  The server may be
              too busy to accept new connections.  Note that for IP
              sockets the timeout may be very long when syncookies are
              enabled on the server.
```

Test plan:

CI for backwards compatibility

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138003
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/rsdcastro
2024-10-15 21:17:05 +00:00
bf0b670598 [Dynamo] Disable torch function compilation during guard execution and in compiled bytecode (#137669)
Fixes https://github.com/pytorch/pytorch/issues/114369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137669
Approved by: https://github.com/anijain2305
2024-10-15 20:52:58 +00:00
28a521e29a [fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow (size 4) in c10::IValue::IValue() (#137924)
Summary: Calling `pop()` on empty stack

Test Plan: CI

Differential Revision: D64332420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137924
Approved by: https://github.com/Skylion007
2024-10-15 20:42:47 +00:00
3ecec0c90c skip lintrunner install on Windows. (#137981)
`lintrunner` is not support Windows x64. Ref: https://pypi.org/project/lintrunner/#files

When we install python dependency by `pip install -r requirements.txt` on Windows x64, it will failed on `lintrunner`.
<img width="887" alt="image" src="https://github.com/user-attachments/assets/e3815177-e893-41ae-96af-8b39d12f74a7">

Solution: skip install `lintrunner` on Windows.
Reference doc: https://peps.python.org/pep-0508/#environment-markers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137981
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2024-10-15 20:37:26 +00:00
35fc24fbed [PGNCCL] Fix bugs in non-blocking mode (#137741)
### Fix 1: Throw async error during init wait

Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.

### Fix 2: Add wait after comm split

```
  // After calling ncclCommSplit in non-blocking mode, we should wait for the
  // source communicator to be out of ncclInProgress state.
  // Reason 1:
  //   it's unsafe to call new operations on the parent comm while it's in
  //   ncclInProgress state.
  // Reason 2:
  //   as of NCCL 2.23, the ptr value of child comm will not be filled until the
  //   state of parent comm is ncclSuccess. This may change in the future. See:
  //   https://github.com/NVIDIA/nccl/issues/1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741
Approved by: https://github.com/shuqiangzhang
2024-10-15 20:35:39 +00:00
370d66d7dd aten/buck | Appropriately convert clang => msvc compiler_flags. (#137944)
Summary:
fPIC is not available in clang on Windows - filter it out.
Also configure the flags appropriately for MSVC.

Reviewed By: rameshviswanathan

Differential Revision: D64365660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137944
Approved by: https://github.com/mwdavis84, https://github.com/ChristianK275, https://github.com/boguscoder
2024-10-15 20:21:01 +00:00
487873f7ca [Inductor]: Support updated Triton AttrsDescriptor (#137757)
The Triton `AttrsDescriptor` object was refactored in https://github.com/triton-lang/triton/pull/4734. These changes add support for the new `AttrsDescriptor` while maintaining backwards compatibility with the existing version. The main changes are different names for the initialized of the descriptor parameters, and a creation via a static method instead of the class constructor.

Depends on #137458 which removes some unused logic around the old descriptor. Those changes make this PR cleaner, but if for some reason that old logic is still used I can make adjustments.

Use of the new `AttrsDescriptor` depends on https://github.com/triton-lang/triton/pull/4888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137757
Approved by: https://github.com/jansel
2024-10-15 19:34:59 +00:00
534fa96f2d Expose option to disable CRC-32 computation during torch.save (#137735)
Option only works in open source, not internal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735
Approved by: https://github.com/albanD
2024-10-15 19:30:02 +00:00
3cc8c8b944 [FSDP2] Add set_unshard_in_backward(bool) (#137922)
For some expert use cases, the user knows some parameters are not required for backward, so we can skip the unshard in backward. One example is the embedding weight.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137922
Approved by: https://github.com/weifengpy
2024-10-15 19:11:14 +00:00
60cf72e028 enable auto functionalize v2 by default (#136685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136685
Approved by: https://github.com/zou3519
ghstack dependencies: #137760
2024-10-15 19:04:42 +00:00
05b6200ccd Do not compute base in export mode (#137760)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137760
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-10-15 19:04:42 +00:00
f5e38f65c5 [FlexAttention] Support training bias for eager (#136910) (#137526)
This PR is Part 2 of the implementation started in https://github.com/pytorch/pytorch/pull/136910, rolled in the updates from https://github.com/pytorch/pytorch/pull/137451. Original was reverted due to calls to #@torch.libary at `import torch` time, so added a call to register at first call to `ModIndex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137526
Approved by: https://github.com/Chillee, https://github.com/zou3519
2024-10-15 18:55:22 +00:00
cd292908e5 Revert "Make c10::string_view an alias of std::string_view (#130417)"
This reverts commit c48fe8901114aa2b0a9c2d77f915a2ad8ab2098b.

Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/clee2000 due to breaking some internal tests, probably usages of string_view that need to be changed? ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2414775064))
2024-10-15 18:55:09 +00:00
e1e6417d4c Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-15 18:52:44 +00:00
b09d6f3a7d [EZ][BE] Delete 3.8 specific checks (#137991)
As we no longer support 3.8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137991
Approved by: https://github.com/Skylion007
2024-10-15 18:45:49 +00:00
524fe784ec BundledAutotuneCache (take 2) (#137902)
Summary:
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Attempt 2 of #134959 (D60677499).

Various configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Test Plan:
unit tests

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

- on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0} <<<<<<
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D64336043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137902
Approved by: https://github.com/oulgen
2024-10-15 18:39:47 +00:00
bf77f52895 Fix memory leak on masked Tensor (#137890)
Note that this reverts the change from https://github.com/pytorch/pytorch/pull/137815 as well which is not needed anymore!

Without this, you create an unbeakable reference cycle. It is unbreakable because part of the cycle is through the autograd graph which we cannot traverse.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137890
Approved by: https://github.com/atalman, https://github.com/huydhn, https://github.com/Skylion007
2024-10-15 18:37:55 +00:00
0b7ef196cd Use filelock to build extension_device backend one at a time (#137930)
Fixes https://github.com/pytorch/pytorch/issues/136125
Fixes https://github.com/pytorch/pytorch/issues/137026
Fixes https://github.com/pytorch/pytorch/issues/137027

The compilation fails during `setUpClass`, so disabling the test doesn't do nothing.  The theory I have for this flaky issue is that `test_open_device_registration` from both `TritonExtensionBackendTests` and `ExtensionBackendTests` are run in parallel and cleaned up while the other is still in fly, causing flaky failure.

Here is an example failure https://github.com/pytorch/pytorch/actions/runs/11331105492/job/31512603585#step:22:1710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137930
Approved by: https://github.com/malfet
2024-10-15 17:46:28 +00:00
60eb3fccfa Revert "[ONNX] Remove ExportTypes (#137789)"
This reverts commit 3e0b83ad1f0a998ef8a72c5e82d9250ab800cce5.

Reverted https://github.com/pytorch/pytorch/pull/137789 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))
2024-10-15 17:40:06 +00:00
2831af39c4 Revert "[ONNX] Remove deprecated export_to_pretty_string (#137790)"
This reverts commit d0628a7e3921639f62d6a6fec9f9f1871e087533.

Reverted https://github.com/pytorch/pytorch/pull/137790 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))
2024-10-15 17:40:06 +00:00
dac0b4e62b Revert "Add SVE implementation of embedding_lookup_idx (#133995)"
This reverts commit 770c134998d3422bc2fa3b90baa235ed0c409e62.

Reverted https://github.com/pytorch/pytorch/pull/133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](https://github.com/pytorch/pytorch/pull/133995#issuecomment-2414596554))
2024-10-15 17:23:50 +00:00
d4d687ffb2 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit 4a8e49389c33934234dc89616fd17a58e760e2e7.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))
2024-10-15 17:19:16 +00:00
9af4e0d2aa Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit a6eb0205225fce7ba7a75d200566613b84aff4e9.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/clee2000 due to breaking internal tests related to MITA, @ezyang has a forward fix? ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2414588302))
2024-10-15 17:19:15 +00:00
44653895cc override bool(), is_nonzero for real tensor tracing (#136788)
Fixes bool() and is_nonzero() calls for real tensor tracing, non-strict export

Differential Revision: D63482693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136788
Approved by: https://github.com/ezyang
2024-10-15 17:13:44 +00:00
bdbe0cfffa Fix test_binary_ufuncs.py for NumPy 2 (#137937)
Related to #107302

The following tests failed in test_binary_ufuncs.py when testing with NumPy 2.

```
FAILED [0.0050s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_complex64 - AssertionError
FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support__refs_sub_cpu_float32 - AssertionError
FAILED [0.0048s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_complex64 - AssertionError
FAILED [0.0043s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_scalar_support_sub_cpu_float32 - AssertionError
FAILED [0.0028s] test/test_binary_ufuncs.py::TestBinaryUfuncsCPU::test_shift_limits_cpu_uint8 - OverflowError: Python integer -100 out of bounds for uint8
```

This PR fixes them.

More details:
* `test_shift_limits` failed because `np.left_shift()` and `np.right_shift()` no longer support negative shift values in NumPy 2.
* `test_scalar_support` failed because NumPy 2 changed its dtype promo rules. We special-cased the incompatible cases by changing the expected dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137937
Approved by: https://github.com/albanD
2024-10-15 17:04:24 +00:00
e4d7676c1b [CPU] Expand torch.special.i1 to Half and BF16 (#137899)
To match behavior of `torch.special.i0`

Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849

Also, add explicit high-precision template specialization for  `calc_i0` and `calc_i1` for `BFloat16` and `Half`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899
Approved by: https://github.com/Skylion007
2024-10-15 17:00:58 +00:00
4abe38bc94 RMSprop docs: add missing input "epsilon" (#137854)
Adding a missing input argument in the docs for RMSprop. Like in the doc for AdamW https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137854
Approved by: https://github.com/janeyx99
2024-10-15 16:40:42 +00:00
822aa588bc Fix torch_np/test_basic for NumPy 2 (#137814)
Related to #107302

`TestExport.test_exported_objects` in `test/torch_np/test_basic.py` is failing with NumPy 2.
The test is checking if all methods under `torch._numpy` exist in `numpy`.
However, some of them are removed in NumPy 2.

This PR fixes the issue by not checking the removed methods when running with NumPy 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137814
Approved by: https://github.com/albanD
2024-10-15 16:40:28 +00:00
120fbe9caa Update inductor benchmark time to avoid flakiness (#137900)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137900
Approved by: https://github.com/laithsakka
2024-10-15 16:17:04 +00:00
966a1a971e [ROCm] Add AMDSMI support for UUID input (#129741)
Adds support for for using UUIDs for AMDSMI utilities in PyTorch via CUDA_VISIBLE_DEVICES/HIP_VISIBLE_DEVICES.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129741
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2024-10-15 15:56:30 +00:00
17ed403644 [ROCm] Enable test_triton* in test_sparse_csr suite (#137712)
All test_triton* UTs are now passing on ROCm within test_sparse_csr suite. See logs here: https://ossci-raw-job-status.s3.amazonaws.com/log/31376189926

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137712
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-10-15 15:41:21 +00:00
5689e33cfe [Intel GPU] Fix Windows linkage issue due to invisible structured kernel symbols (#137794)
Intel GPU aten library(libtorch_xpu) utilizes `torchgen` to generate structure kernels. Currently, the generated structure kernels are decorated by `TORCH_API` to control the visibility, while `TORCH_API` is controlled by the `CAFFE2_BUILD_MAIN_LIB` macro. However, we cannot enable `CAFFE2_BUILD_MAIN_LIB` for the Intel GPU ATen library naively. Because the macro not only serves for the `TORCH_API` semantic. It means that the semantic of `TORCH_API` is symbol `hidden`.

https://github.com/pytorch/pytorch/blob/main/c10/macros/Export.h#L95-L99

Therefore, we need to use ` TORCH_XPU_API` to decorate the produced structure kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137794
Approved by: https://github.com/atalman
ghstack dependencies: #137873
2024-10-15 15:31:37 +00:00
3361908fc5 torch/ao/quantization/utils.py: Moving eps to targeted device to avoid device mismatch issue (#135204)
MOTIVATION

We recently verified some quantization tests on devices other than cpu (eg. CUDA and Intel Gaudi devices identified as 'hpu'). We noticed a device mismatch error as eps is a tensor created on cpu but other tensors (min_val_neg, max_val_pos, scale, zero_point) are moved to the targeted _device_.

CHANGES

Move eps to _device_ of other tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135204
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-10-15 14:58:55 +00:00
cef6c3dcb0 Dont decompose aten.baddmm in inductor (#137904)
Previously the decomposition would upcasts inputs to fp32. This led to a slowdown compared to eager which would run in fp16. We also tried keeping the bmm in fp16, and the upcasting for the epilogue but that led to worse numerics because the bmm in eager would do the epilogue all in fp32 without a downcast in the bmm accumulator.

Fix for https://github.com/pytorch/pytorch/issues/137897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137904
Approved by: https://github.com/ngimel
2024-10-15 14:54:56 +00:00
b7f798caa4 Use C10_UNUSED instead of (void)X (#137239)
Summary:
Auto-generated with
```
buck run //scripts/rbarnes/regex_multiline_replacer:regex_multiline_replacer -- --find '^(\s*for\s*\()(const.*\n)\s*\(void\)[A-Za-z]+;\s*//\s*Suppress.*\s*\n(.*)'  --replace '\1C10_UNUSED \2\3' `find caffe2/ -regex ".*\.\(cpp\|h\)"`
```

Differential Revision: D33432600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137239
Approved by: https://github.com/Skylion007
2024-10-15 14:32:59 +00:00
e7a4ad3b40 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-15 13:51:20 +00:00
5141ade8e3 [AMD] Do not skip 0-byte send/recv (#137952)
Summary: With https://github.com/ROCm/rccl/pull/1376, we can remove this hack now and we have verified that we no longer run into hang

Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-xdwang-900def406a?job_attempt=0&version=1&env=PRODUCTION

Differential Revision: D64370817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137952
Approved by: https://github.com/eqy
2024-10-15 09:35:03 +00:00
b7be4b1e48 [AMD] Turn on fast path for index_put (#136136)
Summary:
This slow path is bad because it has a sync point which makes CPU really slow. I'm not very sure if AMD actually needs this with the newer rocm versino

{F1870213925}

Test Plan: CI

Differential Revision: D62731130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136136
Approved by: https://github.com/danzimm, https://github.com/jeffdaily, https://github.com/eqy
2024-10-15 08:39:17 +00:00
f42d1b6fa1 Fix Intel GPU test failure due to unsupport bool for unfold (#137873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137873
Approved by: https://github.com/etaf, https://github.com/desertfire
2024-10-15 07:58:51 +00:00
cyy
8c860aef0d [Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942)
Reland of #137328, which was reverted due to reverting a dependent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137942
Approved by: https://github.com/eqy
2024-10-15 07:47:24 +00:00
56cc22eb01 [CI][Distributed] Not to test distributed_test.py with UCC (#137932)
Some UCC tests became unstable recently, with or without the M60 to T4 upgrade.
See for example: #137855 (without upgrade), #137161 (with upgrade).
So I am extracting the disablement from #137161 here.

Failure signature:
```
RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:496] [Rank 0][ProcessGroupUCC-0][READY]failed to post triggered collective, error code -6: Unhandled error, system error code 0
```

Earlier discussed here:
https://github.com/pytorch/pytorch/pull/137161/files#r1797353294

Cc: @Aidyn-A @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137932
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/eqy
2024-10-15 07:22:57 +00:00
5b442e8e92 Time torch_key computation in overall Dynamo stats (#137877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137877
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-10-15 05:47:19 +00:00
5c3ba6faff Add fbscribelogger to Dynamo benchmark runner (#137867)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137867
Approved by: https://github.com/bobrenjc93
2024-10-15 04:36:41 +00:00
ed94725b8c log ViewAndMutationMeta to trace_structured (#133784)
I ended up bundling it into the existing tlparse logs for the AOT forward graph, since it looked like registering it as a separate artifact requires changes to tlparse itself (maybe that is wrong though?)

Example new fw AOT graph tlparse output for the below code: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp70zKiO/0_0_0/aot_forward_graph_2.txt

```
import torch

@torch.compile
def f(x):
    out1 = torch.view_as_complex(x)
    out2 = torch.view_as_complex(x)
    return out1, out2, x * 2

x_ = torch.randn(4, 2, requires_grad=True, dtype=torch.float64)
out = f(x_)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133784
Approved by: https://github.com/ezyang
2024-10-15 02:49:02 +00:00
cyy
70206499f1 [3/N] Fix extra warnings brought by clang-tidy-17 (#137552)
Follows #137459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137552
Approved by: https://github.com/ezyang
2024-10-15 02:33:44 +00:00
a6eb020522 Make Context to be Device-agnostic Step by Step (2/N) (#136526)
----

- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2024-10-15 01:53:28 +00:00
b34db401f2 Add support for div in tensorify_python_scalars fx pass (#137623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137623
Approved by: https://github.com/ezyang
2024-10-15 01:49:46 +00:00
8316f9b2a0 Fix autograd function calls without context arg (#137809)
Fixes an issue where if the context arg is not provided, Dynamo would throw an arg mismatch error.

The skips are there because Dynamo would previously fall back to eager on those tests due to the arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137809
Approved by: https://github.com/drisspg
2024-10-15 01:25:47 +00:00
a89cf2b59a [dynamo] Don't codegen temporary cells for pre-existing cells (#137907)
This patch removes tempvar codegen for the `NewCellVariable` that has
`AttributeMutationExisting`, because these tempvar will never get used.
Note that tempvar codegen for other objects also follow this pattern,
i.e., it only fires on `AttributeMutationNew`.

To visualize, in the following program, we'll see the modified bytecode
contains redundant `make_cell` calls, and stores the result to a local
`tmp_2` which is never used again.

```python
import torch

def test_write_cell():
    count = torch.ones(1)
    def inc():
        nonlocal count
        count = count + 1

    torch.compile()
    def fn():
        inc()

    fn()

test_write_cell()
```

```
$ TORCH_LOGS="bytecode" TORCH_LOGS_FORMAT="short" python test.py

......
    0 COPY_FREE_VARS           1
    2 RESUME                   0
    4 LOAD_GLOBAL              9 (NULL + __compiled_fn_2)
   14 LOAD_DEREF               3 (inc)
   16 LOAD_ATTR                6 (__closure__)
   36 LOAD_CONST               1 (0)
   38 BINARY_SUBSCR
   42 LOAD_ATTR                4 (cell_contents)
   62 CALL                     1
   70 STORE_FAST               0 (graph_out_0)
   72 LOAD_GLOBAL              0 (__import_torch_dot__dynamo_dot_utils)
   82 LOAD_ATTR                3 (NULL|self + make_cell)
  102 CALL                     0
  110 STORE_FAST               2 (tmp_2)
  112 LOAD_FAST                0 (graph_out_0)
  114 LOAD_CONST               1 (0)
  116 BINARY_SUBSCR
  120 LOAD_DEREF               3 (inc)
  122 LOAD_ATTR                6 (__closure__)
  142 LOAD_CONST               1 (0)
  144 BINARY_SUBSCR
  148 STORE_ATTR               2 (cell_contents)
  158 DELETE_FAST              0 (graph_out_0)
  160 RETURN_CONST             0 (None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137907
Approved by: https://github.com/anijain2305
2024-10-15 00:49:45 +00:00
1cf78bbf62 Refactored debug_extra to be on ChoiceCaller (and called description) (#137857)
Before:
<img width="644" alt="image" src="https://github.com/user-attachments/assets/17b0fa8a-37c8-494b-8914-9d42c3db4bef">

After:
<img width="1292" alt="image" src="https://github.com/user-attachments/assets/5ee59747-a34f-4dd6-b943-cb5a53d52080">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137857
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/masnesral
ghstack dependencies: #137768
2024-10-15 00:48:14 +00:00
3630398509 Move symbolic_shapes create_env back to INFO (#137926)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137926
Approved by: https://github.com/Skylion007
2024-10-15 00:37:01 +00:00
406db6a73d Improve ASAN path detection (#137865)
Follows #137335, for better adoption of latest clang to ASAN jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137865
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-14 23:54:46 +00:00
aef3591998 [Profiler] Add Test for Clear on Fork (#137511)
Summary: Tests Fix Clear On Fork by forking a process after a profile has already been done. Afterwards we check that all the PID/TID are as expected.

Test Plan: Ran buck2 test 'fbcode//mode/dev' fbcode//caffe2/test:profiler -- --exact 'caffe2/test:profiler - test_forked_process (profiler.test_profiler.TestProfiler)'

Differential Revision: D63992036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137511
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-10-14 23:20:33 +00:00
0786b37260 [MPS] Add i0 op (#137849)
More-or-less verbatim copy of 47c8aa8090/aten/src/ATen/native/Math.h (L101)
Plus a bit of a MPS boilerplate code

Update test_mps.py to mark kaiser_window and i0 as passing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137849
Approved by: https://github.com/Skylion007
2024-10-14 22:50:01 +00:00
18587f2427 [BE] Use std::enable_if_t in Math.h (#137920)
PyTorch is C++17 project, so let's use some C++17 convenience methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137920
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-14 22:20:09 +00:00
8ac06467d4 Forward fix test (#137910)
Summary: Add back in a deleted file to fix test

It was removed in https://github.com/pytorch/pytorch/pull/137404

Test Plan: `buck2 build --flagfile fbcode//mode/opt fbcode//caffe2/test/cpp/c10d:ProcessGroupGlooAsyncTest` succeeded

Differential Revision: D64341074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137910
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/kit1980
2024-10-14 22:07:29 +00:00
ad134fe038 Skip doc test internally (#137813)
Summary:
there are some path issues when we run the doc tests internally

https://www.internalfb.com/intern/test/281475143872621

Test Plan: sandcastle

Reviewed By: drisspg, msaroufim

Differential Revision: D64255824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137813
Approved by: https://github.com/HDCharles
2024-10-14 21:29:15 +00:00
7911bf591d [CUDA][Inductor] Fix some bfloat16 tests for SM70 (#137675)
Unsure about the `runtime_checks` changes as that's a pure pattern-match and guess

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137675
Approved by: https://github.com/eellison, https://github.com/jansel
2024-10-14 20:42:48 +00:00
6016b8a9be Remove CI/CD python 3.8 requirements (#137893)
Python 3.8 is deprecated from CI/CD. No reason have these pins
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137893
Approved by: https://github.com/Skylion007, https://github.com/huydhn, https://github.com/albanD, https://github.com/kit1980
2024-10-14 20:28:48 +00:00
3b7710316c Revert "cublaslt autotuning support for TunableOp (#133896)"
This reverts commit 19bbbef79da8ed32f72d6e76517cb639d5db6c00.

Reverted https://github.com/pytorch/pytorch/pull/133896 on behalf of https://github.com/clee2000 due to this is breaking internal builds, I've copied what I think is the most relevant part of the log below. I believe the job running internally uses an old version of cuda, could you put guards to make sure compilation still words on an older version of cuda/cublaslt? ([comment](https://github.com/pytorch/pytorch/pull/133896#issuecomment-2412180893))
2024-10-14 20:28:09 +00:00
df0c2f5cae Revert "[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328)"
This reverts commit 25ac5652d003c5526f496bd1e2cdfbe697c58ba4.

Reverted https://github.com/pytorch/pytorch/pull/137328 on behalf of https://github.com/clee2000 due to need to revert this in order to revert #133896, please rebase and reland, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/137328#issuecomment-2412143739))
2024-10-14 20:22:26 +00:00
674d59359d [ROCm] Enable dist sharded_tensor test suites (#137724)
Following test suites are enabled on ROCm
test_sharded_tensor
test_sharded_tensor_reshard
test_sharding_plan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137724
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2024-10-14 20:20:57 +00:00
39d21ed803 [Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458)
The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](72c9833927)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes.

Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively).

With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458
Approved by: https://github.com/jansel
2024-10-14 20:20:29 +00:00
11e4232b42 Revert "[Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872)" (#137891)
This reverts commit e688b78791d01bd91614a61e57726c32beb46ee4.

We're reverting this because:
1) The original PR (#134872) fixed a bug but caused another one. The
   assessment is that the bug it caused is worse than the bug it fixed.
2) it was reverted on the release 2.5 branch, so we want to prevent
   divergence
3) The original author is out-of-office for a while so we don't want the
   divergence to wait until they're back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137891
Approved by: https://github.com/Skylion007
2024-10-14 20:12:58 +00:00
41c4aa9f7a [pipelining] rename prev_/next_stage vars to clarify (#137739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137739
Approved by: https://github.com/H-Huang
2024-10-14 20:12:18 +00:00
78299d75b7 [ScaledMM] More Large shape tuning (#137832)
Fixes buggy in previous PR with check, and also after some more performance tuning at very large sizes found that when N > M it is valuable to transpose otherwise performance is better untransposed:

If you look at the absolute Tflops I think we still have some room for improvement!
### Perf

Here are some TFLOP deltas at larger sizes where green is the positive gain in TFLops at different values of K

![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_delta_heatmap](https://github.com/user-attachments/assets/dcd009a5-1e4f-449c-b852-a92bb7db66e3)

<details>
<summary>### Different Values of K</summary>
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K24576_tflops_delta_heatmap](https://github.com/user-attachments/assets/8c043f6c-b8aa-48a9-bd5d-3ec6f39818cd)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K16384_tflops_delta_heatmap](https://github.com/user-attachments/assets/41a4b9f4-2749-4a84-b9c7-bddc2c2334c0)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K12288_tflops_delta_heatmap](https://github.com/user-attachments/assets/68d42421-cfa9-4a0a-a5a5-9f6db80bf609)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K8192_tflops_delta_heatmap](https://github.com/user-attachments/assets/c03906a0-5de7-463e-96a8-85f1774b3af6)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K6144_tflops_delta_heatmap](https://github.com/user-attachments/assets/d697b2d0-efc9-4ea8-9002-d517f3abaf50)
![large_shape_old_vs_update_m_greater_n_FP8Kernel_SCALED_MM_K4096_tflops_delta_heatmap](https://github.com/user-attachments/assets/06f8ef5c-277f-45ca-a44f-ed2e54d4133a)
</details>

<details>
<summary>### Absolute Tflops</summary>

## Old
![large_shape_old_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/8872506b-0ff1-400e-8d11-71eff6d8d59a)

## New
![update_m_greater_n_FP8Kernel_SCALED_MM_K32768_tflops_heatmap](https://github.com/user-attachments/assets/9fc9ec24-ff1a-4b47-8934-72d181677d14)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137832
Approved by: https://github.com/vkuzo
2024-10-14 20:02:52 +00:00
d64492e4cb Increase verbosity of inductor cache hit/miss to INFO level (#137876)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137876
Approved by: https://github.com/Skylion007
2024-10-14 19:59:31 +00:00
eqy
914c90dcea [NCCL][CUDA] Set PYTORCH_C10_DRIVER_API_SUPPORTED in ProcessGroupNCCL.cpp compilation (#137828)
Otherwise `expandable_segments()` is hardcoded to false in `CUDAAllocatorConfig.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137828
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
2024-10-14 19:38:23 +00:00
19918a1863 Fix autograd.Function + NJT when an output grad is None (#136875)
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```

This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer, https://github.com/huydhn
2024-10-14 19:31:50 +00:00
197601eeea Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107)
A proposal addressing Issue #1489: **Optimizer should track parameter names and not id.**

(also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552)

## Summary
This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id.
Optimizers can be initialized with `named_parameters()` as:
```python
optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9)
```
This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as:
```
state_dict =
{
    'state': {
    0: {'momentum_buffer': tensor(...), ...},
    1: {'momentum_buffer': tensor(...), ...},
    },
    'param_groups': [
        {
        'lr': 0.01,
        'weight_decay': 0,
        ...
        'params': [0,1]
        'param_names' ['layer.weight', 'layer.bias']  (optional)
        }
    ]
}
```
Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored.

## Key Features
#### Named Parameters in Optimizer Initialization:
Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly.
#### Parameter Names in `state_dict`:
The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters.

## Backward Compatibility
#### No Breaking Changes:
This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer.

#### Customization with Hooks:
For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs.

## Documentation Updates
Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively.

## Solution Example:

A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order.
The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict :
```python
def adapt_state_dict_ids(optimizer, state_dict):
    # assuming a single param group.
    current_state_group = optimizer.state_dict()['param_groups'][0]
    loaded_state_group = state_dict['param_groups'][0]

    # same number of params, same names, only different ordering
    current_state_name_to_id_mapping = {}  # mapping --  param_name: id
    for i, name in enumerate(current_state_group['param_names']):
        current_state_name_to_id_mapping[name] = current_state_group['params'][i]

    # changing the ids of the loaded state dict to match the order of the given state dict.
    for i, name in enumerate(current_state_group['param_names']):
        loaded_state_group['params'][i] = current_state_name_to_id_mapping[name]

    return state_dict
```
In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`.
Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict.

### Note
This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-10-14 19:24:44 +00:00
4470339fbb [dynamo] Fix an error in _dynamo.compiled_autograd.reset() (#137889)
----

* From https://github.com/pytorch/pytorch/pull/133492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137889
Approved by: https://github.com/Skylion007
2024-10-14 18:21:18 +00:00
929797dedb Fix test_matmul_offline_tunableop by writing its output files to a temp dir (#137835)
The test is failing (flakily?) on periodic Windows CUDA jobs with the following error:

```
__________ TestLinalgCUDA.test_matmul_offline_tunableop_cuda_float16 __________
Traceback (most recent call last):
  File "C:\actions-runner\_work\pytorch\pytorch\test\test_linalg.py", line 4618, in test_matmul_offline_tunableop
    os.remove(filename)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'tunableop_untuned0.csv'
```

For example, https://github.com/pytorch/pytorch/actions/runs/11292745299/job/31410578167#step:15:15097

The test tried to catch and ignore this, but this is Windows.  So, the fix is to:

1. Ignore if these files couldn't be removed
2. Write them to a temp directory instead, otherwise, [assert_git_not_dirty](https://github.com/pytorch/pytorch/blob/main/.ci/pytorch/test.sh#L286) won't be happy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137835
Approved by: https://github.com/atalman
2024-10-14 17:28:33 +00:00
f8a5b7170a Revert "Fix autograd.Function + NJT when an output grad is None (#136875)"
This reverts commit 76ab1ab66560213701943ecde368aedcd5de08e5.

Reverted https://github.com/pytorch/pytorch/pull/136875 on behalf of https://github.com/jbschlosser due to Caused memory leak ([comment](https://github.com/pytorch/pytorch/pull/136875#issuecomment-2411665776))
2024-10-14 16:00:44 +00:00
47bb494e49 Add support for sub in tensorify_python_scalars fx pass (#137622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622
Approved by: https://github.com/ezyang
ghstack dependencies: #137620
2024-10-14 15:37:29 +00:00
f246507f28 Add support for add in tensorify_python_scalars fx pass (#137620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620
Approved by: https://github.com/ezyang
2024-10-14 15:10:27 +00:00
a77145ae2f Selective Activation Checkpointing (SAC) Estimator for estimating memory and recomputation time trade-offs. (#135208)
This PR adds a Selective Activation Checkpointing (SAC) Estimator, built on top of the `Runtime Estimator`, for estimating memory and recomputation time trade-offs.
It provides a `TorchDispatchMode` based context manager that estimates the memory and runtime trade-offs of functions or `torch.nn.Modules` for SAC, using the `Runtime Estimator` #134243  under the hood to support two estimation modes: 'operator-level-benchmark' and 'operator-level-cost-model' (roofline model). The SAC Estimator provides detailed statistics and metadata information for operators of each module, including greedy order for selecting operators to be recomputed/checkpointed and per-module trade-off graphs. This estimator is designed to be used under FakeTensorMode and currently supports estimation of compute time and memory usage."

It's inspired from: [XFormers SAC](https://github.com/facebookresearch/xformers/blob/main/xformers/checkpoint.py) by @fmassa

End-to-end example:

```
import torch
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed._tools.sac_estimator import SACEstimator
from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
)

if __name__ == "__main__":
    dev = torch.cuda.current_device()
    vocab_size = 8192
    bsz, seq_len = 8, 1024
    model_args = ModelArgs(
        n_layers=4,
        n_heads=12,
        vocab_size=vocab_size,
        max_seq_len=seq_len,
        dim=768,
        dropout_p=0.1,
    )
    with FakeTensorMode():
        with torch.device(dev):
            model = Transformer(model_args)
        inp = torch.randint(
            0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev
        )

        sace = SACEstimator()
        with sace(estimate_mode_type='operator-level-cost-model'):
            loss = model(inp).sum()
        loss.backward()
        sace.pwlf_sac_tradeoff_curve(n_segments=2, save_tradeoff_graphs=True)
        sace.display_modulewise_sac_stats(depth=4, print_tabular=True)
```

  Example AC Stats for one of the transformer layers:

![Screenshot 2024-10-11 at 10 09 13 PM](https://github.com/user-attachments/assets/1cf85564-4319-4732-bba1-89d505cda6ab)

Example AC Trade-off for one of the transformer layers:

![Screenshot 2024-10-11 at 10 09 58 PM](https://github.com/user-attachments/assets/5b2f343c-7e73-4c7d-bfea-3dcef2caa362)

Example AC Trade-Off graph one of the transformer layers:

![Transformer layers 3](https://github.com/user-attachments/assets/490d4b37-a916-4298-a14c-f78ffecbbde2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135208
Approved by: https://github.com/weifengpy
2024-10-14 13:56:40 +00:00
0e4d42634e Port Inductor dataclasses to be kw_only (#137768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768
Approved by: https://github.com/ezyang
2024-10-14 10:33:43 +00:00
770c134998 Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-14 10:17:27 +00:00
cyy
c48fe89011 Make c10::string_view an alias of std::string_view (#130417)
In order to facilitate the mitigation from c10::string_view to std::string_view, the old c10::string_view was renamed to c10::string_view_ext.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417
Approved by: https://github.com/ezyang
2024-10-14 09:28:04 +00:00
41977a0531 Revert "Port Inductor dataclasses to be kw_only (#137768)"
This reverts commit 65d665bae5b82a54b819c0c4527e7ccf88d19427.

Reverted https://github.com/pytorch/pytorch/pull/137768 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seem to fail test_loop_ordering in trunk ([comment](https://github.com/pytorch/pytorch/pull/137768#issuecomment-2409203115))
2024-10-13 22:25:19 +00:00
08ce3aac62 Cache some ValueRanges (#137438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137438
Approved by: https://github.com/ezyang
2024-10-13 19:23:34 +00:00
b361cd01f1 profiler: Fix undefined reference to unwind_c in unwind_entry while LTO is enabled (#137862)
With LTO(Link Time Optimization) enabled in CFLAGS, some compiler will optimize and strip the unwind_c function, which is caused by compiler that couldn’t resolve reference correctly, thus breaking the build with undefined reference in unwind_entry. Add an attribute to avoid this bad situation.

Fixes #121282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137862
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-13 19:04:58 +00:00
c09b567a91 Fixed error string assertion in test_invalid_devices (#137772)
ROCm distribution returns different error string for this operation so the test fails this assertion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137772
Approved by: https://github.com/Skylion007
2024-10-13 18:10:07 +00:00
65d665bae5 Port Inductor dataclasses to be kw_only (#137768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137768
Approved by: https://github.com/ezyang
2024-10-13 14:55:45 +00:00
cfc5d18aad [AOTI] Turn on the ABI-compatible mode as default (#136534)
Summary: Make AOTI generate ABI-compatible code as default for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534
Approved by: https://github.com/chenyang78
ghstack dependencies: #137660
2024-10-13 14:42:58 +00:00
b181652f3d [AOTI] Handle inplace output in ProxyExecutor (#137660)
Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660
Approved by: https://github.com/chenyang78
2024-10-13 14:42:58 +00:00
cyy
a90b920284 Install llvm18 packages for ASAN workflows (#137335)
Follows #128763
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137335
Approved by: https://github.com/ezyang
2024-10-13 13:49:38 +00:00
4a8e49389c Make Context to be Device-agnostic Step by Step (1/N) (#136519)
----

- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-13 12:38:02 +00:00
563e9f99c3 Revert "Add device agnostic API for accelerator hooks (#137480)"
This reverts commit 858c91c3d8d9a71c66d0357e51a4bd805f95599f.

Reverted https://github.com/pytorch/pytorch/pull/137480 on behalf of https://github.com/albanD due to break all builds on trunk ([comment](https://github.com/pytorch/pytorch/pull/137480#issuecomment-2408954802))
2024-10-13 12:12:37 +00:00
08576b254b Fix logging in socket.cpp (#137745)
Formatter shall avoid throwing exceptions as much as possible.

Fixes https://github.com/pytorch/pytorch/pull/128673#discussion_r1796226656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137745
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
2024-10-13 10:38:10 +00:00
fe8d66d9a6 Faster Faster BatchSampler (#137423)
Builds upon #76951.

Benchmarking code is the same as in #76950.

AMD Ryzen Threadripper PRO 3995WX:
```
  batch_size  drop_last      origin     new  speedup
------------  -----------  --------  ------  ---------
           4  True           0.94    0.5706  64.74%
           4  False          0.9745  0.9468  2.93%
           8  True           0.7423  0.3715  99.82%
           8  False          0.7974  0.5666  40.73%
          64  True           0.5394  0.2085  158.76%
          64  False          0.6083  0.2697  125.51%
         640  True           0.5448  0.1985  174.41%
         640  False          0.7085  0.2308  206.91%
        6400  True           0.5554  0.2028  173.88%
        6400  False          0.7711  0.2109  265.60%
       64000  True           0.556   0.2091  165.82%
       64000  False          0.7803  0.2078  275.58%
```

When `drop_last == True`, it uses `zip` to speed things up.
When `drop_last == False`, it uses `itertools` to speed things up.

`itertools` was the fastest way I could find that deals with the last batch if it is smaller than `batch_size`. I have a pure python method too, but it is slower when `batch_size` is 4 or 8, so I have committed the `itertools` version for now.

Happy to chat further about this change :-) I understand you may not want to introduce the `itertools` package into [sampler.py](https://github.com/pytorch/pytorch/blob/main/torch/utils/data/sampler.py).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137423
Approved by: https://github.com/Skylion007
2024-10-13 09:36:03 +00:00
b3af359cba Log WorkNCCL exception string to C10dLogger (#137736)
Summary: In WorkNCCL::handleException, log to c10d logger with `strings["work_nccl_exception"]`.

Test Plan: Test run job to verify NCCL exception is logged.

Differential Revision: D62603322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137736
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-10-13 07:33:05 +00:00
858c91c3d8 Add device agnostic API for accelerator hooks (#137480)
Make `AcceleratorHooksInterface` consistent for multiple accelerators
- Add `showConfig` and `deviceSynchronize` method declaration in `AcceleratorHooksInterface`
- Remove unreachable lines of code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137480
Approved by: https://github.com/albanD, https://github.com/FFFrog
2024-10-13 07:19:32 +00:00
7642f6d047 [AMD] Unify cublaslt and hipblaslt path (#137604)
Differential Revision: D63967918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137604
Approved by: https://github.com/eqy
2024-10-13 07:11:12 +00:00
fa08e924ad Skip test export with fake tensor inputs on cuda devices for Intel GPU (#137847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137847
Approved by: https://github.com/etaf, https://github.com/jansel
2024-10-13 07:07:48 +00:00
e3df636580 Fix -Wsign-compare warning spam in Indexing.cu (#137842)
Detailed Descriptions:

Fix for warning spam like
```
warning: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘long int’ [-Wsign-compare]
```
![image](https://github.com/user-attachments/assets/7be3cfff-c33b-4a6e-b52d-04085e6e1bec)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137842
Approved by: https://github.com/ezyang
2024-10-13 07:03:12 +00:00
1d6932937e [dynamo] fix NamedTupleVariable for PyStructSequence (torch.return_types.*) support (#137776)
PyStructSequence is the C API equivalent for `collections.namedtuple` in Python. But they have different constructors:

```python
tuple = NamedTupleType(*args)
tuple = NamedTupleType._make(args)
tuple = StructSequenceType(args)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137776
Approved by: https://github.com/jansel
2024-10-13 06:46:41 +00:00
3050f2e5dd [dynamo] Check nn modules parameters are not overwritten before taking tracing shortcut (#137824)
Fixes https://github.com/pytorch/pytorch/issues/136257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137824
Approved by: https://github.com/jansel
2024-10-13 05:04:28 +00:00
09e2a0d7bc fix PyTorch build with Address Sanitizer enabled (#137446)
**Problem**
Building PyTorch with Address Sanitizer (ASAN) enabled was failing due to a static assertion in KernelFunction_impl.h. The compiler was unable to evaluate FuncPtr::func_ptr() as a constant expression when ASAN was enabled, causing a build error.

```
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o
/usr/bin/ccache /usr/bin/g++-11 -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D_GLIBCXX_SANITIZE_STD_ALLOCATOR -D_GLIBCXX_SANITIZE_VECTOR -Dtorch_cpu_EXPORTS -I/home/abhishekk/stantize/venv/pytorch/build/aten/src -I/home/abhishekk/stantize/venv/pytorch/aten/src -I/home/abhishekk/stantize/venv/pytorch/build -I/home/abhishekk/stantize/venv/pytorch -I/home/abhishekk/stantize/venv/pytorch/cmake/../third_party/benchmark/include -I/home/abhishekk/stantize/venv/pytorch/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/build/third_party/onnx -I/home/abhishekk/stantize/venv/pytorch/nlohmann -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api -I/home/abhishekk/stantize/venv/pytorch/torch/csrc/api/include -I/home/abhishekk/stantize/venv/pytorch/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src/TH -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/aten/src -I/home/abhishekk/stantize/venv/pytorch/build/caffe2/../aten/src -I/home/abhishekk/stantize/venv/pytorch/torch/csrc -I/home/abhishekk/stantize/venv/pytorch/third_party/miniz-2.1.0 -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/include -I/home/abhishekk/stantize/venv/pytorch/third_party/kineto/libkineto/src -I/home/abhishekk/stantize/venv/pytorch/third_party/cpp-httplib -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/FXdiv/include -I/home/abhishekk/stantize/venv/pytorch/c10/.. -I/home/abhishekk/stantize/venv/pytorch/third_party/pthreadpool/include -I/home/abhishekk/stantize/venv/pytorch/third_party/cpuinfo/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/home/abhishekk/stantize/venv/pytorch/third_party/NNPACK/include -I/home/abhishekk/stantize/venv/pytorch/third_party/FP16/include -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/build/third_party/tensorpipe -I/home/abhishekk/stantize/venv/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/abhishekk/stantize/venv/pytorch/third_party/fmt/include -I/home/abhishekk/stantize/venv/pytorch/third_party/flatbuffers/include -isystem /home/abhishekk/stantize/venv/pytorch/build/third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/gloo -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/abhishekk/stantize/venv/pytorch/third_party/protobuf/src -isystem /home/abhishekk/stantize/venv/pytorch/third_party/XNNPACK/include -isystem /home/abhishekk/stantize/venv/pytorch/cmake/../third_party/eigen -isystem /home/abhishekk/stantize/venv/pytorch/INTERFACE -isystem /home/abhishekk/stantize/venv/pytorch/third_party/nlohmann/include -isystem /home/abhishekk/stantize/venv/pytorch/build/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include -isystem /usr/lib/aarch64-linux-gnu/openmpi/include/openmpi -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_PYTORCH_QNNPACK -DAT_BUILD_ARM_VEC256_WITH_SLEEF -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_SVE_CPU_DEFINITION -DHAVE_SVE256_CPU_DEFINITION -g -fno-omit-frame-pointer -Og -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -fsanitize=address -fno-omit-frame-pointer -fsanitize=undefined -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp.o -c /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp
In file included from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction.h:260,
                 from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:4,
                 from /home/abhishekk/stantize/venv/pytorch/torch/library.h:63,
                 from /home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:3:
/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h: In instantiation of ‘static c10::KernelFunction c10::KernelFunction::makeFromUnboxedFunction(FuncPtr) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; bool AllowLegacyTypes = false]’:
/home/abhishekk/stantize/venv/pytorch/torch/library.h:133:59:   required from ‘torch::CppFunction::CppFunction(FuncPtr, std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t>) [with FuncPtr = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>; std::enable_if_t<c10::is_compile_time_function_pointer<FuncPtr>::value, std::nullptr_t> = std::nullptr_t]’
/home/abhishekk/stantize/venv/pytorch/torch/library.h:691:17:   required from ‘torch::Library& torch::Library::impl(Name, Func&&, torch::_RegisterOrVerify) & [with Name = const char*; Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’
/home/abhishekk/stantize/venv/pytorch/torch/library.h:782:16:   required from ‘torch::Library& torch::Library::impl(torch::detail::SelectiveStr<true>, Func&&) & [with Func = c10::CompileTimeFunctionPointer<c10::intrusive_ptr<at::native::xnnpack::LinearOpContext>(at::Tensor, std::optional<at::Tensor>, const std::optional<c10::Scalar>&, const std::optional<c10::Scalar>&), at::native::xnnpack::internal::linear::createLinearClampPrePackOpContext>]’
/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/native/xnnpack/RegisterOpContextClass.cpp:87:9:   required from here
/home/abhishekk/stantize/venv/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:177:39: error: non-constant condition for static assertion
  177 |     static_assert(FuncPtr::func_ptr() != nullptr, "Kernel function cannot be nullptr");
      |                   ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~
```

**Testing**

- Verified that PyTorch builds successfully with USE_ASAN=ON
- Ran PyTorch test suite to ensure no regressions were introduced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137446
Approved by: https://github.com/ezyang, https://github.com/jgong5
2024-10-13 03:31:54 +00:00
70bd58c35f Revert "Add support for add in tensorify_python_scalars fx pass (#137620)"
This reverts commit 0430e72e755d2c1953917ffb78f00c516eb4bbd5.

Reverted https://github.com/pytorch/pytorch/pull/137620 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk 0430e72e75 ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))
2024-10-13 02:05:37 +00:00
279052ab86 Revert "Add support for sub in tensorify_python_scalars fx pass (#137622)"
This reverts commit b7924610a0c20f72657548acef7743801189444a.

Reverted https://github.com/pytorch/pytorch/pull/137622 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to cause test_torchbind_inductor to fail in trunk 0430e72e75 ([comment](https://github.com/pytorch/pytorch/pull/137620#issuecomment-2408784170))
2024-10-13 02:05:37 +00:00
5fee1ee3f4 [inductor] Refactor generate_workspace_allocation (#137673)
And some other small changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137673
Approved by: https://github.com/Chillee
ghstack dependencies: #137754
2024-10-13 01:25:14 +00:00
5146e6a96d [inductor] Fix reduction_hint sum to single element (#137754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137754
Approved by: https://github.com/Chillee, https://github.com/malfet
2024-10-13 01:08:23 +00:00
b7924610a0 Add support for sub in tensorify_python_scalars fx pass (#137622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137622
Approved by: https://github.com/ezyang
ghstack dependencies: #137620
2024-10-13 00:30:02 +00:00
bd63ec4f45 [ROCm] LoadHIP CMake cleanup (#137112)
Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313

While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined.

This PR does not fix any bugs or provide any new functionality. Strictly clean-up.

The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-13 00:06:41 +00:00
47c8aa8090 Refactor make device agnostic in accelerator hooks (#137558)
Make `AcceleratorHooksInterface` consistent for multiple accelerators
- Add `getDeviceFromPtr` method declaration in `AcceleratorHooksInterface`
- Fix clangtidy warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137558
Approved by: https://github.com/FFFrog, https://github.com/ezyang
2024-10-12 18:13:54 +00:00
0430e72e75 Add support for add in tensorify_python_scalars fx pass (#137620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137620
Approved by: https://github.com/ezyang
ghstack dependencies: #136674, #137588
2024-10-12 17:18:27 +00:00
e89fe0bd6e Updating cuda binary build to get cusparselt from PYPI (#137653)
Fixes #137374
Update 1: such PR require Meta uploading the PYPI package to download.pytorch.org.
See: ERROR: Could not find a version that satisfies the requirement nvidia-cusparselt-cu12==0.6.2; platform_system == "Linux" and platform_machine == "x86_64" (from torch) (from versions: none)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137653
Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/atalman
2024-10-12 16:40:37 +00:00
ed55d356de [alt] fix unroll in successive unflatten (#137646)
We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call.

Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.)

Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls.

Test Plan: Like D64014936

Differential Revision: D64136277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646
Approved by: https://github.com/angelayi
2024-10-12 15:53:52 +00:00
561f07fae7 Warn users of mkldnn device usage (#137553)
In https://github.com/pytorch/pytorch/issues/136831, user will use mkldnn device to generate tensor, while mkldnn device is no longer used as device type, and only mkldnn layout is used.

We plan to remove mkldnn device related code in the future release. This PR is to warn users not to use mkldnn device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137553
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-12 13:42:12 +00:00
0dbbcfa7ae [Inductor UT] Generalize newly introduced inductor UTs for intel GPU (Part 3) (#136947)
[Inductor UT] Generalize Newly introduced inductor UTs for intel GPU
reuse `test/inductor/test_pattern_matcher.py`
reuse `test/inductor/test_snode_runtime.py`
reuse `test/inductor/test_unbacked_symints.py`
fix `test/inductor/test_triton_kernels.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136947
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/jansel
2024-10-12 13:21:20 +00:00
030ba03681 Add meta functions for lerp, addcmul, and addcdiv. (#136909)
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).

These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:

- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)

[1]: https://github.com/pytorch/xla/issues/7923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
2024-10-12 12:40:46 +00:00
6001b16597 Add entire _dynamo.config as a json for logging (#137216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137216
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-12 11:48:59 +00:00
a777dea3b3 Remove dtype check on meta device (#136774)
Summary:
# Latest Update

This diff is no longer needed because we did need the check to exist, to make meta behave the same as other devices, see D54526190.

---------------------------------

# Background

T176105639

| case | embedding bag weight | per_sample_weight | fbgemm lookup | forward in meta |
| A | fp32 | fp32 | good | good |
| B | fp16 | fp32 | good| failed [check](https://fburl.com/code/k3n3h031) that forces weight dtype ==  per_sample_weights dtype |
| C | fp16 | fp16 | P1046999270, RuntimeError: "expected scalar type Float but found Half from fbgemm call" | good |
| D | fp32 | fp16 | N/A | N/A |

Currently we are in case A. Users need to add `use_fp32_embedding` in training to force embedding bag dtype to be fp32. However, users actually hope for case B to use fp16 as the embedding bag weight. When deleting `use_fp32_embedding`, they would fail the [check](https://fburl.com/code/k3n3h031) that forces `weight dtype ==  per_sample_weights dtype ` in meta_registration.

The check is actually not necessary. Is it because the backend fbgemm does support case B. Additionally, later on in the `meta_embedding_bag`, `weight` and `per_sample_weights` don't need to be in the same dtype (https://fburl.com/code/q0tho05h, weight is src, per_sample_weights is scale) for `is_fast_path_index_select`.

# This diff
Therefore, this diff remove the unnecessary [check](https://fburl.com/code/k3n3h031) to support case B in meta forward. With such, users are able to use fp16 to be the emb bag dtype without the need to force per_sample_weights the same dtype in meta forward (see Test Plan).

# Reference diffs to resolve this issue
Diff 1: D52591217
This passes embedding bag dtype to feature_processor to make per_sample_weights same dtype as emb bag weight. However, `is_meta` also needs to be passed because of case C. fbgemm still does not support per_sample_weights = fp16 (see the above table). Therefore users are forced to only make per_sample_weights fp16 when it is on meta. The solution requires too many hacks.

Diff 2: D53232739
Basically doing the same thing in diff 1 D52591217, except that the hack is added in TorchRec library. This adds an if in EBC and PEA for: when emb bag weight is fp16, it forces per_sample_weight fp16 too. However, it would then result in fbgemm issue too and has broken a bunch of prod models.

Test Plan:
# APS
The following command will run icvr_launcher which triggers ads_launcher and run forward in meta device:
```
buck2 run mode/opt -c python.package_style=inplace //aps_models/ads/icvr:icvr_launcher_publish -- mode=mast_ig_fm_when_combo0_uhm_publish launcher.fbl_entitlement=ads_global_tc_ads_score launcher.data_project=oncall_ads_model_platform launcher.tags=[ads_ranking_taxonomy_exlarge_fm_prod] stages.train=false
```

Result:
 {F1461463993}

Reviewed By: ezyang

Differential Revision: D54175438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136774
Approved by: https://github.com/ezyang
2024-10-12 05:45:21 +00:00
92cc319120 Fix masked tensor test_stack memory leak (#137815)
This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970.  When testing locally, calling `backward` on a masked tensor always causes memory leak until I clean up the data and the mask manually.  This is probably related to this warning from masked tensor `UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach()`, but I don't know much about the internal details here to go further.  So, let's just fix the test first/

### Testing

```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda
```

passes and doesn't warn about memory leak anymore.

The test itself came from https://github.com/pytorch/pytorch/pull/125262#issuecomment-2344068012
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137815
Approved by: https://github.com/kit1980
2024-10-12 04:30:57 +00:00
c8609cf4b0 [inductor] Update Triton CPU pin (#137778)
This incorporates the fix in
https://github.com/triton-lang/triton/pull/4871.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137778
Approved by: https://github.com/Skylion007
2024-10-12 03:09:09 +00:00
d52b2cf92f [CUDA][SDPA] Fix TF32 handling and bump threshold for multiheadattention test (#137752)
For sm90, main issue was that `torch.testing.assert_close` bypasses the `tf32_on_and_off` tolerance switch decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137752
Approved by: https://github.com/ezyang
2024-10-12 03:05:21 +00:00
2db3f85894 Fixes NumPy 2 test failures in test_torch.py (#137740)
Related to #107302

The breakages are caused by backward incompatibility between NumPy 1 and NumPy 2.
This PR fixes all the corresponding test failures in `test_torch.py`.

1. The dtype of the return value `np.percentile` when passed a `torch.float32` tensor.
NumPy 1: Return value of `np.float64`.
NumPy 2: Return value of `np.float32`.
Solution: Enforce it with `.astype(np.float64)`.

2. The type of `np.gradient()` when returning multiple arrays.
NumPy1: A list of arrays.
NumPy2: A tuple of arrays.
Solution: Cast the tuple to a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137740
Approved by: https://github.com/ezyang
2024-10-12 02:40:17 +00:00
eqy
6be53d52c5 [CUDA][SDPA] Bump tolerances for grad_query in mem_eff test (#137750)
(for sm80)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137750
Approved by: https://github.com/drisspg
2024-10-12 02:15:14 +00:00
67883e70c0 change GPT2ForSequenceClassification inference accuracy tolerance (#136749)
Fixes https://github.com/pytorch/pytorch/issues/123503.

https://github.com/pytorch/pytorch/pull/121866 makes GPT2ForSequenceClassification hit the SDPA pattern 18 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance from 4e-3 to 5e-3 and make the check pass. Note that the issue is due to some small implementation diff. For example, the sdpa math backend scales q, k before matmul for stability; the flash attention backend has more diffs as a new algorithm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136749
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-10-12 01:12:28 +00:00
fba2c0a23a Fix comment in ProcessGroupGloo (#137746)
Summary: Algorithm caching was removed in 2018 D13111781

Test Plan: CI

Differential Revision: D64214527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137746
Approved by: https://github.com/Skylion007, https://github.com/wz337
2024-10-12 01:04:41 +00:00
69bcf1035e Updates reference to _runner-determinator.yml workflow, from current version to main version. (#137791)
Updates all references to runner determinator workflow (`_runner-determinator.yml`) from current cloned version to main version.

This enables the team to push updates to this workflow, like fixing bugs or pushing improvements, and have it immediately be reflected on all open PRs. So avoiding potentially breaking situations, empowering moving fast and fast and simple recover in case of bugs.

From:

```
jobs:
  get-label-type:
    uses: ./.github/workflows/_runner-determinator.yml
```

To:

```
jobs:
  get-label-type:
    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137791
Approved by: https://github.com/malfet, https://github.com/huydhn, https://github.com/zxiiro
2024-10-12 00:18:50 +00:00
e269a5cb09 [TCPStore] Throw value error if passing world_size=0 to TCPStore (#137792)
This fixes https://github.com/pytorch/pytorch/issues/137577.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137792
Approved by: https://github.com/fegin, https://github.com/H-Huang
ghstack dependencies: #137713, #137721
2024-10-11 23:42:57 +00:00
25ac5652d0 [Environment Variable][3/N] Use thread-safe getenv wrapper (#137328)
Follows #124485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137328
Approved by: https://github.com/eqy
2024-10-11 23:23:57 +00:00
8486d3df69 [Profiler] Hide ProfilerStep Alignment behind Experimental Config (#137668)
Summary: Aligning ProfilerStep# annotation can be useful for visual purposes but it affects downstream tools like HTA to misreport how long each step took. For this reason, lets give users the option to turn on this alignment manually but also turn it off by default

Test Plan:
Alignment off:

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_11_48.2543945.pt.trace.json.gz&bucket=gpu_traces

Alignment on:

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Oct_09_16_08_27.2518391.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D64146115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137668
Approved by: https://github.com/aaronenyeshi
2024-10-11 22:57:05 +00:00
0121d64aa9 Revert "[AOTI] Handle inplace output in ProxyExecutor (#137660)"
This reverts commit 573101aac3b1addc0a0b945ae09fe9be9034d3a9.

Reverted https://github.com/pytorch/pytorch/pull/137660 on behalf of https://github.com/desertfire due to Fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/137660#issuecomment-2408213485))
2024-10-11 22:54:39 +00:00
c58e5c4efa Revert "[AOTI] Turn on the ABI-compatible mode as default (#136534)"
This reverts commit b0da076f0cd5957c7fe55a58876f3b74babfc1b7.

Reverted https://github.com/pytorch/pytorch/pull/136534 on behalf of https://github.com/desertfire due to The dependent PR https://github.com/pytorch/pytorch/pull/137660 fails in fbcode ([comment](https://github.com/pytorch/pytorch/pull/136534#issuecomment-2408211238))
2024-10-11 22:50:58 +00:00
e3173d8725 [pipelining] Shape Inference (#136912)
Performs shape inference at runtime using user-provided real tensors.
- avoids the need for users to precompute shapes which is difficult and error prone
- lets us remove args from the PipelineStage ctor (in a later PR)
- deprecates existing inference helper in PipelineStage constructor for several reasons: its problematic to have to reason about the stage submod being on the right device for shape inference

The current state as of this PR:
- Users should not pass any input or output shapes into PipelineStage ctor, and shape inference will run automatically
- To override shape inference, they can continue to pass input/output args as previously

Currently, does not add a barrier after shape-inference, which essentially pipelines shape inference with the subsequent schedule action for that stage.  If this complicates debugging, we could add in a barrier (it comes at a cost, but only during the first step).

Testing:
- Removed input args from all PP test cases, thus exposing them all to shape-inference.
- Verified visually (nvidia-smi) that torchtitan PP 3D test runs shape inference fine without creating extra cuda contexts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136912
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2024-10-11 22:49:00 +00:00
432c3fe5af Default to use training IR (#137804)
Summary: Since capture_pre_autograd_graph is deprecated and will be deleted soon, we default this option to true.

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D64254236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137804
Approved by: https://github.com/tugsbayasgalan
2024-10-11 22:34:28 +00:00
c254901bdb Have Triton custom extension test use privateuseone device (#137611)
The original PR #122396 used the CPU device since at that point in time
there was no actual Triton CPU backend. After #133408, this is no longer
the case, so we now have multiple backends getting registered for the
CPU. The test still works in OSS but fails internally due to different
test runners initializing the backends in a different order.

This PR doesn't actually end up fixing the test internally because
cpp_extension -- needed to implement the privateuseone device -- isn't
supported there, so we simply skip it instead. However, it still makes the
OSS test independent of initialization order, which is good.

Differential Revision: [D63838169](https://our.internmc.facebook.com/intern/diff/D63838169/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137611
Approved by: https://github.com/henrylhtsang
2024-10-11 21:27:29 +00:00
19bbbef79d cublaslt autotuning support for TunableOp (#133896)
Adds support for cublaslt autotuning to TunableOp.

Todo:
- [x] Add and test `ScaledGemmTunableOp`
- [x] Benchmarking numbers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133896
Approved by: https://github.com/eqy, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-10-11 21:16:36 +00:00
1358969fa1 Revert "BundledAutotuneCache (#134959)"
This reverts commit 709021143d9c9aa90df578a2f5abb93a91a4852a.

Reverted https://github.com/pytorch/pytorch/pull/134959 on behalf of https://github.com/albanD due to The newly added test fails on rocm CI ([comment](https://github.com/pytorch/pytorch/pull/134959#issuecomment-2408091754))
2024-10-11 20:43:56 +00:00
74e871355b Add hooks to Scheduler nodes for generating device-specific debug strings (#135015)
Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015
Approved by: https://github.com/jansel
2024-10-11 20:30:49 +00:00
8543000c27 Search through config changes in compiler bisector (#137346)
Follow up to https://github.com/pytorch/pytorch/pull/131936.  In the original bisector you'd have to test inline if we were disabling a component - `if BisectionManager.disable_subsystem("inductor", "post_grad_passes", debug_info)`. This adds a convenient way of testing config changes for root causing issue. I've added `emulate_precision_casts` and aot_eager_decomp_partition cse as initial ones.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137346
Approved by: https://github.com/zou3519
2024-10-11 20:24:54 +00:00
513563eb09 Fix stack named "queue" in Util::ComputePostOrder (#130526)
This function computes a topological sort using a non-recursive implementation of DFS. Upon first reading, I thought it was using Kahn’s algorithm because it uses a variable called `queue`, but upon closer reading, I noticed this variable is actually used as a stack.

This pull request improves readability by renaming the stack and changing it from `std::vector` to `std::stack`.
Note: this also changes the backing store from an `std::vector` to an `std::deque`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130526
Approved by: https://github.com/alanwaketan, https://github.com/malfet
2024-10-11 20:21:07 +00:00
d0628a7e39 [ONNX] Remove deprecated export_to_pretty_string (#137790)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790
Approved by: https://github.com/titaiwangms
ghstack dependencies: #137789
2024-10-11 20:10:04 +00:00
5fca2fd365 Try unify training and inference (#136888)
Previously inference -> inference IR was going through a seperate flow from train -> inference decomposition. This diff unifies them so that we always retrace when decomposing. Joint IR decomp is still going through old flow (inference -> inference) but seems ok for now since it is still in experimental stage.

Differential Revision: [D63062521](https://our.internmc.facebook.com/intern/diff/D63062521/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136888
Approved by: https://github.com/avikchaudhuri
2024-10-11 20:09:58 +00:00
3e0b83ad1f [ONNX] Remove ExportTypes (#137789)
Remove deprecated ExportTypes and the `_exporter_states` module. Only protobuf (default) is supported going forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137789
Approved by: https://github.com/titaiwangms
2024-10-11 19:29:52 +00:00
460358a20f Run lint-autoformat only on PRs to main (#137802)
This is mostly to prevent showing up on ghstack PRs, with which code suggestions are not compatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137802
Approved by: https://github.com/huydhn
2024-10-11 19:25:34 +00:00
2cb983ab97 [CI] Adds support for selecting experiments for workflows on runner determinator (#137614)
adds a `default` tag to experiment configurations, allowing to remove some experiments by default on the random draw:

```
        experiments:
            lf:
                rollout_perc: 25
            otherExp:
                rollout_perc: 25
                default: false
        ---
```

and includes the configuration to filter what experiments are of interest for a particular workflow (comma separated):

```
  get-test-label-type:
    name: get-test-label-type
    uses: ./.github/workflows/_runner-determinator.yml
    with:
      ...
      check_experiments: "awsa100"
```

The end goal, is to enable us to run multiple experiments, that are independent from one another. For example, while we still runs the LF infra experiment, we want to migrate other runners leveraging the current solution. A immediate UC is for the A100 instances, where we want to migrate to AWS.

Those new instances will during the migration period be labeled both `awsa100.linux.gcp.a100` and `linux.aws.a100`. Once the experiment ends, we will remove the first confusing one.

```
jobs:
  get-build-label-type:
    name: get-build-label-type
    uses: ./.github/workflows/_runner-determinator.yml
    with:
      ...

  get-test-label-type:
    name: get-test-label-type
    uses: ./.github/workflows/_runner-determinator.yml
    with:
      ...
      check_experiments: "awsa100"

  linux-focal-cuda12_1-py3_10-gcc9-inductor-build:
    name: cuda12.1-py3.10-gcc9-sm80
    uses: ./.github/workflows/_linux-build.yml
    needs:
      - get-build-label-type
      - get-test-label-type
    with:
      runner_prefix: "${{ needs.get-build-label-type.outputs.label-type }}"
      ...
      test-matrix: |
        { include: [
          { config: "inductor_huggingface_perf_compare", shard: 1, num_shards: 1, runner: "${{ needs.get-test-label-type.outputs.label-type }}linux.gcp.a100" },
          ...
        ]}
      ...
```

```
experiments:
    lf:
        rollout_perc: 50
    awsa100:
        rollout_perc: 50
         default: false
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137614
Approved by: https://github.com/malfet
2024-10-11 19:20:02 +00:00
709021143d BundledAutotuneCache (#134959)
Add a cache to combine individual autotune caches into a single cached bundle.  We still rely on the individual autotune caches - on a cache hit we copy the individual results into the local caches so they can retrieved later.

Various related configs:
env: TORCHINDUCTOR_BUNDLED_AUTOTUNE_REMOTE_CACHE
config: bundled_autotune_remote_cache
jk: pytorch/remote_cache:bundled_autotune_remote_cache_version

Testing:

Manually tested w/ EMU:
```
cd fbcode/accelerators/workloads/models/emu_flash/v1p4
make build_benchmark_model && make save_model_to_path
make test_pt2_latency
```

 - on a cold run we got 0 hits and 40 misses. On a warm run it got 40 hits and 0 miss.
- perf seems a little better - for 8 runs:
  - no bundled cache averaged 14m11s
  - bundled cache averaged 14m6s
  - 125ms saved per cache entry seems reasonable

Cache Metrics for an sample run:
no bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2256, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 7, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:MemcacheCache: {hit: 2256, miss: 0, put: 7, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 7, exception: 0}
  backend:_ManifoldCache: {hit: 40, miss: 0, put: 0, exception: 0}
```
bundled cache:
```
INFO: Cache Metrics:
  FbMemcacheRemoteKernelCache: {hit: 2258, miss: 0, put: 0, exception: 0}
  FbRemoteAutotuneCache: {hit: 0, miss: 0, put: 8, exception: 0}
  FbRemoteBundledAutotuneCache: {hit: 40, miss: 0, put: 0, exception: 0}
  FbRemoteFxGraphCache: {hit: 40, miss: 0, put: 0, exception: 0}
  LocalAutotuneCache: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:MemcacheCache: {hit: 2258, miss: 0, put: 8, exception: 0}
  backend:_LocalAutotuneCacheBackend: {hit: 878, miss: 0, put: 886, exception: 0}
  backend:_ManifoldCache: {hit: 80, miss: 0, put: 0, exception: 0}
```

Differential Revision: D60677499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134959
Approved by: https://github.com/oulgen
2024-10-11 19:12:41 +00:00
b82000c1b3 Removed _compile workaround for create_block_mask (#137477)
I also put in a change for supporting `create_block_mask` to properly handle non-multiples of BLOCK_SIZE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137477
Approved by: https://github.com/drisspg, https://github.com/BoyuanFeng
2024-10-11 19:04:23 +00:00
2dcd69da50 [inductor] Delete dead code and lints (#137753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137753
Approved by: https://github.com/Chillee
2024-10-11 18:55:08 +00:00
267f82b860 [BE] Format .ci/ / .github/ / benchmarks/ / functorch/ / tools/ / torchgen/ with ruff format (#132577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577
Approved by: https://github.com/malfet
2024-10-11 18:30:26 +00:00
04adb74d08 [inductor][cond] Remove redundant prefix (#137718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137718
Approved by: https://github.com/eellison
ghstack dependencies: #137200
2024-10-11 18:13:18 +00:00
cd02c85ba4 [inductor][subgraph][python-wrapper] Lift subgraph code into functions (#137200)
Earlier the subgraphs were getting inlined into the output code. This PR lifts the subgraphs into a function, and then we just call the function in the output code.

This is the output code for test `test_cond_reintepret_view_inputs_outputs`

Before this PR - https://www.internalfb.com/intern/paste/P1632948905/
With this PR - https://www.internalfb.com/intern/paste/P1632946348/

A relevant snippet from the above paste is

~~~

def false_graph_0(args):
    false_graph_0_arg0_1, false_graph_0_arg1_1, s0 = args
    args.clear()
    s0 = s0
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        false_graph_0_buf0 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
        false_graph_0_buf1 = empty_strided_cuda(((-1) + s0, 20), (20, 1), torch.float32)
        # Unsorted Source Nodes: [cond, z1, z2], Original ATen: [aten.sub, aten.add]
        triton_poi_fused_add_sub_1_xnumel = (-20) + (20*s0)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_sub_1.run(false_graph_0_arg0_1, false_graph_0_arg1_1, false_graph_0_buf0, false_graph_0_buf1, triton_poi_fused_add_sub_1_xnumel, grid=grid(triton_poi_fused_add_sub_1_xnumel), stream=stream0)
        del false_graph_0_arg0_1
        del false_graph_0_arg1_1
    return (reinterpret_tensor(false_graph_0_buf0, ((-3) + s0, 20), (20, 1), 40), reinterpret_tensor(false_graph_0_buf1, ((-1) + s0, 16), (20, 1), 4), )

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1 = args
    args.clear()
    s0 = arg0_1
    assert_size_stride(arg1_1, (s0, 20), (20, 1))
    assert_size_stride(arg2_1, (s0, 20), (20, 1))
    assert_size_stride(arg3_1, (), ())
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = [None] * 2
        buf0 = [None] * 2
        if arg3_1.item():
            # subgraph: true_graph_0
            true_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
            true_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
            (true_graph_0_buf0, true_graph_0_buf1) = true_graph_0([true_graph_0_arg0_1, true_graph_0_arg1_1, s0])
            buf0[0] = true_graph_0_buf0
            buf0[1] = true_graph_0_buf1
        else:
            # subgraph: false_graph_0
            false_graph_0_arg0_1 = reinterpret_tensor(arg1_1, ((-1) + s0, 20), (20, 1), 0)
            false_graph_0_arg1_1 = reinterpret_tensor(arg2_1, ((-1) + s0, 20), (20, 1), 0)
            (false_graph_0_buf0, false_graph_0_buf1) = false_graph_0([false_graph_0_arg0_1, false_graph_0_arg1_1, s0])
            buf0[0] = false_graph_0_buf0
            buf0[1] = false_graph_0_buf1
        del arg1_1
        del arg2_1
        del arg3_1
        buf1 = buf0[0]
        buf2 = buf0[1]
        del buf0
    return (buf1, buf2, )

~~~

The key change is to recursively call `codegen` for the subgraph and rely on `SubgraphPythonWrapper` to generate just the subgraph `fn`. The resulting subgraph_code is then inserted into the parent wrapper.

Note that this PR only works for python wrapper. For cpp wrapper, we need a lot of refactor to ensure that we don't duplicate the global variables in the outpute_code. So, for now, I fallback to the old way of inlining for cpp wrapper. I am hoping someone with more familiarity with cpp wrapper can support subgraph lifting (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov).

This work will unblock hierarchical compilation (or cold start compile time work).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137200
Approved by: https://github.com/desertfire, https://github.com/eellison
2024-10-11 17:57:10 +00:00
68272ab596 Extend cuda_flip to unsigned types (#137781)
Using AT_DISPATCH_V2

Test plan: `python3 -c "import torch;print(torch.randint(0, 100, (4, 4),  dtype=torch.uint16, device='cuda').flip(0))"`
Fixes https://github.com/pytorch/pytorch/issues/137770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137781
Approved by: https://github.com/Skylion007
2024-10-11 17:02:53 +00:00
4fa46d3bda TunableOp: Performance Improvement (#135371)
This PR reduces the overhead on the CPU side by eliminating the use of c10::str in creating signatures. Instead we use fmt library. TunableOp overhead on the CPU are reduced by around ~40%. The improvement is most noticeable on small GEMMs. This PR does not contain any bug fixes or new features.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135371
Approved by: https://github.com/jeffdaily
2024-10-11 16:52:40 +00:00
da578495ca [ROCm] enable gfx110x for hipblaslt (#137317)
Fixes #136347.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137317
Approved by: https://github.com/Skylion007, https://github.com/jithunnair-amd

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-11 16:51:31 +00:00
41ccfc8752 Log chromium event for automatic dynamic reasons (#137491)
Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally.

Run following code:
```
import torch
@torch.compile(backend="eager")
def foo(t, x):
    return t.sin() + x

torch._dynamo.config.automatic_dynamic_shapes = True
torch._dynamo.config.assume_static_by_default = True
# Change size
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([2,2])
foo(x, 2)
torch._dynamo.reset()
# Change dimensionality
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([1,2,3])
foo(x, 2)
torch._dynamo.reset()
# Change stride
x = torch.randn([3,3])
foo(x, 2)
x = torch.as_strided(x, [3,3], [2,2])
foo(x, 2)
torch._dynamo.reset()
# Change scalar
x = torch.randn([1,2])
foo(x, 2)
foo(x, 3)
```

Internal link to perfetto:
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

The events look like this:
<img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab">
<img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656">
<img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba">
<img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc">

Differential Revision: [D64184625](https://our.internmc.facebook.com/intern/diff/D64184625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491
Approved by: https://github.com/Skylion007, https://github.com/oulgen
2024-10-11 16:50:25 +00:00
a06d49a9f9 bump up add_loop_inductor_gpu expected instruction count. (#137672)
diff https://github.com/pytorch/pytorch/pull/137117/files increased instruction count for add_loop_inductor_gpu
but not enough to fail in that diff, but now its kind of flaky test .

it failed on recent merge:
<img width="1351" alt="Screenshot 2024-10-09 at 5 25 57 PM" src="https://github.com/user-attachments/assets/27178f76-c08e-4d13-9ac4-4cd70f146611">

and here is the history
<img width="1047" alt="Screenshot 2024-10-09 at 5 26 07 PM" src="https://github.com/user-attachments/assets/bd563e34-6f7f-461a-ae54-8a616be9bf09">
<img width="777" alt="Screenshot 2024-10-09 at 5 30 19 PM" src="https://github.com/user-attachments/assets/d0a1ca81-2bdb-4cf6-8ac8-ba5971d447bf">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137672
Approved by: https://github.com/masnesral
2024-10-11 16:46:38 +00:00
d41558f8d7 [BE][Ez]: Better error message for CUDNN attention attn_bias (#137702)
Follow up to  #136885 . Fixes edge case on error condition (should be early exit so that expand doesn't every run into any trouble with weird cases (attn_bias 0, 1, > 5 dim).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137702
Approved by: https://github.com/eqy
2024-10-11 16:44:46 +00:00
5835b1af10 [FSDP2] Gated dynamo import for torch deploy (#137203)
Differential Revision: [D63777335](https://our.internmc.facebook.com/intern/diff/D63777335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137203
Approved by: https://github.com/wz337
2024-10-11 16:38:19 +00:00
bdb42e7c94 [PGNCCL] Added some missing spaces in barrier msg (#137721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137721
Approved by: https://github.com/kwen2501
ghstack dependencies: #137713
2024-10-11 15:17:25 +00:00
39c5048549 [DeviceMesh] Fixed from_group when passing tensor mesh (#137713)
This fixes https://github.com/pytorch/pytorch/issues/137676. (sorry for messing this up in the original PR 😓 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137713
Approved by: https://github.com/wz337
2024-10-11 14:53:51 +00:00
e30c55ee52 Update maintainers for inductor and x86 CPU (#136839)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136839
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet
2024-10-11 07:24:07 +00:00
1c71de5b2c [ScaleMM] Add a shape dependent max_swizzle size (#137681)
# Summary

I started to explore the performance of _scaled_mm against a triton-based persistent TMA kernel for RowWise scaling.
There are more details here: https://github.com/drisspg/transformer_nuggets/pull/36

It clearly showed that where was some room for improvement on larger problem sizes compared to triton's performance. Note that the triton kernel only has a 128x128x128 Tile shape, where scaled_mm has a 64, 128, 128 tile shape which we use for smaller problem sizes which may explain some of the perf delta for at smaller shapes.

This led to seeing if we can improve our triton codegen lowering  for _scaled_mm (I think we should still do this: https://github.com/pytorch/pytorch/pull/137517).

In the meantime @Chillee  suggested I make sure swizziling is set for the large matmul shapes

This PR makes sure that we increase the max_swizzle_size for the large matmuls.

## Performance
Note* Red means triton based tma beats _scaled_mm blue means _scaled_mm is faster

On Nighlty W/ Triton at (2ef33c6c4c3)
![swizzle_tst_8_full_nightly_heatmaps](https://github.com/user-attachments/assets/e92af19b-4e79-4126-b9d0-da039da5363b)

You can see that as M,K,N increase there is a clear win for the Triton Persistent TMA.

After this PR:

![swizzle_tst_8_full_heatmaps](https://github.com/user-attachments/assets/472068b3-45c2-43f8-84d3-b116da7898d5)

For example w/ this change(power limited gpu)

M=16384  K=16384  N=16384
TFlops Before :`985.49`
TFlops After: `1304.69`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137681
Approved by: https://github.com/eqy
2024-10-11 06:44:31 +00:00
4e309899c7 [Quant] Check stride > 0 for QConv and QConvTranspose (#136739)
Fixes #136722
Fixes #136718

By default, it goes to onednn. So this PR adds a check to ensure stride > 0. Now program will quit with an error message if stride is 0.
FBGEMM and QNNPACK can create modules with stride=0 without error but program crashes when calling forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136739
Approved by: https://github.com/jgong5
2024-10-11 05:50:37 +00:00
fe148024fe [c10d][experimental] Add _abort_process_group (#132291)
Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797

This PR is meant to:
- provide a way to abort multiple PGs without deadlocking each other.
- provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such).
One can find an example from: https://github.com/NVIDIA/nccl/issues/1013

## How is it different from `destroy_process_group`?
`destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`.

## What's new in `_abort_process_group`?
It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22.

## What's next?
Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs).

In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291
Approved by: https://github.com/eqy
2024-10-11 05:04:17 +00:00
bc232e3c08 Fix custom op bug of clearing dir (#137655)
Previously when we delete a custom op out of context manager, we weren't clearing the dir field of the op namespace. As a result, it was polluting other tests.

Differential Revision: [D64141465](https://our.internmc.facebook.com/intern/diff/D64141465/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137655
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-10-11 04:32:40 +00:00
ee713f80ed Enable channels_last format for FSDP (#137382)
Enable FSDP to deal with channels_last memory formatted tensors. Preserving channels_last memory format makes FSDP compatible with the best kernels CUDNN offers.

Summary of changes:
1) Store strides information along with shapes
2) Replace calls to flatten() with as_strided(size=(param.numel(),), stride=(1,)) for flattening
3) Replace calls to view() with as_strided with the stored sizes and strides for unflattening

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137382
Approved by: https://github.com/awgu
2024-10-11 03:47:16 +00:00
8ee361ed13 fix test_retrace_pre_autograd (#137733)
Test Plan: fixed

Differential Revision: D64200918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137733
Approved by: https://github.com/pianpwk, https://github.com/tugsbayasgalan
2024-10-11 03:46:22 +00:00
8321eec009 [Inductor UT] Generalize device bias code in test_triton_kernels.py (#137585)
[Inductor UT] Generalize device bias code in test_triton_kernels.py introduced by #137020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137585
Approved by: https://github.com/eellison, https://github.com/jansel
2024-10-11 02:00:01 +00:00
8262f6d271 fix test_lazy_module_kwargs (#137705)
Test Plan: fixed

Differential Revision: D64185644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137705
Approved by: https://github.com/tugsbayasgalan
2024-10-11 01:53:10 +00:00
9d4cb0d3eb Fix param and buffer mapping for state_dict when there are state_dict hooks (#137609)
Resolve #137540

Summary:

We might get different state_dict and named_parameters result when the module has registered custom state_dict_hooks.
For exported_program's state_dict, we want the state_dict to reflect the actual module hierarchy at runtime, and it might be different from the model's state_dict() output if the model has state_dict hooks.
To do weight swapping, one needs to either re-export or turn-off the hooks when saving model's state_dict().
Previously, ExportedProgram uses nn.Module's state_dict() method to populate its own state_dict, but it doesn't work for some models (e.g. llama3_3_vision) because ExportedProgram's state_dict and an nn.Module's state_dict have some subtle differences semantically.

nn.Module's state_dict is about how the state should be serialized, and it reflects the structure of the original user model code. In contrast, export specializes on a “run” of a model, and its state_dict needs to reflect the runtime module hierarchy.

One example where these two are different is TorchTune's Llama3_2_vision text decoder. Here, a FusionLayer is added as a local optimization and it is not part of the "static model definition".  In runtime, we have mod.layers[3].layer.sa_norm.scale.

But in nn.Module's state_dict, the authors of the model added a state_dict hook to remove the "layer" in mod.state_dict() to reflect the static model definition, so we have mod.state_dict()["layers.3.sa_norm.scale"].
In this Diff, we change ExportedProgram to populate its state_dict using named_parameters() and named_buffers() instead. So in ExportedProgram's state_dict, we have "layers.3.layer.sa_norm.scale", which reflects the runtime module hierarchy.

Now one problem this presents is weight swapping. Since ExportedProgram's state and the model's state is not the same anymore, weight swapping procedure also needs to change slightly.

In internal Ads and RecSys models deployment, weight swapping is where they have one model that is currently being being deployed and serving traffic, and they want to swap out the weights with newly trained model weights without having to redo the whole exporting/lowering process and create a new artifact. So they would move the deployed model’s pointer to the state dict over to the new state dict. Because of this, it’s previously a requirement that the FQNs are matching between the exported and the eager model’s state dict.

The new ExportedProgram's state dict still supports weight swapping, but the state_dict to be swapped needs to be obtained from torch.export.exported_program instead of model.state_dict() if the model has state_dict hooks.
The new requirement is that the FQNs are matching between the exported’s state dict and the state_dict obtained from `_disabled_load_state_dict_hooks(M)` context manager. One benefit of having this new API is that we are now in full control within export of gathering and updating the model state.
If a model doesn't have any state_dict hooks, one can still use model.state_dict() for weight swapping, so it's BC.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  test_export_for_training_with_state_dict_hooks
```

Differential Revision: D64080561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137609
Approved by: https://github.com/angelayi, https://github.com/pianpwk
2024-10-11 01:33:50 +00:00
a919742149 c10::optional -> std::optional in PyTorch (#137333)
Test Plan: Sandcastle

Differential Revision: D63876535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137333
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-11 00:16:10 +00:00
4fb1fd8a51 Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)"
This reverts commit b6a64dce072240c0b06d2fb03ac81b3ed1b73d92.

Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))
2024-10-10 23:47:25 +00:00
b55ff476bd Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit cdd8fa98c77b052085cca65dd54769ae18b72104.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))
2024-10-10 23:47:25 +00:00
b0da076f0c [AOTI] Turn on the ABI-compatible mode as default (#136534)
Summary: Make AOTI generate ABI-compatible code as default for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136534
Approved by: https://github.com/chenyang78
ghstack dependencies: #137660
2024-10-10 23:44:57 +00:00
ad38bad766 [MPS] Add tri[lu]_indices (#137648)
Requested in https://github.com/pytorch/pytorch/issues/77764#issuecomment-2402365980
Copy-n-paste kernel implementation from 13cf8360d8/aten/src/ATen/native/cuda/TensorFactories.cu (L92)

though use `float` instead of `double` for square root computation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137648
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #137601, #137647
2024-10-10 23:41:06 +00:00
573101aac3 [AOTI] Handle inplace output in ProxyExecutor (#137660)
Summary: https://github.com/pytorch/pytorch/pull/137401 didn't fix the underlying inplace output issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137660
Approved by: https://github.com/chenyang78
2024-10-10 23:12:46 +00:00
c37bb492da [ONNX] Create an optimize method in ONNXProgram (#137667)
Move optimization from the export call to the `optimize()` method in ONNXProgram.

Users can call `optimize()` before calling `save()` to save the model. Right now if users set `optimize=True` in `torch.onnx.export` it will have the same effect as calling `optimize()`, but in the future we can evolve the method to be more flexible (e.g. target aware etc.)

Example

```python
onnx_program = torch.onnx.export(..., dynamo=True)
onnx_program.optimize()
onnx_program.save("model.onnx")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137667
Approved by: https://github.com/titaiwangms
ghstack dependencies: #137666
2024-10-10 22:44:19 +00:00
e75984cd31 [ONNX] Use torch_2_6 apis from onnxscript (#137666)
Create an `optimize=False` option in `torch.onnx.export` for model optimization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137666
Approved by: https://github.com/titaiwangms
2024-10-10 22:23:15 +00:00
93bbc8abcc [dynamo, 3.13] use 3.13 multiline traceback in get_instruction_source_311 (#137617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137617
Approved by: https://github.com/jansel
2024-10-10 20:19:27 +00:00
4551a1ee79 [dynamo, 3.13] merge 3.13 FORMAT_* and <=3.12 FORMAT_VALUE (#137656)
This was causing some 3.13 failures locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137656
Approved by: https://github.com/jansel, https://github.com/Skylion007
ghstack dependencies: #137652
2024-10-10 19:53:42 +00:00
6b2c3508f8 [dynamo, 3.13] fix typo in remove_fused_load_store (#137652)
Whoops!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137652
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-10-10 19:53:42 +00:00
9c12198137 [PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel, Try #2 (#137377)
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. First attempt was https://github.com/pytorch/pytorch/pull/136331 .

Differential Revision: [D63923166](https://our.internmc.facebook.com/intern/diff/D63923166/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137377
Approved by: https://github.com/malfet
2024-10-10 19:44:22 +00:00
080f02ac7a [dynamo] do not raise an unimplemented error with boolean masking setitem (#134902)
Cudagraph breaks on boolean masking setitem, however the code runs fine. There is no need to raise an unimplemented error here, since it already warns that its an incompatible op.

Fixes #134241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134902
Approved by: https://github.com/jansel, https://github.com/henrylhtsang
2024-10-10 19:11:40 +00:00
079f909263 Revert "Make Context to be Device-agnostic Step by Step (1/N) (#136519)"
This reverts commit be0b75256a7e516217b059ef273901b95c022fe7.

Reverted https://github.com/pytorch/pytorch/pull/136519 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))
2024-10-10 18:32:17 +00:00
33e5921e6b Revert "Make Context to be Device-agnostic Step by Step (2/N) (#136526)"
This reverts commit 72ad1b8c6c7c364c1974b82a914876dcdf73af44.

Reverted https://github.com/pytorch/pytorch/pull/136526 on behalf of https://github.com/jovianjaison due to this pr is causing errors internally ([comment](https://github.com/pytorch/pytorch/pull/136519#issuecomment-2405781093))
2024-10-10 18:32:16 +00:00
881a18f25f Set Cuda context in inductor and dont initialize wrong cuda device in fake_tensor (#137603)
Previously we would construct tensors with "cuda" device which defaults to device:0 if not cuda context is set. Fix for https://github.com/pytorch/pytorch/issues/124854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137603
Approved by: https://github.com/jansel
2024-10-10 18:25:22 +00:00
dd7c2899bd [dynamo] Properly prune dead cell local variables (#136891)
This patch updates the `prune_dead_locals` logic to do slightly more aggressive pruning for cell local variables, in absence of side-effects, e.g., a cell variable can be pruned when its user function(s) will never be used again.

See added tests for examples; note that a few tests in `test/dynamo/test_higher_order_ops.py` also got updated because we are no longer returning the unnecessary graph output.

Fixes #127350, #124653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136891
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
2024-10-10 18:21:24 +00:00
bcfdb72547 Fix dtype test for NumPy 2 (#137532)
Related to #107302

The following test fails with NumPy 2.

```
_________ TestNumPyInteropCPU.test_numpy_array_interface_cpu __________
Traceback (most recent call last):
  File "/usr/local/google/home/haifengj/git/pytorch_np2/test/test_numpy_interop.py", line 415, in test_numpy_array_interface
    wrapped_x = np.array([1, -2, 3, -4], dtype=dtype)
OverflowError: Python integer -2 out of bounds for uint8

To execute this test, run the following from the base repo dir:
    python test/test_numpy_interop.py TestNumPyInteropCPU.test_numpy_array_interface_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

According to the official warning from NumPy 1, the assigning negative value to a `uint8` is deprecated.
The recommended way is to `np.array([1, -2, 3, -4]).astype(np.uint8)`
See the following for details.
```
>>> np.array([1, -2, 3, -4], dtype=np.uint8)
<stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of -2 to uint8 will fail in the future.
For the old behavior, usually:
    np.array(value).astype(dtype)
will give the desired result (the cast overflows).
<stdin>:1: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of -4 to uint8 will fail in the future.
For the old behavior, usually:
    np.array(value).astype(dtype)
will give the desired result (the cast overflows).
array([  1, 254,   3, 252], dtype=uint8)
```

This PR fixes the test failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137532
Approved by: https://github.com/soulitzer
2024-10-10 18:12:25 +00:00
5e73f2d7c0 [PT2][Dynamo][Optimus] Add batch detach, clamp and nan_to_num in pre grad (#137415)
Test Plan:
# unit test
```
CUDA_VISIBLE_DEVICES=4 OC_CAUSE=1 buck2 test '@fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion -- test_math_op_fusion
```

Buck UI: https://www.internalfb.com/buck2/185799e1-6ea8-4bd1-b2e1-0c1a8dd92f89
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2533275044114335
Network: Up: 14KiB  Down: 287B  (reSessionID-d24cee56-2a22-4a90-b4c6-1d0c3ab256c1)
Jobs completed: 8. Time elapsed: 48.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# local reproduce

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt scripts/shuaiyang:test -- --optimus --flow_id 648108097 2>&1 | tee ~/local_run_shuai_interformer_cmf.txt
```

Counter({'pattern_matcher_nodes': 6626, 'pattern_matcher_count': 6396, 'extern_calls': 5340, 'benchmarking.TritonBenchmarker.benchmark_gpu': 2710, 'normalization_pass': 44, 'fxgraph_cache_miss': 37, 'scmerge_split_removed': 16, 'scmerge_cat_removed': 16, 'unbind_stack_pass': 16, 'batch_aten_mul': 15, 'batch_linear_post_grad': 12, 'batch_linear': 5, 'batch_detach': 4, 'batch_nan_to_num': 4, 'batch_clamp': 4, 'batch_aten_add': 4, 'batch_layernorm': 2, 'scmerge_cat_added': 2, 'batch_sigmoid': 1, 'scmerge_split_sections_removed': 1, 'unbind_stack_to_slices_pass': 1, 'benchmarking.TritonBenchmarker.triton_do_bench': 1, 'scmerge_split_added': 1, 'fxgraph_cache_hit': 1, 'batch_aten_sub': 1})

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/mengluy/2024-10-06-20-53-01/trace.json.gz&bucket=gpu_traces

# e2e

baseline:
f650336422

proposal:

f650336607

### QPS and NE results

 {F1914975940}
{F1914975938}
{F1914975939}
{F1914975945}

> 0.7% QPS gain with NE neutral

### trace analysis

Before
 {F1914990600}

After

{F1914990015}

We reduced green part in the trace introduced by small nan_to_num kernels

Differential Revision: D63962711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137415
Approved by: https://github.com/Yuzhen11
2024-10-10 18:11:08 +00:00
cyy
94e12f97dc [Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404)
Follows #137072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137404
Approved by: https://github.com/Skylion007
2024-10-10 18:05:34 +00:00
20815c7cb9 Intel GPU: mode: add XPU to supported devices list (#137575)
Kernel for `mode` Op is being ported to https://github.com/intel/torch-xpu-ops/pull/770, this requires adding XPU to supported device type.

Additional context: https://github.com/intel/torch-xpu-ops/issues/327

@fengyuan14 @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137575
Approved by: https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Feng Yuan <feng1.yuan@intel.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-10 17:44:40 +00:00
cdd8fa98c7 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
ghstack dependencies: #137161
2024-10-10 17:16:34 +00:00
9690cacd61 [aotinductor] Add helper fn to atomically apply size_hint to an expr w/ unbacked symints (#137537)
### Context
Fixes CUDA IMA in autotune_at_compile_time, where we would generate an example tensor with an incorrect stride.

In the case below, the stride should be (u0 * 128, 128, 1). However, we apply the fallback on the entire expr (i.e. u0 * 128).
```
# buf817 = tensor(size=(s0, u0, 128), stride=(u0 * 128, 128, 1))

buf812 = generate_example_value(
    (64, 8192, 128), (8192, 128, 1), "cuda:0", torch.bfloat16, 0
)
```

The fix is to apply the fallback on each symbol.

### Test
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer python test_aot_inductor.py -k test_stride_with_unbacked_expr_abi_compatible_cuda

========= Invalid __global__ write of size 2 bytes
```

Differential Revision: [D64074561](https://our.internmc.facebook.com/intern/diff/D64074561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137537
Approved by: https://github.com/jingsh
2024-10-10 17:11:24 +00:00
b6a64dce07 Upgrade distributed test to g4dn instances (T4 GPUs) (#137161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161
Approved by: https://github.com/seemethere
2024-10-10 17:11:21 +00:00
034af88c2d Add a microbechmark for cache read path (#137607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137607
Approved by: https://github.com/jamesjwu
2024-10-10 16:36:18 +00:00
dae60075e0 [BE][MPS] Use Tensor->TensorBase in OperationUtils.h (#137647)
As for the most part those helper method need access to only base class methods.
Also replace spurious `at::` namespace prefixes, i.e. `at::Tensor`->`Tensor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137647
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #137601
2024-10-10 16:11:17 +00:00
bcf15d1bb4 [AOTI] Add error check for parsing error string from error code (#137626)
Currently, there are compilation warnings as below, which are resolved after the fix

```
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp: In function ‘ihipModuleSymbol_t* loadKernel(std::string, const string&, uint32_t, const std::optional<std::__cxx11::basic_string<char> >&)’:
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:482:25: warning: ignoring returned value of type ‘hipError_t’, declared with attribute nodiscard [-Wunused-result]
  482 |     hipDrvGetErrorString(code, &msg);                  \
      |     ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
/tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:519:5: note: in expansion of macro ‘CUDA_DRIVER_CHECK’
  519 |     CUDA_DRIVER_CHECK(hipModuleLoad(&mod, filePath.c_str()));
      |     ^~~~~~~~~~~~~~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13,
                 from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4:
/opt/rocm/include/hip/hip_runtime_api.h:2369:12: note: in call to ‘hipError_t hipDrvGetErrorString(hipError_t, const char**)’, declared here
 2369 | hipError_t hipDrvGetErrorString(hipError_t hipError, const char** errorString);
      |            ^~~~~~~~~~~~~~~~~~~~
In file included from /opt/rocm/include/hip/hip_runtime.h:70,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/device_utils.h:14,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:17,
                 from /pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:13,
                 from /tmp/torchinductor_root/c7t6qm4gf35cxkk5jywa5booovl5n6ivzwdbbs5og7rdemqtgrzh/caoefkofe5jrkuaoch4lfpjwtodlcy4savxgzsxqldkcdof7ifh7.cpp:4:
/opt/rocm/include/hip/hip_runtime_api.h:399:3: note: ‘hipError_t’ declared here
  399 | } hipError_t;
      |   ^~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137626
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2024-10-10 15:14:39 +00:00
575f260229 Extend vectorization with SVE(ARM) with Torch Compile (Inductor) (#134672)
**Motivation**
Enable SVE vectorization with `torch.compile`
Extends PR: #119571

* This PR enables vectorization for codegen part using SVE-256 (vec length)
* The changes can be extended to other SVE vec lengths

I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile`
Test results are for 8 cores on ARM Neoverse_V1

<img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2">

It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-10-10 13:20:40 +00:00
479bd1f300 Hardlock frequent periodic jobs to Meta runners (#137616)
The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2024-10-10 12:32:16 +00:00
f69bf005f7 Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097)"
This reverts commit 4304c68a4c4d742a3ec5266b81f64a85922509c9.

Reverted https://github.com/pytorch/pytorch/pull/137097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it seems to increase the compilation time a lot causing some jobs to timeout ([comment](https://github.com/pytorch/pytorch/pull/137097#issuecomment-2404573266))
2024-10-10 09:29:05 +00:00
eea1f79a1d [AMD] use rccl.h instead of rccl/rccl.h (#135472)
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header.

Test Plan:
buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True  --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true  //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom   data_loader.dataset.table_ds=[2024-09-04]   data_loader.dataset.batch_size=512  max_ind_range=10

w/o this diff, it'll show 2.18 nccl version

Differential Revision: D62371434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472
Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
2024-10-10 08:55:57 +00:00
eaab5cf0f9 Fix torch.compile correctness bug on aarch64+sve due to gcc bug (#137606)
Some unit tests were failing relating to argmin_vec/argmax_vec due to a bug in GCC affecting versions <= 12 on aarch64 platforms with SVE

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

Fixes #137597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137606
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-10 08:44:53 +00:00
365722f606 fix test_constant_output (#137547)
Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants.

Test Plan: fixed test

Differential Revision: D64081786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547
Approved by: https://github.com/tugsbayasgalan
2024-10-10 07:48:15 +00:00
4e8997744c [inductor] Enable coordinate descent tuning with max-autotune (#136867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136867
Approved by: https://github.com/Chillee
2024-10-10 07:29:52 +00:00
383eba5229 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492
Fixes #75240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby, https://github.com/eqy
2024-10-10 06:59:08 +00:00
71010bf097 [Inductor][CPP] generalize the wgt tensor delete (#135101)
**Summary**
Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-10-10 06:01:09 +00:00
ea83c78174 [SymmetricMemory] set the storage_offset of tensors returned by get_buffer() to 0 (#137569)
It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size.

Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569
Approved by: https://github.com/weifengpy
ghstack dependencies: #137567
2024-10-10 05:05:58 +00:00
96bab021c0 ATen | Fix header namespace pollution. (#137619)
Summary: Fixing a warning, so we can enable it globally.

Test Plan: Sandcastle-only, no runtime changes.

Differential Revision: D64122115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137619
Approved by: https://github.com/Skylion007
2024-10-10 05:04:54 +00:00
1aa130e80c Avoid generating as_strided for alaising views in auto_functionalize_v2 (#137149)
during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base
we create a view that is regenerated by calling aten.alias instead of as_strided for better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149
Approved by: https://github.com/zou3519
2024-10-10 05:00:41 +00:00
b5284a01a4 [CPU] remove keyword static for exp_u20 (#137571)
Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571
Approved by: https://github.com/jgong5
2024-10-10 04:52:04 +00:00
d170c410f2 Clean up op BC check list (#137634)
Summary: Remove some stale items

Test Plan: CI

Differential Revision: D64133246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137634
Approved by: https://github.com/hl475
2024-10-10 04:29:21 +00:00
249152475d fix sequence number for group (#134578)
Summary:
Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay.

`ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`.
b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L1188-L1191)
However, `WorkNCCL` only has one sequence number member `seq_`. b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L387)
We need to match collective and p2p with wait separately.
29b5a462dc

Depend on: https://github.com/pytorch/pytorch/pull/135132

Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134578
Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o
2024-10-10 04:24:06 +00:00
5aa9f2b660 Fixed issue with nn.Transformer().generate_square_subsequent_mask() (#137654)
Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype().

Fixes #137186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137654
Approved by: https://github.com/mikaylagawarecki
2024-10-10 03:10:01 +00:00
b9c9f7f0fa Document ROCm environment variables and improve CMake messaging to user (#137308)
Fixes #115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about.

The PR improves the documentation and CMake's behavior for ROCM builds.

- Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`.
- CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-10 03:08:08 +00:00
f394fb554b Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541)
Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance,
I set it up with 20% threshold (8*2)++
others are stable within +-1.5%

<img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf">

<img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541
Approved by: https://github.com/aorenste
2024-10-10 02:47:38 +00:00
f9ed39c989 Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637
Approved by: https://github.com/albanD
2024-10-10 01:23:30 +00:00
d50d5df2fb Add warning for non static grads in optimizer variable (#137554)
Fixes https://github.com/pytorch/pytorch/issues/112548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137554
Approved by: https://github.com/williamwen42
2024-10-10 01:23:21 +00:00
f301f6544b fix bug for fill_empty_deterministic_ not support complex half (#137488)
Fixes #133157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137488
Approved by: https://github.com/ezyang
2024-10-10 01:21:32 +00:00
361046718d Generate new expected results file when there is failures in diff time benchmarks (#137551)
The test also add singpost log for the benchmarks that pass.
to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv
results
```
WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results.

REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results.

PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00%

MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

You can use the new reference expected result stored at path: out.csv.

a,instruction count,90,0.01
b,memory,200,0.1
c,something,100,0.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551
Approved by: https://github.com/aorenste
2024-10-10 01:09:15 +00:00
d9f4a7d3f9 Simplify find_localzeros (#133325)
Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D64135230](https://our.internmc.facebook.com/intern/diff/D64135230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325
Approved by: https://github.com/albanD
2024-10-10 00:52:50 +00:00
4f45c76806 [PGNCCL] Limit access to ncclComm_ (#137573)
When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it.
`NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`.

To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o
ghstack dependencies: #137572
2024-10-10 00:34:05 +00:00
cyy
0739efbd1f Remove reference of gcc7 from CI scripts (#137339)
Because gcc7 can't be used to build Pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137339
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-10-10 00:29:29 +00:00
47a515d260 [c10d] simplify barrier implementation and further decouple CPU/GPU (#137516)
synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
2024-10-09 23:55:28 +00:00
51c33c0b72 Increase the runner size of AVX* jobs to 4xlarge (#137633)
The failed test is recently moved backed from slow and it requires more RAM than what available on 2xlarge runner.  It looks ok to up the instance size to 4xlarge instead.  I missed periodic jobs in https://github.com/pytorch/pytorch/pull/137447

Example periodic failures de4c2a3b4e (test_cpu_repro)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137633
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-10-09 23:43:49 +00:00
4304c68a4c In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137097
Approved by: https://github.com/angelayi
ghstack dependencies: #137091
2024-10-09 23:34:35 +00:00
6908d8d450 Enable python dispatcher for reinplacing pass (#137091)
Arguably this should be put somewhere higher up in the stack?  Not sure.

Xref: https://fb.workplace.com/groups/6829516587176185/permalink/8042762615851570/

There is a repro but I need to fix more bugs before it can be checked in

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137091
Approved by: https://github.com/bdhirsh
2024-10-09 23:34:35 +00:00
31e334ad9e [unwind] replace LONG_LONG_MAX by the portable LLONG_MAX (#125043)
This fixes a compilation error on systems with the musl c library.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125043
Approved by: https://github.com/aaronenyeshi
2024-10-09 23:34:16 +00:00
aafa02506e [CudaDMAConnectivityDetector] improve the detection robustness (#137530)
- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent).
- Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529
2024-10-09 23:30:16 +00:00
fbaf9b62de [SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce (#137529)
This provides better accuracy without additional cost.

Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137529
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475
2024-10-09 23:30:16 +00:00
39c5122a4f [IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
2024-10-09 23:30:16 +00:00
e6edfe3928 [SymmetricMemoryOps] create an out-variant for multimem_one_shot_all_reduce (#137474)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473
2024-10-09 23:30:16 +00:00
b22749712c type _inductor/optimize_indexing.py (#137599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137599
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-10-09 23:29:47 +00:00
d67b4f9e5f type _inductor/quantized_lowerings.py (#137598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137598
Approved by: https://github.com/Skylion007
2024-10-09 23:29:26 +00:00
9b01d17b8d Use MetaProxy more pervasively (#137588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137588
Approved by: https://github.com/ezyang
ghstack dependencies: #136674
2024-10-09 23:22:03 +00:00
13cf8360d8 [MPS] Fix testing for generator operators (#137601)
Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same

Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs`

Discovered that:
 - `torch.randn` and `torch.randint` fall into the same undefined category
 - `torch.logspace` is not implemented for MPS
 -  Allow 1.0  absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side
 - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem)
 - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601
Approved by: https://github.com/albanD
2024-10-09 23:17:11 +00:00
48fe0d56d6 Type _inductor/exc.py (#137595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137595
Approved by: https://github.com/Skylion007
2024-10-09 23:15:06 +00:00
7408742b67 Make ignore_fresh_unbacked_symbols reentrant (#137605)
I have a test but it requires some other feature work that isn't fully baked.  Maybe this will fix an xfail.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137605
Approved by: https://github.com/albanD
2024-10-09 23:08:05 +00:00
5516ac5c21 [ROCm] Tunableop record untuned (#128813)
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference.  So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:

- record untuned GEMMs to file.

- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-09 21:59:03 +00:00
839d3568b0 [compiled autograd] fix -Wuninitialized (#137539)
https://github.com/pytorch/pytorch/pull/135663#discussion_r1792408353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137539
Approved by: https://github.com/isuruf, https://github.com/Skylion007
2024-10-09 21:16:26 +00:00
38027b9b47 [SymmetricMemory] fix a bug where numel calculation overflows when the tensor size is large (#137567)
Fixes https://github.com/pytorch/pytorch/issues/137145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137567
Approved by: https://github.com/Chillee, https://github.com/weifengpy
2024-10-09 20:45:57 +00:00
a93ea617b5 [FSDP2] Required mesh_dim_names for HSDP (#137436)
Two changes:
1. Require `mesh_dim_names` if using HSDP
2. Pass only the shard mesh to `fsdp_pre_all_gather`

Change 1 is technically BC breaking, but it should not be hard to fix on the user side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-10-09 20:35:09 +00:00
47af7cc962 Add compiler bisector (#131936)
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.

The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.

An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :

```
    from torch._inductor.bisect_helper import BisectionManager
    if op in CURRENT_DECOMPOSITION_TABLE:
        if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
            return NotImplemented
```

Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.

We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.

Fix for https://github.com/pytorch/pytorch/issues/126546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
2024-10-09 20:34:11 +00:00
cfe970260a Clarify opt-einsum usage, fix #127109 (#137596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137596
Approved by: https://github.com/albanD
2024-10-09 20:31:24 +00:00
c73d2634b9 Revert "Log chromium event for automatic dynamic reasons (#137491)"
This reverts commit 3c1ab9367885fdb0ead5fcc14a22d6934070ca92.

Reverted https://github.com/pytorch/pytorch/pull/137491 on behalf of https://github.com/jovianjaison due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/137491#issuecomment-2403360486))
2024-10-09 20:24:12 +00:00
16a2c2cfd4 Revert "Introduce torch.sym_sum (#136429)"
This reverts commit 90bed32b986ab1356dc376df3985497cedbe8a29.

Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))
2024-10-09 20:08:01 +00:00
572f506f9c [c10d] Improve split_group test (#137572)
Fix 1:
`backend1 = pg._get_backend`, here `pg` should be `ng1`.

Fix 2:
`dist.broadcast` should be called by ranks of subgroup `ng1` only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137572
Approved by: https://github.com/Skylion007
2024-10-09 19:43:57 +00:00
70288c3c2d Remove dependency on numpy for serialization for XLA/open registration devices without numpy (#137444)
Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263

Follow ups: Do the same for maia and mtia

## Motivation

With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors  by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`.

However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy.

## Why is this ok?

The following comment on why numpy is necessary for serialization is legacy

c87c9f0a01/torch/_tensor.py (L303-L312)

We no longer do the following, though it was the case 5 years ago in the PR that added this
> CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content

**Instead what now happens is that CPU storage is constructed with data from the file **and then** moved onto backend device.**

Old behavior (`legacy_load`): 67adda891a/torch/serialization.py (L620)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444
Approved by: https://github.com/albanD
2024-10-09 19:35:55 +00:00
aa61e251d4 [FSDP2] Added shard_placement_fn arg (#137496)
## Overview
This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size.

```
# Example:
def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]:
    largest_dim = largest_dim_size = -1
    for dim, dim_size in enumerate(param.shape):
        if dim_size > largest_dim_size:
            largest_dim = dim
            largest_dim_size = dim_size
    return Shard(largest_dim)

fully_shard(module, shard_placement_fn=shard_placement_fn)
```

## Follow-Ups
- **Copy kernels:** For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang  has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496
Approved by: https://github.com/weifengpy
ghstack dependencies: #137593
2024-10-09 19:13:32 +00:00
36133f39db Tensorify compute on Python scalars (#136674)
Signed-off-by: Bob Ren <bobrenfb.com>

Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674
Approved by: https://github.com/ezyang
2024-10-09 18:51:41 +00:00
f15edb291a type _dynamo/trace_wrapped_higher_order_op.py (#137354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-10-09 18:35:28 +00:00
9a957e2842 [NCCL][Profiler] Add functionality to call dump function of NCCL profiler plugin (#137523)
Summary:
NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events.

The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin.

Test Plan:
```
        tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev)
        # Create a list with same number of elements as world size (aka no. of ranks)
        # During allgather this list is going to be populated with tensors from all ranks (aka all gather)
        gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)]
        # get collective from all ranks
        if i <= 10 or RANK != 0:
            dist.all_gather(gathered_tensors, tensor)
```
My script triggers flight recoder.
```
trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info.
trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so
trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump
trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0
trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0
```
Content from /tmp/nccl_trace_rank_0: P1614645283

Differential Revision: D61929401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523
Approved by: https://github.com/c-p-i-o
2024-10-09 18:19:33 +00:00
394c143e4e [dynamo] Fix error when inlining certain nested closure returned by another function (#137510)
See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context.

In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`.

Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values.

This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`.

This fixes #136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in
`NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up.

Fixes #136814.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510
Approved by: https://github.com/williamwen42
2024-10-09 18:13:57 +00:00
018dabff20 [ONNX] Implement patch for jit.isinstance (#137592)
Patch torch.jit.isinstance for users for models to be dynamo exportable. Replaces https://github.com/pytorch/pytorch/pull/137487.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137592
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2024-10-09 18:06:52 +00:00
ceb2fcc5db [FSDP2] Fixed incorrect tensor meta after .to(dtype) (#137593)
This fixes https://github.com/pytorch/pytorch/issues/137522. After a method that changes to module parameters (like `.to(torch.float64)`), we need to update the `DTensorSpec`, whose `TensorMeta`'s dtype may have changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137593
Approved by: https://github.com/Skylion007
2024-10-09 17:57:11 +00:00
bae8d5853e [TorchRec][PT2 compile] enable dynamo in _get_user_embeddings (#136798)
Summary:
# context
* enable the `_get_user_embeddings` function
* run failed at P1610151892
```
  torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
  GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0).  (Size-like symbols: u22)

  ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
  Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.

  Potential framework code culprit (scroll up for full backtrace):
    File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward
      if M <= 0 or N <= 0:
```
```
    N = prod(inner_dims)  # type: ignore[arg-type]
    M = prod(outer_dims)  # type: ignore[arg-type]
    if M <= 0 or N <= 0:
        return (
            input.new_zeros(input_shape) if output_mask[0] else None,
            input.new_zeros(input_shape[axis:]) if output_mask[1] else None,
            input.new_zeros(input_shape[axis:]) if output_mask[2] else None,
        )
```
# changes
* use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic.
* the size `ret[i][j]` could be zero, so the change in V1 isn't valid
* for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/)
```
    from torch.fx.experimental.symbolic_shapes import guard_size_oblivious
    if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):
```

# past
* found `u22` was introduced at
```
    def _wait_impl(self) -> List[List[int]]:
        # Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks.
        if isinstance(self._splits_awaitable, dist.Work):
            self._splits_awaitable.wait()

        ret = self._output_tensor.view(self.num_workers, -1).T.tolist()  # <------ u22 introduced here

        if not torch.jit.is_scripting() and is_torchdynamo_compiling():
            for i in range(len(ret)):
                for j in range(len(ret[i])):
                    torch._check_is_size(ret[i][j])   # <----------  my question: why the _check_is_size isn't enough??
                    torch._check(ret[i][j] > 0)   # <------ added by diff V1
```

Test Plan:
# run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `tagT`.`tagH`.log
```

# results
* before
**without enabling `_get_user_embeddings`**
[14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html)
log: P1610151892
{F1889387940}
* V1
enable `_get_user_embeddings`
with `torch._check(ret[i][j] > 0)`
[13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html)
{F1889388378}
* V2
enable `_get_user_embeddings`
with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):`
[tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html)
if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):

Differential Revision: D63424929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798
Approved by: https://github.com/ezyang
2024-10-09 17:19:45 +00:00
4d45536e92 Save aot graph code in AOTAutogradCache for logging purposes (#137432)
Save the string graph code from print_readable

Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432
Approved by: https://github.com/bdhirsh
ghstack dependencies: #137431
2024-10-09 16:59:08 +00:00
b71d0ac3b1 remove unused variable (#137565)
per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565
Approved by: https://github.com/Skylion007
2024-10-09 16:31:43 +00:00
ae03c0cff3 Add microbenchmark for FxGraphHashDetails.debug_lines (#137506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506
Approved by: https://github.com/jamesjwu
2024-10-09 16:15:05 +00:00
e945b6600d Support 3.8 compile again (#137587)
This is not going to be very reliable since we don't have CI though...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137587
Approved by: https://github.com/Skylion007
2024-10-09 15:54:52 +00:00
1d15dd7891 Fix triton_reshape to properly expand Min keyword in triton codegen (#137357)
Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`.

Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape
```

Differential Revision: D63850158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2024-10-09 15:53:45 +00:00
de4c2a3b4e Add AsyncCollectiveTensor isinstance check to test_graph_input_is_async (#137253)
This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253
Approved by: https://github.com/bdhirsh
2024-10-09 08:06:16 +00:00
ac8954d1ca [pattern match][SDPA] remove contiguous in sdpa replacement (#136930)
Fixes a perf issue which is found internally.
In the case, we see query(size=[1, 16, 384, 64], stride=[393216, 64, 1024, 1]) in model code. However before entering SDPA, it becomes query(size=[1, 16, 384, 64], stride=[393216, 24576, 64, 1]). This is caused by the [SDPA pattern match](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L130-L132), which applies contiguous to inputs in replacement. This is not necessary as the contiguous doesn't exist in pattern. Furthermore, it could sometimes cause perf issues. Anyway, we can do the additional contiguous in the kernel implementation if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136930
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jgong5
2024-10-09 07:52:38 +00:00
72ad1b8c6c Make Context to be Device-agnostic Step by Step (2/N) (#136526)
- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
ghstack dependencies: #136519
2024-10-09 07:34:30 +00:00
a02093e824 fix test_export_constraints_error_not_in_range (#137500)
Test Plan: fixed

Differential Revision: D64052011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137500
Approved by: https://github.com/tugsbayasgalan
2024-10-09 05:48:14 +00:00
abb00efc14 Add torch.squeeze parameter description to declare allowed type (#137485)
Fixes #137422

Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None`  as `dim` value directly.

```python
>>> import torch
>>> x = torch.randn(3,1,2)
>>> x.squeeze(dim=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Please look up dimensions by name, got: name = None.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485
Approved by: https://github.com/albanD
2024-10-09 05:29:13 +00:00
df114a447e Parametrize test_lstm_packed (#137447)
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour.  Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.

Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error.  Maybe, this would also fix the issue (pending CI testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
2024-10-09 05:13:53 +00:00
2fff990c16 Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)"
This reverts commit 932b9945c0bc61a11a7db2f52c974cf283d5a2ed.

Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))
2024-10-09 04:53:30 +00:00
972822dea1 Minorly reorder optim kwargs in docs, fixes #137391 (#137531)
Closes #137391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531
Approved by: https://github.com/albanD
2024-10-09 04:14:45 +00:00
4628fcf41a Fix ir._WaitKernel (#137401)
In ABI-compatible mode, AOTInductor could not compile _WaitKernel due to
an incorrect outputs list.  Add the correct set of outputs, as done in
ir._CollectiveKernel.create_out_of_place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137401
Approved by: https://github.com/desertfire
ghstack dependencies: #136924
2024-10-09 04:02:30 +00:00
0414aeacd9 AOTInductor: silence linker warnings about executable stacks (#136924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136924
Approved by: https://github.com/desertfire
2024-10-09 04:02:30 +00:00
ddc7b6d0b4 Removes confusing note, addresses #38006 (#137535)
Fixes #38006

The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535
Approved by: https://github.com/albanD
2024-10-09 04:00:38 +00:00
d3edf4ebf4 [SymmetricMemoryOps] implement two-shot all-reduce (#137473)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472
2024-10-09 03:49:42 +00:00
82e55b624f [SymmetricMemoryOps] implement one_shot_all_reduce (#137472)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137471
2024-10-09 03:49:42 +00:00
5d83ee3e32 [SymmetricMemoryOps] refine cross-device barriers (#137471)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471
Approved by: https://github.com/Chillee, https://github.com/weifengpy
2024-10-09 03:49:42 +00:00
5f1759a025 [Dynamo] add flex attention mode test (#137121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119
2024-10-09 02:29:40 +00:00
d5785d4295 [Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227
2024-10-09 02:29:40 +00:00
0a304d9048 [Dynamo] Handle extracted unbound tensor methods (#137227)
fixes2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120
2024-10-09 02:29:40 +00:00
b3f30c9bc3 [Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts.  (We don't trace through torch.* modules by default)

Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)

Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #137114, #137115, #137116, #137117
2024-10-09 02:29:40 +00:00
27dee935af [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-09 02:29:40 +00:00
38afac2917 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-09 02:29:40 +00:00
108b469f78 [Dynamo] Remove ignored modes workaround (#135502) (#137115)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114
2024-10-09 02:29:40 +00:00
e41dffbedd [Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
2024-10-09 02:29:40 +00:00
0b8048c78a Fix AOTI CPP GEMM Template issue without freezing (#136421)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight
```
ReinterpretView(
  StorageBox(
    ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1]))
  ),
  FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]),
  origins=OrderedSet([addmm])
)
```
In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-10-09 02:19:07 +00:00
be0b75256a Make Context to be Device-agnostic Step by Step (1/N) (#136519)
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-09 02:13:36 +00:00
384ddab294 [c10d] fix sequence numbers for coalesced operations (#135132)
Summary:
We were erroneously incrementing seq_collective for p2p operations.
FIxes issue #134833

Test Plan:
Unit tests.
TODO: add more unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135132
Approved by: https://github.com/fduwjj
2024-10-09 01:38:12 +00:00
8cbb58cff6 [inductor] Limit cpu copies in autotuning to CUDA devices (#137509)
Summary: Missed in https://github.com/pytorch/pytorch/pull/136701#discussion_r1792328849: we should perform this optimization only for mutated args on cuda devices

Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137509
Approved by: https://github.com/int3, https://github.com/eellison
2024-10-09 01:31:58 +00:00
932b9945c0 [AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)
Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes.

Test Plan: tbd

Reviewed By: Chillee, basilwong

Differential Revision: D63714905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314
Approved by: https://github.com/eellison, https://github.com/basilwong

Co-authored-by: Parikshit Shah <parikshit@meta.com>
2024-10-09 00:39:29 +00:00
23c531b3e9 Allow parallelize_module to get device_mesh from ambient context (#134247)
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.

Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.

(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)

For example:
```
class FeedForward(nn.Module):
    def __init__(self, config: TransformerArgs) -> None:
        super().__init__()
        w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
        w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        self.w1 = parallelize_module(w1, Colwise)
        self.w2 = parallelize_module(w2, Rowwise)
        self.w3 = parallelize_module(w3, Colwise)

    def forward(self, x: Tensor) -> Tensor:
        y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
        # y is a DTensor with Partial placement; we can return it as is.
        return y
        # Or we can convert it to Replicate -- there is modeling flexibility here.
        return y.redistribute(Replicate())

with device_mesh:
    model = FeedForward(config)
    # Now model is a model parallelized onto device_mesh

y = model(x)

```

The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.

Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
2024-10-09 00:19:03 +00:00
de7f32a205 openreg add pin_memory (#135339)
Occording to `Next steps` in test/cpp_extensions/open_registration_extension/README.md, add Pinned memory and HostAllocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135339
Approved by: https://github.com/albanD
2024-10-09 00:07:59 +00:00
8893881867 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-10-09 00:05:52 +00:00
eqy
cba3f4f5e3 [CUDA] Clean up asserts in test_cuda.py (#137034)
Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034
Approved by: https://github.com/Skylion007
2024-10-08 23:16:19 +00:00
b16167874d Minor SGD docs clarification fixing #137356, #137352 (#137528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528
Approved by: https://github.com/albanD
2024-10-08 23:05:08 +00:00
2a1829d728 Error message for allow_in_graph decorator and arbitrary function combo (#135972)
Fixes #103615

Quick error message for non-allowed allow_in_graph decorator and arbitrary function combo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135972
Approved by: https://github.com/anijain2305
2024-10-08 22:48:38 +00:00
4aed81c0db Add support for cat memory planning mms with max autotune (#132554)
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

```
class AllocatedFixedLayout(FixedLayout)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
2024-10-08 22:36:46 +00:00
02013da038 Lift restriction on training IR for unflatten (#137470)
Differential Revision: [D64025578](https://our.internmc.facebook.com/intern/diff/D64025578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137470
Approved by: https://github.com/avikchaudhuri
2024-10-08 22:30:24 +00:00
81c8a8ada6 [ONNX] Bump onnxscript in CI (#137497)
To 0.1.0.dev20241008
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137497
Approved by: https://github.com/titaiwangms
2024-10-08 21:56:30 +00:00
76ab1ab665 Fix autograd.Function + NJT when an output grad is None (#136875)
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```

This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer
2024-10-08 21:01:36 +00:00
5e3e1c0151 Revert "[FSDP2] Required mesh_dim_names for HSDP (#137436)"
This reverts commit 5fb30df7d6ecc25cc7c4c17a8a33d14ddaa7c279.

Reverted https://github.com/pytorch/pytorch/pull/137436 on behalf of https://github.com/malfet due to Looks like it broke distributed testing, see https://github.com/pytorch/pytorch/actions/runs/11239761070/job/31249854217 ([comment](https://github.com/pytorch/pytorch/pull/137436#issuecomment-2400794929))
2024-10-08 20:50:49 +00:00
b499083a91 Get rid of quadratic tests to has_same_metadata (#136857)
Fixes https://github.com/pytorch/pytorch/issues/136852

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857
Approved by: https://github.com/isuruf, https://github.com/bdhirsh
2024-10-08 20:49:23 +00:00
d34b617bb9 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)"
This reverts commit 51bc839b94829f176e3c1b7f62e3448d6028c480.

Reverted https://github.com/pytorch/pytorch/pull/137114 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
8c937445ee Revert "[Dynamo] Remove ignored modes workaround (#135502) (#137115)"
This reverts commit b1fd7708bd81d8d52908bf4459ed024471abd803.

Reverted https://github.com/pytorch/pytorch/pull/137115 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
e5f9131327 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)"
This reverts commit f9d69cde88ad972ee8fc24413dd0740f4e21562d.

Reverted https://github.com/pytorch/pytorch/pull/137116 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
2d18c2d5e7 Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)"
This reverts commit 941be418d8ec3290d0e3bae0e16a443be26b3075.

Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
cc75ac084f Add test for https://github.com/pytorch/pytorch/issues/137087 (#137090)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137090
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-08 20:17:03 +00:00
5349ee2934 Revert "Parametrize test_lstm_packed (#137447)"
This reverts commit d5493ed579ba41015ffef981832a3f04f94bb6f8.

Reverted https://github.com/pytorch/pytorch/pull/137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](https://github.com/pytorch/pytorch/pull/137447#issuecomment-2400737602))
2024-10-08 20:15:24 +00:00
3c1ab93678 Log chromium event for automatic dynamic reasons (#137491)
Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally.

Run following code:
```
import torch
@torch.compile(backend="eager")
def foo(t, x):
    return t.sin() + x

torch._dynamo.config.automatic_dynamic_shapes = True
torch._dynamo.config.assume_static_by_default = True
# Change size
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([2,2])
foo(x, 2)
torch._dynamo.reset()
# Change dimensionality
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([1,2,3])
foo(x, 2)
torch._dynamo.reset()
# Change stride
x = torch.randn([3,3])
foo(x, 2)
x = torch.as_strided(x, [3,3], [2,2])
foo(x, 2)
torch._dynamo.reset()
# Change scalar
x = torch.randn([1,2])
foo(x, 2)
foo(x, 3)
```

Internal link to perfetto:
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

The events look like this:
<img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab">
<img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656">
<img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba">
<img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491
Approved by: https://github.com/Skylion007, https://github.com/oulgen
2024-10-08 19:53:12 +00:00
cyy
a2396b2dd8 [2/N] Fix extra warnings brought by clang-tidy-17 (#137459)
Follows #137407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
b41fc14072 compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136759
2024-10-08 18:44:13 +00:00
48b8f818b2 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
2024-10-08 18:44:13 +00:00
53af729a66 add meta for _segment_reduce_backward (#137442)
reland of https://github.com/pytorch/pytorch/pull/124988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442
Approved by: https://github.com/albanD
2024-10-08 18:40:06 +00:00
1aac1ffce1 Don't generate implicit value ranges for missing symbols. (#136667)
Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute.

```
TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667
Approved by: https://github.com/isuruf
ghstack dependencies: #136429
2024-10-08 18:12:57 +00:00
90bed32b98 Introduce torch.sym_sum (#136429)
Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

update_hint_regression benchmark, before and after:

```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
2024-10-08 18:12:57 +00:00
3bf6594d13 Log compile ids to pt2_remote_cache and pt2_compile_events (#137431)
Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile.

Differential Revision: [D63900826](https://our.internmc.facebook.com/intern/diff/D63900826/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137431
Approved by: https://github.com/oulgen
2024-10-08 18:04:48 +00:00
758dbac308 Add type check for ord in torch.linalg.vector_norm() and torch.linalg.matrix_norm() (#137463)
fixes #137424, fixes #137460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463
Approved by: https://github.com/lezcano
2024-10-08 17:53:56 +00:00
d87835ac32 [Profiler] Clear Out Dangling AppendOnlyLists (#137450)
Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting.

Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully

Differential Revision: D64010911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450
Approved by: https://github.com/aaronenyeshi
2024-10-08 17:48:59 +00:00
7e8dace0de Revert "[ROCm] remove caffe2 from hipify (#137157)"
This reverts commit 40d826074546558f6665a4c118335a7725503cac.

Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))
2024-10-08 17:45:45 +00:00
a8047564ff Revert "[FlexAttention] Support training bias for eager (#136910)"
This reverts commit 711dacf9845cbc9ea8b3b0fa257309930106712f.

Reverted https://github.com/pytorch/pytorch/pull/136910 on behalf of https://github.com/malfet due to torch.library.custom_op looks weird here and it breaks some internal workloads ([comment](https://github.com/pytorch/pytorch/pull/136910#issuecomment-2400434833))
2024-10-08 17:29:02 +00:00
0b5ade8a12 Revert "[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)"
This reverts commit 68151fd2889c9752348c2dfdc7c175ee201c0cd3.

Reverted https://github.com/pytorch/pytorch/pull/137120 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137120#issuecomment-2400429265))
2024-10-08 17:26:19 +00:00
2570d77a26 Revert "type _dynamo/trace_wrapped_higher_order_op.py (#137354)"
This reverts commit a9f7b905de2217eedee6723b0eb83b3ac7406c26.

Reverted https://github.com/pytorch/pytorch/pull/137354 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137354#issuecomment-2400424669))
2024-10-08 17:22:40 +00:00
76c5bdd2cc Revert "[Dynamo] Handle extracted unbound tensor methods (#137227)"
This reverts commit 14eabd69152e31d059444310979625542db2aece.

Reverted https://github.com/pytorch/pytorch/pull/137227 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137227#issuecomment-2400406384))
2024-10-08 17:12:41 +00:00
c88c0e6c65 Revert "[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)"
This reverts commit d255b34c0ac6208633ed5e71d019fa9ae061e1fc.

Reverted https://github.com/pytorch/pytorch/pull/137119 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137119#issuecomment-2400401262))
2024-10-08 17:09:26 +00:00
cc10ef4645 Revert "[Dynamo] add flex attention mode test (#137121)"
This reverts commit 144665d772f7ec014a4a23f460a632a4a4774f4a.

Reverted https://github.com/pytorch/pytorch/pull/137121 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137121#issuecomment-2400389882))
2024-10-08 17:03:34 +00:00
11192ceca4 Revert "[FlexAttention] only calculate grads for buffers that require_grad (#137451)"
This reverts commit 9f9d252971ea1de04d349a0460e39e3bfe824eae.

Reverted https://github.com/pytorch/pytorch/pull/137451 on behalf of https://github.com/malfet due to Need to revert it in order to be able to backout https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137451#issuecomment-2400385858))
2024-10-08 17:00:59 +00:00
8184e202d8 Update mutation checking in pattern matcher (#137448)
Fix for https://github.com/pytorch/pytorch/issues/137229

The current mutation checking is complicated because it works for pre grad IR. When pre grad ir has been traced to OpOverloads checking is much easier. I am also special casing auto functional hop although I discussed with @zou3519 it would be nice to have a way of querying HOPs that mimic schemas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137448
Approved by: https://github.com/zou3519
2024-10-08 16:56:40 +00:00
28493efe6e fix silly mapping issue with torch.Size (#137465)
Test Plan: added test

Differential Revision: D64022949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465
Approved by: https://github.com/yushangdi, https://github.com/angelayi
2024-10-08 16:53:15 +00:00
7267363844 [ONNX] Insert contiguous node between transpose and view before calling run_decompositions (#137340)
Works around #136543.

This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context.

The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-10-08 16:45:59 +00:00
5fb30df7d6 [FSDP2] Required mesh_dim_names for HSDP (#137436)
Two changes:
1. Require `mesh_dim_names` if using HSDP
2. Pass only the shard mesh to `fsdp_pre_all_gather`

Change 1 is technically BC breaking, but it should not be hard to fix on the user side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-10-08 16:31:18 +00:00
0bfedb13e7 Remove aoti_torch_zero_ codegen (#137371)
Summary: aoti_torch_zero_ codegen breaks AOTI FC, see discussion in D63281798.

Test Plan: CI

Differential Revision: D63916320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137371
Approved by: https://github.com/jingsh
2024-10-08 15:57:41 +00:00
c04b35a5ae [AOTI] Add standalone version of TORCH_CHECK (#136873)
Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained.

Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873
Approved by: https://github.com/albanD, https://github.com/chenyang78
2024-10-08 15:30:01 +00:00
d5493ed579 Parametrize test_lstm_packed (#137447)
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour.  Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.

Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error.  Maybe, this would also fix the issue (pending CI testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
2024-10-08 15:26:27 +00:00
3e2f276a14 Fix to() on non-contiguous NJTs (#137124)
Called out via torchrec integration: `lengths` is not handled properly.

Future work (not related to non-contiguous NJTs): #137275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030, #137031
2024-10-08 15:11:05 +00:00
a77bb8527c Make index check in applySelect support deferred runtime assert (#137046)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137046
Approved by: https://github.com/albanD
2024-10-08 14:31:47 +00:00
9b2e453e24 Migrate ARM64 Linux binary jobs to runner determinator (#136666)
Updates ARM64 Linux binary jobs to use the runner determinator.

Issue: pytorch/ci-infra#265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136666
Approved by: https://github.com/ZainRizvi
2024-10-08 12:14:06 +00:00
76dca1fef3 [c10d] separate the codes for GPU stream synchronization and CPU thread synchronization (#137295)
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295
Approved by: https://github.com/kwen2501
2024-10-08 08:53:47 +00:00
9f9d252971 [FlexAttention] only calculate grads for buffers that require_grad (#137451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137451
Approved by: https://github.com/Chillee
2024-10-08 07:36:38 +00:00
59cdd8ddf1 Bump optree version to 0.13.0 to enable Python 3.13 and Python 3.13t support (#137396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137396
Approved by: https://github.com/albanD
2024-10-08 06:49:04 +00:00
493d0eeef3 Revert "Add support for cat memory planning mms with max autotune (#132554)"
This reverts commit d558ec07300defee24dd4a83ab4b387a39ea2176.

Reverted https://github.com/pytorch/pytorch/pull/132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/132554#issuecomment-2398946854))
2024-10-08 06:21:06 +00:00
8ca15e87f5 Update torchbind expecttest from landrace (#137453)
Update expecttest from torch function mode PR landrace (torch function mode changes output code slightly)

Attempted to revert the stack but there were conflicts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137453
Approved by: https://github.com/huydhn
2024-10-08 06:01:29 +00:00
bb31e3f57e Add original forward names to schema so that prettify pass works (#136887)
When we run_decomp, we retrace if it is training IR. As a result, we do need to reliably store the oroiginal forward names when we run decomp.

Differential Revision: [D63064453](https://our.internmc.facebook.com/intern/diff/D63064453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136887
Approved by: https://github.com/angelayi
2024-10-08 04:21:02 +00:00
46525abb71 OpenReg: support multiple executors (#136249)
From PR https://github.com/pytorch/pytorch/pull/135646 we have split the daemon into drvier/executor, however, current executor stands for all devices and allocate memory all together. In order to better simulate device behavior, here we support multiple executors, each executor stands for one device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136249
Approved by: https://github.com/FFFrog, https://github.com/albanD
2024-10-08 01:37:08 +00:00
395e098209 type _dynamo/mutation_guard.py (#137350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137350
Approved by: https://github.com/Skylion007
2024-10-08 00:04:34 +00:00
52ba40c6f6 [ROCm][AOTI] add CK backend (#135641)
Companion to #134379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135641
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78

Co-authored-by: Colin Peppler <colinpeppler@meta.com>
2024-10-07 23:53:58 +00:00
2c0b11c79b forward-fix D63916220 breaking test_cutlass_backend in FBCode (#137435)
Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend

Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0
```

Differential Revision: D63991961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435
Approved by: https://github.com/jovianjaison
2024-10-07 23:44:04 +00:00
812f286d4a Delete duplicate bindings in torch/csrc/autograd/python_torch_functions_manual.cpp (#136711)
This change deletes the duplicate binding of `torch. _functionalize_mark_mutation_hidden_from_autograd()`, another defination is here:

5c78c6b05a/torch/csrc/autograd/python_torch_functions_manual.cpp (L630-L636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136711
Approved by: https://github.com/soulitzer
2024-10-07 23:19:06 +00:00
d558ec0730 Add support for cat memory planning mms with max autotune (#132554)
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

```
class AllocatedFixedLayout(FixedLayout)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
2024-10-07 22:49:29 +00:00
01bf350967 Fix bmm_sparse_cuda illegal memory access (#131977)
This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example:
```py
import torch

ind = torch.tensor([[1], [0], [0]], device="cuda")
val = torch.tensor([1.], device="cuda")
A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1))
B = torch.zeros((2, 1, 1), device="cuda")
C = torch.bmm(A, B)
```

## Details

In the previous code, we may for example end up with the following situation:
```
i : indices_1D[i]
------------------------------------------
0 : 1                <- start_idx, mid_idx
1 : 1                <- end_idx
...
```
When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977
Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved
2024-10-07 22:47:34 +00:00
a6707a7303 [dynamo] log all graph breaks to graph_breaks logging artifact (#137244)
We were previously not logging all graph breaks (e.g. data dependent jumps) to the graph_breaks logging artifact.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137244
Approved by: https://github.com/jansel
2024-10-07 22:34:27 +00:00
a9f7b905de type _dynamo/trace_wrapped_higher_order_op.py (#137354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-10-07 21:57:06 +00:00
796c3c3415 Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221)"
This reverts commit 7e13e7dd7e5fc20c0420605aeecb0f902af5326c.

Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))
2024-10-07 21:46:13 +00:00
319eda9dfd [inductor] Add API to make post_grad_custom passes cache-able (#137298)
Summary: See https://github.com/pytorch/pytorch/issues/130772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-10-07 21:11:54 +00:00
8aa110cb00 [ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999)
This pull request enables the int_mm_error tests for rocm 6.0+ . since  #122431 landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-10-07 21:10:18 +00:00
46abaa3b0f Increase parallelnative shards to 4 (#137433)
The job times out flakily in trunk as its duration is approaching 3.5h https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=parallelnative

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137433
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-10-07 21:06:34 +00:00
c87c9f0a01 [inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701
Approved by: https://github.com/eellison
2024-10-07 19:47:04 +00:00
900f57216f [dynamo] Log a summary of frames Dynamo traced (#137297)
This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function.

Example:
```python
import torch

@torch.compile
def foo():
    x = torch.ones([10])
    def bar():
        y = x + x
        torch._dynamo.graph_break()
        z = y * x
        return z

    return bar(), bar

foo()
foo()
```

Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end.
```
......
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486]   * foo /Users/ryanguo99/Documents/work/scratch/test.py:4
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486]   * bar /Users/ryanguo99/Documents/work/scratch/test.py:7
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ]
I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: []
......
```

Fixes #118262.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297
Approved by: https://github.com/williamwen42
2024-10-07 19:44:41 +00:00
f33ffd01f2 [export] fix joint graph metadata (#136011)
Differential Revision: D62652832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011
Approved by: https://github.com/tugsbayasgalan
2024-10-07 19:36:44 +00:00
08b84afda9 [inductor] Fix alignment hint for WorkspaceArg (#137429)
Alignment hints refer to the base ptr, not the size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137429
Approved by: https://github.com/eellison
2024-10-07 19:32:33 +00:00
fe44b6a67f Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835)"
This reverts commit 40b09edd87fcbe4e63c4db6399ec758d5c34e1b1.

Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/jovianjaison due to this pr is causing typecheck errors internally ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2397661940))
2024-10-07 18:59:41 +00:00
144665d772 [Dynamo] add flex attention mode test (#137121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119
2024-10-07 18:55:26 +00:00
d255b34c0a [Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119
Approved by: https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227
2024-10-07 18:55:26 +00:00
14eabd6915 [Dynamo] Handle extracted unbound tensor methods (#137227)
fixes2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227
Approved by: https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116, #137117, #137120
2024-10-07 18:55:26 +00:00
68151fd288 [Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts.  (We don't trace through torch.* modules by default)

Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)

Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117
2024-10-07 18:55:26 +00:00
941be418d8 [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-07 18:55:26 +00:00
f9d69cde88 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-07 18:55:26 +00:00
b1fd7708bd [Dynamo] Remove ignored modes workaround (#135502) (#137115)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114
2024-10-07 18:55:26 +00:00
51bc839b94 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
2024-10-07 18:55:26 +00:00
ff95ff5d38 type _dynamo/profiler.py (#137351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137351
Approved by: https://github.com/Skylion007
2024-10-07 18:54:33 +00:00
aa145dead8 [FSDP2] Fixed mistargeted backward prefetch (#137348)
If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348
Approved by: https://github.com/weifengpy
2024-10-07 18:10:09 +00:00
01c07e7864 Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920)"
This reverts commit 8dddd456794f82db5b4e807e9aed1919d3a832da.

Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))
2024-10-07 17:56:57 +00:00
cyy
0c0d8c8ff0 [1/N] Fix extra warnings brought by clang-tidy-17 (#137407)
Before we can use clang-tidy-17
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137407
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-10-07 17:53:59 +00:00
ceb4ed8450 [AOTI][Tooling][10/n] Add scalar and symbolic type input debug printing support (#137323)
Summary:
- Further added more types for debug value dumping.

- Add a test case for symint inputs for debug printer. in real prod model use cases,  "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda
```

Differential Revision: D63864708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323
Approved by: https://github.com/chenyang78
2024-10-07 17:41:40 +00:00
04e48ac562 [inductor] Refactor prefix to make it easy to create subclass of PythonWrapper (#137198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137198
Approved by: https://github.com/jansel
ghstack dependencies: #137191, #137193
2024-10-07 17:20:58 +00:00
e2b72348d0 [inductor] Reuse the subgraph if accessed via same get_attr node (#137193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137193
Approved by: https://github.com/jansel
ghstack dependencies: #137191
2024-10-07 17:20:58 +00:00
7a5eaecd92 [inductor] Correctly keep track of the graph_input_names (#137191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137191
Approved by: https://github.com/jansel
2024-10-07 17:20:53 +00:00
14b4099521 [FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```

Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
2024-10-07 16:36:31 +00:00
33461592e2 [TLParse] Include cache hit/miss/bypass in the report name (#137282)
Makes it easier to tell on glance

https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp1xoGc1/index.html

<img width="398" alt="image" src="https://github.com/user-attachments/assets/7ed111cb-46d8-4442-a1b2-037d0a8decd8">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137282
Approved by: https://github.com/jamesjwu
2024-10-07 16:00:00 +00:00
4db199f15f Implement Remote AOTAutogradCache (#137278)
Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name.

Test Plan:
Run benchmark:
TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency

See that it cache hits even with local cache removed.

Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj

New unit tests

Reviewed By: oulgen

Differential Revision: D63323958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278
Approved by: https://github.com/oulgen
2024-10-07 15:38:54 +00:00
f80ed0b831 [export] Custom op meta kernel generation (two pass) (#137277)
Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi

Test Plan: followup diff (D63837739)

Differential Revision: D63837740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277
Approved by: https://github.com/zou3519
2024-10-07 15:34:19 +00:00
e20e7a8c38 Fixed developer setup issue in open_registration_extension (#137355)
This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following:

```
error: 'synchronizeStream' overrides a member function but is not marked 'override'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355
Approved by: https://github.com/albanD, https://github.com/spzala
2024-10-07 15:25:37 +00:00
8c3ab21490 multiprocessing.spawn: allow a grace period when shutdown (#131278)
When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).

This PR adds a grace period. Default behavior is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278
Approved by: https://github.com/albanD
2024-10-07 12:37:34 +00:00
a063a82c8b [redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341)
Summary:

Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341
Approved by: https://github.com/eqy
2024-10-07 00:58:51 +00:00
6097 changed files with 380101 additions and 161287 deletions

View File

@ -1 +1 @@
6.1.1
6.5.0

View File

@ -1,23 +0,0 @@
[pt]
is_oss=1
[buildfile]
name = BUCK.oss
includes = //tools/build_defs/select.bzl
[repositories]
bazel_skylib = third_party/bazel-skylib/
ovr_config = .
[download]
in_build = true
[cxx]
cxxflags = -std=c++17
ldflags = -Wl,--no-undefined
should_remap_host_platform = true
cpp = /usr/bin/clang
cc = /usr/bin/clang
cxx = /usr/bin/clang++
cxxpp = /usr/bin/clang++
ld = /usr/bin/clang++

View File

@ -0,0 +1,19 @@
# Aarch64 (ARM/Graviton) Support Scripts
Scripts for building aarch64 PyTorch PIP Wheels. These scripts build the following wheels:
* torch
* torchvision
* torchaudio
* torchtext
* torchdata
## Aarch64_ci_build.sh
This script is design to support CD operations within PyPi manylinux aarch64 container, and be executed in the container. It prepares the container and then executes __aarch64_wheel_ci_build.py__ to build the wheels. The script "assumes" the PyTorch repo is located at: ```/pytorch``` and will put the wheels into ```/artifacts```.
### Usage
```DESIRED_PYTHON=<PythonVersion> aarch64_ci_build.sh```
__NOTE:__ CI build is currently __EXPERMINTAL__
## Build_aarch64_wheel.py
This app allows a person to build using AWS EC3 resources and requires AWS-CLI and Boto3 with AWS credentials to support building EC2 instances for the wheel builds. Can be used in a codebuild CD or from a local system.
### Usage
```build_aarch64_wheel.py --key-name <YourPemKey> --use-docker --python 3.8 --branch <RCtag>```

View File

@ -0,0 +1,32 @@
#!/bin/bash
set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
source $SCRIPTPATH/aarch64_ci_setup.sh
###############################################################################
# Run aarch64 builder python
###############################################################################
cd /
# adding safe directory for git as the permissions will be
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel==6.2.0
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files
USE_PRIORITIZED_TEXT_FOR_LD=1 python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -0,0 +1,21 @@
#!/bin/bash
set -eux -o pipefail
# This script is used to prepare the Docker container for aarch64_ci_wheel_build.py python script
# By creating symlinks from desired /opt/python to /usr/local/bin/
NUMPY_VERSION=2.0.2
if [[ "$DESIRED_PYTHON" == "3.13" || "$DESIRED_PYTHON" == "3.13t" ]]; then
NUMPY_VERSION=2.1.2
fi
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
source $SCRIPTPATH/../manywheel/set_desired_python.sh
pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2
for tool in python python3 pip pip3 ninja scons patchelf; do
ln -sf ${DESIRED_PYTHON_BIN_DIR}/${tool} /usr/local/bin;
done
python --version

View File

@ -0,0 +1,237 @@
#!/usr/bin/env python3
# encoding: UTF-8
import os
import shutil
from subprocess import check_call, check_output
def list_dir(path: str) -> list[str]:
"""'
Helper for getting paths for Python
"""
return check_output(["ls", "-1", path]).decode().split("\n")
def build_ArmComputeLibrary() -> None:
"""
Using ArmComputeLibrary for aarch64 PyTorch
"""
print("Building Arm Compute Library")
acl_build_flags = [
"debug=0",
"neon=1",
"opencl=0",
"os=linux",
"openmp=1",
"cppthreads=0",
"arch=armv8a",
"multi_isa=1",
"fixed_format_kernels=1",
"build=native",
]
acl_install_dir = "/acl"
acl_checkout_dir = "ComputeLibrary"
os.makedirs(acl_install_dir)
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
check_call(
["scons", "Werror=1", "-j8", f"build_dir=/{acl_install_dir}/build"]
+ acl_build_flags,
cwd=acl_checkout_dir,
)
for d in ["arm_compute", "include", "utils", "support", "src"]:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def update_wheel(wheel_path, desired_cuda) -> None:
"""
Update the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
wheelname = os.path.basename(wheel_path)
os.mkdir(f"{folder}/tmp")
os.system(f"unzip {wheel_path} -d {folder}/tmp")
libs_to_copy = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusparse.so.12",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnvToolsExt.so.1",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
]
if enable_cuda:
libs_to_copy += [
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if "126" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
elif "128" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
else:
libs_to_copy += [
"/opt/OpenBLAS/lib/libopenblas.so.0",
]
# Copy libraries to unzipped_folder/a/lib
for lib_path in libs_to_copy:
lib_name = os.path.basename(lib_path)
shutil.copy2(lib_path, f"{folder}/tmp/torch/lib/{lib_name}")
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '$ORIGIN' --force-rpath {folder}/tmp/torch/lib/{lib_name}"
)
os.mkdir(f"{folder}/cuda_wheel")
os.system(f"cd {folder}/tmp/; zip -r {folder}/cuda_wheel/{wheelname} *")
shutil.move(
f"{folder}/cuda_wheel/{wheelname}",
f"{folder}/{wheelname}",
copy_function=shutil.copy2,
)
os.system(f"rm -rf {folder}/tmp/ {folder}/cuda_wheel/")
def complete_wheel(folder: str) -> str:
"""
Complete wheel build and put in artifact location
"""
wheel_name = list_dir(f"/{folder}/dist")[0]
if "pytorch" in folder and not enable_cuda:
print("Repairing Wheel with AuditWheel")
check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)
repaired_wheel_name = list_dir(f"/{folder}/wheelhouse")[0]
print(f"Moving {repaired_wheel_name} wheel to /{folder}/dist")
os.rename(
f"/{folder}/wheelhouse/{repaired_wheel_name}",
f"/{folder}/dist/{repaired_wheel_name}",
)
else:
repaired_wheel_name = wheel_name
print(f"Copying {repaired_wheel_name} to artifacts")
shutil.copy2(
f"/{folder}/dist/{repaired_wheel_name}", f"/artifacts/{repaired_wheel_name}"
)
return repaired_wheel_name
def parse_arguments():
"""
Parse inline arguments
"""
from argparse import ArgumentParser
parser = ArgumentParser("AARCH64 wheels python CD")
parser.add_argument("--debug", action="store_true")
parser.add_argument("--build-only", action="store_true")
parser.add_argument("--test-only", type=str)
parser.add_argument("--enable-mkldnn", action="store_true")
parser.add_argument("--enable-cuda", action="store_true")
return parser.parse_args()
if __name__ == "__main__":
"""
Entry Point
"""
args = parse_arguments()
enable_mkldnn = args.enable_mkldnn
enable_cuda = args.enable_cuda
branch = check_output(
["git", "rev-parse", "--abbrev-ref", "HEAD"], cwd="/pytorch"
).decode()
print("Building PyTorch wheel")
build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
os.system("cd /pytorch; python setup.py clean")
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
if override_package_version is not None:
version = override_package_version
build_vars += (
f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version} PYTORCH_BUILD_NUMBER=1 "
)
elif branch in ["nightly", "main"]:
build_date = (
check_output(["git", "log", "--pretty=format:%cs", "-1"], cwd="/pytorch")
.decode()
.replace("-", "")
)
version = (
check_output(["cat", "version.txt"], cwd="/pytorch").decode().strip()[:-2]
)
if enable_cuda:
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date}+{desired_cuda} PYTORCH_BUILD_NUMBER=1 "
else:
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "
elif branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
if enable_mkldnn:
build_ArmComputeLibrary()
print("build pytorch with mkldnn+acl backend")
build_vars += (
"USE_MKLDNN=ON USE_MKLDNN_ACL=ON "
"ACL_ROOT_DIR=/acl "
"LD_LIBRARY_PATH=/pytorch/build/lib:/acl/build:$LD_LIBRARY_PATH "
"ACL_INCLUDE_DIR=/acl/build "
"ACL_LIBRARY=/acl/build "
)
if enable_cuda:
build_vars += "BLAS=NVPL "
else:
build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/OpenBLAS "
else:
print("build pytorch without mkldnn backend")
os.system(f"cd /pytorch; {build_vars} python3 setup.py bdist_wheel")
if enable_cuda:
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")
wheel_path = f"/pytorch/dist/{filename[0]}"
update_wheel(wheel_path, desired_cuda)
pytorch_wheel_name = complete_wheel("/pytorch/")
print(f"Build Complete. Created {pytorch_wheel_name}..")

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,87 @@
#!/usr/bin/env python3
import os
import shutil
import sys
from subprocess import check_call
from tempfile import TemporaryDirectory
from auditwheel.elfutils import elf_file_filter
from auditwheel.lddtree import lddtree
from auditwheel.patcher import Patchelf
from auditwheel.repair import copylib
from auditwheel.wheeltools import InWheelCtx
def replace_tag(filename):
with open(filename) as f:
lines = f.read().split("\\n")
for i, line in enumerate(lines):
if not line.startswith("Tag: "):
continue
lines[i] = line.replace("-linux_", "-manylinux2014_")
print(f"Updated tag from {line} to {lines[i]}")
with open(filename, "w") as f:
f.write("\\n".join(lines))
class AlignedPatchelf(Patchelf):
def set_soname(self, file_name: str, new_soname: str) -> None:
check_call(
["patchelf", "--page-size", "65536", "--set-soname", new_soname, file_name]
)
def replace_needed(self, file_name: str, soname: str, new_soname: str) -> None:
check_call(
[
"patchelf",
"--page-size",
"65536",
"--replace-needed",
soname,
new_soname,
file_name,
]
)
def embed_library(whl_path, lib_soname, update_tag=False):
patcher = AlignedPatchelf()
out_dir = TemporaryDirectory()
whl_name = os.path.basename(whl_path)
tmp_whl_name = os.path.join(out_dir.name, whl_name)
with InWheelCtx(whl_path) as ctx:
torchlib_path = os.path.join(ctx._tmpdir.name, "torch", "lib")
ctx.out_wheel = tmp_whl_name
new_lib_path, new_lib_soname = None, None
for filename, _ in elf_file_filter(ctx.iter_files()):
if not filename.startswith("torch/lib"):
continue
libtree = lddtree(filename)
if lib_soname not in libtree["needed"]:
continue
lib_path = libtree["libs"][lib_soname]["path"]
if lib_path is None:
print(f"Can't embed {lib_soname} as it could not be found")
break
if lib_path.startswith(torchlib_path):
continue
if new_lib_path is None:
new_lib_soname, new_lib_path = copylib(lib_path, torchlib_path, patcher)
patcher.replace_needed(filename, lib_soname, new_lib_soname)
print(f"Replacing {lib_soname} with {new_lib_soname} for {filename}")
if update_tag:
# Add manylinux2014 tag
for filename in ctx.iter_files():
if os.path.basename(filename) != "WHEEL":
continue
replace_tag(filename)
shutil.move(tmp_whl_name, whl_path)
if __name__ == "__main__":
embed_library(
sys.argv[1], "libgomp.so.1", len(sys.argv) > 2 and sys.argv[2] == "--update-tag"
)

View File

@ -1,47 +1,39 @@
ARG CUDA_VERSION=10.2
ARG CUDA_VERSION=12.4
ARG BASE_TARGET=cuda${CUDA_VERSION}
FROM centos:7 as base
FROM amd64/almalinux:8 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum update -y
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which unzip
ARG DEVTOOLSET_VERSION=11
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum -y update
RUN yum -y install epel-release
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel openssl-devel yum-utils autoconf automake make gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
# EPEL for cmake
RUN yum --enablerepo=extras install -y epel-release
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN yum install -y autoconf aclocal automake make sudo
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake3
RUN rm -rf /usr/local/cuda-*
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh && cp $(which patchelf) /patchelf
FROM base as openssl
# Install openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
FROM base as conda
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
@ -49,7 +41,7 @@ RUN bash ./install_conda.sh && rm install_conda.sh
# Install CUDA
FROM base as cuda
ARG CUDA_VERSION=10.2
ARG CUDA_VERSION=12.4
RUN rm -rf /usr/local/cuda-*
ADD ./common/install_cuda.sh install_cuda.sh
ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
@ -70,6 +62,10 @@ FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
ENV DESIRED_CUDA=12.4
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
ENV DESIRED_CUDA=12.6
# Install MNIST test data
FROM base as mnist
ADD ./common/install_mnist.sh install_mnist.sh
@ -79,6 +75,7 @@ FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
COPY --from=cuda12.6 /usr/local/cuda-12.6 /usr/local/cuda-12.6
# Final step
FROM ${BASE_TARGET} as final
@ -91,7 +88,8 @@ COPY ./common/install_jni.sh install_jni.sh
COPY ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
ENV PATH /opt/conda/bin:$PATH
ENV PATH /opt/conda/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
COPY --from=mnist /usr/local/mnist /usr/local/mnist
RUN rm -rf /usr/local/cuda
RUN chmod o+rw /usr/local

View File

@ -48,10 +48,10 @@ esac
--progress plain \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=9" \
--build-arg "DEVTOOLSET_VERSION=11" \
-t ${DOCKER_IMAGE_NAME} \
$@ \
-f "${TOPDIR}/.ci/docker/conda/Dockerfile" \
-f "${TOPDIR}/.ci/docker/almalinux/Dockerfile" \
${TOPDIR}/.ci/docker/
)

View File

@ -1 +0,0 @@
<manifest package="org.pytorch.deps" />

View File

@ -1,66 +0,0 @@
buildscript {
ext {
minSdkVersion = 21
targetSdkVersion = 28
compileSdkVersion = 28
buildToolsVersion = '28.0.3'
coreVersion = "1.2.0"
extJUnitVersion = "1.1.1"
runnerVersion = "1.2.0"
rulesVersion = "1.2.0"
junitVersion = "4.12"
}
repositories {
google()
mavenLocal()
mavenCentral()
jcenter()
}
dependencies {
classpath 'com.android.tools.build:gradle:4.1.2'
classpath 'com.vanniktech:gradle-maven-publish-plugin:0.14.2'
}
}
repositories {
google()
jcenter()
}
apply plugin: 'com.android.library'
android {
compileSdkVersion rootProject.compileSdkVersion
buildToolsVersion rootProject.buildToolsVersion
defaultConfig {
minSdkVersion minSdkVersion
targetSdkVersion targetSdkVersion
}
sourceSets {
main {
manifest.srcFile 'AndroidManifest.xml'
}
}
}
dependencies {
implementation 'com.android.support:appcompat-v7:28.0.0'
implementation 'androidx.appcompat:appcompat:1.0.0'
implementation 'com.facebook.fbjni:fbjni-java-only:0.2.2'
implementation 'com.google.code.findbugs:jsr305:3.0.1'
implementation 'com.facebook.soloader:nativeloader:0.10.5'
implementation 'junit:junit:' + rootProject.junitVersion
implementation 'androidx.test:core:' + rootProject.coreVersion
implementation 'junit:junit:' + rootProject.junitVersion
implementation 'androidx.test:core:' + rootProject.coreVersion
implementation 'androidx.test.ext:junit:' + rootProject.extJUnitVersion
implementation 'androidx.test:rules:' + rootProject.rulesVersion
implementation 'androidx.test:runner:' + rootProject.runnerVersion
}

View File

@ -1,5 +0,0 @@
0.7b
manylinux_2_17
rocm6.2
9be04068c3c0857a4cfd17d7e39e71d0423ebac2
3e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291

View File

@ -1,4 +1,8 @@
#!/bin/bash
# The purpose of this script is to:
# 1. Extract the set of parameters to be used for a docker build based on the provided image name.
# 2. Run docker build with the parameters found in step 1.
# 3. Run the built image and print out the expected and actual versions of packages installed.
set -ex
@ -86,30 +90,20 @@ CMAKE_VERSION=3.18.5
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
if [[ "$image" == *rocm* ]]; then
_UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
fi
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
@ -134,36 +128,6 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
@ -179,6 +143,80 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=9
@ -193,48 +231,6 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3-clang10-onnx)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
@ -244,16 +240,6 @@ case "$image" in
CONDA_CMAKE=yes
ONNX=yes
;;
pytorch-linux-focal-py3-clang9-android-ndk-r21e)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=9
LLVMDEV=yes
PROTOBUF=yes
ANDROID=yes
ANDROID_NDK_VERSION=r21e
GRADLE_VERSION=6.8.3
NINJA_VERSION=1.9.0
;;
pytorch-linux-focal-py3.9-clang10)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
@ -287,25 +273,33 @@ case "$image" in
;;
pytorch-linux-focal-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.1
ROCM_VERSION=6.2.4
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.2
ROCM_VERSION=6.3
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.9
@ -318,6 +312,17 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-xpu-2025.0-py3)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
XPU_VERSION=2025.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
@ -380,7 +385,7 @@ case "$image" in
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
@ -388,7 +393,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
@ -414,9 +419,6 @@ case "$image" in
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping sccache due to the following issue
# https://github.com/pytorch/pytorch/issues/121559
SKIP_SCCACHE_INSTALL=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -429,9 +431,6 @@ case "$image" in
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping sccache due to the following issue
# https://github.com/pytorch/pytorch/issues/121559
SKIP_SCCACHE_INSTALL=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@ -508,8 +507,6 @@ docker build \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
--build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \
--build-arg "ANDROID=${ANDROID}" \
--build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
--build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \
--build-arg "SWIFTSHADER=${SWIFTSHADER}" \
@ -517,7 +514,7 @@ docker build \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906;gfx90a}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942}" \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "UCX_COMMIT=${UCX_COMMIT}" \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \

View File

@ -113,13 +113,6 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton (Early fail)
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -1 +1 @@
cd1c833b079adb324871dcbbe75b43d42ffc0ade
01a22b6f16d117454b7d21ebdc691b0785b84a7f

View File

@ -0,0 +1 @@
v2.21.5-1

View File

@ -0,0 +1 @@
v2.26.2-1

View File

@ -1 +1 @@
ac3470188b914c5d7a5058a7e28b9eb685a62427
5d535d7a2d4b435b1b5c1177fd8f04a12b942b9a

View File

@ -1 +1 @@
6a333f1b05671f6fada4ba7bbfae4a02a9d96f4f
c7711371cace304afe265c1ffa906415ab82fc66

View File

@ -1 +1 @@
91b14bf5593cf58a8541f3e6b9125600a867d4ef
0bcc8265e677e5321606a3311bf71470f14456a8

View File

@ -1 +1 @@
cf34004b8a67d290a962da166f5aa2fc66751326
96316ce50fade7e209553aba4898cd9b82aab83b

View File

@ -1,7 +1,7 @@
set -euo pipefail
readonly version=v24.04
readonly src_host=https://review.mlplatform.org/ml
readonly version=v25.02
readonly src_host=https://github.com/ARM-software
readonly src_repo=ComputeLibrary
# Clone ACL

View File

@ -1,112 +0,0 @@
#!/bin/bash
set -ex
[ -n "${ANDROID_NDK}" ]
_https_amazon_aws=https://ossci-android.s3.amazonaws.com
apt-get update
apt-get install -y --no-install-recommends autotools-dev autoconf unzip
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
pushd /tmp
curl -Os --retry 3 $_https_amazon_aws/android-ndk-${ANDROID_NDK}-linux-x86_64.zip
popd
_ndk_dir=/opt/ndk
mkdir -p "$_ndk_dir"
unzip -qo /tmp/android*.zip -d "$_ndk_dir"
_versioned_dir=$(find "$_ndk_dir/" -mindepth 1 -maxdepth 1 -type d)
mv "$_versioned_dir"/* "$_ndk_dir"/
rmdir "$_versioned_dir"
rm -rf /tmp/*
# Install OpenJDK
# https://hub.docker.com/r/picoded/ubuntu-openjdk-8-jdk/dockerfile/
sudo apt-get update && \
apt-get install -y openjdk-8-jdk && \
apt-get install -y ant && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
rm -rf /var/cache/oracle-jdk8-installer;
# Fix certificate issues, found as of
# https://bugs.launchpad.net/ubuntu/+source/ca-certificates-java/+bug/983302
sudo apt-get update && \
apt-get install -y ca-certificates-java && \
apt-get clean && \
update-ca-certificates -f && \
rm -rf /var/lib/apt/lists/* && \
rm -rf /var/cache/oracle-jdk8-installer;
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
# Installing android sdk
# https://github.com/circleci/circleci-images/blob/staging/android/Dockerfile.m4
_tmp_sdk_zip=/tmp/android-sdk-linux.zip
_android_home=/opt/android/sdk
rm -rf $_android_home
sudo mkdir -p $_android_home
curl --silent --show-error --location --fail --retry 3 --output /tmp/android-sdk-linux.zip $_https_amazon_aws/android-sdk-linux-tools3859397-build-tools2803-2902-platforms28-29.zip
sudo unzip -q $_tmp_sdk_zip -d $_android_home
rm $_tmp_sdk_zip
sudo chmod -R 777 $_android_home
export ANDROID_HOME=$_android_home
export ADB_INSTALL_TIMEOUT=120
export PATH="${ANDROID_HOME}/tools:${ANDROID_HOME}/tools/bin:${ANDROID_HOME}/platform-tools:${PATH}"
echo "PATH:${PATH}"
# Installing Gradle
echo "GRADLE_VERSION:${GRADLE_VERSION}"
_gradle_home=/opt/gradle
sudo rm -rf $gradle_home
sudo mkdir -p $_gradle_home
curl --silent --output /tmp/gradle.zip --retry 3 $_https_amazon_aws/gradle-${GRADLE_VERSION}-bin.zip
sudo unzip -q /tmp/gradle.zip -d $_gradle_home
rm /tmp/gradle.zip
sudo chmod -R 777 $_gradle_home
export GRADLE_HOME=$_gradle_home/gradle-$GRADLE_VERSION
alias gradle="${GRADLE_HOME}/bin/gradle"
export PATH="${GRADLE_HOME}/bin/:${PATH}"
echo "PATH:${PATH}"
gradle --version
mkdir /var/lib/jenkins/gradledeps
cp build.gradle /var/lib/jenkins/gradledeps
cp AndroidManifest.xml /var/lib/jenkins/gradledeps
pushd /var/lib/jenkins
export GRADLE_LOCAL_PROPERTIES=gradledeps/local.properties
rm -f $GRADLE_LOCAL_PROPERTIES
echo "sdk.dir=/opt/android/sdk" >> $GRADLE_LOCAL_PROPERTIES
echo "ndk.dir=/opt/ndk" >> $GRADLE_LOCAL_PROPERTIES
chown -R jenkins /var/lib/jenkins/gradledeps
chgrp -R jenkins /var/lib/jenkins/gradledeps
sudo -H -u jenkins $GRADLE_HOME/bin/gradle -Pandroid.useAndroidX=true -p /var/lib/jenkins/gradledeps -g /var/lib/jenkins/.gradle --refresh-dependencies --debug --stacktrace assemble
chown -R jenkins /var/lib/jenkins/.gradle
chgrp -R jenkins /var/lib/jenkins/.gradle
popd
rm -rf /var/lib/jenkins/.gradle/daemon
# Cache vision models used by the test
source "$(dirname "${BASH_SOURCE[0]}")/cache_vision_models.sh"

View File

@ -1,23 +0,0 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
TARBALL='aotriton.tar.gz'
# This read command alwasy returns with exit code 1
read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true
ARCH=$(uname -m)
AOTRITON_INSTALL_PREFIX="$1"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"
cd "${AOTRITON_INSTALL_PREFIX}"
# Must use -L to follow redirects
curl -L --retry 3 -o "${TARBALL}" "${AOTRITON_URL}"
ACTUAL_SHA256=$(sha256sum "${TARBALL}" | cut -d " " -f 1)
if [ "${SHA256}" != "${ACTUAL_SHA256}" ]; then
echo -n "Error: The SHA256 of downloaded tarball is ${ACTUAL_SHA256},"
echo " which does not match the expected value ${SHA256}."
exit
fi
tar xf "${TARBALL}" && rm -rf "${TARBALL}"

View File

@ -32,8 +32,12 @@ install_ubuntu() {
# HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes
# See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729
# TODO: Eliminate this hack, we should not relay on apt-get installation
# See https://github.com/pytorch/pytorch/issues/144768
if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then
maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then
maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
else
maybe_libnccl_dev=""
fi
@ -76,7 +80,8 @@ install_ubuntu() {
vim \
unzip \
gpg-agent \
gdb
gdb \
bc
# Should resolve issues related to various apt package repository cert issues
# see: https://github.com/pytorch/pytorch/issues/65931

View File

@ -9,7 +9,7 @@ install_ubuntu() {
# Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``
apt-get install -y cargo
echo "Checking out sccache repo"
git clone https://github.com/pytorch/sccache
git clone https://github.com/mozilla/sccache -b v0.9.1
cd sccache
echo "Building sccache"
cargo build --release
@ -19,6 +19,10 @@ install_ubuntu() {
rm -rf sccache
apt-get remove -y cargo rustc
apt-get autoclean && apt-get clean
echo "Downloading old sccache binary from S3 repo for PCH builds"
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache-0.2.14a
chmod 755 /opt/cache/bin/sccache-0.2.14a
}
install_binary() {
@ -32,22 +36,42 @@ sed -e 's|PATH="\(.*\)"|PATH="/opt/cache/bin:\1"|g' -i /etc/environment
export PATH="/opt/cache/bin:$PATH"
# Setup compiler cache
if [ -n "$ROCM_VERSION" ]; then
curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache
else
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
# TODO: Install the pre-built binary from S3 as building from source
# https://github.com/pytorch/sccache has started failing mysteriously
# in which sccache server couldn't start with the following error:
# sccache: error: Invalid argument (os error 22)
install_binary
fi
install_ubuntu
chmod a+x /opt/cache/bin/sccache
function write_sccache_stub() {
# Unset LD_PRELOAD for ps because of asan + ps issues
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589
printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n exec sccache $(which $1) \"\$@\"\nelse\n exec $(which $1) \"\$@\"\nfi" > "/opt/cache/bin/$1"
if [ $1 == "gcc" ]; then
# Do not call sccache recursively when dumping preprocessor argument
# For some reason it's very important for the first cached nvcc invocation
cat >"/opt/cache/bin/$1" <<EOF
#!/bin/sh
# sccache does not support -E flag, so we need to call the original compiler directly in order to avoid calling this wrapper recursively
for arg in "\$@"; do
if [ "\$arg" = "-E" ]; then
exec $(which $1) "\$@"
fi
done
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache $(which $1) "\$@"
else
exec $(which $1) "\$@"
fi
EOF
else
cat >"/opt/cache/bin/$1" <<EOF
#!/bin/sh
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache $(which $1) "\$@"
else
exec $(which $1) "\$@"
fi
EOF
fi
chmod a+x "/opt/cache/bin/$1"
}
@ -88,7 +112,7 @@ if [ -n "$ROCM_VERSION" ]; then
TOPDIR=$(dirname $OLDCOMP)
WRAPPED="$TOPDIR/original/$COMPNAME"
mv "$OLDCOMP" "$WRAPPED"
printf "#!/bin/sh\nexec sccache $WRAPPED \"\$@\"" > "$OLDCOMP"
printf "#!/bin/sh\nexec sccache $WRAPPED \"\$@\"" >"$OLDCOMP"
chmod a+x "$OLDCOMP"
}

View File

@ -20,9 +20,10 @@ if [ -n "$CLANG_VERSION" ]; then
fi
sudo apt-get update
apt-get install -y --no-install-recommends clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"
if [[ $CLANG_VERSION == 18 ]]; then
apt-get install -y --no-install-recommends libomp-18-dev
if [[ $CLANG_VERSION -ge 18 ]]; then
apt-get install -y libomp-${CLANG_VERSION}-dev libclang-rt-${CLANG_VERSION}-dev clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"
else
apt-get install -y --no-install-recommends clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"
fi
# Install dev version of LLVM.

View File

@ -25,7 +25,8 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
SCRIPT_FOLDER="$( cd "$(dirname "$0")" ; pwd -P )"
source "${SCRIPT_FOLDER}/common_utils.sh"
pushd /tmp
wget -q "${BASE_URL}/${CONDA_FILE}"
@ -65,23 +66,10 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
NUMPY_VERSION=1.24.4
else
NUMPY_VERSION=1.26.2
fi
conda_install "openblas==0.3.29=*openmp*"
else
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then
NUMPY_VERSION=1.26.0
else
NUMPY_VERSION=1.21.2
fi
conda_install "mkl=2021.4.0 mkl-include=2021.4.0"
fi
conda_install ${CONDA_COMMON_DEPS}
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
# and libpython-static for torch deploy
@ -92,19 +80,18 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# following builds that we know should use conda. Specifically, Ubuntu bionic
# and focal cannot find conda mkl with stock cmake, so we need a cmake from conda
if [ -n "${CONDA_CMAKE}" ]; then
conda_install cmake
conda_install cmake==3.31.2
fi
# Magma package names are concatenation of CUDA major and minor ignoring revision
# I.e. magma-cuda102 package corresponds to CUDA_VERSION=10.2 and CUDA_VERSION=10.2.89
# Magma is installed from a tarball in the ossci-linux bucket into the conda env
if [ -n "$CUDA_VERSION" ]; then
conda_install magma-cuda$(TMP=${CUDA_VERSION/./};echo ${TMP%.*[0-9]}) -c pytorch
${SCRIPT_FOLDER}/install_magma_conda.sh $(cut -f1-2 -d'.' <<< ${CUDA_VERSION}) ${ANACONDA_PYTHON_VERSION}
fi
# Install some other packages, including those needed for Python test reporting
pip_install -r /opt/conda/requirements-ci.txt
pip_install numpy=="$NUMPY_VERSION"
pip_install -U scikit-learn
if [ -n "$DOCS" ]; then
apt-get update

View File

@ -70,7 +70,7 @@ function do_cpython_build {
# install setuptools since python 3.12 is required to use distutils
${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
ln -s ${prefix} /opt/python/${abi_tag}
ln -sf ${prefix} /opt/python/${abi_tag}
}
function build_cpython {

View File

@ -2,8 +2,8 @@
set -ex
NCCL_VERSION=v2.21.5-1
CUDNN_VERSION=9.1.0.70
NCCL_VERSION=v2.26.2-1
CUDNN_VERSION=9.5.1.17
function install_cusparselt_040 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
@ -16,17 +16,6 @@ function install_cusparselt_040 {
rm -rf tmp_cusparselt
}
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
@ -38,7 +27,20 @@ function install_cusparselt_062 {
rm -rf tmp_cusparselt
}
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.6.3.2-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_118 {
CUDNN_VERSION=9.1.0.70
NCCL_VERSION=v2.21.5-1
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"
rm -rf /usr/local/cuda-11.8 /usr/local/cuda
# install CUDA 11.8.0 in the same container
@ -71,40 +73,8 @@ function install_118 {
ldconfig
}
function install_121 {
echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.1 /usr/local/cuda
# install CUDA 12.1.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
chmod +x cuda_12.1.1_530.30.02_linux.run
./cuda_12.1.1_530.30.02_linux.run --toolkit --silent
rm -f cuda_12.1.1_530.30.02_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
@ -137,6 +107,39 @@ function install_124 {
ldconfig
}
function install_126 {
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.6 /usr/local/cuda
# install CUDA 12.6.3 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
chmod +x cuda_12.6.3_560.35.05_linux.run
./cuda_12.6.3_560.35.05_linux.run --toolkit --silent
rm -f cuda_12.6.3_560.35.05_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_063
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
@ -168,37 +171,6 @@ function prune_118 {
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_121 {
echo "Pruning CUDA 12.1"
#####################################################################################
# CUDA 12.1 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.1/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
@ -227,22 +199,92 @@ function prune_124 {
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
# CUDA 12.6 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.6 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.6/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
function install_128 {
CUDNN_VERSION=9.7.1.26
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run
chmod +x cuda_12.8.0_570.86.10_linux.run
./cuda_12.8.0_570.86.10_linux.run --toolkit --silent
rm -f cuda_12.8.0_570.86.10_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_063
ldconfig
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
;;
12.1) install_121; prune_121
;;
12.4) install_124; prune_124
;;
12.6) install_126; prune_126
;;
12.8) install_128;
;;
*) echo "bad argument $1"; exit 1
;;
esac

View File

@ -3,35 +3,36 @@
set -ex
NCCL_VERSION=v2.21.5-1
NCCL_VERSION=v2.26.2-1
CUDNN_VERSION=9.8.0.87
function install_cusparselt_062 {
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.3.2-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.3.2-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_124 {
echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run
./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
function install_128 {
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux_sbsa.run
chmod +x cuda_12.8.0_570.86.10_linux_sbsa.run
./cuda_12.8.0_570.86.10_linux_sbsa.run --toolkit --silent
rm -f cuda_12.8.0_570.86.10_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz -O cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/lib/* /usr/local/cuda/lib64/
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
@ -44,47 +45,16 @@ function install_124 {
cd ..
rm -rf nccl
install_cusparselt_062
install_cusparselt_063
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
12.8) install_128;
;;
*) echo "bad argument $1"; exit 1
;;

View File

@ -4,7 +4,11 @@ if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:2} == "12" ]]; then
if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.7.1.26_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"

View File

@ -5,7 +5,15 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
@ -13,17 +21,11 @@ if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
else
echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"
fi
tar xf ${CUSPARSELT_NAME}.tar.xz

View File

@ -36,25 +36,24 @@ install_conda_dependencies() {
}
install_pip_dependencies() {
pushd executorch/.ci/docker
# Install PyTorch CPU build beforehand to avoid installing the much bigger CUDA
# binaries later, ExecuTorch only needs CPU
pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# Install all Python dependencies
pip_install -r requirements-ci.txt
pushd executorch
as_jenkins bash install_executorch.sh
# A workaround, ExecuTorch has moved to numpy 2.0 which is not compatible with the current
# numba and scipy version used in PyTorch CI
conda_run pip uninstall -y numba scipy
popd
}
setup_executorch() {
pushd executorch
# Setup swiftshader and Vulkan SDK which are required to build the Vulkan delegate
as_jenkins bash .ci/scripts/setup-vulkan-linux-deps.sh
export PYTHON_EXECUTABLE=python
export EXECUTORCH_BUILD_PYBIND=ON
export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
as_jenkins .ci/scripts/setup-linux.sh cmake
as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true
popd
}

View File

@ -7,14 +7,20 @@ source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
function install_huggingface() {
local version
commit=$(get_pinned_commit huggingface)
pip_install pandas==2.0.3
pip_install "git+https://github.com/huggingface/transformers@${commit}"
}
function install_timm() {
local commit
commit=$(get_pinned_commit timm)
pip_install pandas==2.0.3
# TODO (huydhn): There is no torchvision release on 3.13 when I write this, so
# I'm using nightly here instead. We just need to package to be able to install
# TIMM. Removing this once vision has a release on 3.13
if [[ "${ANACONDA_PYTHON_VERSION}" == "3.13" ]]; then
pip_install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu124
fi
pip_install "git+https://github.com/huggingface/pytorch-image-models@${commit}"
# Clean up
conda_run pip uninstall -y cmake torch torchvision triton

View File

@ -3,8 +3,6 @@
set -eou pipefail
MAGMA_VERSION="2.5.2"
function do_install() {
cuda_version=$1
cuda_version_nodot=${1/./}
@ -17,7 +15,7 @@ function do_install() {
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mkdir -p "${cuda_dir}/magma"
mv include "${cuda_dir}/magma/include"

View File

@ -0,0 +1,26 @@
#!/usr/bin/env bash
# Script that replaces the magma install from a conda package
set -eou pipefail
function do_install() {
cuda_version_nodot=${1/./}
anaconda_python_version=$2
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
anaconda_dir="/opt/conda/envs/py_${anaconda_python_version}"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://ossci-linux.s3.us-east-1.amazonaws.com/${magma_archive}
tar -xvf "${magma_archive}"
mv include/* "${anaconda_dir}/include/"
mv lib/* "${anaconda_dir}/lib"
popd
)
}
do_install $1 $2

View File

@ -16,7 +16,7 @@ case "$ID" in
ubuntu)
IS_UBUNTU=1
;;
centos)
centos|almalinux)
IS_UBUNTU=0
;;
*)
@ -43,12 +43,6 @@ else
fi
ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))
# Install custom MIOpen + COMgr for ROCm >= 4.0.1
if [[ $ROCM_INT -lt 40001 ]]; then
echo "ROCm version < 4.0.1; will not install custom MIOpen"
exit 0
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
@ -66,55 +60,27 @@ else
ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}"
fi
# MIOPEN_USE_HIP_KERNELS is a Workaround for COMgr issues
MIOPEN_CMAKE_COMMON_FLAGS="
-DMIOPEN_USE_COMGR=ON
-DMIOPEN_BUILD_DRIVER=OFF
"
# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version
if [[ $ROCM_INT -ge 60300 ]]; then
echo "ROCm 6.3+ MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then
if [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60204 ]]; then
MIOPEN_BRANCH="release/rocm-rel-6.2-staging"
elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then
echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then
echo "ROCm 6.0 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 50700 ]] && [[ $ROCM_INT -lt 60000 ]]; then
echo "ROCm 5.7 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 50600 ]] && [[ $ROCM_INT -lt 50700 ]]; then
MIOPEN_BRANCH="release/rocm-rel-5.6-staging"
elif [[ $ROCM_INT -ge 50500 ]] && [[ $ROCM_INT -lt 50600 ]]; then
MIOPEN_BRANCH="release/rocm-rel-5.5-gfx11"
elif [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.4-staging"
elif [[ $ROCM_INT -ge 50300 ]] && [[ $ROCM_INT -lt 50400 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.3-staging"
elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50300 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.2-staging"
elif [[ $ROCM_INT -ge 50100 ]] && [[ $ROCM_INT -lt 50200 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"
MIOPEN_BRANCH="release/rocm-rel-5.1-staging"
elif [[ $ROCM_INT -ge 50000 ]] && [[ $ROCM_INT -lt 50100 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"
MIOPEN_BRANCH="release/rocm-rel-5.0-staging"
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
echo "ROCm ${ROCM_VERSION} does not need any patches, do not build from source"
exit 0
fi
if [[ ${IS_UBUNTU} == 1 ]]; then
apt-get remove -y miopen-hip
else
yum remove -y miopen-hip
# Workaround since almalinux manylinux image already has this and cget doesn't like that
rm -rf /usr/local/lib/pkgconfig/sqlite3.pc
# Versioned package name needs regex match
# Use --noautoremove to prevent other rocm packages from being uninstalled
yum remove -y miopen-hip* --noautoremove
fi
git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}
@ -122,16 +88,7 @@ pushd MIOpen
# remove .git to save disk space since CI runner was running out
rm -rf .git
# Don't build CK to save docker build time
if [[ $ROCM_INT -ge 60200 ]]; then
sed -i '/composable_kernel/d' requirements.txt
fi
# Don't build MLIR to save docker build time
# since we are disabling MLIR backend for MIOpen anyway
if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
sed -i '/rocMLIR/d' requirements.txt
elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50400 ]]; then
sed -i '/llvm-project-mlir/d' requirements.txt
fi
sed -i '/composable_kernel/d' requirements.txt
## MIOpen minimum requirements
cmake -P install_deps.cmake --minimum
@ -153,7 +110,7 @@ cd build
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig CXX=${ROCM_INSTALL_PATH}/llvm/bin/clang++ cmake .. \
${MIOPEN_CMAKE_COMMON_FLAGS} \
${MIOPEN_CMAKE_DB_FLAGS} \
-DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}/hip;${ROCM_INSTALL_PATH}"
-DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}"
make MIOpen -j $(nproc)
# Build MIOpen package

View File

@ -4,10 +4,15 @@ set -ex
[ -n "$NINJA_VERSION" ]
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"
arch=$(uname -m)
if [ "$arch" == "aarch64" ]; then
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux-aarch64.zip"
else
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"
fi
pushd /tmp
wget --no-verbose --output-document=ninja-linux.zip "$url"
unzip ninja-linux.zip -d /usr/local/bin
rm -f ninja-linux.zip
popd
popd

View File

@ -31,15 +31,15 @@ pip_install \
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.16.2
pip_install onnxscript==0.1.0.dev20240831 --no-deps
pip_install onnx==1.17.0
pip_install onnxscript==0.2.2 --no-deps
# required by onnxscript
pip_install ml_dtypes
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/
IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"
as_jenkins echo 'import transformers; transformers.GPTJForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gptj");' > "${IMPORT_SCRIPT_FILENAME}"
# Need a PyTorch version for transformers to work
pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

View File

@ -4,7 +4,7 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.25 --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="

View File

@ -62,6 +62,22 @@ install_ubuntu() {
sqlite3 $kdb "PRAGMA journal_mode=off; PRAGMA VACUUM;"
done
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
# clr build needs CppHeaderParser but can only find it using conda's python
/opt/conda/bin/python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b rocm-6.3.x
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-6.3-statco-hotfix
mkdir -p clr/build
pushd clr/build
cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
make -j
cp hipamd/lib/libamdhip64.so.6.3.* /opt/rocm/lib/libamdhip64.so.6.3.*
popd
rm -rf HIP clr
fi
# Cleanup
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

View File

@ -12,7 +12,7 @@ case "$ID" in
apt-get install -y libpciaccess-dev pkg-config
apt-get clean
;;
centos)
centos|almalinux)
yum install -y libpciaccess-devel pkgconfig
;;
*)
@ -115,7 +115,7 @@ index a5007ffc..13fa07fc 100644
if (!fp) {
- fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,
- strerror(errno));
+ fprintf(stderr, "amdgpu.ids: No such file or directory\n");
+ //fprintf(stderr, "amdgpu.ids: No such file or directory\n");
return;
}

View File

@ -3,6 +3,18 @@
set -ex
# Magma build scripts need `python`
ln -sf /usr/bin/python3 /usr/bin/python
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
almalinux)
yum install -y gcc-gfortran
;;
*)
echo "No preinstalls to build magma..."
;;
esac
MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}

View File

@ -60,15 +60,15 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 pip_install -e .
CXX=g++-9 pip_install .
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 pip_install -e .
CXX=g++-9 pip_install .
else
pip_install -e .
pip_install .
fi
if [ -n "${CONDA_CMAKE}" ]; then

View File

@ -8,6 +8,12 @@ else
with_cuda=no
fi
if [[ -d "/opt/rocm" ]]; then
with_rocm=/opt/rocm
else
with_rocm=no
fi
function install_ucx() {
set -ex
git clone --recursive https://github.com/openucx/ucx.git
@ -19,6 +25,7 @@ function install_ucx() {
./configure --prefix=$UCX_HOME \
--enable-mt \
--with-cuda=$with_cuda \
--with-rocm=$with_rocm \
--enable-profiling \
--enable-stats
time make -j
@ -36,12 +43,29 @@ function install_ucc() {
git submodule update --init --recursive
./autogen.sh
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
if [[ -n "$ROCM_VERSION" ]]; then
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
HIP_OFFLOAD="$HIP_OFFLOAD --offload-arch=$arch"
done
else
HIP_OFFLOAD="all-arch-no-native"
fi
./configure --prefix=$UCC_HOME \
--with-ucx=$UCX_HOME \
--with-cuda=$with_cuda \
--with-nvcc-gencode="${NVCC_GENCODE}"
--with-nvcc-gencode="${NVCC_GENCODE}" \
--with-rocm=$with_rocm \
--with-rocm-arch="${HIP_OFFLOAD}"
time make -j
sudo make install

View File

@ -2,6 +2,13 @@
set -ex
# Since version 24 the system ships with user 'ubuntu' that has id 1000
# We need a work-around to enable id 1000 usage for this script
if [[ $UBUNTU_VERSION == 24.04 ]]; then
# touch is used to disable harmless error message
touch /var/mail/ubuntu && chown ubuntu /var/mail/ubuntu && userdel -r ubuntu
fi
# Mirror jenkins user in container
# jenkins user as ec2-user should have the same user-id
echo "jenkins:x:1000:1000::/var/lib/jenkins:" >> /etc/passwd

View File

@ -24,10 +24,10 @@ function install_ubuntu() {
| tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list
# To add the online network network package repository for the Intel Support Packages
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor > /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \
https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \
| tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list
| gpg --dearmor > /usr/share/keyrings/oneapi-archive-keyring.gpg.gpg
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg.gpg] \
https://apt.repos.intel.com/${XPU_REPO_NAME} all main" \
| tee /etc/apt/sources.list.d/oneAPI.list
# Update the packages list and repository index
apt-get update
@ -41,14 +41,16 @@ function install_ubuntu() {
libegl-mesa0 libegl1-mesa libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri \
libglapi-mesa libgles2-mesa-dev libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers \
mesa-vdpau-drivers mesa-vulkan-drivers va-driver-all vainfo hwinfo clinfo
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
apt-get install -y intel-ocloc
fi
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
if [ -n "$XPU_VERSION" ]; then
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev
else
apt-get install -y intel-for-pytorch-gpu-dev intel-pti-dev
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_PACKAGES="${XPU_PACKAGES} intel-oneapi-dnnl=2025.0.1-6"
fi
apt-get install -y ${XPU_PACKAGES}
# Cleanup
apt-get autoclean && apt-get clean
@ -58,13 +60,13 @@ function install_ubuntu() {
function install_rhel() {
. /etc/os-release
if [[ "${ID}" == "rhel" ]]; then
if [[ ! " 8.6 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
if [[ ! " 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
echo "RHEL version ${VERSION_ID} not supported"
exit
fi
elif [[ "${ID}" == "almalinux" ]]; then
# Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64
VERSION_ID="8.6"
VERSION_ID="8.8"
fi
dnf install -y 'dnf-command(config-manager)'
@ -72,16 +74,21 @@ function install_rhel() {
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/${VERSION_ID}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_ID}.repo
# To add the online network network package repository for the Intel Support Packages
tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF
[intel-for-pytorch-gpu-dev]
tee > /etc/yum.repos.d/oneAPI.repo << EOF
[oneAPI]
name=Intel for Pytorch GPU dev repository
baseurl=https://yum.repos.intel.com/intel-for-pytorch-gpu-dev
baseurl=https://yum.repos.intel.com/${XPU_REPO_NAME}
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF
# Install Intel Support Packages
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_PACKAGES="${XPU_PACKAGES} intel-oneapi-dnnl-2025.0.1-6"
fi
yum install -y ${XPU_PACKAGES}
# The xpu-smi packages
dnf install -y xpu-smi
# Compute and Media Runtimes
@ -96,8 +103,6 @@ EOF
dnf install -y --refresh \
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel Support Packages
yum install -y intel-for-pytorch-gpu-dev intel-pti-dev
# Cleanup
dnf clean all
@ -119,7 +124,7 @@ function install_sles() {
https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo
rpm --import https://repositories.intel.com/gpu/intel-graphics.key
# To add the online network network package repository for the Intel Support Packages
zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev
zypper addrepo https://yum.repos.intel.com/${XPU_REPO_NAME} oneAPI
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
# The xpu-smi packages
@ -131,7 +136,7 @@ function install_sles() {
zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel
# Install Intel Support Packages
zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev
zypper install -y ${XPU_PACKAGES}
}
@ -142,6 +147,13 @@ if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
XPU_DRIVER_VERSION=""
fi
XPU_REPO_NAME="intel-for-pytorch-gpu-dev"
XPU_PACKAGES="intel-for-pytorch-gpu-dev-0.5 intel-pti-dev-0.9"
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_REPO_NAME="oneapi"
XPU_PACKAGES="intel-deep-learning-essentials-2025.0"
fi
# The installation depends on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in

View File

@ -56,16 +56,21 @@ RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
RUN bash ./install_magma.sh 12.1
RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda
FROM cuda as cuda12.6
RUN bash ./install_cuda.sh 12.6
RUN bash ./install_magma.sh 12.6
RUN ln -sf /usr/local/cuda-12.6 /usr/local/cuda
FROM cuda as cuda12.8
RUN bash ./install_cuda.sh 12.8
RUN bash ./install_magma.sh 12.8
RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda
FROM cpu as rocm
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
@ -87,13 +92,6 @@ RUN apt-get update -y && \
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
FROM ${BASE_TARGET} as final
COPY --from=openssl /opt/openssl /opt/openssl
# Install patchelf

View File

@ -39,17 +39,7 @@ case ${GPU_ARCH_TYPE} in
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)

View File

@ -25,7 +25,8 @@ ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install cuda and cudnn
ARG CUDA_VERSION

View File

@ -144,6 +144,10 @@ COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.1
ARG DEVTOOLSET_VERSION=9
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
ENV PATH /opt/conda/bin:$PATH
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
@ -194,10 +198,3 @@ ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

View File

@ -1,153 +0,0 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=10.2
ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
FROM quay.io/pypa/manylinux2014_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
RUN yum install -y yum-utils centos-release-scl sudo
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.2
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
# ninja
RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

View File

@ -1,5 +1,4 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=amd64/almalinux:8
FROM quay.io/pypa/manylinux_2_28_x86_64 as base
@ -117,30 +116,49 @@ COPY --from=jni /usr/local/include/jni.h /usr/local/
FROM common as cpu_final
ARG BASE_CUDA_VERSION=11.8
ARG DEVTOOLSET_VERSION=11
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
ENV PATH /opt/conda/bin:$PATH
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# Install setuptools and wheel for python 3.12/3.13
RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
/opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
done;
# cmake-3.18.4 from pip
# cmake-3.18.4 from pip; force in case cmake3 already exists
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake3
ln -sf /usr/local/bin/cmake /usr/bin/cmake3
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM cpu_final as rocm_final
ARG ROCM_VERSION=6.0
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
ARG DEVTOOLSET_VERSION=11
ENV LDFLAGS="-Wl,-rpath=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64 -Wl,-rpath=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib"
# Somewhere in ROCm stack, we still use non-existing /opt/rocm/hip path,
# below workaround helps avoid error
ENV ROCM_PATH /opt/rocm
# cmake-3.28.4 from pip to get enable_language(HIP)
# and avoid 3.21.0 cmake+ninja issues with ninja inserting "-Wl,--no-as-needed" in LINK_FLAGS for static linker
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
ENV MKLROOT /opt/intel
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
@ -150,8 +168,7 @@ ENV XPU_DRIVER_TYPE ROLLING
# cmake-3.28.4 from pip
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
# Install setuptools and wheel for python 3.13
RUN /opt/python/cp313-cp313/bin/python -m pip install setuptools wheel
ADD ./common/install_xpu.sh install_xpu.sh
ENV XPU_VERSION 2025.0
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -38,6 +38,12 @@ RUN yum install -y \
sudo \
gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
@ -48,6 +54,11 @@ ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/op
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM base as final
# remove unncessary python versions
@ -55,3 +66,5 @@ RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -61,7 +61,7 @@ RUN git config --global --add safe.directory "*"
# NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
###############################################################################
RUN cd ~/ \
&& curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb \
&& curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-4ubuntu2_arm64.deb \
&& ar x ~/libgfortran-10-dev.deb \
&& tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
&& cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/

View File

@ -1,17 +1,20 @@
FROM --platform=linux/s390x docker.io/ubuntu:24.04 as base
FROM quay.io/pypa/manylinux_2_28_s390x as base
# Language variables
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV LANGUAGE=C.UTF-8
ARG DEVTOOLSET_VERSION=13
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN apt update ; apt upgrade -y
RUN apt install -y \
build-essential \
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
sudo \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
@ -24,19 +27,40 @@ RUN apt install -y \
util-linux \
wget \
which \
xz-utils \
xz \
yasm \
less \
zstd \
libgomp \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-c++ \
gcc-toolset-${DEVTOOLSET_VERSION}-binutils \
gcc-toolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
cmake \
python3 \
python3-dev \
python3-setuptools \
python3-yaml \
python3-typing-extensions \
libblas-dev \
libopenblas-dev \
liblapack-dev \
libatlas-base-dev
rust \
cargo \
llvm-devel \
libzstd-devel \
python3.12-devel \
python3.12-setuptools \
python3.12-pip \
python3-virtualenv \
python3.12-pyyaml \
python3.12-numpy \
python3.12-wheel \
python3.12-cryptography \
blas-devel \
openblas-devel \
lapack-devel \
atlas-devel \
libjpeg-devel \
libxslt-devel \
libxml2-devel \
openssl-devel \
valgrind
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
@ -44,14 +68,8 @@ RUN apt install -y \
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# installed python doesn't have development parts. Rebuild it from scratch
RUN /bin/rm -rf /opt/_internal /opt/python /usr/local/*/*
# EPEL for cmake
FROM base as patchelf
@ -64,10 +82,43 @@ FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
ENV SSL_CERT_FILE=
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM openssl as final
FROM base as final
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
RUN alternatives --set python /usr/bin/python3.12
RUN alternatives --set python3 /usr/bin/python3.12
RUN pip-3.12 install typing_extensions
ENTRYPOINT []
CMD ["/bin/bash"]
# install test dependencies:
# - grpcio requires system openssl, bundled crypto fails to build
# - ml_dtypes 0.4.0 requires some fixes provided in later commits to build
RUN dnf install -y \
protobuf-devel \
protobuf-c-devel \
protobuf-lite-devel \
wget \
patch
RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio==1.65.4
RUN cd ~ && \
git clone https://github.com/jax-ml/ml_dtypes && \
cd ml_dtypes && \
git checkout v0.4.0 && \
git submodule update --init --recursive && \
wget https://github.com/jax-ml/ml_dtypes/commit/b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
wget https://github.com/jax-ml/ml_dtypes/commit/d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
patch -p1 < b969f76914d6b30676721bc92bf0f6021a0d1321.patch && \
patch -p1 < d4e6d035ecda073eab8bcf60f4eef572ee7087e6.patch && \
python3 setup.py bdist_wheel && \
pip3 install dist/*.whl && \
rm -rf ml_dtypes

View File

@ -48,7 +48,7 @@ case ${GPU_ARCH_TYPE} in
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
;;
cpu-cxx11-abi)
@ -61,7 +61,7 @@ case ${GPU_ARCH_TYPE} in
cpu-s390x)
TARGET=final
DOCKER_TAG=cpu-s390x
GPU_IMAGE=redhat/ubi9
GPU_IMAGE=s390x/almalinux:8
DOCKER_GPU_BUILD_ARG=""
MANY_LINUX_VERSION="s390x"
;;
@ -87,22 +87,18 @@ case ${GPU_ARCH_TYPE} in
MANY_LINUX_VERSION="aarch64"
DOCKERFILE_SUFFIX="_cuda_aarch64"
;;
rocm)
rocm|rocm-manylinux_2_28)
TARGET=rocm_final
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
DEVTOOLSET_VERSION="9"
if [ ${GPU_ARCH_TYPE} == "rocm-manylinux_2_28" ]; then
MANY_LINUX_VERSION="2_28"
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=9"
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
xpu)
TARGET=xpu_final
@ -125,11 +121,14 @@ fi
(
set -x
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
# Only activate this if in CI
if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
fi
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
@ -141,7 +140,7 @@ fi
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GITHUB_REF=${GITHUB_REF:-"dev")}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

View File

@ -16,37 +16,27 @@ CURL_HASH=cf34fe0b07b800f1c01a499a6e8b2af548f6d0e044dca4a29d88a4bee146d131
AUTOCONF_ROOT=autoconf-2.69
AUTOCONF_HASH=954bd69b391edc12d6a4a51a2dd1476543da5c6bbf05a95b59dc0dd6fd4c2969
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel"
if [ "$(uname -m)" != "s390x" ] ; then
PYTHON_COMPILE_DEPS="${PYTHON_COMPILE_DEPS} db4-devel"
else
PYTHON_COMPILE_DEPS="${PYTHON_COMPILE_DEPS} libdb-devel"
fi
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
# Get build utilities
MY_DIR=$(dirname "${BASH_SOURCE[0]}")
source $MY_DIR/build_utils.sh
if [ "$(uname -m)" != "s390x" ] ; then
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel"
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
# Development tools and libraries
yum -y install bzip2 make git patch unzip bison yasm diffutils \
automake which file cmake28 \
kernel-devel-`uname -r` \
${PYTHON_COMPILE_DEPS}
else
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib1g-dev libbz2-dev libncurses-dev libsqlite3-dev libdb-dev libpcap-dev liblzma-dev libffi-dev"
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="libglib2.0-dev libX11-dev libncurses-dev"
# Development tools and libraries
apt install -y bzip2 make git patch unzip diffutils \
automake which file cmake \
linux-headers-virtual \
${PYTHON_COMPILE_DEPS}
fi
# Development tools and libraries
yum -y install bzip2 make git patch unzip bison yasm diffutils \
automake which file \
${PYTHON_COMPILE_DEPS}
# Install newest autoconf
build_autoconf $AUTOCONF_ROOT $AUTOCONF_HASH
@ -92,16 +82,13 @@ ln -s $PY39_BIN/auditwheel /usr/local/bin/auditwheel
# Clean up development headers and other unnecessary stuff for
# final image
if [ "$(uname -m)" != "s390x" ] ; then
yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \
avahi freetype bitstream-vera-fonts \
${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
yum -y install ${MANYLINUX1_DEPS}
yum -y clean all > /dev/null 2>&1
yum list installed
else
apt purge -y ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
fi
yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \
avahi freetype bitstream-vera-fonts \
${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
yum -y install ${MANYLINUX1_DEPS}
yum -y clean all > /dev/null 2>&1
yum list installed
# we don't need libpython*.a, and they're many megabytes
find /opt/_internal -name '*.a' -print0 | xargs -0 rm -f
# Strip what we can -- and ignore errors, because this just attempts to strip

View File

@ -3,7 +3,7 @@
# Script used only in CD pipeline
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/
CURL_DOWNLOAD_URL=https://curl.askapache.com/download
CURL_DOWNLOAD_URL=https://curl.se/download
AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

View File

@ -1,10 +1,12 @@
# cf. https://github.com/pypa/manylinux/issues/53
import sys
from urllib.request import urlopen
GOOD_SSL = "https://google.com"
BAD_SSL = "https://self-signed.badssl.com"
import sys
print("Testing SSL certificate checking for Python:", sys.version)
@ -12,14 +14,8 @@ if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):
print("This version never checks SSL certs; skipping tests")
sys.exit(0)
if sys.version_info[0] >= 3:
from urllib.request import urlopen
EXC = OSError
else:
from urllib import urlopen
EXC = IOError
EXC = OSError
print(f"Connecting to {GOOD_SSL} should work")
urlopen(GOOD_SSL)

View File

@ -5,7 +5,7 @@
#Pinned versions: 1.6
#test that import:
boto3==1.19.12
boto3==1.35.42
#Description: AWS SDK for python
#Pinned versions: 1.19.12, 1.16.34
#test that import:
@ -30,13 +30,13 @@ dill==0.3.7
#Pinned versions: 0.3.7
#test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
expecttest==0.2.1
expecttest==0.3.0
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
#Pinned versions: 0.2.1
#Pinned versions: 0.3.0
#test that import:
fbscribelogger==0.1.6
fbscribelogger==0.1.7
#Description: write to scribe from authenticated jobs on CI
#Pinned versions: 0.1.6
#test that import:
@ -90,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.11.2
mypy==1.14.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.10.0
#Pinned versions: 1.14.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -118,7 +118,7 @@ numba==0.55.2 ; python_version == "3.10"
#numpy
#Description: Provides N-dimensional arrays and linear algebra
#Pinned versions: 1.20
#Pinned versions: 1.26.2
#test that import: test_view_ops.py, test_unary_ufuncs.py, test_type_promotion.py,
#test_type_info.py, test_torch.py, test_tensorexpr_pybind.py, test_tensorexpr.py,
#test_tensorboard.py, test_tensor_creation_ops.py, test_static_runtime.py,
@ -128,6 +128,12 @@ numba==0.55.2 ; python_version == "3.10"
#test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
#test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
#test_binary_ufuncs.py
numpy==1.22.4; python_version == "3.9" or python_version == "3.10"
numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
numpy==2.1.2; python_version >= "3.13"
pandas==2.0.3; python_version < "3.13"
pandas==2.2.3; python_version >= "3.13"
#onnxruntime
#Description: scoring engine for Open Neural Network Exchange (ONNX) models
@ -139,9 +145,9 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
optree==0.12.1
optree==0.13.0
#Description: A library for tree manipulation
#Pinned versions: 0.12.1
#Pinned versions: 0.13.0
#test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
#test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
#common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
@ -152,7 +158,7 @@ optree==0.12.1
#test_pointwise_ops.py, test_dtensor_ops.py, test_torchinductor.py, test_fx.py,
#test_fake_tensor.py, test_mps.py
pillow==10.3.0
pillow==11.0.0
#Description: Python Imaging Library fork
#Pinned versions: 10.3.0
#test that import:
@ -187,6 +193,11 @@ pytest-rerunfailures>=10.3
#Pinned versions:
#test that import:
pytest-subtests==0.13.1
#Description: plugin for subtest support
#Pinned versions:
#test that import:
#pytest-benchmark
#Description: fixture for benchmarking code
#Pinned versions: 3.2.3
@ -234,7 +245,7 @@ scikit-image==0.22.0 ; python_version >= "3.10"
#test that import:
scipy==1.10.1 ; python_version <= "3.11"
scipy==1.12.0 ; python_version == "3.12"
scipy==1.14.1 ; python_version >= "3.12"
# Pin SciPy because of failing distribution tests (see #60347)
#Description: scientific python
#Pinned versions: 1.10.1
@ -253,7 +264,7 @@ tb-nightly==2.13.0a20230426
#test that import:
# needed by torchgen utils
typing-extensions
typing-extensions>=4.10.0
#Description: type hints for python
#Pinned versions:
#test that import:
@ -269,26 +280,21 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#test that import:
#lintrunner is supported on aarch64-linux only from 0.12.4 version
lintrunner==0.12.5
lintrunner==0.12.7
#Description: all about linters!
#Pinned versions: 0.12.5
#Pinned versions: 0.12.7
#test that import:
redis>=4.0.0
#Description: redis database
#test that import: anything that tests OSS caching/mocking (inductor/test_codecache.py, inductor/test_max_autotune.py)
rockset==1.0.3
#Description: queries Rockset
#Pinned versions: 1.0.3
#test that import:
ghstack==0.8.0
#Description: ghstack tool
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.4
jinja2==3.1.6
#Description: jinja2 template engine
#Pinned versions: 3.1.4
#test that import:
@ -298,42 +304,42 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver==4.12.2.0
z3-solver==4.12.6.0
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
tensorboard==2.13.0
tensorboard==2.13.0 ; python_version < "3.13"
tensorboard==2.18.0 ; python_version >= "3.13"
#Description: Also included in .ci/docker/requirements-docs.txt
#Pinned versions:
#test that import: test_tensorboard
pywavelets==1.4.1 ; python_version < "3.12"
pywavelets==1.5.0 ; python_version >= "3.12"
pywavelets==1.7.0 ; python_version >= "3.12"
#Description: This is a requirement of scikit-image, we need to pin
# it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
#Pinned versions: 1.4.1
#test that import:
lxml==5.0.0
lxml==5.3.0
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries
PyGithub==2.3.0
sympy==1.12.1 ; python_version == "3.8"
sympy==1.13.1 ; python_version >= "3.9"
sympy==1.13.3
#Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
#Pinned versions:
#test that import:
onnx==1.16.1
onnx==1.17.0
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
onnxscript==0.1.0.dev20240817
onnxscript==0.2.2
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -342,3 +348,32 @@ parameterized==0.8.1
#Description: Parameterizes unittests, both the tests themselves and the entire testing class
#Pinned versions:
#test that import:
#Description: required for testing torch/distributed/_tools/sac_estimator.py
#Pinned versions: 1.24.0
#test that import: test_sac_estimator.py
pwlf==2.2.1 ; python_version >= "3.8"
#Description: required for testing torch/distributed/_tools/sac_estimator.py
#Pinned versions: 2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
astunparse
PyYAML
pyzstd
setuptools
ninja==1.11.1 ; platform_machine == "aarch64"
scons==4.5.2 ; platform_machine == "aarch64"
pulp==2.9.0 ; python_version >= "3.8"
#Description: required for testing ilp formulaiton under torch/distributed/_tools
#Pinned versions: 2.9.0
#test that import: test_sac_ilp.py
dataclasses_json==0.6.7
#Description: required for data pipeline and scripts under tools/stats
#Pinned versions: 0.6.7
#test that import:

View File

@ -14,7 +14,8 @@ matplotlib==3.5.3
#Description: This is used to generate PyTorch docs
#Pinned versions: 3.5.3
tensorboard==2.13.0
tensorboard==2.13.0 ; python_version < "3.13"
tensorboard==2.18.0 ; python_version >= "3.13"
#Description: This is used to generate PyTorch docs
#Pinned versions: 2.13.0

View File

@ -1 +1 @@
3.1.0
3.3.0

View File

@ -30,7 +30,8 @@ ARG CONDA_CMAKE
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
# Install gcc
ARG GCC_VERSION
@ -80,6 +81,8 @@ RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt

View File

@ -14,21 +14,20 @@ ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install clang
ARG LLVMDEV
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install katex
ARG KATEX
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ARG CONDA_CMAKE
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
@ -39,6 +38,11 @@ ARG GCC_VERSION
COPY ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install clang
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
@ -85,6 +89,32 @@ COPY ./common/install_amdsmi.sh install_amdsmi.sh
RUN bash ./install_amdsmi.sh
RUN rm install_amdsmi.sh
# (optional) Install UCC
ARG UCX_COMMIT
ARG UCC_COMMIT
ENV UCX_COMMIT $UCX_COMMIT
ENV UCC_COMMIT $UCC_COMMIT
ENV UCX_HOME /usr
ENV UCC_HOME /usr
ADD ./common/install_ucc.sh install_ucc.sh
RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
RUN rm install_ucc.sh
COPY ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
@ -107,18 +137,17 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Install Open MPI for ROCm
COPY ./common/install_openmpi.sh install_openmpi.sh
RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi
RUN rm install_openmpi.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

View File

@ -36,7 +36,8 @@ ENV DOCS=$DOCS
COPY requirements-ci.txt requirements-docs.txt /opt/conda/
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
COPY ./common/install_magma_conda.sh install_magma_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
RUN if [ -n "${UNINSTALL_DILL}" ]; then pip uninstall -y dill; fi
# Install gcc
@ -87,19 +88,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install Android NDK
ARG ANDROID
ARG ANDROID_NDK
ARG GRADLE_VERSION
COPY ./common/install_android.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
COPY ./android/AndroidManifest.xml AndroidManifest.xml
COPY ./android/build.gradle build.gradle
RUN if [ -n "${ANDROID}" ]; then bash ./install_android.sh; fi
RUN rm install_android.sh cache_vision_models.sh common_utils.sh
RUN rm AndroidManifest.xml
RUN rm build.gradle
ENV INSTALLED_ANDROID ${ANDROID}
# (optional) Install Vulkan SDK
ARG VULKAN_SDK_VERSION
COPY ./common/install_vulkan_sdk.sh install_vulkan_sdk.sh

10
.ci/libtorch/build.sh Normal file
View File

@ -0,0 +1,10 @@
#!/usr/bin/env bash
# This is mostly just a shim to manywheel/build.sh
# TODO: Make this a dedicated script to build just libtorch
set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
USE_CUSPARSELT=0 BUILD_PYTHONLESS=1 DESIRED_PYTHON="3.9" ${SCRIPTPATH}/../manywheel/build.sh

2
.ci/magma/.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
output/
magma-cuda*/

49
.ci/magma/Makefile Normal file
View File

@ -0,0 +1,49 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_CUDA ?= 11.8
DESIRED_CUDA_SHORT = $(subst .,,$(DESIRED_CUDA))
PACKAGE_NAME = magma-cuda
CUDA_ARCH_LIST ?= -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90
DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-v $(shell git rev-parse --show-toplevel)/.ci:/builder \
-w /builder \
-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \
-e DESIRED_CUDA=${DESIRED_CUDA} \
-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \
"pytorch/manylinux2_28-builder:cuda${DESIRED_CUDA}-main" \
magma/build_magma.sh
.PHONY: all
all: magma-cuda128
all: magma-cuda126
all: magma-cuda124
all: magma-cuda118
.PHONY:
clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-cuda128
magma-cuda128: DESIRED_CUDA := 12.8
magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
magma-cuda128:
$(DOCKER_RUN)
.PHONY: magma-cuda126
magma-cuda126: DESIRED_CUDA := 12.6
magma-cuda126:
$(DOCKER_RUN)
.PHONY: magma-cuda124
magma-cuda124: DESIRED_CUDA := 12.4
magma-cuda124:
$(DOCKER_RUN)
.PHONY: magma-cuda118
magma-cuda118: DESIRED_CUDA := 11.8
magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37
magma-cuda118:
$(DOCKER_RUN)

50
.ci/magma/README.md Normal file
View File

@ -0,0 +1,50 @@
# Magma
This folder contains the scripts and configurations to build magma, statically linked for various versions of CUDA.
## Building
Look in the `Makefile` for available targets to build. To build any target, for example `magma-cuda118`, run
```
# Using `docker`
make magma-cuda118
# Using `podman`
DOCKER_CMD=podman make magma-cuda118
```
This spawns a `pytorch/manylinux-cuda<version>` docker image, which has the required `devtoolset` and CUDA versions installed.
Within the docker image, it runs `build_magma.sh` with the correct environment variables set, which package the necessary files
into a tarball, with the following structure:
```
.
├── include # header files
├── lib # libmagma.a
├── info
│ ├── licenses # license file
│ └── recipe # build script and patches
```
More specifically, `build_magma.sh` copies over the relevant files from the `package_files` directory depending on the CUDA version.
Outputted binaries should be in the `output` folder.
## Pushing
Packages can be uploaded to an S3 bucket using:
```
aws s3 cp output/*/magma-cuda*.bz2 <bucket-with-path>
```
If you do not have upload permissions, please ping @seemethere or @soumith to gain access
## New versions
New CUDA versions can be added by creating a new make target with the next desired version. For CUDA version NN.n, the target should be named `magma-cudaNNn`.
Make sure to edit the appropriate environment variables (e.g., DESIRED_CUDA, CUDA_ARCH_LIST) in the `Makefile` accordingly. Remember also to check `build_magma.sh` to ensure the logic for copying over the files remains correct.
New patches can be added by editing `Makefile` and`build_magma.sh` the same way `getrf_nbparam.patch` is implemented.

50
.ci/magma/build_magma.sh Executable file
View File

@ -0,0 +1,50 @@
#!/usr/bin/env bash
set -eou pipefail
# Environment variables
# The script expects DESIRED_CUDA and PACKAGE_NAME to be set
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
MAGMA_VERSION=2.6.1
# Folders for the build
PACKAGE_FILES=${ROOT_DIR}/magma/package_files # source patches and metadata
PACKAGE_DIR=${ROOT_DIR}/magma/${PACKAGE_NAME} # build workspace
PACKAGE_OUTPUT=${ROOT_DIR}/magma/output # where tarballs are stored
PACKAGE_BUILD=${PACKAGE_DIR}/build # where the content of the tarball is prepared
PACKAGE_RECIPE=${PACKAGE_BUILD}/info/recipe
PACKAGE_LICENSE=${PACKAGE_BUILD}/info/licenses
mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RECIPE} ${PACKAGE_LICENSE}
# Fetch magma sources and verify checksum
pushd ${PACKAGE_DIR}
curl -LO http://icl.utk.edu/projectsfiles/magma/downloads/magma-${MAGMA_VERSION}.tar.gz
tar zxf magma-${MAGMA_VERSION}.tar.gz
sha256sum --check < ${PACKAGE_FILES}/magma-${MAGMA_VERSION}.sha256
popd
# Apply patches and build
pushd ${PACKAGE_DIR}/magma-${MAGMA_VERSION}
patch < ${PACKAGE_FILES}/CMake.patch
patch < ${PACKAGE_FILES}/cmakelists.patch
patch -p0 < ${PACKAGE_FILES}/thread_queue.patch
patch -p1 < ${PACKAGE_FILES}/getrf_shfl.patch
patch -p1 < ${PACKAGE_FILES}/getrf_nbparam.patch
# The build.sh script expects to be executed from the sources root folder
INSTALL_DIR=${PACKAGE_BUILD} ${PACKAGE_FILES}/build.sh
popd
# Package recipe, license and tarball
# Folder and package name are backward compatible for the build workflow
cp ${PACKAGE_FILES}/build.sh ${PACKAGE_RECIPE}/build.sh
cp ${PACKAGE_FILES}/thread_queue.patch ${PACKAGE_RECIPE}/thread_queue.patch
cp ${PACKAGE_FILES}/cmakelists.patch ${PACKAGE_RECIPE}/cmakelists.patch
cp ${PACKAGE_FILES}/getrf_shfl.patch ${PACKAGE_RECIPE}/getrf_shfl.patch
cp ${PACKAGE_FILES}/getrf_nbparam.patch ${PACKAGE_RECIPE}/getrf_nbparam.patch
cp ${PACKAGE_FILES}/CMake.patch ${PACKAGE_RECIPE}/CMake.patch
cp ${PACKAGE_FILES}/magma-${MAGMA_VERSION}.sha256 ${PACKAGE_RECIPE}/magma-${MAGMA_VERSION}.sha256
cp ${PACKAGE_DIR}/magma-${MAGMA_VERSION}/COPYRIGHT ${PACKAGE_LICENSE}/COPYRIGHT
pushd ${PACKAGE_BUILD}
tar cjf ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2 include lib info
echo Built in ${PACKAGE_OUTPUT}/linux-64/${PACKAGE_NAME}-${MAGMA_VERSION}-1.tar.bz2
popd

View File

@ -0,0 +1,40 @@
--- CMake.src.cuda 2023-03-29 10:05:32.136954140 +0000
+++ CMake.src.cuda 2023-03-29 10:05:50.281318043 +0000
@@ -283,10 +283,10 @@
magmablas/zgeadd.cu
magmablas/zgeadd2.cu
magmablas/zgeam.cu
-magmablas/zgemm_fermi.cu
+#magmablas/zgemm_fermi.cu
magmablas/zgemm_reduce.cu
magmablas/zgemv_conj.cu
-magmablas/zgemv_fermi.cu
+#magmablas/zgemv_fermi.cu
magmablas/zgerbt.cu
magmablas/zgerbt_kernels.cu
magmablas/zgetmatrix_transpose.cpp
@@ -1009,18 +1009,18 @@
magmablas/sgeam.cu
magmablas/dgeam.cu
magmablas/cgeam.cu
-magmablas/sgemm_fermi.cu
-magmablas/dgemm_fermi.cu
-magmablas/cgemm_fermi.cu
+#magmablas/sgemm_fermi.cu
+#magmablas/dgemm_fermi.cu
+#magmablas/cgemm_fermi.cu
magmablas/sgemm_reduce.cu
magmablas/dgemm_reduce.cu
magmablas/cgemm_reduce.cu
magmablas/sgemv_conj.cu
magmablas/dgemv_conj.cu
magmablas/cgemv_conj.cu
-magmablas/sgemv_fermi.cu
-magmablas/dgemv_fermi.cu
-magmablas/cgemv_fermi.cu
+#magmablas/sgemv_fermi.cu
+#magmablas/dgemv_fermi.cu
+#magmablas/cgemv_fermi.cu
magmablas/sgerbt.cu
magmablas/dgerbt.cu
magmablas/cgerbt.cu

View File

@ -0,0 +1,12 @@
CUDA__VERSION=$(nvcc --version|sed -n 4p|cut -f5 -d" "|cut -f1 -d",")
if [ "$CUDA__VERSION" != "$DESIRED_CUDA" ]; then
echo "CUDA Version is not $DESIRED_CUDA. CUDA Version found: $CUDA__VERSION"
exit 1
fi
mkdir build
cd build
cmake .. -DUSE_FORTRAN=OFF -DGPU_TARGET="All" -DCMAKE_INSTALL_PREFIX="$INSTALL_DIR" -DCUDA_ARCH_LIST="$CUDA_ARCH_LIST"
make -j$(getconf _NPROCESSORS_CONF)
make install
cd ..

View File

@ -0,0 +1,388 @@
diff --git a/CMakeLists.txt b/CMakeLists.txt
index d5d8d87d..8a507334 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -3,7 +3,7 @@ cmake_minimum_required( VERSION 2.8.1 )
# ----------------------------------------
# to disable Fortran, set this to "off"
# see also -DADD_ below
-option( USE_FORTRAN "Fortran is required for some tester checks, but can be disabled with reduced functionality" ON )
+option( USE_FORTRAN "Fortran is required for some tester checks, but can be disabled with reduced functionality" OFF )
if (USE_FORTRAN)
project( MAGMA C CXX Fortran )
@@ -75,6 +75,8 @@ else()
message( WARNING "The compiler ${CMAKE_CXX_COMPILER} doesn't support the -std=c++11 flag. Some code may not compile.")
endif()
+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -static-libstdc++ -fno-exceptions")
+
CHECK_C_COMPILER_FLAG("-std=c99" COMPILER_SUPPORTS_C99)
if (COMPILER_SUPPORTS_C99)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -std=c99")
@@ -101,15 +103,15 @@ endif()
# ----------------------------------------
-# locate OpenMP
-find_package( OpenMP )
-if (OPENMP_FOUND)
- message( STATUS "Found OpenMP" )
- message( STATUS " OpenMP_C_FLAGS ${OpenMP_C_FLAGS}" )
- message( STATUS " OpenMP_CXX_FLAGS ${OpenMP_CXX_FLAGS}" )
- set( CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}" )
- set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}" )
-endif()
+# # locate OpenMP
+# find_package( OpenMP )
+# if (OPENMP_FOUND)
+# message( STATUS "Found OpenMP" )
+# message( STATUS " OpenMP_C_FLAGS ${OpenMP_C_FLAGS}" )
+# message( STATUS " OpenMP_CXX_FLAGS ${OpenMP_CXX_FLAGS}" )
+# set( CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}" )
+# set( CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}" )
+# endif()
if (MAGMA_ENABLE_CUDA)
# ----------------------------------------
@@ -132,7 +134,7 @@ if (MAGMA_ENABLE_CUDA)
set( NV_SM "" )
set( NV_COMP "" )
- set(CUDA_SEPARABLE_COMPILATION ON)
+ set(CUDA_SEPARABLE_COMPILATION OFF)
# nvcc >= 6.5 supports -std=c++11, so propagate CXXFLAGS to NVCCFLAGS.
# Older nvcc didn't support -std=c++11, so previously we disabled propagation.
@@ -294,11 +296,18 @@ if (MAGMA_ENABLE_CUDA)
message( STATUS " compile for CUDA arch 8.0 (Ampere)" )
endif()
+ if ( ${GPU_TARGET} MATCHES "All")
+ set( MIN_ARCH 370)
+ SET( NV_SM ${CUDA_ARCH_LIST})
+ SET( NV_COMP "")
+ endif()
+
if (NOT MIN_ARCH)
message( FATAL_ERROR "GPU_TARGET must contain one or more of Fermi, Kepler, Maxwell, Pascal, Volta, Turing, Ampere, or valid sm_[0-9][0-9]" )
endif()
- set( CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -Xcompiler -fPIC ${NV_SM} ${NV_COMP} ${FORTRAN_CONVENTION} )
+ set( CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -DHAVE_CUBLAS -Xfatbin -compress-all -Xcompiler -fPIC -std=c++11 ${NV_SM} ${NV_COMP} ${FORTRAN_CONVENTION} )
+ MESSAGE(STATUS "CUDA_NVCC_FLAGS: ${CUDA_NVCC_FLAGS}")
#add_definitions( "-DMAGMA_HAVE_CUDA -DMAGMA_CUDA_ARCH_MIN=${MIN_ARCH}" )
set(MAGMA_HAVE_CUDA "1")
set(MAGMA_CUDA_ARCH_MIN "${MIN_ARCH}")
@@ -413,7 +422,7 @@ set_property(CACHE BLA_VENDOR PROPERTY STRINGS
set( LAPACK_LIBRARIES "" CACHE STRING "Libraries for LAPACK and BLAS, to manually override search" )
if (LAPACK_LIBRARIES STREQUAL "")
message( STATUS "Searching for BLAS and LAPACK. To override, set LAPACK_LIBRARIES using ccmake." )
- find_package( LAPACK )
+ # find_package( LAPACK )
# force showing updated LAPACK_LIBRARIES in ccmake / cmake-gui.
set( LAPACK_LIBRARIES ${LAPACK_LIBRARIES} CACHE STRING "Libraries for LAPACK and BLAS, to manually override search" FORCE )
else()
@@ -552,12 +561,12 @@ if (WIN32)
#message( "libmagma_all_f ${libmagma_all_f}" )
# on Windows, Fortran files aren't compiled if listed here...
- cuda_add_library( magma ${libmagma_all_cpp} )
+ cuda_add_library( magma STATIC ${libmagma_all_cpp} OPTIONS --compiler-options "-fPIC")
target_link_libraries( magma
${LAPACK_LIBRARIES}
${CUDA_CUDART_LIBRARY}
${CUDA_CUBLAS_LIBRARIES}
- ${CUDA_cusparse_LIBRARY}
+ # ${CUDA_cusparse_LIBRARY}
)
# no Fortran files at the moment (how to test libmagma_all_f is not empty?),
@@ -575,13 +584,13 @@ if (WIN32)
else()
# Unix doesn't seem to have a problem with mixing C, CUDA, and Fortran files
if (MAGMA_ENABLE_CUDA)
- cuda_add_library( magma ${libmagma_all} )
+ cuda_add_library( magma STATIC ${libmagma_all} OPTIONS --compiler-options "-fPIC")
target_link_libraries( magma
${blas_fix}
${LAPACK_LIBRARIES}
${CUDA_CUDART_LIBRARY}
${CUDA_CUBLAS_LIBRARIES}
- ${CUDA_cusparse_LIBRARY}
+ # ${CUDA_cusparse_LIBRARY}
)
else()
find_package( hipBLAS )
@@ -614,138 +623,139 @@ else()
endif()
endif()
add_custom_target( lib DEPENDS magma )
-
-
-# ----------------------------------------
-# compile lapacktest library
-# If use fortran, compile only Fortran files, not magma_[sdcz]_no_fortran.cpp
-# else, compile only C++ files, not Fortran files
-if (USE_FORTRAN)
- foreach( filename ${liblapacktest_all} )
- if (filename MATCHES "\\.(f|f90|F90)$")
- list( APPEND liblapacktest_all_f ${filename} )
- endif()
- endforeach()
- add_library( lapacktest ${liblapacktest_all_f} )
-else()
- # alternatively, use only C/C++/CUDA files, including magma_[sdcz]_no_fortran.cpp
- foreach( filename ${liblapacktest_all} )
- if (filename MATCHES "\\.(c|cu|cpp)$")
- list( APPEND liblapacktest_all_cpp ${filename} )
- endif()
- endforeach()
- add_library( lapacktest ${liblapacktest_all_cpp} )
-endif()
-target_link_libraries( lapacktest
- ${blas_fix}
- ${LAPACK_LIBRARIES}
-)
-
-
-# ----------------------------------------
-# compile tester library
-add_library( tester ${libtest_all} )
-target_link_libraries( tester
- magma
- lapacktest
- ${blas_fix}
- ${LAPACK_LIBRARIES}
-)
+set_target_properties(magma PROPERTIES POSITION_INDEPENDENT_CODE ON)
+
+
+# # ----------------------------------------
+# # compile lapacktest library
+# # If use fortran, compile only Fortran files, not magma_[sdcz]_no_fortran.cpp
+# # else, compile only C++ files, not Fortran files
+# if (USE_FORTRAN)
+# foreach( filename ${liblapacktest_all} )
+# if (filename MATCHES "\\.(f|f90|F90)$")
+# list( APPEND liblapacktest_all_f ${filename} )
+# endif()
+# endforeach()
+# add_library( lapacktest ${liblapacktest_all_f} )
+# else()
+# # alternatively, use only C/C++/CUDA files, including magma_[sdcz]_no_fortran.cpp
+# foreach( filename ${liblapacktest_all} )
+# if (filename MATCHES "\\.(c|cu|cpp)$")
+# list( APPEND liblapacktest_all_cpp ${filename} )
+# endif()
+# endforeach()
+# add_library( lapacktest ${liblapacktest_all_cpp} )
+# endif()
+# target_link_libraries( lapacktest
+# ${blas_fix}
+# ${LAPACK_LIBRARIES}
+# )
+
+
+# # ----------------------------------------
+# # compile tester library
+# add_library( tester ${libtest_all} )
+# target_link_libraries( tester
+# magma
+# lapacktest
+# ${blas_fix}
+# ${LAPACK_LIBRARIES}
+# )
# ----------------------------------------
# compile MAGMA sparse library
# sparse doesn't have Fortran at the moment, so no need for above shenanigans
-if (MAGMA_ENABLE_CUDA)
- include_directories( sparse/include )
- include_directories( sparse/control )
-else()
- include_directories( sparse_hip/include )
- include_directories( sparse_hip/control )
-endif()
-include_directories( testing )
-
-if (MAGMA_ENABLE_CUDA)
- cuda_add_library( magma_sparse ${libsparse_all} )
- target_link_libraries( magma_sparse
- magma
- ${blas_fix}
- ${LAPACK_LIBRARIES}
- ${CUDA_CUDART_LIBRARY}
- ${CUDA_CUBLAS_LIBRARIES}
- ${CUDA_cusparse_LIBRARY}
- )
-else()
- add_library( magma_sparse ${libsparse_all} )
- target_link_libraries( magma_sparse
- magma
- ${blas_fix}
- ${LAPACK_LIBRARIES}
- hip::device
- roc::hipblas
- roc::hipsparse
- )
-endif()
-add_custom_target( sparse-lib DEPENDS magma_sparse )
-
-
-# ----------------------------------------
-# compile each tester
-
-# save testers to testing/
-# save tester lib files to testing_lib/ to avoid cluttering lib/
-set( CMAKE_RUNTIME_OUTPUT_DIRECTORY testing )
-set( CMAKE_ARCHIVE_OUTPUT_DIRECTORY testing_lib )
-set( CMAKE_LIBRARY_OUTPUT_DIRECTORY testing_lib )
-
-# skip Fortran testers, which require an extra file from CUDA
-foreach( filename ${testing_all} )
- if (filename MATCHES "\\.(c|cu|cpp)$")
- list( APPEND testing_all_cpp ${filename} )
- endif()
-endforeach()
-foreach( TEST ${testing_all_cpp} )
- string( REGEX REPLACE "\\.(cpp|f90|F90)" "" EXE ${TEST} )
- string( REGEX REPLACE "testing/" "" EXE ${EXE} )
- #message( "${TEST} --> ${EXE}" )
- add_executable( ${EXE} ${TEST} )
- target_link_libraries( ${EXE} tester lapacktest magma )
- list( APPEND testing ${EXE} )
-endforeach()
-add_custom_target( testing DEPENDS ${testing} )
-
-
-# ----------------------------------------
-# compile each sparse tester
-
-if (MAGMA_ENABLE_CUDA)
- set(SPARSE_TEST_DIR "sparse/testing")
-else()
- set(SPARSE_TEST_DIR "sparse_hip/testing")
-endif()
-
-
-set( CMAKE_RUNTIME_OUTPUT_DIRECTORY "${SPARSE_TEST_DIR}" )
-cmake_policy( SET CMP0037 OLD)
-foreach( TEST ${sparse_testing_all} )
- string( REGEX REPLACE "\\.(cpp|f90|F90)" "" EXE ${TEST} )
- string( REGEX REPLACE "${SPARSE_TEST_DIR}/" "" EXE ${EXE} )
- #message( "${TEST} --> ${EXE}" )
- add_executable( ${EXE} ${TEST} )
- target_link_libraries( ${EXE} magma_sparse magma )
- list( APPEND sparse-testing ${EXE} )
-endforeach()
-add_custom_target( sparse-testing DEPENDS ${sparse-testing} )
+# if (MAGMA_ENABLE_CUDA)
+# include_directories( sparse/include )
+# include_directories( sparse/control )
+# else()
+# include_directories( sparse_hip/include )
+# include_directories( sparse_hip/control )
+# endif()
+# include_directories( testing )
+
+# if (MAGMA_ENABLE_CUDA)
+# cuda_add_library( magma_sparse ${libsparse_all} )
+# target_link_libraries( magma_sparse
+# magma
+# ${blas_fix}
+# ${LAPACK_LIBRARIES}
+# ${CUDA_CUDART_LIBRARY}
+# ${CUDA_CUBLAS_LIBRARIES}
+# ${CUDA_cusparse_LIBRARY}
+# )
+# else()
+# add_library( magma_sparse ${libsparse_all} )
+# target_link_libraries( magma_sparse
+# magma
+# ${blas_fix}
+# ${LAPACK_LIBRARIES}
+# hip::device
+# roc::hipblas
+# roc::hipsparse
+# )
+# endif()
+# add_custom_target( sparse-lib DEPENDS magma_sparse )
+
+
+# # ----------------------------------------
+# # compile each tester
+
+# # save testers to testing/
+# # save tester lib files to testing_lib/ to avoid cluttering lib/
+# set( CMAKE_RUNTIME_OUTPUT_DIRECTORY testing )
+# set( CMAKE_ARCHIVE_OUTPUT_DIRECTORY testing_lib )
+# set( CMAKE_LIBRARY_OUTPUT_DIRECTORY testing_lib )
+
+# # skip Fortran testers, which require an extra file from CUDA
+# foreach( filename ${testing_all} )
+# if (filename MATCHES "\\.(c|cu|cpp)$")
+# list( APPEND testing_all_cpp ${filename} )
+# endif()
+# endforeach()
+# foreach( TEST ${testing_all_cpp} )
+# string( REGEX REPLACE "\\.(cpp|f90|F90)" "" EXE ${TEST} )
+# string( REGEX REPLACE "testing/" "" EXE ${EXE} )
+# #message( "${TEST} --> ${EXE}" )
+# add_executable( ${EXE} ${TEST} )
+# target_link_libraries( ${EXE} tester lapacktest magma )
+# list( APPEND testing ${EXE} )
+# endforeach()
+# add_custom_target( testing DEPENDS ${testing} )
+
+
+# # ----------------------------------------
+# # compile each sparse tester
+
+# if (MAGMA_ENABLE_CUDA)
+# set(SPARSE_TEST_DIR "sparse/testing")
+# else()
+# set(SPARSE_TEST_DIR "sparse_hip/testing")
+# endif()
+
+
+# set( CMAKE_RUNTIME_OUTPUT_DIRECTORY "${SPARSE_TEST_DIR}" )
+# cmake_policy( SET CMP0037 OLD)
+# foreach( TEST ${sparse_testing_all} )
+# string( REGEX REPLACE "\\.(cpp|f90|F90)" "" EXE ${TEST} )
+# string( REGEX REPLACE "${SPARSE_TEST_DIR}/" "" EXE ${EXE} )
+# #message( "${TEST} --> ${EXE}" )
+# add_executable( ${EXE} ${TEST} )
+# target_link_libraries( ${EXE} magma_sparse magma )
+# list( APPEND sparse-testing ${EXE} )
+# endforeach()
+# add_custom_target( sparse-testing DEPENDS ${sparse-testing} )
# ----------------------------------------
# what to install
-install( TARGETS magma magma_sparse ${blas_fix}
+install( TARGETS magma ${blas_fix}
RUNTIME DESTINATION bin
LIBRARY DESTINATION lib
ARCHIVE DESTINATION lib )
-file( GLOB headers include/*.h sparse/include/*.h "${CMAKE_BINARY_DIR}/include/*.h" )
+file( GLOB headers include/*.h "${CMAKE_BINARY_DIR}/include/*.h" )
if (USE_FORTRAN)
install( FILES ${headers} ${modules}
DESTINATION include )
@@ -769,9 +779,9 @@ else()
"${blas_fix_lib} ${LAPACK_LIBS} hip::device roc::hipblas roc::hipsparse" )
endif()
set( MAGMA_REQUIRED "" )
-configure_file( "${pkgconfig}.in" "${pkgconfig}" @ONLY )
-install( FILES "${CMAKE_BINARY_DIR}/${pkgconfig}"
- DESTINATION lib/pkgconfig )
+# configure_file( "${pkgconfig}.in" "${pkgconfig}" @ONLY )
+# install( FILES "${CMAKE_BINARY_DIR}/${pkgconfig}"
+# DESTINATION lib/pkgconfig )
# ----------------------------------------
get_directory_property( compile_definitions COMPILE_DEFINITIONS )

View File

@ -0,0 +1,40 @@
diff --git a/control/get_batched_crossover.cpp b/control/get_batched_crossover.cpp
index 4ec57306..912f8608 100644
--- a/control/get_batched_crossover.cpp
+++ b/control/get_batched_crossover.cpp
@@ -119,7 +119,7 @@ void magma_get_spotrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_
void magma_get_zgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_t *recnb)
{
*nb = 64;
- *recnb = 32;
+ *recnb = 16;
return;
}
@@ -127,7 +127,7 @@ void magma_get_zgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_
void magma_get_cgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_t *recnb)
{
*nb = 128;
- *recnb = 32;
+ *recnb = 16;
return;
}
@@ -135,7 +135,7 @@ void magma_get_cgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_
void magma_get_dgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_t *recnb)
{
*nb = 128;
- *recnb = 32;
+ *recnb = 16;
return;
}
@@ -143,7 +143,7 @@ void magma_get_dgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_
void magma_get_sgetrf_batched_nbparam(magma_int_t n, magma_int_t *nb, magma_int_t *recnb)
{
*nb = 128;
- *recnb = 32;
+ *recnb = 16;
return;
}

View File

@ -0,0 +1,15 @@
diff --git a/src/zgetrf_batched.cpp b/src/zgetrf_batched.cpp
index 24a65a90..884d9352 100644
--- a/src/zgetrf_batched.cpp
+++ b/src/zgetrf_batched.cpp
@@ -116,7 +116,9 @@ magma_zgetrf_batched(
return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
}
else{
- return magma_zgetrf_batched_smallsq_shfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
+ // magma_cgetrf_batched_smallsq_shfl is broken, therefore let's call noshfl version for arch < 700
+ // return magma_zgetrf_batched_smallsq_shfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
+ return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );
}
#else
return magma_zgetrf_batched_smallsq_noshfl( m, dA_array, ldda, ipiv_array, info_array, batchCount, queue );

View File

@ -0,0 +1 @@
6cd83808c6e8bc7a44028e05112b3ab4e579bcc73202ed14733f66661127e213 magma-2.6.1.tar.gz

View File

@ -0,0 +1,20 @@
--- control/thread_queue.cpp 2016-08-30 06:37:49.000000000 -0700
+++ control/thread_queue.cpp 2016-10-10 19:47:28.911580965 -0700
@@ -15,7 +15,7 @@
{
if ( err != 0 ) {
fprintf( stderr, "Error: %s (%d)\n", strerror(err), err );
- throw std::exception();
+ // throw std::exception();
}
}
@@ -172,7 +172,7 @@
check( pthread_mutex_lock( &mutex ));
if ( quit_flag ) {
fprintf( stderr, "Error: push_task() called after quit()\n" );
- throw std::exception();
+ // throw std::exception();
}
q.push( task );
ntask += 1;

21
.ci/manywheel/LICENSE Normal file
View File

@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2016 manylinux
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

28
.ci/manywheel/build.sh Executable file
View File

@ -0,0 +1,28 @@
#!/usr/bin/env bash
set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
case "${GPU_ARCH_TYPE:-BLANK}" in
BLANK)
# Legacy behavior for CircleCI
bash "${SCRIPTPATH}/build_cuda.sh"
;;
cuda)
bash "${SCRIPTPATH}/build_cuda.sh"
;;
rocm)
bash "${SCRIPTPATH}/build_rocm.sh"
;;
cpu | cpu-cxx11-abi | cpu-s390x)
bash "${SCRIPTPATH}/build_cpu.sh"
;;
xpu)
bash "${SCRIPTPATH}/build_xpu.sh"
;;
*)
echo "Un-recognized GPU_ARCH_TYPE '${GPU_ARCH_TYPE}', exiting..."
exit 1
;;
esac

View File

@ -0,0 +1,498 @@
#!/usr/bin/env bash
# meant to be called only from the neighboring build.sh and build_cpu.sh scripts
set -ex
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
source ${SOURCE_DIR}/set_desired_python.sh
if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then
echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"
echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"
exit 1
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
PLATFORM="manylinux2014_x86_64"
# TODO move this into the Docker images
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
PLATFORM="manylinux_2_28_x86_64"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
# TODO: Remove this once nvidia package repos are back online
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
fi
# We use the package name to test the package by passing this to 'pip install'
# This is the env variable that setup.py uses to name the package. Note that
# pip 'normalizes' the name first by changing all - to _
if [[ -z "$TORCH_PACKAGE_NAME" ]]; then
TORCH_PACKAGE_NAME='torch'
fi
if [[ -z "$TORCH_NO_PYTHON_PACKAGE_NAME" ]]; then
TORCH_NO_PYTHON_PACKAGE_NAME='torch_no_python'
fi
TORCH_PACKAGE_NAME="$(echo $TORCH_PACKAGE_NAME | tr '-' '_')"
TORCH_NO_PYTHON_PACKAGE_NAME="$(echo $TORCH_NO_PYTHON_PACKAGE_NAME | tr '-' '_')"
echo "Expecting the built wheels to all be called '$TORCH_PACKAGE_NAME' or '$TORCH_NO_PYTHON_PACKAGE_NAME'"
# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if
# PYTORCH_BUILD_NUMBER > 1
build_version="$PYTORCH_BUILD_VERSION"
build_number="$PYTORCH_BUILD_NUMBER"
if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then
# This will be the *exact* version, since build_number<1
build_version="$OVERRIDE_PACKAGE_VERSION"
build_number=0
fi
if [[ -z "$build_version" ]]; then
build_version=1.0.0
fi
if [[ -z "$build_number" ]]; then
build_number=1
fi
export PYTORCH_BUILD_VERSION=$build_version
export PYTORCH_BUILD_NUMBER=$build_number
export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"
export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"
if [[ -e /opt/openssl ]]; then
export OPENSSL_ROOT_DIR=/opt/openssl
export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH
fi
mkdir -p /tmp/$WHEELHOUSE_DIR
export PATCHELF_BIN=/usr/local/bin/patchelf
patchelf_version=$($PATCHELF_BIN --version)
echo "patchelf version: " $patchelf_version
if [[ "$patchelf_version" == "patchelf 0.9" ]]; then
echo "Your patchelf version is too old. Please use version >= 0.10."
exit 1
fi
########################################################
# Compile wheels as well as libtorch
#######################################################
if [[ -z "$PYTORCH_ROOT" ]]; then
echo "Need to set PYTORCH_ROOT env variable"
exit 1
fi
pushd "$PYTORCH_ROOT"
python setup.py clean
retry pip install -qr requirements.txt
case ${DESIRED_PYTHON} in
cp31*)
retry pip install -q --pre numpy==2.1.0
;;
# Should catch 3.9+
*)
retry pip install -q --pre numpy==2.0.2
;;
esac
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
export _GLIBCXX_USE_CXX11_ABI=1
else
export _GLIBCXX_USE_CXX11_ABI=0
fi
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
echo "Calling build_amd.py at $(date)"
python tools/amd_build/build_amd.py
fi
# This value comes from binary_linux_build.sh (and should only be set to true
# for master / release branches)
BUILD_DEBUG_INFO=${BUILD_DEBUG_INFO:=0}
if [[ $BUILD_DEBUG_INFO == "1" ]]; then
echo "Building wheel and debug info"
else
echo "BUILD_DEBUG_INFO was not set, skipping debug info"
fi
if [[ "$DISABLE_RCCL" = 1 ]]; then
echo "Disabling NCCL/RCCL in pyTorch"
USE_RCCL=0
USE_NCCL=0
USE_KINETO=0
else
USE_RCCL=1
USE_NCCL=1
USE_KINETO=1
fi
echo "Calling setup.py bdist at $(date)"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "Calling setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist_wheel for split build (BUILD_LIBTORCH_WHL)"
echo "Calling setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
time EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR --cmake
echo "Finished setup.py bdist_wheel for split build (BUILD_PYTHON_ONLY)"
else
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
fi
echo "Finished setup.py bdist at $(date)"
# Build libtorch packages
if [[ -n "$BUILD_PYTHONLESS" ]]; then
# Now build pythonless libtorch
# Note - just use whichever python we happen to be on
python setup.py clean
if [[ $LIBTORCH_VARIANT = *"static"* ]]; then
STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"
fi
mkdir -p build
pushd build
echo "Calling tools/build_libtorch.py at $(date)"
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \
python ../tools/build_libtorch.py
echo "Finished tools/build_libtorch.py at $(date)"
popd
mkdir -p libtorch/{lib,bin,include,share}
cp -r build/build/lib libtorch/
# for now, the headers for the libtorch package will just be copied in
# from one of the wheels (this is from when this script built multiple
# wheels at once)
ANY_WHEEL=$(ls /tmp/$WHEELHOUSE_DIR/torch*.whl | head -n1)
unzip -d any_wheel $ANY_WHEEL
if [[ -d any_wheel/torch/include ]]; then
cp -r any_wheel/torch/include libtorch/
else
cp -r any_wheel/torch/lib/include libtorch/
fi
cp -r any_wheel/torch/share/cmake libtorch/share/
rm -rf any_wheel
echo $PYTORCH_BUILD_VERSION > libtorch/build-version
echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash
mkdir -p /tmp/$LIBTORCH_HOUSE_DIR
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
LIBTORCH_ABI="cxx11-abi-"
else
LIBTORCH_ABI=
fi
zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch
cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \
/tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip
fi
popd
#######################################################################
# ADD DEPENDENCIES INTO THE WHEEL
#
# auditwheel repair doesn't work correctly and is buggy
# so manually do the work of copying dependency libs and patchelfing
# and fixing RECORDS entries correctly
######################################################################
fname_with_sha256() {
HASH=$(sha256sum $1 | cut -c1-8)
DIRNAME=$(dirname $1)
BASENAME=$(basename $1)
# Do not rename nvrtc-builtins.so as they are dynamically loaded
# by libnvrtc.so
# Similarly don't mangle libcudnn and libcublas library names
if [[ $BASENAME == "libnvrtc-builtins.s"* || $BASENAME == "libcudnn"* || $BASENAME == "libcublas"* ]]; then
echo $1
else
INITNAME=$(echo $BASENAME | cut -f1 -d".")
ENDNAME=$(echo $BASENAME | cut -f 2- -d".")
echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"
fi
}
fname_without_so_number() {
LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')
echo "$LINKNAME"
}
make_wheel_record() {
FPATH=$1
if echo $FPATH | grep RECORD >/dev/null 2>&1; then
# if the RECORD file, then
echo "\"$FPATH\",,"
else
HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')
FSIZE=$(ls -nl $FPATH | awk '{print $5}')
echo "\"$FPATH\",sha256=$HASH,$FSIZE"
fi
}
replace_needed_sofiles() {
find $1 -name '*.so*' | while read sofile; do
origname=$2
patchedname=$3
if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
set +e
origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")
ERRCODE=$?
set -e
if [ "$ERRCODE" -eq "0" ]; then
echo "patching $sofile entry $origname to $patchedname"
$PATCHELF_BIN --replace-needed $origname $patchedname $sofile
fi
fi
done
}
echo 'Built this wheel:'
ls /tmp/$WHEELHOUSE_DIR
mkdir -p "/$WHEELHOUSE_DIR"
mv /tmp/$WHEELHOUSE_DIR/torch*linux*.whl /$WHEELHOUSE_DIR/
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
mv /tmp/$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/ || true
fi
if [[ -n "$BUILD_PYTHONLESS" ]]; then
mkdir -p /$LIBTORCH_HOUSE_DIR
mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR
rm -rf /tmp/$LIBTORCH_HOUSE_DIR
fi
rm -rf /tmp/$WHEELHOUSE_DIR
rm -rf /tmp_dir
mkdir /tmp_dir
pushd /tmp_dir
for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.whl /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do
# if the glob didn't match anything
if [[ ! -e $pkg ]]; then
continue
fi
rm -rf tmp
mkdir -p tmp
cd tmp
cp $pkg .
unzip -q $(basename $pkg)
rm -f $(basename $pkg)
if [[ -d torch ]]; then
PREFIX=torch
else
PREFIX=libtorch
fi
if [[ $pkg != *"without-deps"* ]]; then
# copy over needed dependent .so files over and tag them with their hash
patched=()
for filepath in "${DEPS_LIST[@]}"; do
filename=$(basename $filepath)
destpath=$PREFIX/lib/$filename
if [[ "$filepath" != "$destpath" ]]; then
cp $filepath $destpath
fi
# ROCm workaround for roctracer dlopens
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
patchedpath=$(fname_without_so_number $destpath)
# Keep the so number for XPU dependencies
elif [[ "$DESIRED_CUDA" == *"xpu"* ]]; then
patchedpath=$destpath
else
patchedpath=$(fname_with_sha256 $destpath)
fi
patchedname=$(basename $patchedpath)
if [[ "$destpath" != "$patchedpath" ]]; then
mv $destpath $patchedpath
fi
patched+=("$patchedname")
echo "Copied $filepath to $patchedpath"
done
echo "patching to fix the so names to the hashed names"
for ((i=0;i<${#DEPS_LIST[@]};++i)); do
replace_needed_sofiles $PREFIX ${DEPS_SONAME[i]} ${patched[i]}
# do the same for caffe2, if it exists
if [[ -d caffe2 ]]; then
replace_needed_sofiles caffe2 ${DEPS_SONAME[i]} ${patched[i]}
fi
done
# copy over needed auxiliary files
for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do
srcpath=${DEPS_AUX_SRCLIST[i]}
dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}
mkdir -p $(dirname $dstpath)
cp $srcpath $dstpath
done
fi
# set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib
find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile to ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'}"
$PATCHELF_BIN --set-rpath ${C_SO_RPATH:-'$ORIGIN:$ORIGIN/lib'} ${FORCE_RPATH:-} $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# set RPATH of lib/ files to $ORIGIN
find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile to ${LIB_SO_RPATH:-'$ORIGIN'}"
$PATCHELF_BIN --set-rpath ${LIB_SO_RPATH:-'$ORIGIN'} ${FORCE_RPATH:-} $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# create Manylinux 2_28 tag this needs to happen before regenerate the RECORD
if [[ $PLATFORM == "manylinux_2_28_x86_64" && $GPU_ARCH_TYPE != "cpu-s390x" && $GPU_ARCH_TYPE != "xpu" ]]; then
wheel_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/WHEEL/g')
sed -i -e s#linux_x86_64#"${PLATFORM}"# $wheel_file;
fi
# regenerate the RECORD file with new hashes
record_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g')
if [[ -e $record_file ]]; then
echo "Generating new record file $record_file"
: > "$record_file"
# generate records for folders in wheel
find * -type f | while read fname; do
make_wheel_record "$fname" >>"$record_file"
done
fi
if [[ $BUILD_DEBUG_INFO == "1" ]]; then
pushd "$PREFIX/lib"
# Duplicate library into debug lib
cp libtorch_cpu.so libtorch_cpu.so.dbg
# Keep debug symbols on debug lib
strip --only-keep-debug libtorch_cpu.so.dbg
# Remove debug info from release lib
strip --strip-debug libtorch_cpu.so
objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg
# Zip up debug info
mkdir -p /tmp/debug
mv libtorch_cpu.so.dbg /tmp/debug/libtorch_cpu.so.dbg
CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch_cpu.so)
pushd /tmp
PKG_NAME=$(basename "$pkg" | sed 's/\.whl$//g')
zip /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip /tmp/debug/libtorch_cpu.so.dbg
cp /tmp/debug-whl-libtorch-"$PKG_NAME"-"$CRC32".zip "$PYTORCH_FINAL_PACKAGE_DIR"
popd
popd
fi
# Rename wheel for Manylinux 2_28
if [[ $PLATFORM == "manylinux_2_28_x86_64" && $GPU_ARCH_TYPE != "cpu-s390x" && $GPU_ARCH_TYPE != "xpu" ]]; then
pkg_name=$(echo $(basename $pkg) | sed -e s#linux_x86_64#"${PLATFORM}"#)
zip -rq $pkg_name $PREIX*
rm -f $pkg
mv $pkg_name $(dirname $pkg)/$pkg_name
else
# zip up the wheel back
zip -rq $(basename $pkg) $PREIX*
# remove original wheel
rm -f $pkg
mv $(basename $pkg) $pkg
fi
cd ..
rm -rf tmp
done
# Copy wheels to host machine for persistence before testing
if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
if [[ -n "$BUILD_PYTHONLESS" ]]; then
cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"
else
cp /$WHEELHOUSE_DIR/torch*.whl "$PYTORCH_FINAL_PACKAGE_DIR"
fi
fi
# remove stuff before testing
rm -rf /opt/rh
if ls /usr/local/cuda* >/dev/null 2>&1; then
rm -rf /usr/local/cuda*
fi
# Test that all the wheels work
if [[ -z "$BUILD_PYTHONLESS" ]]; then
export OMP_NUM_THREADS=4 # on NUMA machines this takes too long
pushd $PYTORCH_ROOT/test
# Install the wheel for this Python version
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pip uninstall -y "$TORCH_NO_PYTHON_PACKAGE_NAME" || true
fi
pip uninstall -y "$TORCH_PACKAGE_NAME"
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pip install "$TORCH_NO_PYTHON_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v
fi
pip install "$TORCH_PACKAGE_NAME" --no-index -f /$WHEELHOUSE_DIR --no-dependencies -v
# Print info on the libraries installed in this wheel
# Rather than adjust find command to skip non-library files with an embedded *.so* in their name,
# since this is only for reporting purposes, we add the || true to the ldd command.
installed_libraries=($(find "$pydir/lib/python${py_majmin}/site-packages/torch/" -name '*.so*'))
echo "The wheel installed all of the libraries: ${installed_libraries[@]}"
for installed_lib in "${installed_libraries[@]}"; do
ldd "$installed_lib" || true
done
# Run the tests
echo "$(date) :: Running tests"
pushd "$PYTORCH_ROOT"
LD_LIBRARY_PATH=/usr/local/nvidia/lib64 \
"${PYTORCH_ROOT}/.ci/pytorch/run_tests.sh" manywheel "${py_majmin}" "$DESIRED_CUDA"
popd
echo "$(date) :: Finished tests"
fi

60
.ci/manywheel/build_cpu.sh Executable file
View File

@ -0,0 +1,60 @@
#!/usr/bin/env bash
set -ex
export TH_BINARY_BUILD=1
export USE_CUDA=0
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build()
CMAKE_ARGS=()
fi
if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build_caffe2()
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
WHEELHOUSE_DIR="wheelhousecpu"
LIBTORCH_HOUSE_DIR="libtorch_housecpu"
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
if [[ -z "$BUILD_PYTHONLESS" ]]; then
PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhousecpu"
else
PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_housecpu"
fi
fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
if [[ "$(uname -m)" == "s390x" ]]; then
LIBGOMP_PATH="/usr/lib/s390x-linux-gnu/libgomp.so.1"
else
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
fi
fi
DEPS_LIST=(
"$LIBGOMP_PATH"
)
DEPS_SONAME=(
"libgomp.so.1"
)
rm -rf /usr/local/cuda*
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
if [[ -z "$BUILD_PYTHONLESS" ]]; then
BUILD_SCRIPT=build_common.sh
else
BUILD_SCRIPT=build_libtorch.sh
fi
source ${SOURCE_DIR}/${BUILD_SCRIPT}

309
.ci/manywheel/build_cuda.sh Normal file
View File

@ -0,0 +1,309 @@
#!/usr/bin/env bash
set -ex
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P ))"
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
export NCCL_ROOT_DIR=/usr/local/cuda
export TH_BINARY_BUILD=1
export USE_STATIC_CUDNN=1
export USE_STATIC_NCCL=1
export ATEN_STATIC_CUDA=1
export USE_CUDA_STATIC_LINK=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
export USE_CUPTI_SO=0
export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build
export USE_CUFILE=${USE_CUFILE:-1}
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build()
CMAKE_ARGS=()
fi
if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build_caffe2()
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
# Determine CUDA version and architectures to build for
#
# NOTE: We should first check `DESIRED_CUDA` when determining `CUDA_VERSION`,
# because in some cases a single Docker image can have multiple CUDA versions
# on it, and `nvcc --version` might not show the CUDA version we want.
if [[ -n "$DESIRED_CUDA" ]]; then
# If the DESIRED_CUDA already matches the format that we expect
if [[ ${DESIRED_CUDA} =~ ^[0-9]+\.[0-9]+$ ]]; then
CUDA_VERSION=${DESIRED_CUDA}
else
# cu90, cu92, cu100, cu101
if [[ ${#DESIRED_CUDA} -eq 4 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:1}.${DESIRED_CUDA:3:1}"
elif [[ ${#DESIRED_CUDA} -eq 5 ]]; then
CUDA_VERSION="${DESIRED_CUDA:2:2}.${DESIRED_CUDA:4:1}"
fi
fi
echo "Using CUDA $CUDA_VERSION as determined by DESIRED_CUDA"
else
CUDA_VERSION=$(nvcc --version|grep release|cut -f5 -d" "|cut -f1 -d",")
echo "CUDA $CUDA_VERSION Detected"
fi
cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.8)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.4)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
11.8)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};3.7;9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
*)
echo "unknown cuda version $CUDA_VERSION"
exit 1
;;
esac
export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}
echo "${TORCH_CUDA_ARCH_LIST}"
# Package directories
WHEELHOUSE_DIR="wheelhouse$cuda_version_nodot"
LIBTORCH_HOUSE_DIR="libtorch_house$cuda_version_nodot"
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
if [[ -z "$BUILD_PYTHONLESS" ]]; then
PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$cuda_version_nodot"
else
PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$cuda_version_nodot"
fi
fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
fi
DEPS_LIST=(
"$LIBGOMP_PATH"
)
DEPS_SONAME=(
"libgomp.so.1"
)
# CUDA 11.8 have to ship the libcusparseLt.so.0 with the binary
# since nvidia-cusparselt-cu11 is not available in PYPI
if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then
DEPS_SONAME+=(
"libcusparseLt.so.0"
)
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcusparseLt.so.0"
)
fi
# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are
# not available in PYPI
if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then
export USE_CUFILE=0
fi
# CUDA_VERSION 12.4, 12.6, 12.8
if [[ $CUDA_VERSION == 12* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
"/usr/local/cuda/lib64/libcudnn_graph.so.9"
"/usr/local/cuda/lib64/libcudnn_ops.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.12"
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcusparseLt.so.0"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
"libcudnn_cnn.so.9"
"libcudnn_graph.so.9"
"libcudnn_ops.so.9"
"libcudnn_engines_runtime_compiled.so.9"
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.12"
"libcublasLt.so.12"
"libcusparseLt.so.0"
"libcudart.so.12"
"libnvToolsExt.so.1"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
)
if [[ $USE_CUFILE == 1 ]]; then
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
)
DEPS_SONAME+=(
"libcufile.so.0"
"libcufile_rdma.so.1"
)
fi
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../cusparselt/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
if [[ $USE_CUFILE == 1 ]]; then
CUDA_RPATHS+=(
'$ORIGIN/../../nvidia/cufile/lib'
)
fi
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
elif [[ $CUDA_VERSION == "11.8" ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
# Bundle ptxas into the wheel, see https://github.com/pytorch/pytorch/pull/119750
export BUILD_BUNDLE_PTXAS=1
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcudnn_adv.so.9"
"/usr/local/cuda/lib64/libcudnn_cnn.so.9"
"/usr/local/cuda/lib64/libcudnn_graph.so.9"
"/usr/local/cuda/lib64/libcudnn_ops.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9"
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9"
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9"
"/usr/local/cuda/lib64/libcudnn.so.9"
"/usr/local/cuda/lib64/libcublas.so.11"
"/usr/local/cuda/lib64/libcublasLt.so.11"
"/usr/local/cuda/lib64/libcudart.so.11.0"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.11.2" # this is not a mistake, it links to more specific cuda version
"/usr/local/cuda/lib64/libnvrtc-builtins.so.11.8"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
"libcudnn_cnn.so.9"
"libcudnn_graph.so.9"
"libcudnn_ops.so.9"
"libcudnn_engines_runtime_compiled.so.9"
"libcudnn_engines_precompiled.so.9"
"libcudnn_heuristic.so.9"
"libcudnn.so.9"
"libcublas.so.11"
"libcublasLt.so.11"
"libcudart.so.11.0"
"libnvToolsExt.so.1"
"libnvrtc.so.11.2"
"libnvrtc-builtins.so.11.8"
)
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
'$ORIGIN/../../nvidia/cublas/lib'
'$ORIGIN/../../nvidia/cuda_cupti/lib'
'$ORIGIN/../../nvidia/cuda_nvrtc/lib'
'$ORIGIN/../../nvidia/cuda_runtime/lib'
'$ORIGIN/../../nvidia/cudnn/lib'
'$ORIGIN/../../nvidia/cufft/lib'
'$ORIGIN/../../nvidia/curand/lib'
'$ORIGIN/../../nvidia/cusolver/lib'
'$ORIGIN/../../nvidia/cusparse/lib'
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
export USE_STATIC_NCCL=0
export USE_SYSTEM_NCCL=1
export ATEN_STATIC_CUDA=0
export USE_CUDA_STATIC_LINK=0
export USE_CUPTI_SO=1
export NCCL_INCLUDE_DIR="/usr/local/cuda/include/"
export NCCL_LIB_DIR="/usr/local/cuda/lib64/"
fi
else
echo "Unknown cuda version $CUDA_VERSION"
exit 1
fi
# run_tests.sh requires DESIRED_CUDA to know what tests to exclude
export DESIRED_CUDA="$cuda_version_nodot"
# Switch `/usr/local/cuda` to the desired CUDA version
rm -rf /usr/local/cuda || true
ln -s "/usr/local/cuda-${CUDA_VERSION}" /usr/local/cuda
# Switch `/usr/local/magma` to the desired CUDA version
rm -rf /usr/local/magma || true
ln -s /usr/local/cuda-${CUDA_VERSION}/magma /usr/local/magma
export CUDA_VERSION=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev) # 10.0.130
export CUDA_VERSION_SHORT=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev | cut -f1,2 -d".") # 10.0
export CUDNN_VERSION=$(ls /usr/local/cuda/lib64/libcudnn.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev)
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
if [[ -z "$BUILD_PYTHONLESS" ]]; then
BUILD_SCRIPT=build_common.sh
else
BUILD_SCRIPT=build_libtorch.sh
fi
source $SCRIPTPATH/${BUILD_SCRIPT}

View File

@ -0,0 +1,353 @@
#!/usr/bin/env bash
# meant to be called only from the neighboring build.sh and build_cpu.sh scripts
set -e pipefail
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
# Require only one python installation
if [[ -z "$DESIRED_PYTHON" ]]; then
echo "Need to set DESIRED_PYTHON env variable"
exit 1
fi
if [[ -n "$BUILD_PYTHONLESS" && -z "$LIBTORCH_VARIANT" ]]; then
echo "BUILD_PYTHONLESS is set, so need LIBTORCH_VARIANT to also be set"
echo "LIBTORCH_VARIANT should be one of shared-with-deps shared-without-deps static-with-deps static-without-deps"
exit 1
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# TODO move this into the Docker images
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
# TODO: Remove this once nvidia package repos are back online
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
retry apt-get update
retry apt-get -y install zip openssl
fi
# Version: setup.py uses $PYTORCH_BUILD_VERSION.post$PYTORCH_BUILD_NUMBER if
# PYTORCH_BUILD_NUMBER > 1
build_version="$PYTORCH_BUILD_VERSION"
build_number="$PYTORCH_BUILD_NUMBER"
if [[ -n "$OVERRIDE_PACKAGE_VERSION" ]]; then
# This will be the *exact* version, since build_number<1
build_version="$OVERRIDE_PACKAGE_VERSION"
build_number=0
fi
if [[ -z "$build_version" ]]; then
build_version=1.0.0
fi
if [[ -z "$build_number" ]]; then
build_number=1
fi
export PYTORCH_BUILD_VERSION=$build_version
export PYTORCH_BUILD_NUMBER=$build_number
export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib:$CMAKE_LIBRARY_PATH"
export CMAKE_INCLUDE_PATH="/opt/intel/include:$CMAKE_INCLUDE_PATH"
# set OPENSSL_ROOT_DIR=/opt/openssl if it exists
if [[ -e /opt/openssl ]]; then
export OPENSSL_ROOT_DIR=/opt/openssl
export CMAKE_INCLUDE_PATH="/opt/openssl/include":$CMAKE_INCLUDE_PATH
fi
# If given a python version like 3.6m or 2.7mu, convert this to the format we
# expect. The binary CI jobs pass in python versions like this; they also only
# ever pass one python version, so we assume that DESIRED_PYTHON is not a list
# in this case
if [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then
python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"
DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"
fi
pydir="/opt/python/$DESIRED_PYTHON"
export PATH="$pydir/bin:$PATH"
export PATCHELF_BIN=/usr/local/bin/patchelf
patchelf_version=`$PATCHELF_BIN --version`
echo "patchelf version: " $patchelf_version
if [[ "$patchelf_version" == "patchelf 0.9" ]]; then
echo "Your patchelf version is too old. Please use version >= 0.10."
exit 1
fi
########################################################
# Compile wheels as well as libtorch
#######################################################
if [[ -z "$PYTORCH_ROOT" ]]; then
echo "Need to set PYTORCH_ROOT env variable"
exit 1
fi
pushd "$PYTORCH_ROOT"
python setup.py clean
retry pip install -qr requirements.txt
retry pip install -q numpy==2.0.1
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
export _GLIBCXX_USE_CXX11_ABI=1
else
export _GLIBCXX_USE_CXX11_ABI=0
fi
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
echo "Calling build_amd.py at $(date)"
python tools/amd_build/build_amd.py
# TODO remove this work-around once pytorch sources are updated
export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr
fi
echo "Calling setup.py install at $(date)"
if [[ $LIBTORCH_VARIANT = *"static"* ]]; then
STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"
fi
(
set -x
mkdir -p build
time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \
# TODO: Remove this flag once https://github.com/pytorch/pytorch/issues/55952 is closed
CFLAGS='-Wno-deprecated-declarations' \
BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \
python setup.py install
mkdir -p libtorch/{lib,bin,include,share}
# Make debug folder separate so it doesn't get zipped up with the rest of
# libtorch
mkdir debug
# Copy over all lib files
cp -rv build/lib/* libtorch/lib/
cp -rv build/lib*/torch/lib/* libtorch/lib/
# Copy over all include files
cp -rv build/include/* libtorch/include/
cp -rv build/lib*/torch/include/* libtorch/include/
# Copy over all of the cmake files
cp -rv build/lib*/torch/share/* libtorch/share/
# Split libtorch into debug / release version
cp libtorch/lib/libtorch_cpu.so libtorch/lib/libtorch_cpu.so.dbg
# Keep debug symbols on debug lib
strip --only-keep-debug libtorch/lib/libtorch_cpu.so.dbg
# Remove debug info from release lib
strip --strip-debug libtorch/lib/libtorch_cpu.so
# Add a debug link to the release lib to the debug lib (debuggers will then
# search for symbols in a file called libtorch_cpu.so.dbg in some
# predetermined locations) and embed a CRC32 of the debug library into the .so
cd libtorch/lib
objcopy libtorch_cpu.so --add-gnu-debuglink=libtorch_cpu.so.dbg
cd ../..
# Move the debug symbols to its own directory so it doesn't get processed /
# zipped with all the other libraries
mv libtorch/lib/libtorch_cpu.so.dbg debug/libtorch_cpu.so.dbg
echo "${PYTORCH_BUILD_VERSION}" > libtorch/build-version
echo "$(pushd $PYTORCH_ROOT && git rev-parse HEAD)" > libtorch/build-hash
)
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
LIBTORCH_ABI="cxx11-abi-"
else
LIBTORCH_ABI=
fi
(
set -x
mkdir -p /tmp/$LIBTORCH_HOUSE_DIR
# objcopy installs a CRC32 into libtorch_cpu above so, so add that to the name here
CRC32=$(objcopy --dump-section .gnu_debuglink=>(tail -c4 | od -t x4 -An | xargs echo) libtorch/lib/libtorch_cpu.so)
# Zip debug symbols
zip /tmp/$LIBTORCH_HOUSE_DIR/debug-libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION-$CRC32.zip debug/libtorch_cpu.so.dbg
# Zip and copy libtorch
zip -rq /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip libtorch
cp /tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-$PYTORCH_BUILD_VERSION.zip \
/tmp/$LIBTORCH_HOUSE_DIR/libtorch-$LIBTORCH_ABI$LIBTORCH_VARIANT-latest.zip
)
popd
#######################################################################
# ADD DEPENDENCIES INTO THE WHEEL
#
# auditwheel repair doesn't work correctly and is buggy
# so manually do the work of copying dependency libs and patchelfing
# and fixing RECORDS entries correctly
######################################################################
fname_with_sha256() {
HASH=$(sha256sum $1 | cut -c1-8)
DIRNAME=$(dirname $1)
BASENAME=$(basename $1)
if [[ $BASENAME == "libnvrtc-builtins.so" || $BASENAME == "libcudnn"* ]]; then
echo $1
else
INITNAME=$(echo $BASENAME | cut -f1 -d".")
ENDNAME=$(echo $BASENAME | cut -f 2- -d".")
echo "$DIRNAME/$INITNAME-$HASH.$ENDNAME"
fi
}
fname_without_so_number() {
LINKNAME=$(echo $1 | sed -e 's/\.so.*/.so/g')
echo "$LINKNAME"
}
make_wheel_record() {
FPATH=$1
if echo $FPATH | grep RECORD >/dev/null 2>&1; then
# if the RECORD file, then
echo "\"$FPATH\",,"
else
HASH=$(openssl dgst -sha256 -binary $FPATH | openssl base64 | sed -e 's/+/-/g' | sed -e 's/\//_/g' | sed -e 's/=//g')
FSIZE=$(ls -nl $FPATH | awk '{print $5}')
echo "\"$FPATH\",sha256=$HASH,$FSIZE"
fi
}
echo 'Built this package:'
(
set -x
mkdir -p /$LIBTORCH_HOUSE_DIR
mv /tmp/$LIBTORCH_HOUSE_DIR/*.zip /$LIBTORCH_HOUSE_DIR
rm -rf /tmp/$LIBTORCH_HOUSE_DIR
)
TMP_DIR=$(mktemp -d)
trap "rm -rf ${TMP_DIR}" EXIT
pushd "${TMP_DIR}"
for pkg in /$LIBTORCH_HOUSE_DIR/libtorch*.zip; do
# if the glob didn't match anything
if [[ ! -e $pkg ]]; then
continue
fi
rm -rf tmp
mkdir -p tmp
cd tmp
cp $pkg .
unzip -q $(basename $pkg)
rm -f $(basename $pkg)
PREFIX=libtorch
if [[ $pkg != *"without-deps"* ]]; then
# copy over needed dependent .so files over and tag them with their hash
patched=()
for filepath in "${DEPS_LIST[@]}"; do
filename=$(basename $filepath)
destpath=$PREFIX/lib/$filename
if [[ "$filepath" != "$destpath" ]]; then
cp $filepath $destpath
fi
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
patchedpath=$(fname_without_so_number $destpath)
else
patchedpath=$(fname_with_sha256 $destpath)
fi
patchedname=$(basename $patchedpath)
if [[ "$destpath" != "$patchedpath" ]]; then
mv $destpath $patchedpath
fi
patched+=("$patchedname")
echo "Copied $filepath to $patchedpath"
done
echo "patching to fix the so names to the hashed names"
for ((i=0;i<${#DEPS_LIST[@]};++i)); do
find $PREFIX -name '*.so*' | while read sofile; do
origname=${DEPS_SONAME[i]}
patchedname=${patched[i]}
if [[ "$origname" != "$patchedname" ]] || [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
set +e
origname=$($PATCHELF_BIN --print-needed $sofile | grep "$origname.*")
ERRCODE=$?
set -e
if [ "$ERRCODE" -eq "0" ]; then
echo "patching $sofile entry $origname to $patchedname"
$PATCHELF_BIN --replace-needed $origname $patchedname $sofile
fi
fi
done
done
# copy over needed auxiliary files
for ((i=0;i<${#DEPS_AUX_SRCLIST[@]};++i)); do
srcpath=${DEPS_AUX_SRCLIST[i]}
dstpath=$PREFIX/${DEPS_AUX_DSTLIST[i]}
mkdir -p $(dirname $dstpath)
cp $srcpath $dstpath
done
fi
# set RPATH of _C.so and similar to $ORIGIN, $ORIGIN/lib
find $PREFIX -maxdepth 1 -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile to " '$ORIGIN:$ORIGIN/lib'
$PATCHELF_BIN --set-rpath '$ORIGIN:$ORIGIN/lib' $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# set RPATH of lib/ files to $ORIGIN
find $PREFIX/lib -maxdepth 1 -type f -name "*.so*" | while read sofile; do
echo "Setting rpath of $sofile to " '$ORIGIN'
$PATCHELF_BIN --set-rpath '$ORIGIN' $sofile
$PATCHELF_BIN --print-rpath $sofile
done
# regenerate the RECORD file with new hashes
record_file=`echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/RECORD/g'`
if [[ -e $record_file ]]; then
echo "Generating new record file $record_file"
rm -f $record_file
# generate records for folders in wheel
find * -type f | while read fname; do
echo $(make_wheel_record $fname) >>$record_file
done
fi
# zip up the wheel back
zip -rq $(basename $pkg) $PREFIX*
# replace original wheel
rm -f $pkg
mv $(basename $pkg) $pkg
cd ..
rm -rf tmp
done
# Copy wheels to host machine for persistence before testing
if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
cp /$LIBTORCH_HOUSE_DIR/libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"
cp /$LIBTORCH_HOUSE_DIR/debug-libtorch*.zip "$PYTORCH_FINAL_PACKAGE_DIR"
fi

268
.ci/manywheel/build_rocm.sh Executable file
View File

@ -0,0 +1,268 @@
#!/usr/bin/env bash
set -ex
export ROCM_HOME=/opt/rocm
export MAGMA_HOME=$ROCM_HOME/magma
# TODO: libtorch_cpu.so is broken when building with Debug info
export BUILD_DEBUG_INFO=0
# TODO Are these all used/needed?
export TH_BINARY_BUILD=1
export USE_STATIC_CUDNN=1
export USE_STATIC_NCCL=1
export ATEN_STATIC_CUDA=1
export USE_CUDA_STATIC_LINK=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
# Set RPATH instead of RUNPATH when using patchelf to avoid LD_LIBRARY_PATH override
export FORCE_RPATH="--force-rpath"
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build()
CMAKE_ARGS=()
fi
if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build_caffe2()
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
# Determine ROCm version and architectures to build for
#
# NOTE: We should first check `DESIRED_CUDA` when determining `ROCM_VERSION`
if [[ -n "$DESIRED_CUDA" ]]; then
if ! echo "${DESIRED_CUDA}"| grep "^rocm" >/dev/null 2>/dev/null; then
export DESIRED_CUDA="rocm${DESIRED_CUDA}"
fi
# rocm3.7, rocm3.5.1
ROCM_VERSION="$DESIRED_CUDA"
echo "Using $ROCM_VERSION as determined by DESIRED_CUDA"
else
echo "Must set DESIRED_CUDA"
exit 1
fi
# Package directories
WHEELHOUSE_DIR="wheelhouse$ROCM_VERSION"
LIBTORCH_HOUSE_DIR="libtorch_house$ROCM_VERSION"
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
if [[ -z "$BUILD_PYTHONLESS" ]]; then
PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhouse$ROCM_VERSION"
else
PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_house$ROCM_VERSION"
fi
fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
# To make version comparison easier, create an integer representation.
ROCM_VERSION_CLEAN=$(echo ${ROCM_VERSION} | sed s/rocm//)
save_IFS="$IFS"
IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION_CLEAN})
IFS="$save_IFS"
if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=0
elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
fi
ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))
# Required ROCm libraries
ROCM_SO_FILES=(
"libMIOpen.so"
"libamdhip64.so"
"libhipblas.so"
"libhipfft.so"
"libhiprand.so"
"libhipsolver.so"
"libhipsparse.so"
"libhsa-runtime64.so"
"libamd_comgr.so"
"libmagma.so"
"librccl.so"
"librocblas.so"
"librocfft.so"
"librocm_smi64.so"
"librocrand.so"
"librocsolver.so"
"librocsparse.so"
"libroctracer64.so"
"libroctx64.so"
"libhipblaslt.so"
"libhiprtc.so"
)
if [[ $ROCM_INT -ge 60100 ]]; then
ROCM_SO_FILES+=("librocprofiler-register.so")
fi
if [[ $ROCM_INT -ge 60200 ]]; then
ROCM_SO_FILES+=("librocm-core.so")
fi
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
LIBNUMA_PATH="/usr/lib64/libnuma.so.1"
LIBELF_PATH="/usr/lib64/libelf.so.1"
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBTINFO_PATH="/usr/lib64/libtinfo.so.5"
else
LIBTINFO_PATH="/usr/lib64/libtinfo.so.6"
fi
LIBDRM_PATH="/opt/amdgpu/lib64/libdrm.so.2"
LIBDRM_AMDGPU_PATH="/opt/amdgpu/lib64/libdrm_amdgpu.so.1"
if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then
# Below libs are direct dependencies of libhipsolver
LIBSUITESPARSE_CONFIG_PATH="/lib64/libsuitesparseconfig.so.4"
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBCHOLMOD_PATH="/lib64/libcholmod.so.2"
# Below libs are direct dependencies of libsatlas
LIBGFORTRAN_PATH="/lib64/libgfortran.so.3"
else
LIBCHOLMOD_PATH="/lib64/libcholmod.so.3"
# Below libs are direct dependencies of libsatlas
LIBGFORTRAN_PATH="/lib64/libgfortran.so.5"
fi
# Below libs are direct dependencies of libcholmod
LIBAMD_PATH="/lib64/libamd.so.2"
LIBCAMD_PATH="/lib64/libcamd.so.2"
LIBCCOLAMD_PATH="/lib64/libccolamd.so.2"
LIBCOLAMD_PATH="/lib64/libcolamd.so.2"
LIBSATLAS_PATH="/lib64/atlas/libsatlas.so.3"
# Below libs are direct dependencies of libsatlas
LIBQUADMATH_PATH="/lib64/libquadmath.so.0"
fi
MAYBE_LIB64=lib64
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
LIBNUMA_PATH="/usr/lib/x86_64-linux-gnu/libnuma.so.1"
LIBELF_PATH="/usr/lib/x86_64-linux-gnu/libelf.so.1"
if [[ $ROCM_INT -ge 50300 ]]; then
LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.6"
else
LIBTINFO_PATH="/lib/x86_64-linux-gnu/libtinfo.so.5"
fi
LIBDRM_PATH="/usr/lib/x86_64-linux-gnu/libdrm.so.2"
LIBDRM_AMDGPU_PATH="/usr/lib/x86_64-linux-gnu/libdrm_amdgpu.so.1"
if [[ $ROCM_INT -ge 60100 && $ROCM_INT -lt 60300 ]]; then
# Below libs are direct dependencies of libhipsolver
LIBCHOLMOD_PATH="/lib/x86_64-linux-gnu/libcholmod.so.3"
# Below libs are direct dependencies of libcholmod
LIBSUITESPARSE_CONFIG_PATH="/lib/x86_64-linux-gnu/libsuitesparseconfig.so.5"
LIBAMD_PATH="/lib/x86_64-linux-gnu/libamd.so.2"
LIBCAMD_PATH="/lib/x86_64-linux-gnu/libcamd.so.2"
LIBCCOLAMD_PATH="/lib/x86_64-linux-gnu/libccolamd.so.2"
LIBCOLAMD_PATH="/lib/x86_64-linux-gnu/libcolamd.so.2"
LIBMETIS_PATH="/lib/x86_64-linux-gnu/libmetis.so.5"
LIBLAPACK_PATH="/lib/x86_64-linux-gnu/liblapack.so.3"
LIBBLAS_PATH="/lib/x86_64-linux-gnu/libblas.so.3"
# Below libs are direct dependencies of libblas
LIBGFORTRAN_PATH="/lib/x86_64-linux-gnu/libgfortran.so.5"
LIBQUADMATH_PATH="/lib/x86_64-linux-gnu/libquadmath.so.0"
fi
MAYBE_LIB64=lib
fi
OS_SO_PATHS=($LIBGOMP_PATH $LIBNUMA_PATH\
$LIBELF_PATH $LIBTINFO_PATH\
$LIBDRM_PATH $LIBDRM_AMDGPU_PATH\
$LIBSUITESPARSE_CONFIG_PATH\
$LIBCHOLMOD_PATH $LIBAMD_PATH\
$LIBCAMD_PATH $LIBCCOLAMD_PATH\
$LIBCOLAMD_PATH $LIBSATLAS_PATH\
$LIBGFORTRAN_PATH $LIBQUADMATH_PATH\
$LIBMETIS_PATH $LIBLAPACK_PATH\
$LIBBLAS_PATH)
OS_SO_FILES=()
for lib in "${OS_SO_PATHS[@]}"
do
file_name="${lib##*/}" # Substring removal of path to get filename
OS_SO_FILES[${#OS_SO_FILES[@]}]=$file_name # Append lib to array
done
# rocBLAS library files
ROCBLAS_LIB_SRC=$ROCM_HOME/lib/rocblas/library
ROCBLAS_LIB_DST=lib/rocblas/library
ARCH=$(echo $PYTORCH_ROCM_ARCH | sed 's/;/|/g') # Replace ; seperated arch list to bar for grep
ARCH_SPECIFIC_FILES=$(ls $ROCBLAS_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $ROCBLAS_LIB_SRC | grep -v gfx)
ROCBLAS_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
# hipblaslt library files
HIPBLASLT_LIB_SRC=$ROCM_HOME/lib/hipblaslt/library
HIPBLASLT_LIB_DST=lib/hipblaslt/library
ARCH_SPECIFIC_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -E $ARCH)
OTHER_FILES=$(ls $HIPBLASLT_LIB_SRC | grep -v gfx)
HIPBLASLT_LIB_FILES=($ARCH_SPECIFIC_FILES $OTHER_FILES)
# ROCm library files
ROCM_SO_PATHS=()
for lib in "${ROCM_SO_FILES[@]}"
do
file_path=($(find $ROCM_HOME/lib/ -name "$lib")) # First search in lib
if [[ -z $file_path ]]; then
if [ -d "$ROCM_HOME/lib64/" ]; then
file_path=($(find $ROCM_HOME/lib64/ -name "$lib")) # Then search in lib64
fi
fi
if [[ -z $file_path ]]; then
file_path=($(find $ROCM_HOME/ -name "$lib")) # Then search in ROCM_HOME
fi
if [[ -z $file_path ]]; then
echo "Error: Library file $lib is not found." >&2
exit 1
fi
ROCM_SO_PATHS[${#ROCM_SO_PATHS[@]}]="$file_path" # Append lib to array
done
DEPS_LIST=(
${ROCM_SO_PATHS[*]}
${OS_SO_PATHS[*]}
)
DEPS_SONAME=(
${ROCM_SO_FILES[*]}
${OS_SO_FILES[*]}
)
DEPS_AUX_SRCLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_SRC/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_SRC/}"
"/opt/amdgpu/share/libdrm/amdgpu.ids"
)
DEPS_AUX_DSTLIST=(
"${ROCBLAS_LIB_FILES[@]/#/$ROCBLAS_LIB_DST/}"
"${HIPBLASLT_LIB_FILES[@]/#/$HIPBLASLT_LIB_DST/}"
"share/libdrm/amdgpu.ids"
)
# MIOpen library files
MIOPEN_SHARE_SRC=$ROCM_HOME/share/miopen/db
MIOPEN_SHARE_DST=share/miopen/db
MIOPEN_SHARE_FILES=($(ls $MIOPEN_SHARE_SRC | grep -E $ARCH))
DEPS_AUX_SRCLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_SRC/})
DEPS_AUX_DSTLIST+=(${MIOPEN_SHARE_FILES[@]/#/$MIOPEN_SHARE_DST/})
# RCCL library files
RCCL_SHARE_SRC=$ROCM_HOME/share/rccl/msccl-algorithms
RCCL_SHARE_DST=share/rccl/msccl-algorithms
RCCL_SHARE_FILES=($(ls $RCCL_SHARE_SRC))
DEPS_AUX_SRCLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_SRC/})
DEPS_AUX_DSTLIST+=(${RCCL_SHARE_FILES[@]/#/$RCCL_SHARE_DST/})
echo "PYTORCH_ROCM_ARCH: ${PYTORCH_ROCM_ARCH}"
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
if [[ -z "$BUILD_PYTHONLESS" ]]; then
BUILD_SCRIPT=build_common.sh
else
BUILD_SCRIPT=build_libtorch.sh
fi
source $SCRIPTPATH/${BUILD_SCRIPT}

108
.ci/manywheel/build_xpu.sh Executable file
View File

@ -0,0 +1,108 @@
#!/usr/bin/env bash
set -ex
export TH_BINARY_BUILD=1
export USE_CUDA=0
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build()
CMAKE_ARGS=()
fi
if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
# These are passed to tools/build_pytorch_libs.sh::build_caffe2()
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
# Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/pti/latest/env/vars.sh
source /opt/intel/oneapi/umf/latest/env/vars.sh
export USE_STATIC_MKL=1
WHEELHOUSE_DIR="wheelhousexpu"
LIBTORCH_HOUSE_DIR="libtorch_housexpu"
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
if [[ -z "$BUILD_PYTHONLESS" ]]; then
PYTORCH_FINAL_PACKAGE_DIR="/remote/wheelhousexpu"
else
PYTORCH_FINAL_PACKAGE_DIR="/remote/libtorch_housexpu"
fi
fi
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"CentOS Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
if [[ "$(uname -m)" == "s390x" ]]; then
LIBGOMP_PATH="/usr/lib/s390x-linux-gnu/libgomp.so.1"
else
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
fi
fi
DEPS_LIST=(
"$LIBGOMP_PATH"
"/opt/intel/oneapi/compiler/latest/lib/libOpenCL.so.1"
)
DEPS_SONAME=(
"libgomp.so.1"
"libOpenCL.so.1"
)
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with xpu support package libs."
DEPS_LIST+=(
"/opt/intel/oneapi/compiler/latest/lib/libsycl.so.8"
"/opt/intel/oneapi/compiler/latest/lib/libur_loader.so.0"
"/opt/intel/oneapi/compiler/latest/lib/libur_adapter_level_zero.so.0"
"/opt/intel/oneapi/compiler/latest/lib/libur_adapter_opencl.so.0"
"/opt/intel/oneapi/compiler/latest/lib/libsvml.so"
"/opt/intel/oneapi/compiler/latest/lib/libirng.so"
"/opt/intel/oneapi/compiler/latest/lib/libimf.so"
"/opt/intel/oneapi/compiler/latest/lib/libintlc.so.5"
"/opt/intel/oneapi/pti/latest/lib/libpti_view.so.0.10"
"/opt/intel/oneapi/umf/latest/lib/libumf.so.0"
"/opt/intel/oneapi/tcm/latest/lib/libhwloc.so.15"
)
DEPS_SONAME+=(
"libsycl.so.8"
"libur_loader.so.0"
"libur_adapter_level_zero.so.0"
"libur_adapter_opencl.so.0"
"libsvml.so"
"libirng.so"
"libimf.so"
"libintlc.so.5"
"libpti_view.so.0.10"
"libumf.so.0"
"libhwloc.so.15"
)
else
echo "Using xpu runtime libs from pypi."
XPU_RPATHS=(
'$ORIGIN/../../../..'
)
XPU_RPATHS=$(IFS=: ; echo "${XPU_RPATHS[*]}")
export C_SO_RPATH=$XPU_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$XPU_RPATHS':$ORIGIN'
export FORCE_RPATH="--force-rpath"
fi
rm -rf /usr/local/cuda*
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"
if [[ -z "$BUILD_PYTHONLESS" ]]; then
BUILD_SCRIPT=build_common.sh
else
BUILD_SCRIPT=build_libtorch.sh
fi
source ${SOURCE_DIR}/${BUILD_SCRIPT}

View File

@ -0,0 +1,30 @@
#!/usr/bin/env bash
# Require only one python installation
if [[ -z "$DESIRED_PYTHON" ]]; then
echo "Need to set DESIRED_PYTHON env variable"
exit 1
fi
# If given a python version like 3.6m or 2.7mu, convert this to the format we
# expect. The binary CI jobs pass in python versions like this; they also only
# ever pass one python version, so we assume that DESIRED_PYTHON is not a list
# in this case
if [[ -n "$DESIRED_PYTHON" && $DESIRED_PYTHON =~ ([0-9].[0-9]+)t ]]; then
python_digits="$(echo $DESIRED_PYTHON | tr -cd [:digit:])"
py_majmin="${DESIRED_PYTHON}"
DESIRED_PYTHON="cp${python_digits}-cp${python_digits}t"
elif [[ -n "$DESIRED_PYTHON" && "$DESIRED_PYTHON" != cp* ]]; then
python_nodot="$(echo $DESIRED_PYTHON | tr -d m.u)"
DESIRED_PYTHON="cp${python_nodot}-cp${python_nodot}"
if [[ ${python_nodot} -ge 310 ]]; then
py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:2}"
else
py_majmin="${DESIRED_PYTHON:2:1}.${DESIRED_PYTHON:3:1}"
fi
fi
pydir="/opt/python/$DESIRED_PYTHON"
export DESIRED_PYTHON_BIN_DIR="${pydir}/bin"
export PATH="$DESIRED_PYTHON_BIN_DIR:$PATH"
echo "Will build for Python version: ${DESIRED_PYTHON}"

26
.ci/manywheel/test_wheel.sh Executable file
View File

@ -0,0 +1,26 @@
#!/usr/bin/env bash
set -e
yum install -y wget git
rm -rf /usr/local/cuda*
# Install Anaconda
if ! ls /py
then
echo "Miniconda needs to be installed"
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p /py
else
echo "Miniconda is already installed"
fi
export PATH="/py/bin:$PATH"
# Anaconda token
if ls /remote/token
then
source /remote/token
fi
conda install -y conda-build anaconda-client

View File

@ -1,6 +1,6 @@
#!/bin/bash
set -ex
set -ex -o pipefail
# Required environment variable: $BUILD_ENVIRONMENT
# (This is set by default in the Docker images we build, so you don't
@ -87,7 +87,7 @@ else
# Workaround required for MKL library linkage
# https://github.com/pytorch/pytorch/issues/119557
if [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
if [[ "$ANACONDA_PYTHON_VERSION" = "3.12" || "$ANACONDA_PYTHON_VERSION" = "3.13" ]]; then
export CMAKE_LIBRARY_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/"
export CMAKE_INCLUDE_PATH="/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/include/"
fi
@ -173,12 +173,13 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
export TORCH_XPU_ARCH_LIST=pvc
fi
# sccache will fail for CUDA builds if all cores are used for compiling
# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used
if [ -z "$MAX_JOBS" ]; then
if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]] || [[ "$BUILD_ENVIRONMENT" == *gcc7* ]]; } && which sccache > /dev/null; then
if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; } && which sccache > /dev/null; then
export MAX_JOBS=$(($(nproc) - 1))
fi
fi
@ -191,7 +192,7 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ "$TORCH_CUDA_ARCH_LIST" == *"8.6"* || "$TORCH_CUDA_ARCH_LIST" == *"8.0"* ]]; then
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
@ -203,10 +204,12 @@ if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
fi
if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
export LDSHARED="clang --shared"
export USE_CUDA=0
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
export USE_CUDA=1
fi
export USE_ASAN=1
export UBSAN_FLAGS="-fno-sanitize-recover=all;-fno-sanitize=float-divide-by-zero;-fno-sanitize=float-cast-overflow"
export REL_WITH_DEB_INFO=1
export UBSAN_FLAGS="-fno-sanitize-recover=all"
unset USE_LLVM
fi
@ -218,10 +221,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
export USE_PRECOMPILED_HEADERS=1
fi
if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build* ]]; then
export USE_GLOO_WITH_OPENSSL=ON
fi
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
@ -230,9 +229,9 @@ if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
export CMAKE_BUILD_TYPE=RelWithAssert
fi
# Do not change workspace permissions for ROCm CI jobs
# Do not change workspace permissions for ROCm and s390x CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
@ -249,10 +248,9 @@ if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then
fi
if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e
set -e -o pipefail
get_bazel
install_sccache_nvcc_for_bazel
# Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
# the runner
@ -278,18 +276,16 @@ else
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *s390x* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0.2 for builds which are backward compatible with 1.X
python -mpip install --pre numpy==2.0.2
python -mpip install numpy==2.0.2
fi
WERROR=1 python setup.py clean
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 python setup.py bdist_wheel --cmake
python3 tools/packaging/split_wheel.py bdist_wheel
else
WERROR=1 python setup.py bdist_wheel
fi
@ -382,8 +378,10 @@ else
# This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization
# is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has
# 16 CPUs
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
if [ -z "$MAX_JOBS_OVERRIDE" ]; then
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
fi
# NB: Install outside of source directory (at the same level as the root
# pytorch folder) so that it doesn't get cleaned away prior to docker push.
@ -400,9 +398,7 @@ if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]];
# don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build
python tools/stats/export_test_times.py
fi
# snadampal: skipping it till sccache support added for aarch64
# https://github.com/pytorch/pytorch/issues/121559
if [[ "$BUILD_ENVIRONMENT" != *aarch64* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then
# don't do this for bazel or s390x as they don't use sccache
if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
print_sccache_stats
fi

394
.ci/pytorch/check_binary.sh Executable file
View File

@ -0,0 +1,394 @@
#!/bin/bash
# shellcheck disable=SC2086,SC2006,SC2207,SC2076,SC2155,SC2046,SC1091,SC2143
# TODO: Re-enable shellchecks above
set -eux -o pipefail
# This script checks the following things on binaries
# 1. The gcc abi matches DESIRED_DEVTOOLSET
# 2. MacOS binaries do not link against OpenBLAS
# 3. There are no protobuf symbols of any sort anywhere (turned off, because
# this is currently not true)
# 4. Standard Python imports work
# 5. MKL is available everywhere except for MacOS wheels
# 6. XNNPACK is available everywhere except for MacOS wheels
# 7. CUDA is setup correctly and does not hang
# 8. Magma is available for CUDA builds
# 9. CuDNN is available for CUDA builds
#
# This script needs the env variables DESIRED_PYTHON, DESIRED_CUDA,
# DESIRED_DEVTOOLSET and PACKAGE_TYPE
#
# This script expects PyTorch to be installed into the active Python (the
# Python returned by `which python`). Or, if this is testing a libtorch
# Pythonless binary, then it expects to be in the root folder of the unzipped
# libtorch package.
if [[ -z ${DESIRED_PYTHON:-} ]]; then
export DESIRED_PYTHON=${MATRIX_PYTHON_VERSION:-}
fi
if [[ -z ${DESIRED_CUDA:-} ]]; then
export DESIRED_CUDA=${MATRIX_DESIRED_CUDA:-}
fi
if [[ -z ${DESIRED_DEVTOOLSET:-} ]]; then
export DESIRED_DEVTOOLSET=${MATRIX_DESIRED_DEVTOOLSET:-}
fi
if [[ -z ${PACKAGE_TYPE:-} ]]; then
export PACKAGE_TYPE=${MATRIX_PACKAGE_TYPE:-}
fi
# The install root depends on both the package type and the os
# All MacOS packages use conda, even for the wheel packages.
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
# NOTE: Only $PWD works on both CentOS and Ubuntu
export install_root="$PWD"
else
if [[ $DESIRED_PYTHON =~ ([0-9].[0-9]+)t ]]; then
# For python that is maj.mint keep original version
py_dot="$DESIRED_PYTHON"
elif [[ $DESIRED_PYTHON =~ ([0-9].[0-9]+) ]]; then
# Strip everything but major.minor from DESIRED_PYTHON version
py_dot="${BASH_REMATCH[0]}"
else
echo "Unexpected ${DESIRED_PYTHON} format"
exit 1
fi
export install_root="$(dirname $(which python))/../lib/python${py_dot}/site-packages/torch/"
fi
###############################################################################
# Setup XPU ENV
###############################################################################
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
set +u
# Refer https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
source /opt/intel/oneapi/compiler/latest/env/vars.sh
source /opt/intel/oneapi/pti/latest/env/vars.sh
fi
###############################################################################
# Check GCC ABI
###############################################################################
# NOTE [ Building libtorch with old vs. new gcc ABI ]
#
# Packages built with one version of ABI could not be linked against by client
# C++ libraries that were compiled using the other version of ABI. Since both
# gcc ABIs are still common in the wild, we need to support both ABIs. Currently:
#
# - All the nightlies built on CentOS 7 + devtoolset7 use the old gcc ABI.
# - All the nightlies built on Ubuntu 16.04 + gcc 5.4 use the new gcc ABI.
echo "Checking that the gcc ABI is what we expect"
if [[ "$(uname)" != 'Darwin' ]]; then
function is_expected() {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* || "$DESIRED_CUDA" == *"rocm"* ]]; then
if [[ "$1" -gt 0 || "$1" == "ON " ]]; then
echo 1
fi
else
if [[ -z "$1" || "$1" == 0 || "$1" == "OFF" ]]; then
echo 1
fi
fi
}
# First we check that the env var in TorchConfig.cmake is correct
# We search for D_GLIBCXX_USE_CXX11_ABI=1 in torch/TorchConfig.cmake
torch_config="${install_root}/share/cmake/Torch/TorchConfig.cmake"
if [[ ! -f "$torch_config" ]]; then
echo "No TorchConfig.cmake found!"
ls -lah "$install_root/share/cmake/Torch"
exit 1
fi
echo "Checking the TorchConfig.cmake"
cat "$torch_config"
# The sed call below is
# don't print lines by default (only print the line we want)
# -n
# execute the following expression
# e
# replace lines that match with the first capture group and print
# s/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p
# any characters, D_GLIBCXX_USE_CXX11_ABI=, exactly one any character, a
# quote, any characters
# Note the exactly one single character after the '='. In the case that the
# variable is not set the '=' will be followed by a '"' immediately and the
# line will fail the match and nothing will be printed; this is what we
# want. Otherwise it will capture the 0 or 1 after the '='.
# /.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/
# replace the matched line with the capture group and print
# /\1/p
actual_gcc_abi="$(sed -ne 's/.*D_GLIBCXX_USE_CXX11_ABI=\(.\)".*/\1/p' < "$torch_config")"
if [[ "$(is_expected "$actual_gcc_abi")" != 1 ]]; then
echo "gcc ABI $actual_gcc_abi not as expected."
exit 1
fi
# We also check that there are [not] cxx11 symbols in libtorch
#
echo "Checking that symbols in libtorch.so have the right gcc abi"
python3 "$(dirname ${BASH_SOURCE[0]})/smoke_test/check_binary_symbols.py"
echo "cxx11 symbols seem to be in order"
fi # if on Darwin
###############################################################################
# Check for no OpenBLAS
# TODO Check for no Protobuf symbols (not finished)
# Print *all* runtime dependencies
###############################################################################
# We have to loop through all shared libraries for this
if [[ "$(uname)" == 'Darwin' ]]; then
all_dylibs=($(find "$install_root" -name '*.dylib'))
for dylib in "${all_dylibs[@]}"; do
echo "All dependencies of $dylib are $(otool -L $dylib) with rpath $(otool -l $dylib | grep LC_RPATH -A2)"
# Check that OpenBlas is not linked to on Macs
echo "Checking the OpenBLAS is not linked to"
if [[ -n "$(otool -L $dylib | grep -i openblas)" ]]; then
echo "ERROR: Found openblas as a dependency of $dylib"
echo "Full dependencies is: $(otool -L $dylib)"
exit 1
fi
# Check for protobuf symbols
#proto_symbols="$(nm $dylib | grep protobuf)" || true
#if [[ -n "$proto_symbols" ]]; then
# echo "ERROR: Detected protobuf symbols in $dylib"
# echo "Symbols are $proto_symbols"
# exit 1
#fi
done
else
all_libs=($(find "$install_root" -name '*.so'))
for lib in "${all_libs[@]}"; do
echo "All dependencies of $lib are $(ldd $lib) with runpath $(objdump -p $lib | grep RUNPATH)"
# Check for protobuf symbols
#proto_symbols=$(nm $lib | grep protobuf) || true
#if [[ -n "$proto_symbols" ]]; then
# echo "ERROR: Detected protobuf symbols in $lib"
# echo "Symbols are $proto_symbols"
# exit 1
#fi
done
fi
setup_link_flags () {
REF_LIB="-Wl,-R${install_root}/lib"
if [[ "$(uname)" == 'Darwin' ]]; then
REF_LIB="-Wl,-rpath ${install_root}/lib"
fi
ADDITIONAL_LINKER_FLAGS=""
if [[ "$(uname)" == 'Linux' ]]; then
ADDITIONAL_LINKER_FLAGS="-Wl,--no-as-needed"
fi
C10_LINK_FLAGS=""
if [ -f "${install_root}/lib/libc10.so" ] || [ -f "${install_root}/lib/libc10.dylib" ]; then
C10_LINK_FLAGS="-lc10"
fi
TORCH_CPU_LINK_FLAGS=""
if [ -f "${install_root}/lib/libtorch_cpu.so" ] || [ -f "${install_root}/lib/libtorch_cpu.dylib" ]; then
TORCH_CPU_LINK_FLAGS="-ltorch_cpu"
fi
TORCH_CUDA_LINK_FLAGS=""
if [ -f "${install_root}/lib/libtorch_cuda.so" ] || [ -f "${install_root}/lib/libtorch_cuda.dylib" ]; then
TORCH_CUDA_LINK_FLAGS="-ltorch_cuda"
elif [ -f "${install_root}/lib/libtorch_cuda_cpp.so" ] && [ -f "${install_root}/lib/libtorch_cuda_cpp.so" ] || \
[ -f "${install_root}/lib/libtorch_cuda_cu.dylib" ] && [ -f "${install_root}/lib/libtorch_cuda_cu.dylib" ]; then
TORCH_CUDA_LINK_FLAGS="-ltorch_cuda_cpp -ltorch_cuda_cu"
fi
}
TEST_CODE_DIR="$(dirname $(realpath ${BASH_SOURCE[0]}))/test_example_code"
build_and_run_example_cpp () {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
GLIBCXX_USE_CXX11_ABI=1
else
GLIBCXX_USE_CXX11_ABI=0
fi
setup_link_flags
g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1
./$1
}
build_example_cpp_with_incorrect_abi () {
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
GLIBCXX_USE_CXX11_ABI=0
else
GLIBCXX_USE_CXX11_ABI=1
fi
set +e
setup_link_flags
g++ ${TEST_CODE_DIR}/$1.cpp -I${install_root}/include -I${install_root}/include/torch/csrc/api/include -D_GLIBCXX_USE_CXX11_ABI=$GLIBCXX_USE_CXX11_ABI -std=gnu++17 -L${install_root}/lib ${REF_LIB} ${ADDITIONAL_LINKER_FLAGS} -ltorch $TORCH_CPU_LINK_FLAGS $TORCH_CUDA_LINK_FLAGS $C10_LINK_FLAGS -o $1
ERRCODE=$?
set -e
if [ "$ERRCODE" -eq "0" ]; then
echo "Building example with incorrect ABI didn't throw error. Aborting."
exit 1
else
echo "Building example with incorrect ABI throws expected error. Proceeding."
fi
}
###############################################################################
# Check simple Python/C++ calls
###############################################################################
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
# NS: Set LD_LIBRARY_PATH for CUDA builds, but perhaps it should be removed
if [[ "$DESIRED_CUDA" == "cu"* ]]; then
export LD_LIBRARY_PATH=/usr/local/cuda/lib64
fi
build_and_run_example_cpp simple-torch-test
# `_GLIBCXX_USE_CXX11_ABI` is always ignored by gcc in devtoolset7, so we test
# the expected failure case for Ubuntu 16.04 + gcc 5.4 only.
if [[ "$DESIRED_DEVTOOLSET" == *"cxx11-abi"* ]]; then
build_example_cpp_with_incorrect_abi simple-torch-test
fi
else
pushd /tmp
python -c 'import torch'
popd
fi
###############################################################################
# Check torch.git_version
###############################################################################
if [[ "$PACKAGE_TYPE" != 'libtorch' ]]; then
pushd /tmp
python -c 'import torch; assert torch.version.git_version != "Unknown"'
python -c 'import torch; assert torch.version.git_version != None'
popd
fi
###############################################################################
# Check for MKL
###############################################################################
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
echo "Checking that MKL is available"
build_and_run_example_cpp check-torch-mkl
elif [[ "$(uname -m)" != "arm64" && "$(uname -m)" != "s390x" ]]; then
if [[ "$(uname)" != 'Darwin' || "$PACKAGE_TYPE" != *wheel ]]; then
if [[ "$(uname -m)" == "aarch64" ]]; then
echo "Checking that MKLDNN is available on aarch64"
pushd /tmp
python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)'
popd
else
echo "Checking that MKL is available"
pushd /tmp
python -c 'import torch; exit(0 if torch.backends.mkl.is_available() else 1)'
popd
fi
fi
fi
###############################################################################
# Check for XNNPACK
###############################################################################
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
echo "Checking that XNNPACK is available"
build_and_run_example_cpp check-torch-xnnpack
else
if [[ "$(uname)" != 'Darwin' || "$PACKAGE_TYPE" != *wheel ]] && [[ "$(uname -m)" != "s390x" ]]; then
echo "Checking that XNNPACK is available"
pushd /tmp
python -c 'import torch.backends.xnnpack; exit(0 if torch.backends.xnnpack.enabled else 1)'
popd
fi
fi
###############################################################################
# Check CUDA configured correctly
###############################################################################
# Skip these for Windows machines without GPUs
if [[ "$OSTYPE" == "msys" ]]; then
GPUS=$(wmic path win32_VideoController get name)
if [[ ! "$GPUS" == *NVIDIA* ]]; then
echo "Skip CUDA tests for machines without a Nvidia GPU card"
exit 0
fi
fi
# Test that CUDA builds are setup correctly
if [[ "$DESIRED_CUDA" != 'cpu' && "$DESIRED_CUDA" != 'xpu' && "$DESIRED_CUDA" != 'cpu-cxx11-abi' && "$DESIRED_CUDA" != *"rocm"* && "$(uname -m)" != "s390x" ]]; then
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
build_and_run_example_cpp check-torch-cuda
else
pushd /tmp
echo "Checking that CUDA archs are setup correctly"
timeout 20 python -c 'import torch; torch.randn([3,5]).cuda()'
# These have to run after CUDA is initialized
echo "Checking that magma is available"
python -c 'import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)'
echo "Checking that CuDNN is available"
python -c 'import torch; exit(0 if torch.backends.cudnn.is_available() else 1)'
# Validates builds is free of linker regressions reported in https://github.com/pytorch/pytorch/issues/57744
echo "Checking that exception handling works"
python -c "import torch; from unittest import TestCase;TestCase().assertRaises(RuntimeError, lambda:torch.eye(7, 7, device='cuda:7'))"
echo "Checking that basic RNN works"
python ${TEST_CODE_DIR}/rnn_smoke.py
echo "Checking that basic CNN works"
python "${TEST_CODE_DIR}/cnn_smoke.py"
echo "Test that linalg works"
python -c "import torch;x=torch.rand(3,3,device='cuda');print(torch.linalg.svd(torch.mm(x.t(), x)))"
popd
fi # if libtorch
fi # if cuda
##########################
# Run parts of smoke tests
##########################
if [[ "$PACKAGE_TYPE" != 'libtorch' ]]; then
pushd "$(dirname ${BASH_SOURCE[0]})/smoke_test"
python -c "from smoke_test import test_linalg; test_linalg()"
if [[ "$DESIRED_CUDA" == *cuda* ]]; then
python -c "from smoke_test import test_linalg; test_linalg('cuda')"
fi
popd
fi
###############################################################################
# Check PyTorch supports TCP_TLS gloo transport
###############################################################################
if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" != 'libtorch' ]]; then
GLOO_CHECK="import torch.distributed as dist
try:
dist.init_process_group('gloo', rank=0, world_size=1)
except RuntimeError as e:
print(e)
"
RESULT=`GLOO_DEVICE_TRANSPORT=TCP_TLS MASTER_ADDR=localhost MASTER_PORT=63945 python -c "$GLOO_CHECK"`
GLOO_TRANSPORT_IS_NOT_SUPPORTED='gloo transport is not supported'
if [[ "$RESULT" =~ "$GLOO_TRANSPORT_IS_NOT_SUPPORTED" ]]; then
echo "PyTorch doesn't support TLS_TCP transport, please build with USE_GLOO_WITH_OPENSSL=1"
exit 1
fi
fi
###############################################################################
# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries
###############################################################################
if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" == 'manywheel' ]]; then
pushd /tmp
python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"
popd
fi

View File

@ -6,6 +6,12 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then
# Save the absolute path in case later we chdir (as occurs in the gpu perf test)
script_dir="$( cd "$(dirname "${BASH_SOURCE[0]}")" || exit ; pwd -P )"
if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
# This is really weird, but newer sccache somehow produces broken binary
# see https://github.com/pytorch/pytorch/issues/139188
sudo mv /opt/cache/bin/sccache-0.2.14a /opt/cache/bin/sccache
fi
if which sccache > /dev/null; then
# Save sccache logs to file
sccache --stop-server > /dev/null 2>&1 || true

View File

@ -3,7 +3,7 @@
# Common setup for all Jenkins scripts
# shellcheck source=./common_utils.sh
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
set -ex
set -ex -o pipefail
# Required environment variables:
# $BUILD_ENVIRONMENT (should be set by your Docker image)

View File

@ -81,14 +81,15 @@ function pip_install_whl() {
function pip_install() {
# retry 3 times
# old versions of pip don't have the "--progress-bar" flag
pip install --progress-bar off "$@" || pip install --progress-bar off "$@" || pip install --progress-bar off "$@" ||\
pip install "$@" || pip install "$@" || pip install "$@"
pip_install_pkg="python3 -m pip install --progress-bar off"
${pip_install_pkg} "$@" || \
${pip_install_pkg} "$@" || \
${pip_install_pkg} "$@"
}
function pip_uninstall() {
# uninstall 2 times
pip uninstall -y "$@" || pip uninstall -y "$@"
pip3 uninstall -y "$@" || pip3 uninstall -y "$@"
}
function get_exit_code() {
@ -104,32 +105,12 @@ function get_bazel() {
# version of Bazelisk to fetch the platform specific version of
# Bazel to use from .bazelversion.
retry curl --location --output tools/bazel \
https://raw.githubusercontent.com/bazelbuild/bazelisk/v1.16.0/bazelisk.py
https://raw.githubusercontent.com/bazelbuild/bazelisk/v1.23.0/bazelisk.py
shasum --algorithm=1 --check \
<(echo 'd4369c3d293814d3188019c9f7527a948972d9f8 tools/bazel')
<(echo '01df9cf7f08dd80d83979ed0d0666a99349ae93c tools/bazel')
chmod u+x tools/bazel
}
# This function is bazel specific because of the bug
# in the bazel that requires some special paths massaging
# as a workaround. See
# https://github.com/bazelbuild/bazel/issues/10167
function install_sccache_nvcc_for_bazel() {
sudo mv /usr/local/cuda/bin/nvcc /usr/local/cuda/bin/nvcc-real
# Write the `/usr/local/cuda/bin/nvcc`
cat << EOF | sudo tee /usr/local/cuda/bin/nvcc
#!/bin/sh
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache /usr/local/cuda/bin/nvcc "\$@"
else
exec external/local_cuda/cuda/bin/nvcc-real "\$@"
fi
EOF
sudo chmod +x /usr/local/cuda/bin/nvcc
}
function install_monkeytype {
# Install MonkeyType
pip_install MonkeyType
@ -179,7 +160,7 @@ function install_torchvision() {
}
function install_tlparse() {
pip_install --user "tlparse==0.3.25"
pip_install --user "tlparse==0.3.30"
PATH="$(python -m site --user-base)/bin:$PATH"
}
@ -188,30 +169,40 @@ function install_torchrec_and_fbgemm() {
torchrec_commit=$(get_pinned_commit torchrec)
local fbgemm_commit
fbgemm_commit=$(get_pinned_commit fbgemm)
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
fbgemm_commit=$(get_pinned_commit fbgemm_rocm)
fi
pip_uninstall torchrec-nightly
pip_uninstall fbgemm-gpu-nightly
pip_install setuptools-git-versioning scikit-build pyre-extensions
# TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it
# seems to be an sccache-related issue
if [[ "$IS_A100_RUNNER" == "1" ]]; then
unset CMAKE_CUDA_COMPILER_LAUNCHER
sudo mv /opt/cache/bin /opt/cache/bin-backup
fi
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
# install torchrec first because it installs fbgemm nightly on top of rocm fbgemm
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_uninstall fbgemm-gpu-nightly
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
if [[ "$IS_A100_RUNNER" == "1" ]]; then
export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
sudo mv /opt/cache/bin-backup /opt/cache/bin
pip_install tabulate # needed for newer fbgemm
pip_install patchelf # needed for rocm fbgemm
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
python setup.py install \
--package_variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
rm -rf fbgemm
else
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
fi
}
function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git
git clone --recursive -b r2.7 https://github.com/pytorch/xla.git
pushd xla
# pin the xla hash so that we don't get broken by changes to xla
git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"
@ -235,11 +226,22 @@ function checkout_install_torchbench() {
# to install and test other models
python install.py --continue_on_fail
fi
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}
function install_torchao() {
local commit
commit=$(get_pinned_commit torchao)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/ao.git@${commit}"
}
function print_sccache_stats() {
echo 'PyTorch Build Statistics'
sccache --show-stats

View File

@ -40,7 +40,7 @@ echo "Building PyTorch C++ API docs..."
rm -rf cppdocs
git clone https://github.com/pytorch/cppdocs
set -ex
set -ex -o pipefail
# Generate ATen files
pushd "${pt_checkout}"

View File

@ -45,8 +45,7 @@ def create_cert(path, C, ST, L, O, key):
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.now(timezone.utc)
+ timedelta(days=10)
datetime.now(timezone.utc) + timedelta(days=10)
)
.add_extension(
x509.BasicConstraints(ca=True, path_length=None),
@ -91,8 +90,7 @@ def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.now(timezone.utc)
+ timedelta(days=10)
datetime.now(timezone.utc) + timedelta(days=10)
# Sign our certificate with our private key
)
.sign(private_ca_key, hashes.SHA256())

View File

@ -5,7 +5,7 @@ pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.ci/pytorch/common_utils.sh"
echo "functorch_doc_push_script.sh: Invoked with $*"
set -ex
set -ex -o pipefail
version=${DOCS_VERSION:-nightly}
echo "version: $version"

View File

@ -6,7 +6,7 @@
# return the same thing, ex checks for for rocm, CUDA, and changing the path
# where sccache is installed, and not changing /etc/environment.
set -ex
set -ex -o pipefail
install_binary() {
echo "Downloading sccache binary from S3 repo"

View File

@ -1,4 +1,5 @@
#!/bin/bash
set -x
# shellcheck disable=SC2034
# shellcheck source=./macos-common.sh
@ -17,6 +18,9 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(
fi
popd
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
# This environment variable makes ProcessGroupGloo default to
@ -148,9 +152,146 @@ test_jit_hooks() {
assert_git_not_dirty
}
torchbench_setup_macos() {
git clone --recursive https://github.com/pytorch/vision torchvision
git clone --recursive https://github.com/pytorch/audio torchaudio
pushd torchvision
git fetch
git checkout "$(cat ../.github/ci_commit_pins/vision.txt)"
git submodule update --init --recursive
python setup.py clean
python setup.py develop
popd
pushd torchaudio
git fetch
git checkout "$(cat ../.github/ci_commit_pins/audio.txt)"
git submodule update --init --recursive
python setup.py clean
python setup.py develop
popd
# Shellcheck doesn't like it when you pass no arguments to a function that can take args. See https://www.shellcheck.net/wiki/SC2120
# shellcheck disable=SC2119,SC2120
checkout_install_torchbench
}
conda_benchmark_deps() {
conda install -y astunparse numpy scipy ninja pyyaml setuptools cmake typing-extensions requests protobuf numba cython scikit-learn
conda install -y -c conda-forge librosa
}
test_torchbench_perf() {
print_cmake_info
echo "Launching torchbench setup"
conda_benchmark_deps
torchbench_setup_macos
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local backend=eager
local dtype=notset
local device=mps
echo "Setup complete, launching torchbench training performance run"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --backend "$backend" --training --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
echo "Launching torchbench inference performance run"
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --backend "$backend" --inference --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
echo "Pytorch benchmark on mps device completed"
}
test_torchbench_smoketest() {
print_cmake_info
echo "Launching torchbench setup"
conda_benchmark_deps
# shellcheck disable=SC2119,SC2120
torchbench_setup_macos
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local backend=eager
local dtype=notset
local device=mps
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
touch "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
echo "Setup complete, launching torchbench training performance run"
for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --training --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_training_${device}_performance.csv"
done
echo "Launching torchbench inference performance run"
for model in hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152; do
PYTHONPATH="$(pwd)"/torchbench python benchmarks/dynamo/torchbench.py \
--performance --only "$model" --backend "$backend" --inference --devices "$device" \
--output "$TEST_REPORTS_DIR/inductor_${backend}_torchbench_${dtype}_inference_${device}_performance.csv"
done
echo "Pytorch benchmark on mps device completed"
}
test_hf_perf() {
print_cmake_info
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
conda_benchmark_deps
torchbench_setup_macos
echo "Launching HuggingFace training perf run"
python "$(pwd)"/benchmarks/dynamo/huggingface.py --backend eager --device mps --performance --training --output="${TEST_REPORTS_DIR}"/hf_training.csv
echo "Launching HuggingFace inference perf run"
python "$(pwd)"/benchmarks/dynamo/huggingface.py --backend eager --device mps --performance --training --output="${TEST_REPORTS_DIR}"/hf_inference.csv
echo "HuggingFace benchmark on mps device completed"
}
test_timm_perf() {
print_cmake_info
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
conda_benchmark_deps
torchbench_setup_macos
echo "Launching timm training perf run"
python "$(pwd)"/benchmarks/dynamo/timm_models.py --backend eager --device mps --performance --training --output="${TEST_REPORTS_DIR}"/timm_training.csv
echo "Launching timm inference perf run"
python "$(pwd)"/benchmarks/dynamo/timm_models.py --backend eager --device mps --performance --training --output="${TEST_REPORTS_DIR}"/timm_inference.csv
echo "timm benchmark on mps device completed"
}
install_tlparse
if [[ $NUM_TEST_SHARDS -gt 1 ]]; then
if [[ $TEST_CONFIG == *"perf_all"* ]]; then
test_torchbench_perf
test_hf_perf
test_timm_perf
elif [[ $TEST_CONFIG == *"perf_torchbench"* ]]; then
test_torchbench_perf
elif [[ $TEST_CONFIG == *"perf_hf"* ]]; then
test_hf_perf
elif [[ $TEST_CONFIG == *"perf_timm"* ]]; then
test_timm_perf
elif [[ $TEST_CONFIG == *"perf_smoketest"* ]]; then
test_torchbench_smoketest
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_python_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_libtorch

Some files were not shown because too many files have changed in this diff Show More