Compare commits

...

2365 Commits

Author SHA1 Message Date
27e9ca5d36 [MKLDNN] Check that strides are positive (#153092)
[MKLDNN] Check that strides are positive (#151848)

For pooling ops. Prevents division-by-zero when argument is wrong

Fixes https://github.com/pytorch/pytorch/issues/149274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151848
Approved by: https://github.com/atalman

(cherry picked from commit 6f327128a99debfb2312ee256523ad6b62f763d6)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-05-07 16:45:25 -07:00
dab8130f4f [vec128] Fix fmsub NEON defintion (#153093)
[vec128] Fix fmsub NEON defintion (#152075)

As reported in https://github.com/pytorch/pytorch/issues/149292, according to manual, `vfmsq_f32` implements `c - a * b` rather than `a * b - c`, so it's call must be prefixed with `vnegq_f32`

Also, adjust the tests to use OpMath for FMA computation to avoid accuracy error accumulation due to non-fused multiply-and-add over lower precision dtypes

Note that `Vectorized::fmsub` is not currently instantiated anywhere, so it could safely remain broken

TODO:
 - Enable C++ testing on MacOS and/or aarch64 platforms (right now Mac tests are build without C++ tests)

Fixes https://github.com/pytorch/pytorch/issues/149292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152075
Approved by: https://github.com/swolchok
ghstack dependencies: #151955

(cherry picked from commit 2ea865339185b4b59324b4a494d088d0a5aeab88)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-05-07 18:05:06 -04:00
20d62a8d25 Fix tensorpipe compilation with clang-17 (#153091)
Fix tensorpipe compilation with clang-17 (#151344)

By suppressing `missing-template-arg-list-after-template-kw` warning, which seems to be required to compile Google's libnop, which is in a semi-abandoned state now
```
In file included from /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/variant.h:21:
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:241:30: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  241 |     index_ = value_.template Construct(std::forward<Args>(args)...);
      |                              ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:258:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  258 |     if (!value_.template Assign(TypeTag<T>{}, index_, std::forward<U>(value))) {
      |                          ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:265:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
  265 |     if (!value_.template Assign(index_, std::forward<T>(value))) {
      |                          ^
3 errors generated.
```

Fixes https://github.com/pytorch/pytorch/issues/151316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151344
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere

(cherry picked from commit 331423e5c24170b218e743b3392acbad4480340d)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-05-07 17:07:42 -04:00
cd885e7c9a [Cherry Pick] Remove cuda dependencies from non cuda buids #152333 (#153089)
* Remove cuda dependencies from non cuda buids

* regenerate

* regenerate
2025-05-07 17:06:04 -04:00
99847860ea [BE] Move all lint runner to 24.04 (#153080)
* [BE] Move all lint runner to 24.04 (#150427)

As Ubuntu-20 reached EOL on Apr 1st, see https://github.com/actions/runner-images/issues/11101
This forces older python version to be 3.8
Delete all linux-20.04 runners from the lintrunner.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150427
Approved by: https://github.com/seemethere

(cherry picked from commit 48af2cdd270c275acccc4a94b04e4ccdb64d557a)

* Update few more cases
2025-05-07 13:38:42 -07:00
24b0c4abfc [ATen][CUDA] Optimize 128 bit vectorization (#152967)
[ATen][CUDA] Optimize 128 bit vectorization (#148320)

Fixes #147376.
As per request: https://github.com/pytorch/pytorch/pull/145746#pullrequestreview-2642118301
This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148320
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman

(cherry picked from commit 72337bdcf2f86eb72f289fdbd7eb63fd664aaa86)

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
2025-05-07 11:09:39 -07:00
2dc4b15cf3 [dynamo][super variable] Fix bug to use correct source (#152774) 2025-05-06 14:25:49 -04:00
cd6037ed4b [cudagraphs] Fix issue in collecting static_input_idxs (#152768)
[cudagraphs] Fix issue in collecting static_input_idxs (#152287)

related to https://github.com/pytorch/pytorch/issues/152275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152287
Approved by: https://github.com/bdhirsh, https://github.com/eellison


(cherry picked from commit 4a63cab624a798929191779743c62ed65926de58)

Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
2025-05-06 14:13:32 -04:00
1341794745 Gracefully handle optree less than minimum version, part 2 (#151323)
Gracefully handle optree less than minimum version, part 2 (#151257)

If optree is less than the minimum version, we should pretend it doesn't
exist.

The problem right now is:
- Install optree==0.12.1
- `import torch._dynamo`
- This raise an error "min optree version is 0.13.0"

The fix is to pretend optree doesn't exist if it is less than the min
version.

There are ways to clean up this PR more (e.g. have a single source of
truth for the version, some of the variables are redundant), but I am
trying to reduce the risk as much as possible for this to go into 2.7.

Test Plan:

I verified the above problem was fixed. Also tried some other things,
like the following, which now gives the expected behavior.
```py
>>> import torch
>>> import optree
>>> optree.__version__
'0.12.1'
>>> import torch._dynamo
>>> import torch._dynamo.polyfills.pytree
>>> import torch.utils._pytree
>>> import torch.utils._cxx_pytree
ImportError: torch.utils._cxx_pytree depends on optree, which is
an optional dependency of PyTorch. To u
se it, please upgrade your optree package to >= 0.13.0
```

I also audited all non-test callsites of optree and torch.utils._cxx_pytree.
Follow along with me:

optree imports
- torch.utils._cxx_pytree. This is fine.
- [guarded by check] f76b7ef33c/torch/_dynamo/polyfills/pytree.py (L29-L31)

_cxx_pytree imports
- [guarded by check] torch.utils._pytree (changed in this PR)
- [guarded by check] torch/_dynamo/polyfills/pytree.py (changed in this PR)
- [guarded by try-catch] f76b7ef33c/torch/distributed/_functional_collectives.py (L17)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/_op_schema.py (L15)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/_dispatch.py (L35)
- [guarded by try-catch] f76b7ef33c/torch/_dynamo/variables/user_defined.py (L94)
- [guarded by try-catch] f76b7ef33c/torch/distributed/tensor/experimental/_func_map.py (L14)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151257
Approved by: https://github.com/malfet, https://github.com/XuehaiPan

(cherry picked from commit f1f18c75c9fc85df3cba8fe38582b1ddeefb270a)

Co-authored-by: rzou <zou3519@gmail.com>
2025-04-15 15:56:51 -07:00
073912749d Gracefully handle optree less than minimum version (#150977)
Gracefully handle optree less than minimum version (#150956)

Summary:
- We are saying the minimum version of pytree that PyTorch can use is
  0.13.0
- If a user imports torch.utils._cxx_pytree, it will raise an
  ImportError if optree doesn't exist or exists and is less than the
  minimum version.

Fixes https://github.com/pytorch/pytorch/issues/150889. There are
actually two parts to that issue:
1. dtensor imports torch.utils._cxx_pytree, but the optree installed in
   the environment might be too old. Instead, raising ImportError in
   torch.utils._cxx_pytree solves the issue.
2. We emit an "optree too low version" warning. I've deleted the
   warning in favor of the more explicit ImportError.

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150956
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/XuehaiPan

(cherry picked from commit 061832bc7a6711daaaf2bca12c2140bd8dea7eb5)

Co-authored-by: rzou <zou3519@gmail.com>
2025-04-10 10:39:40 -04:00
0c236f3c72 Update triton wheel build, setuptools pin (#150953)
Update triton wheel build, setuptools pin (#150931)

Observing failure in release workflow:
https://github.com/pytorch/pytorch/actions/runs/14346340202/job/40216804374

```
Traceback (most recent call last):
  File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 11, in <module>
    from setuptools.command.bdist_wheel import bdist_wheel as bdist_wheel
ModuleNotFoundError: No module named 'setuptools.command.bdist_wheel'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/tmppwpqef_x/triton/python/setup.py", line 27, in <module>
    from wheel.bdist_wheel import bdist_wheel
  File "/opt/python/cp311-cp311/lib/python3.11/site-packages/wheel/bdist_wheel.py", line 13, in <module>
    raise ImportError(ERROR) from exc
ImportError: The 'wheel.bdist_wheel' module has been removed.
Please update your setuptools to v70.1 or later.
If you're explicitly importing 'wheel.bdist_wheel', please update your import to point to 'setuptools.command.bdist_wheel' instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150931
Approved by: https://github.com/Skylion007

(cherry picked from commit d0e34822663b759f17ef5e6ec574cbf820c23b85)

Co-authored-by: atalman <atalman@fb.com>
2025-04-10 10:39:03 -04:00
c7ff78dfc0 Fix inplacing with multiple, fused uses (#150892)
Fix inplacing with multiple, fused uses (#150845)

We had `can_inplace` defined on a single use. When that buffer has multiple uses inside a fused node, we need to check if the other accesses have the same index. Otherwise we may read memory that has already been written to from inplacing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150845
Approved by: https://github.com/zou3519, https://github.com/exclamaforte, https://github.com/atalman, https://github.com/jansel

(cherry picked from commit 27ded359a5dcbe8f92e01a24bec258bbfe1a73d6)

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-04-08 20:35:02 -04:00
894909a613 Revert "[CUDA] Only use vec128 if CUDA version is newer than 12.8" (#150855)
Revert "[CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150818)"

This reverts commit 3f236f19032ff6424160018c024478c83b6ad6b9.
2025-04-08 18:49:02 -04:00
ef2b1390ed [Manylinux 2.28] Correct Linux aarch64 cuda binaries wheel name (#150820)
[Manylinux 2.28] Correct Linux aarch64 cuda binaries wheel name (#150786)

Related to: https://github.com/pytorch/pytorch/issues/149044#issuecomment-2784044555
For CPU binaries we run auditwheel however for cuda binaries auditwheel produces invalid results . Hence we need to rename the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150786
Approved by: https://github.com/malfet

(cherry picked from commit 836955bdbdeb299e6937065299564fb44ec422c2)

Co-authored-by: atalman <atalman@fb.com>
2025-04-07 23:07:41 -04:00
3f236f1903 [CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150818)
[CUDA] Only use vec128 if CUDA version is newer than 12.8 (#150705)

By addressing a feedback requested at https://github.com/pytorch/pytorch/pull/145746
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150705
Approved by: https://github.com/atalman

(cherry picked from commit 5228986c395dc79f90d2a2b991deea1eef188260)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-04-07 23:06:01 -04:00
35f1e76212 Reland of "[ROCm] change preferred blas lib defaults (#150249)"" (#150707)
Revert "Revert "[ROCm] change preferred blas lib defaults (#150249)" (#150658)"

This reverts commit 06c6a81a987e271d35a5da9501b4a17915bb8206.
2025-04-04 19:34:45 -04:00
a6321d6227 Revert "Dont exclude constant_pad_nd in prologue fusion" (#150699)
Revert "Dont exclude constant_pad_nd in prologue fusion (#150145)"

This reverts commit 6569576c4ecfb9b094a3b8a0b3db7c6e8b48f49d.
2025-04-04 15:51:44 -04:00
1cc51c640a [CUDA][avgpool2d] Fix backward launch bounds again for sm100, sm120 (#150676)
[CUDA][avgpool2d] Fix backward launch bounds again for `sm100`, `sm120` (#150640)

`__CUDA_ARCH__` is not visible in host code, which causes incorrect launch bounds and `too many resources requested for launch` on blackwell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150640
Approved by: https://github.com/malfet, https://github.com/drisspg, https://github.com/atalman

(cherry picked from commit 09c4da9325595f0091c81f5c47fc4ee1df0c4094)

Co-authored-by: Eddie Yan <eddiey@nvidia.com>
2025-04-04 07:09:23 -07:00
28ca4dd77d update get start xpu document for v2.7 (#150633)
update get start xpu document for v2.7 (#150397)

update get start xpu document for v2.7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150397
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 96f35f55e2676cfa76c28fb8f88e9f3cde08c59c)

Co-authored-by: ZhaoqiongZ <106125927+ZhaoqiongZ@users.noreply.github.com>
2025-04-03 20:40:52 -04:00
06c6a81a98 Revert "[ROCm] change preferred blas lib defaults (#150249)" (#150658)
This reverts commit 8b6bc59e9552689e115445649b76917b9487a181.
2025-04-03 20:39:27 -04:00
3b61d5d4e3 Update expected results for pr_time_benchmarks (#150620) 2025-04-03 10:14:13 -04:00
8b6bc59e95 [ROCm] change preferred blas lib defaults (#150249)
* [ROCm] change preferred blas lib defaults (#150212)

Fixes #148883
Fixes #150155

Also adds at::BlasBackend:Default. Instinct cards prefer hipBLASLt, everything else prefers rocBLAS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150212
Approved by: https://github.com/jeffdaily

(cherry picked from commit 7a470c932060190b314fe18bc1cec75335e4831f)

* add unit test for preferred_blas_library settings

---------

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-03 09:26:58 -04:00
c2ccaa3c21 [inductor] Fix inductor windows linker error (#150447)
[inductor] Fix inductor windows linker error (#150256)

Fixes #149889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150256
Approved by: https://github.com/anijain2305, https://github.com/eellison

(cherry picked from commit 37ebb0b56a3af1a5e8083337b4d670fc70fe23a3)

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-04-02 17:36:04 -07:00
6569576c4e Dont exclude constant_pad_nd in prologue fusion (#150145)
Dont exclude constant_pad_nd in prologue fusion (#149947)

Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues.

For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd.

```
import torch
import torch.nn.functional as F
torch._inductor.config.force_disable_caches = True

padded_N = 2048
n_pad_rows = 100

K, N = 2048, 4096

tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16)
tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16)

@torch.compile(mode='max-autotune-no-cudagraphs')
def masked_linear(input, weight, n_pad_input_rows):
    """
    Linear layer with input padded by `n_pad_input_rows` rows
    """
    # Use constant_pad_nd to pad with zeros for the invalid rows
    padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0)
    return F.linear(padded_input, weight)

# Invoke the function
masked_linear(tensor1, tensor2, n_pad_rows)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947
Approved by: https://github.com/drisspg

(cherry picked from commit 4c57aec5b9a37e23caedfe305fb4577e26254123)

Co-authored-by: eellison <elias.ellison@gmail.com>
2025-04-02 20:24:52 -04:00
5416dff2b2 [Release/2.7][MPS] Warn that torch.compile is a prototype (#150550)
And reference https://github.com/pytorch/pytorch/issues/150121
2025-04-02 14:55:19 -07:00
791265114e Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)" (#150572)
This reverts commit 5d4e7d58b42623a9024a84f0050967ff0318dcdb.
2025-04-02 14:47:28 -07:00
7ad8bc7e8b [Windows][inductor] fix blank space break windows file path (#150448)
[Windows][inductor] fix blank space break windows file path (#149388)

Fixes #149310

From origin error message:
```cmd
Command:
cl /I C:/Program Files/Python310/Include /I c:/code/.env/lib/site-packages/torch/include /I c:/code/.env/lib/site-packages/torch/include/torch/csrc/api/include /I c:/code/.env/lib/site-packages/torch/include/TH /I c:/code/.env/lib/site-packages/torch/include/THC /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D C10_USING_CUSTOM_GENERATED_MACROS /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp /LD /FeC:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.pyd /link /LIBPATH:c:/code/.env/Scripts/libs /LIBPATH:c:/code/.env/lib/site-packages/torch/lib torch.lib torch_cpu.lib torch_python.lib sleef.lib

Output:
Microsoft (R) C/C++ Optimizing Compiler Version 19.43.34809 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental'
cl : Command line warning D9024 : unrecognized source file type 'Files/Python310/Include', object file assumed
coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp
C:/Users/user/AppData/Local/Temp/torchinductor_user/ou/coubnfnqsm2gbdzdytufv46jotd6sxsnnhgldiw45pl5yjq5nbvz.cpp(21): fatal error C1083: Cannot open include file: 'Python.h': No such file or directory
```
Python installed in `C:/Program Files/Python310` path, and the blank space break the file path.

Solution:
Add quotes to declare Windows file paths, after that:
```cmd
cl /I "C:/Users/Xuhan/.conda/envs/new_build/Include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include" /I "C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/include/torch/csrc/api/include"  /D TORCH_INDUCTOR_CPP_WRAPPER /D STANDALONE_TORCH_HEADER /D  C10_USING_CUSTOM_GENERATED_MACROS /D CPU_CAPABILITY_AVX512  /DLL /MD /O2 /std:c++20 /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc /openmp /openmp:experimental  C:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.cpp  /arch:AVX512  /FeC:/Users/Xuhan/AppData/Local/Temp/tmp1wsj0m8r/za/czarp3ly5c22ge3hydvnzvad4cjimyr3hkwvofodxqffgil7frfd.pyd /LD /link /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/libs" /LIBPATH:"C:/Users/Xuhan/.conda/envs/new_build/lib/site-packages/torch/lib"  "torch.lib" "torch_cpu.lib" "torch_python.lib" "sleef.lib"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149388
Approved by: https://github.com/jansel

(cherry picked from commit bc1b8730a45e659dca83ec83995c17d4eec9c869)

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-04-02 13:21:44 -07:00
f2ee3f4847 [BE] Fix triton windows build (#150547)
[BE] Fix triton windows build (#150512)

Fixes #150480
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150512
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
(cherry picked from commit 8102272d8c5b5a3063446ec67877eea495e6d323)

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
2025-04-02 09:54:45 -07:00
dfd39fe14f [cherry-pick] [CI] Disable some tests that are failing in periodic #150059 (#150327)
* [CI] Disable some tests that are failing in periodic (#150059)

Disabling some tests to restore periodic

nogpu avx512 timeout:
59f14d19ae (38492953496-box)

profiler failure: 7ae0ce6360 (38461255009-box)

test_accelerator failure:
87bfd66c3c (39476723746-box)
origin: 146098

test_overrides failure:
bf752c36da (39484562957-box)
origin: 146098

inductor cpu repro:
bb9c426024 (38447525659-box)

functorch eager transforms:
8f858e226b (39488068620-box)
f2cea01f71 (39555064878)
b5281a4a18 (39599355600)
either 148288 or 148261?

2ec9aceaeb/1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150059
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet

* disable_CompiledOptimizerParityTests

* Update test/inductor/test_compiled_optimizers.py

---------

Co-authored-by: Catherine Lee <csl@fb.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-01 23:05:14 -07:00
b766c0200a [Cherry-pick] Make PyTorch buildable with cmake-4 (#150460)
* [Cmake] Make PyTorch buildable by CMake-4.x (#150203)

By turning on compatibility mode for protobuf, nnpack, PSimd and FP16, ittapi, TensorPipe and Gloo
Update CMake requirements

 Revert 0ece461ccafe5649d2d0f058ff5477765fd56499 and b0901d62ae2c2e909f91401eacebf3731df20cbe to test that it actually works

TODO:
  - Update/get rid of those libraries

Fixes https://github.com/pytorch/pytorch/issues/150149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150203
Approved by: https://github.com/clee2000

(cherry picked from commit 493c7fa66f82cf781ee0f9d0cc9e305688f0a286)

* Make PyTorch buildable by CMake-4.x on s390x (#150294)

This is a continuation of
https://github.com/pytorch/pytorch/pull/150203
that fixes nightly build on s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150294
Approved by: https://github.com/malfet

(cherry picked from commit ab342d3793472c65aaa0b007ca13a98fc9206dc5)

---------

Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
2025-04-01 19:37:52 -07:00
a3cd7b0cc4 [MPS] tril op not handling infs correctly (#150479)
[MPS] tril op not handling infs correctly (#149866)

Fixes #149813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866
Approved by: https://github.com/malfet

(cherry picked from commit ba46643df181f37efe594f9dd77b45436e08e6ec)

Co-authored-by: Isalia20 <irakli.salia854@gmail.com>
2025-04-01 16:30:38 -07:00
8522972133 torch.backends.mkldnn.flags() CM should not warn (#150416)
`torch.backends.mkldnn.flags()` CM should not warn (#150358)

By returning `None` rather than `False` from `THPModule_allowTF32OneDNN` when USE_XPU is not defined

Added regression test

Fixes https://github.com/pytorch/pytorch/issues/149829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150358
Approved by: https://github.com/atalman

(cherry picked from commit 6470b373c16017f5cb8f1aa4060bb60632b18160)

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-04-01 08:53:27 -07:00
c4b98c8364 [Build] Fix XPU builds inside venv (#150301)
Update the torch-xpu-ops commit to [3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5](3ee2bd2f13)

 - Fix the build error if users build torch xpu through python virtual environment. It was due to that torch-xpu-ops uses `${PYTHON_EXECUTABLE}` to get python path. However, `${PYTHON_EXECUTABLE}` is the sytem python path, while the pytorch root cmake is using the Python_EXECUTABLE ([Here](420a9be743/tools/setup_helpers/cmake.py (L310))) https://github.com/intel/torch-xpu-ops/issues/1461
 - code diff (026b2c8c7c..3ee2bd2f13)
   - base commit: 026b2c8c7c92a7b2cec5d26334006e3423251cc6
   - new commit: 3ee2bd2f13e1ed17a685986ff667a58bed5f2aa5

(cherry picked from commit f74d5d576aedf053b7574f3eb06d12417d80625a)

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
2025-04-01 08:22:00 -07:00
d10ffd76db [Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#150395)
[Doc] Update CMAKE_PREFIX_PATH for XPU windows README (#148863)

We found that the `pip install cmake` and `conda install cmake` has different behavior.
The reason is that the pip installed one doesn't find the corresponding libs under conda env. So we need to set the `CMAKE_PREFIX_PATH` for alignment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148863
Approved by: https://github.com/CuiYifeng, https://github.com/malfet

Co-authored-by: Cui, Yifeng <yifeng.cui@intel.com>
(cherry picked from commit ce52674b7651921630019de62323ee0bfd69516d)

Co-authored-by: Stonepia <tong.su@intel.com>
2025-04-01 10:56:57 -04:00
53a13e553d Enabling xpu in OffsetBasedRNGTracker . (#150389)
Enabling xpu in OffsetBasedRNGTracker . (#148360)

Else torch.distributed breaks on xpu devices.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148360
Approved by: https://github.com/zhangxiaoli73, https://github.com/guangyey, https://github.com/gujinghui, https://github.com/XilunWu, https://github.com/kwen2501

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
(cherry picked from commit f0e1a0838c1245a8763d1c67318b23940a3e9246)

Co-authored-by: _githubsgi <zozoxoxo897@gmail.com>
2025-04-01 10:54:51 -04:00
5745d6a770 [ROCm] cmake 4 workaround for hiprtc (#150361)
[ROCm] cmake 4 workaround for hiprtc (#150324)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150324
Approved by: https://github.com/jeffdaily, https://github.com/atalman, https://github.com/malfet

(cherry picked from commit 423e4a4568958845da52808e50d1cdd2ba7fa48d)

Co-authored-by: Faa Diallo <Faa.Diallo@amd.com>
2025-04-01 10:44:03 -04:00
60ddcd803e Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590) (#150352)
Revert "[PGNCCL] Launch kernel on current stream & remove `record_stream` entirely (#148590)"

This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626.
2025-03-31 15:25:34 -07:00
f2b3b5c453 [MPS] Fix dot/mm for conj_tensors (#150237)
[MPS] Fix dot/mm for conj_tensors (#150157)

- Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key
- For matmul or dot, add `conjugateWithTensor:name:` calls before running the op
- Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo
- Filter  `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR)
- Preserve conj property when gathering the views, that fixes `cov` operator

Fixes https://github.com/pytorch/pytorch/issues/148156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157
Approved by: https://github.com/dcci

(cherry picked from commit 7c65911b11fc1cc7d93045f4cf923058e8a27782)

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
2025-03-31 16:13:11 -04:00
71fa7def26 Fix #149806 : Fix path lookup in _preload_cuda_deps (#150068)
Fix #149806 : Fix path lookup in _preload_cuda_deps (#149808)

@pytorchbot label "bug"

Fixes #149806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149808
Approved by: https://github.com/jansel

(cherry picked from commit 68b327341c748c869fdd7cb51cd05ab8ad6caaac)

Co-authored-by: Divain <fegnouche@hotmail.fr>
2025-03-31 13:07:56 -07:00
1a6c192dc4 Use schema as source of truth + support ones_like/empty_like (#149775)
Use schema as source of truth + support ones_like/empty_like (#149052)

This change does 2 important things:
(a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat!

(b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052
Approved by: https://github.com/albanD

(cherry picked from commit 988827cdfb6d5946049cac7141a5ca04f2177c0a)

Co-authored-by: Jane Xu <janeyx@meta.com>
2025-03-31 11:03:26 -07:00
e691e92297 Update Doc for Intel XPU Profiling (#150272)
Update Doc for Intel XPU Profiling (#134515)

Updated below two pages for Intel XPU
https://pytorch.org/docs/stable/torch.compiler_profiling_torch_compile.html
https://pytorch.org/docs/stable/profiler.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134515
Approved by: https://github.com/dvrogozh, https://github.com/malfet

(cherry picked from commit 7aacbab0b32596a3c334dca5d488e4620b79bb5e)

Co-authored-by: Louie Tsai <louie.tsai@intel.com>
2025-03-31 09:42:08 -07:00
2b73f403c7 Pin cmake to 3.31.2 for windows conda install (#150223)
Pin cmake to 3.31.2 for windows conda install (#150185)

Trying to fix nightly failures
Cmake 4.0 update https://pypi.org/project/cmake/4.0.0/ broke nightly builds
You can see it here: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=cuda11_8-build
and here: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=
This fix for Windows Builds. Linux and MacOS where already fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150185
Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi

(cherry picked from commit b0901d62ae2c2e909f91401eacebf3731df20cbe)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-28 14:28:41 -07:00
697cd9bbb1 [inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149993)
[inductor][triton 3.3] Fix cpp_wrapper w/ TMA in triton 3.3 (#149973)

Fixes #148938

Context:

In triton 3.3, triton kernels expect a global scratch space arg to be passed in. This is fixed in #148051, which fixed most of the AOTI/cpp_wrapper failures; the fix is to inject a (null) global scratch space arg passed as an argument to all kernels.

But in the case of TMA, we need to call a non-triton-generated function - init1DTMADescriptor. The same `generate_args_decl` function used for calling triton kernels (and modified in #148051 to insert a global scratch space) is used to prepare the arguments to init1DTMADescriptor, and so it had an extra global scratch space arg. Then we'd get a null pointer passed into init1DTMADescriptor, resulting in an IMA later on when the TMA use kernel

This PR: adds an option to `generate_args_decl` to specify whether this is a triton kernel (in which case we should add the global scratch space arg) or not (when we shouldn't add the extra arg).

Note: this doesn't appear in CI because we don't run these tests with Hopper machines in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149973
Approved by: https://github.com/drisspg

(cherry picked from commit a8d0c5c92818186119d4a94d98999acc3f549a7e)

Co-authored-by: David Berard <dberard@fb.com>
2025-03-28 16:40:47 -04:00
64ca70f83c Pin cmake==3.31.6 (#150193)
Pin cmake==3.31.6 (#150158)

I'm not sure if this is the right think to do, but cmake 4.0.0 got released on pypi and our builds are failing with it

Example:
aa70d62041 (39555975425-box)

I guess we have to go change all the cmake_minimum_required to >=3.5?

backwards compat still failing because its building with the base commit which this pr can't really change until it gets merged, but at least manywheel binary builds got past where they were originally failing

Also pin the conda installation, but the most recent version on conda is 3.31.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150158
Approved by: https://github.com/cyyever, https://github.com/malfet

(cherry picked from commit 0ece461ccafe5649d2d0f058ff5477765fd56499)

Co-authored-by: Catherine Lee <csl@fb.com>
2025-03-28 09:09:29 -07:00
1b84fd1503 Enable fast path for qlinear (static/dynamic) and qadd for AArch64 though ACL directly. (#149435)
* Enable fast qlinear static/dynamic path for AArch64 through ACL directly (#148585)

This enables a fast path for eager mode static/dynamic quantization for AArch64 through Arm Compute Library (ACL) directly.

Context: PRs #126687, #139887 enabled an optimized implementation for `qlinear` and `qlinear_dynamic` for aarch64 through `ideep → oneDNN → ACL` which improved performance by ~10x compared to the previous implementation.
However, the current `qlinear` and `qlinear_dynamic` path (`ideep → oneDNN → ACL`) suffers from high overhead due to the API friction between the stateless oneDNN API and the stateful ACL low-precision GEMM (`lowp_gemm`) API - for example, ACL's `lowp_gemm` objects cache information like weights reduction or weights in optimized memory format which oneDNN does not allow due to its stateless nature.
Hence, ACL currently runs a (redundant) sum of columns and pre-transposition (to the gemm kerne's optimal format) for each GEMM operation.
This PR addresses the sub-optimalities above by integrating ACL directly with `qlinear` and `qlinear_dynamic`.

- **For `qlinear_dynamic` (dynamically quantized matmuls):**

This PR yields an ****average speedup** (averaged over context_lengths of 2^3 up to 2^9) of ~ **50%** for `bert-base-uncased`, `bert-large-uncased`, `roberta-base`, `distilbert-base-uncased`** with 16 threads on a Neoverse-V1 (with transformers==4.48) for the benchmarking script below:
```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
from transformers import AutoModel, AutoConfig
import time
import numpy as np
from argparse import ArgumentParser

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__(description="huggingface model")
        self.add_argument("--context_length",
                            help="context length - number of input tokens",
                            type=int,
                            default=64
        )
        self.add_argument("--model",
                            help="model checkpoint - i.e. 'bert-base-uncased'",
                            type=str,
                            default=None)
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()
    model_name = args.model
    config = AutoConfig.from_pretrained(model_name)
    batch_size = 1
    model = AutoModel.from_pretrained(model_name)
    model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
    model.eval()
    inputs = torch.randint(config.vocab_size, (batch_size, args.context_length), dtype=torch.long, device="cpu")
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("Model = ", model_name)
    print("Context Length = ", args.context_length)
    print("Min (ms) = ", min(times))
    print("Mean (ms) = ", np.mean(times))
```

- **For `qlinear` (statically quantized matmuls):**

This PR yields an **average speedup of 2x for signed activations (`s8s8s8`) and 95x for unsigned activations (u8s8u8)** on a Neoverse-V1 with 16 threads for the benchmarking script below.
The averages are over for all combinations of `M = [8, 16, ..., 512]`, `K = [768, 1024, 2048, 4096]`, `N = [768, 1024, 2048, 4096]`.
The astronomical speedup for unsigned activation is because oneDNN v3.7 does not have an optimized implementation for `u8s8u8` on AArch64.

```
# SPDX-FileCopyrightText: Copyright 2025 Arm Limited and/or its affiliate <open-source-office@arm.com>
# SPDX-License-Identifier: BSD-3-Clause
import torch
import torch.nn as nn
from torch.quantization import QConfig
from torch.ao.quantization.observer import HistogramObserver, default_weight_observer
import torch
import torch.nn as nn
import numpy as np
import random
from argparse import ArgumentParser
import time

class ModelArgumentParser(ArgumentParser):
    def __init__(self) -> None:
        super().__init__()
        self.add_argument("--M",
                            help="M dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--K",
                            help="K dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--N",
                            help="N dimension",
                            type=int,
                            default=64
        )
        self.add_argument("--signed_input",
                            help="Use (signed) torch.qint8 for inputs instead of (unsigned) torch.quint8",
                            action="store_true"
        )
        self.add_argument("--seed",
                          help="Random seed",
                          type=int,
                          default=42
        )
        self.add_argument("--iters",
                          help="benchmark iterations",
                          default=500)

def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

class LinearModel(nn.Module):
    def __init__(self, K, N):
        super(LinearModel, self).__init__()
        self.quant = torch.quantization.QuantStub()
        self.fc = nn.Linear(K, N)
        self.dequant = torch.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc(x)
        x = self.dequant(x)
        return x

def quantize_model(model, args):
    qconfig = QConfig(
            activation=HistogramObserver.with_args(reduce_range=False,
            dtype=torch.qint8 if args.signed_input else torch.quint8),
            weight=default_weight_observer,
    )
    # Prepare the model for static quantization
    # Specify quantization configurations
    model.qconfig = qconfig
    model_prepared = torch.quantization.prepare(model_fp32)

    # Calibrate the model with sample inputs
    # Example input data for calibration
    with torch.no_grad():
        sample_data = torch.randn(args.M, args.K)
        model_prepared(sample_data)
    # Convert the prepared model to a quantized model
    model_quantized = torch.quantization.convert(model_prepared)
    return model_quantized

if __name__ == "__main__":
    parser = ModelArgumentParser()
    args = parser.parse_args()

    set_seed(args.seed)
    model_fp32 = LinearModel(args.K, args.N)
    model_quantized = quantize_model(model_fp32, args)

    inputs = torch.randn(args.M, args.K)
    times = []
    with torch.no_grad():
        # warmup
        for _ in range(10):
            model_quantized(inputs)
        # benchmark
        for _ in range(args.iters):
            s = time.time_ns()
            model_quantized(inputs)
            times.append((time.time_ns() - s) / 1e6)

    print("M,K,N,signed = ", args.M, args.K, args.N, args.signed_input)
    print("Min Times (ms) = ", min(times))
    print("Mean Times (ms) = ", np.mean(times))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148585
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 08a644a4c4a0f74cf3277e85e265a44a192079c5)

* Enable qint8 and quint8 add for AArch64 using ACL directly (#148653)

This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly.
Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.

Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653
Approved by: https://github.com/malfet
ghstack dependencies: #148585

(cherry picked from commit 6c2db8fab047b8a1d671c3c8dfbdd4c478c6d2e3)

* [Build] Guard per-op headers in ACLUtils.cpp (#149417)

To fix internal build failures, where per-op headers are not generated.
We really should have lint for something like that.

Test Plan: CI

Reviewed By: izaitsevfb

Differential Revision: D71406882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149417
Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb

(cherry picked from commit 5db3a4ac88ad9a3062a9f64dc64741b820208a91)

---------

Co-authored-by: Nikita Shulga <nshulga@meta.com>
2025-03-28 07:51:16 -07:00
6b27e11a5b [Release-only] Pin intel-oneapi-dnnl to 2025.0.1-6 (#150132)
[CI] Fix the XPU CI build environment
2025-03-27 16:07:33 -07:00
18a926f547 update release 2.7 xla pin (#150126)
* update release 2.7 xla pin

* fix

* fix
2025-03-27 18:29:03 -04:00
ecd434bea9 Revert "Parallelize sort" (#150128)
Revert "Parallelize sort (#149765)"

This reverts commit 8d2186cd7952336d4f8b3f73648a5c0714a832b9 as it causes inductor test regression, see 5bed3fafc7/1
2025-03-27 12:04:15 -07:00
5bed3fafc7 [ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149432)
[ROCm] Fixes and improvements to CUDA->HIP flag conversion for CPP extensions (#149245)

Fixes https://github.com/ROCm/hip/issues/3764.

Fixes and improvements to CUDA->HIP flag conversion for CPP extensions

- Log flag conversion for debugging purposes.
- Fix cases where it should not touch the -I flags or cases where CUDA appears more than once by replacing only the first instance.
- Fix case where nvcc key may not exist
- Fix case where hipify should ignore flag values and only touch the flag itself

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149245
Approved by: https://github.com/jeffdaily

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
(cherry picked from commit c0566e0dbf42f633624adb02015742509edcb444)

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2025-03-26 17:18:20 -07:00
9b4f085526 [MPS] fix attention enable_gqa crash on mps (#150067)
[MPS] fix attention enable_gqa crash on mps (#149147)

Fixes #149132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147
Approved by: https://github.com/malfet

(cherry picked from commit dd6e9df3d00851c44fb76341f3113fe9223dcfca)

Co-authored-by: Isalia20 <irakli.salia854@gmail.com>
2025-03-26 17:17:11 -07:00
d29e4c81d9 update aotinductor doc for XPU support (#149935)
update aotinductor doc for XPU support (#149299)

as title. Since the AOTInductor feature starting from 2.7 works on Intel GPU, add the related contents into its doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149299
Approved by: https://github.com/guangyey, https://github.com/desertfire

(cherry picked from commit 4ea580568a27e281b96d26d9380c786c2e2116e6)

Co-authored-by: Jing Xu <jing.xu@intel.com>
2025-03-26 18:42:56 -05:00
8d2186cd79 Parallelize sort (#149765)
Parallelize sort (#149505)

PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149505
Approved by: https://github.com/fadara01, https://github.com/Skylion007

(cherry picked from commit 842d51500be144d53f4d046d31169e8f46c063f6)

Co-authored-by: Annop Wongwathanarat <annop.wongwathanarat@arm.com>
2025-03-26 16:47:46 -05:00
b04d8358d9 ci/docker: use NCCL 2.26.2-1 (#149874)
ci/docker: use NCCL 2.26.2-1 (#149778)

Related to #149153

This updates some build scripts to hopefully fix the nightly builds which are somehow building against nccl 2.25.1 and using 2.26.2 from pip.

Test plan:

After merging rerun nightly linux jobs and validate that nccl version matches
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149778
Approved by: https://github.com/Skylion007, https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
(cherry picked from commit ddc0fe903f3043246103d71b60a4fff0aeeef9e8)

Co-authored-by: Tristan Rice <rice@fn.lc>
2025-03-26 16:36:35 -04:00
d80afc07f0 [cherry-pick] Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker #149540 (#149624)
Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540)

1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: https://github.com/pytorch/pytorch/pull/149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds

3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: https://github.com/pytorch/pytorch/pull/148895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149540
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
2025-03-26 13:33:54 -07:00
84210a82ef [cherry-pick] nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351) (#149546)
* nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351)

Fixes #149153

Yaml generated from:

```
python .github/scripts/generate_ci_workflows.py
```

Test plan:

Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7

```
rm -rf third_party/nccl
python setup.py develop
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351
Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet

* fixed_regenerations

---------

Co-authored-by: Tristan Rice <rice@fn.lc>
2025-03-26 13:32:51 -07:00
4268b2f40a [MPSInductor] Move threadfence at the right location (#150037)
[MPSInductor] Move threadfence at the right location (#149437)

Not sure how it worked in the past, but fence should be before first read from the shared memory, not after it.
This bug was exposed by https://github.com/pytorch/pytorch/pull/148969 which removed unnecessary barrier before calling `threadgroup_reduce` functions
Test plan:
```
% python3 generate.py --checkpoint_path checkpoints/stories15M/model.pth --prompt "Once upon a time" --device mps --compile
```
Before that it produced gibberish, now it works fine
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149437
Approved by: https://github.com/manuelcandales, https://github.com/dcci

(cherry picked from commit 61a64c20c402e61027dad4a9e7a192ec0971d1d6)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-26 11:17:09 -07:00
12a6d2a0b8 Add triton as dependency to CUDA aarch64 build (#149945)
Add triton as dependency to CUDA aarch64 build (#149584)

Aarch64 Triton build was added by: https://github.com/pytorch/pytorch/pull/148705
Hence add proper contrain to CUDA 12.8 Aarch64 build

Please note we want to still use:
```platform_system == 'Linux' and platform_machine == 'x86_64'```
For all other builds.

Since these are prototype binaries only used by cuda 12.8 linux aarch64 build. Which we would like to serve from download.pytorch.org

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149584
Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 9b1127437e6ccf0c55a87607d9f551cc6424ca67)

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-26 07:28:27 -07:00
464432ec47 Automate stable CUDA update and linter using min Python verison (#149981)
Automate stable CUDA update and linter using min Python verison (#148912)

1. Fixes: https://github.com/pytorch/pytorch/issues/145571 . Cuda Stable is the same cuda version that is published to pypi, also used to set Metadata section in the rest of whl scripts and tag the docker releases with latest tag.
2. Updates min python version used in linter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148912
Approved by: https://github.com/Skylion007, https://github.com/malfet

(cherry picked from commit 29fd875bc125582f29abbdf5559d3941899680be)

Co-authored-by: atalman <atalman@fb.com>
2025-03-26 07:03:35 -07:00
1f612dafb5 Removing doc references to PRE_CXX11_ABI. (#149952)
Removing doc references to PRE_CXX11_ABI. (#149756)

Fixes #149550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149756
Approved by: https://github.com/svekars, https://github.com/atalman

(cherry picked from commit 43ee67e8dc6827fbb7d12a5950ddf6a5c80771dc)

Co-authored-by: Alanna Burke <burkealanna@meta.com>
2025-03-25 14:04:12 -07:00
f63def6ac7 [XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149714)
[XPU] Update triton commit to fix to fix level_zero not found by env var LEVEL_ZERO_V1_SDK_PATH. (#149511)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149511
Approved by: https://github.com/EikanWang

(cherry picked from commit ee6a0291653cd507d59bda6bcf5d848099a804d1)

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-03-25 16:04:51 -04:00
3a8e623a9b op should NOT be static in aoti_torch_call_dispatcher (#149644)
op should NOT be static in aoti_torch_call_dispatcher (#149208)

aoti_torch_call_dispatcher is meant to call different ops, so the op must not be static. Otherwise, every call to this API will call the first op that was ever called, which is not the intended behavior of any human being.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149208
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/malfet

(cherry picked from commit 740ce0fa5f8c7e9e51422b614f8187ab93a60b8b)

Co-authored-by: Jane Xu <janeyx@meta.com>
2025-03-25 16:03:19 -04:00
bf727425a0 Symintify transpose_ (#149632)
Symintify transpose_ (#149057)

Fixes https://github.com/pytorch/pytorch/issues/148702
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149057
Approved by: https://github.com/yushangdi

(cherry picked from commit 8d7c430e84f4ad439ebdc81f9ab496a3665033a4)

Co-authored-by: angelayi <yiangela7@gmail.com>
2025-03-25 15:59:42 -04:00
8c7dbc939f [cherry-pick] [CI] Don't clean workspace when fetching repo (#147994) (#149129)
Revert "[CI] Don't clean workspace when fetching repo (#147994)"

This reverts commit e5fef8a08ebb8548e8413ae54ef0ad9a11f1f4c0.
2025-03-25 15:25:19 -04:00
644fdbad95 [Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149631)
[Intel GPU][PT2E] bugfix: use zero-point to decide conv src zp mask (#149473)

# Motivation
The PR fix a bug that wrongly decides the zero-point mask setting. Specifically, it deems zero-point is always not zeros due to scale is used for judgement. Fortunately, the bug only affects the performance. The accuracy is not affected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149473
Approved by: https://github.com/EikanWang, https://github.com/guangyey

(cherry picked from commit d67c1a027e61bd68908bc4c8e5275a983521366c)

Co-authored-by: ZhiweiYan-96 <zhiwei.yan@intel.com>
2025-03-25 10:07:25 -05:00
fb027c5692 [cherry-pick] Update ExecuTorch pin update (#149539) (#149630)
Update ExecuTorch pin update (#149539)

Latest commit in https://hud.pytorch.org/hud/pytorch/executorch/viable%2Fstrict/1?per_page=50

Follow-up to https://github.com/pytorch/pytorch/issues/144480#issuecomment-2731150636

Also, need to incorporate change from https://github.com/pytorch/executorch/pull/8817

Test Plan:

Monitor  linux-jammy-py3-clang12-executorch test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149539
Approved by: https://github.com/larryliu0820

(cherry picked from commit bc86b6c55a4f7e07548a92fe7c9b52ad2c88af35)
2025-03-25 10:03:16 -05:00
3b87bd8b82 Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#149878)
Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070)

**Issue:**
* The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards.
* Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve
* This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated.

**Fix:**
* Updated the build flags to explicitly use **-march=armv8-a+sve**, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before.
* This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware.

Test plan:
 - Allocate `a1.4xlarge` on AWS
 - Run following script using wheel produced by this PR
 ```python
import torch
def f(x):
    return x.sin() + x.cos()

print(torch.__version__)
f_c = torch.jit.script(f)
```
- Observe no crash
```
$ python3 foo.py
2.7.0.dev20250313+cpu
```
- Observe crash with 2.6.0
```
$ python3 foo.py
2.6.0+cpu
Illegal instruction (core dumped)
```

Fixes #146792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070
Approved by: https://github.com/malfet

(cherry picked from commit 09f7f62cfebb0067b93d227c13fe9a94b51af762)

Co-authored-by: maajidkhann <maajidkhan.n@fujitsu.com>
2025-03-24 13:57:16 -07:00
89b098a677 Add release branch push triggers to inductor-rocm-mi300.yml (#149871)
Add release branch push triggers to inductor-rocm-mi300.yml (#149672)

In similar vein as https://github.com/pytorch/pytorch/pull/149517

When we added the rocm-mi300.yml earlier this year, we had lower capacity and we were just pipecleaning the workflow, so we set the trigger to only respond to pushes to main branch. But now we have more stability as well as capacity, and we would really like to ensure that the release branch is being tested on MI300s as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149672
Approved by: https://github.com/jeffdaily

(cherry picked from commit 1eab841185cb2d68a11e2e0604fd96d110778960)

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-03-24 13:24:10 -07:00
4cc4302b32 Do not depend on numpy during the import (#149731)
Do not depend on numpy during the import (#149683)

But a good followup would be to use torch primitives instead of numpy here
Fixes https://github.com/pytorch/pytorch/issues/149681

Test plan: Monkey-patch 2.7.0-rc and run `python -c "import torch;print(torch.compile(lambda x:x.sin() + x.cos())(torch.rand(32)))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149683
Approved by: https://github.com/seemethere

(cherry picked from commit 68dfd44e50f59c53698a24985039a27351862963)

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-21 11:13:52 -07:00
c632e4fdb8 [ONNX] Expose verification utilities (#149375)
* [ONNX] Expose verification utilities (#148603)

Expose verification utilities to public documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603
Approved by: https://github.com/titaiwangms

(cherry picked from commit ebabd0efdddd91e11364e42227b746c419a39be4)

* [ONNX] Update types in VerificationInfo (#149377)

torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377
Approved by: https://github.com/titaiwangms

---------

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-03-20 08:14:38 -07:00
b23bfae9f7 Add AOTI shim for _weight_int4pack_mm_cpu_tensor (#149031) (#149386)
**Summary**
Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907
This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT.

**Test plan**
```
pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031
Approved by: https://github.com/leslie-fang-intel, https://github.com/EikanWang, https://github.com/desertfire
2025-03-20 08:08:09 -07:00
1b8f496f87 Pin auditwheel to 6.2.0 (#149525) 2025-03-19 16:13:43 -07:00
c236b602ff Add release branch push triggers to rocm-mi300.yml (#149526) 2025-03-19 16:12:59 -07:00
6926f30654 BC fix for AOTIModelPackageLoader() constructor defaults (#149214)
BC fix for AOTIModelPackageLoader() constructor defaults (#149082)

The default value for `run_single_threaded` was wrongly specified in the .cpp file instead of the header, breaking C++-side instantiation of `AOTIModelPackageLoader` with no arguments. This PR fixes this and adds a test for the use case of running with `AOTIModelPackageLoader` instead of `AOTIModelContainerRunner` on the C++ side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149082
Approved by: https://github.com/desertfire

(cherry picked from commit 5e1b715dda813d8c545378291261b565649df8e5)

Co-authored-by: Joel Schlosser <jbschlosser@meta.com>
2025-03-17 17:16:54 -05:00
483980d7f3 [AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149242)
[AOTI][XPU] Fix: model_container_runner_xpu.cpp is not built into libtorch_xpu.so (#149175)

The missing of model_container_runner_xpu.cpp will cause compilation failure when user build CPP inference application on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149175
Approved by: https://github.com/jansel

(cherry picked from commit 9ad6265d044075d1ceb27cf0f2af7495e586003c)

Co-authored-by: xinan.lin <xinan.lin@intel.com>
2025-03-17 17:15:39 -05:00
7173a73cf4 [regression] Fix pin_memory() when it is called before device lazy initialization. (#149183)
[regression] Fix pin_memory() when it is called before device lazy initialization. (#149033)

PR #145752 has added a check in the isPinnedPtr to check if a device is initialized before checking if the tensor is pinned. Also that PR has added a lazy initialization trigger when an at::empty is called with a pinned param set to true. However, when the tensor is firstly created and it is pinned in a separate call by calling pin_memory() function, lazy device init is not called so is_pinned returns always false.

With this PR, the lazy initialization is moved to getPinnedMemoryAllocator function, thus it is assured that device is initialized before we pin a tensor.

Fixes #149032

@ngimel @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149033
Approved by: https://github.com/ngimel, https://github.com/albanD

(cherry picked from commit 420a9be743f8dd5d6296a32a1351c1baced12f1f)

Co-authored-by: Bartlomiej Stemborowski <bstemborowskix@habana.ai>
2025-03-17 15:37:24 -04:00
7bab7354df [cherry-pick] Revert #148823 - Make dynamism code robust to NotImplementedException (#149160)
Revert "Make dynamism code robust to NotImplementedException (#148823)"

This reverts commit 60576419a2a5cc09e4a92be870fda8f3fc305ddc.

Reverting from RC since it was reverted from the main branch
2025-03-14 10:50:24 -05:00
b1940b5867 Remove runtime dependency on packaging (#149125)
Remove runtime dependency on packaging (#149092)

Looks like after https://github.com/pytorch/pytorch/pull/148924
We are seeing this error in nightly test:
https://github.com/pytorch/pytorch/actions/runs/13806023728/job/38616861623

```
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/pattern_matcher.py", line 79, in <module>
    from .lowering import fallback_node_due_to_unsupported_type
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/lowering.py", line 7024, in <module>
    from . import kernel
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/__init__.py", line 1, in <module>
    from . import mm, mm_common, mm_plus_mm
  File "/Users/runner/work/_temp/anaconda/envs/test_conda_env/lib/python3.13/site-packages/torch/_inductor/kernel/mm.py", line 6, in <module>
    from packaging.version import Version
ModuleNotFoundError: No module named 'packaging'
```

Hence removing runtime dependency on packaging since it may not be installed by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149092
Approved by: https://github.com/drisspg, https://github.com/davidberard98

(cherry picked from commit 65d19a5699afbb0b123b6b264188f5610b925c5e)

Co-authored-by: atalman <atalman@fb.com>
2025-03-13 12:28:50 -04:00
abebbd5113 [Profiler][HPU] Fix incorrect availabilities for HPU (#149115)
[Profiler][HPU] Fix incorrect availabilities for HPU (#148663)

Fixes #148661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148663
Approved by: https://github.com/jeromean, https://github.com/albanD

(cherry picked from commit 75c8b7d9725af59d8348379a4165d1252c4ac208)

Co-authored-by: wdziurdz <witold.dziurdz@intel.com>
2025-03-13 08:09:45 -07:00
cdd7a2c72b [RLEASE ONLY CHANGES] Apply release only chnages to release 2.7 (#149056)
* [RLEASE ONLY CHANGES] Apply release only chnages to release 2.7

* fix_lint_workflow

* docker_release

* fix_check_binary
2025-03-12 15:44:15 -04:00
d94ea2647c [inductor] Fix profiler tests with latest Triton (#149059)
[inductor] Fix profiler tests with latest Triton (#149025)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149025
Approved by: https://github.com/yanboliang

(cherry picked from commit 488c4480f97e9d537905b0fbcb7236f88dca47f7)

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-03-12 15:21:51 -04:00
924a247fbb [MPS] Enable angle and atan2 for torch.long (#149017)
This check was added by https://github.com/pytorch/pytorch/pull/85817, that introduced no unit-tests and its content seems to be totally unrelated to title/subject of that PR. Anyway, right now it seems to be working fine on MacOS-13+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149017
Approved by: https://github.com/dcci
2025-03-12 04:48:52 +00:00
7b78a2c415 [MPSInductor] Fix argmin/argmax long reductions (#149021)
By adding an additional indexes array for aggregates and populating it when performing partial reductions.

And with that I can finally `torch.compile` TinyStories and get 600+ tokens/sec vs <200 on eager

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149021
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004, #149020
2025-03-12 04:39:29 +00:00
758522d56a [MPSInductor][EZ] Fix argmin/max signatures (#149020)
threadgroup_argmin used to return input type, which is wrong, it should have returned `int` or `long`

Change signatures of both thredgroup_argmin and threadgroup_argmax to return int, as group size is small, no need to carry over large integeres
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149020
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975, #149004
2025-03-12 04:39:29 +00:00
fe22db9cc3 [MPSInductor] Fix min/max reductions over large dims (#149004)
Simple followup after sum/prod

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149004
Approved by: https://github.com/jansel
ghstack dependencies: #148969, #148975
2025-03-12 04:39:19 +00:00
clr
2a7e997b3f test/dynamo/test_utils: Fix one broken test on different python versions (#148987)
We correctly handed different python version in the explicit ir_nodes test, but
didn't handle it in the dynamo_timed test. Just explicitly deleting the fields
there so the dynamo_timed test passes on all python versions.

(I noticed it breaking on 3.13).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148987
Approved by: https://github.com/jansel
2025-03-12 02:11:08 +00:00
e40a9e602b Add the max_autotune tests in the periodic jobs. (#143560)
To promptly detect issues with max_autotune, such as [#143102](https://github.com/pytorch/pytorch/issues/143102), add the max_autotune tests to the periodic CI to track the accuracy regularly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143560
Approved by: https://github.com/leslie-fang-intel, https://github.com/desertfire
2025-03-12 01:47:46 +00:00
60576419a2 Make dynamism code robust to NotImplementedException (#148823)
In prod many models have `@property` methods that raise
NotImplementedError. This PR updates our dynamism code to be more robust
to these types of models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148823
Approved by: https://github.com/laithsakka
2025-03-12 01:01:57 +00:00
46f096bba6 Explicitly set use-ephemeral runners for windows nightly cpu test jobs (#149001)
This PR migrated windows builds to use ephemeral runners: https://github.com/pytorch/pytorch/pull/134463 however missed test jobs.

Explicitly set use-ephemeral runners for windows nightly cpu tests.
Please note we should be using already ephemeral runners for these after: https://github.com/pytorch/test-infra/pull/6377 (recently migrated)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149001
Approved by: https://github.com/malfet
2025-03-11 23:51:39 +00:00
5b60749e9e [cudagraph] add log for skip reasons (#148797)
Summary: Add skip reasons to dynamo_compile so we can know popular skip reasons for cudagraph

Test Plan: {F1975906635}

Differential Revision: D70820791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148797
Approved by: https://github.com/masnesral
2025-03-11 23:31:48 +00:00
98a2d905bf [MPSInductor] Fix large prod and sum reductions (#148975)
After this change, if reduction dimension is larger than `max_threadgroup_size`,  emit a `for` loop from `codegen_iteration_ranges_entry` and wrap it up in `codegen_body()`
I.e. after this changes following command
```
% TORCH_LOGS=output_code python -c "import torch;print(torch.compile(lambda x:(x[0::2].sin()+(x[1::2] + .4).cos()).sum(dim=0) - 3.14)(torch.rand(4096, device='mps')))" 2>&1|cut -c 86-
```
will emit following shader
```metal
#include <c10/metal/random.h>
#include <c10/metal/special_math.h>
#include <c10/metal/utils.h>
#include <c10/metal/reduction_utils.h>
kernel void generated_kernel(
    device float* out_ptr1,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    threadgroup float tmp_acc_0[1024];
    tmp_acc_0[r0_index] = 0;
    for(auto r0_0_cnt = 0; r0_0_cnt < 2; ++r0_0_cnt) {
        int r0_0 = 2 * r0_index + r0_0_cnt;
        if (r0_0 >= 2047) break;
        auto tmp0 = in_ptr0[2*r0_0];
        auto tmp2 = in_ptr0[1 + 2*r0_0];
        auto tmp1 = metal::precise::sin(tmp0);
        auto tmp3 = 0.4;
        auto tmp4 = tmp2 + tmp3;
        auto tmp5 = metal::precise::cos(tmp4);
        auto tmp6 = tmp1 + tmp5;
        tmp_acc_0[r0_index] += tmp6;
    }
    auto tmp7 = c10:🤘:threadgroup_sum(tmp_acc_0, 1024);
    auto tmp8 = 3.14;
    auto tmp9 = tmp7 - tmp8;
    out_ptr1[0] = static_cast<float>(tmp9);
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148975
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #148969
2025-03-11 22:46:41 +00:00
2dcdb4ba78 [ez] include config as part of __all__ in torch.compiler (#148978)
Right now we are susceptive to a race condition where if the torch.compiler.config is not implicitly import via dynamo/builder.py, we will throw an error when trying to set compiler configs. This fixes it by including config in `__all__`.

Previous
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
>>> torch.compiler.config.dynamic_sources =
"L['kwargs']['float_features']"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'torch.compiler' has no attribute 'config'
```

Now
```
>>> import torch
>>> torch.compiler.config.dynamic_sources = "L['kwargs']['float_features']"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148978
Approved by: https://github.com/bdhirsh, https://github.com/laithsakka
2025-03-11 21:58:38 +00:00
a6459afb0e [dynamic shapes] add backed_size_oblivious option (#148696)
Adds option `torch.fx.experimental._config.backed_size_oblivious = True` to allocate `[0, inf]` instead of `[2, inf]` ranges for size backed symbols, and opting into size-oblivious semantics for them.

Helps in a number of cases like
- Keeps `[0, inf]` bounds for unbacked symbols, when we make a unbacked -> backed replacement
- More sound handling for 0/1 inputs at runtime when we lower from export
- Avoids ends-of-bounds, sys.maxsize constraint violations for exporting with named Dims (https://github.com/pytorch/pytorch/issues/146315, https://github.com/pytorch/pytorch/issues/146046)

May look towards turning this on globally for export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148696
Approved by: https://github.com/bobrenjc93
2025-03-11 21:52:34 +00:00
53a1a022a9 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 21:49:46 +00:00
b98af95401 Fix DCP link (#148974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148974
Approved by: https://github.com/svekars
2025-03-11 21:26:37 +00:00
6119ffc711 [ROCm][TunableOp] Fix TunableOp BLAS logging for online tuning case. (#148979)
In a previous PR https://github.com/pytorch/pytorch/pull/147034, there was a bad merge at the last minute.
BLAS logging works for offline tuning, but does not currently work for online tuning.

This PR fixes BLAS logging for online tuning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148979
Approved by: https://github.com/jeffdaily
2025-03-11 21:20:04 +00:00
e5fef8a08e [CI] Don't clean workspace when fetching repo (#147994)
Tested on 874c5dc4c98cc63a06bfc900d03683b02f110d7c'
Also tested on https://github.com/pytorch/pytorch/actions/runs/13798178199/job/38594767529?pr=148995#step:4:12

Don't remove the workspace when fetching.  The checkout action performs git clean -ffdx to remove untracked files and files in gitignore

This is probably not going to be useful if we switch entirely to ephemeral runners but w/e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147994
Approved by: https://github.com/malfet, https://github.com/atalman
2025-03-11 21:10:56 +00:00
72d9f88ef2 [release] Move triton pin to latest triton release/3.3.x (#148971)
This branch contains latest AMD cherry-picks:
https://github.com/triton-lang/triton/pull/6171
https://github.com/triton-lang/triton/pull/6165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148971
Approved by: https://github.com/danzimm
2025-03-11 21:10:42 +00:00
e6ef0620cc Add shim.h C API to call dispatcher on our own aten ops (#148832)
This PR still needs testing through some cpp extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148832
Approved by: https://github.com/albanD, https://github.com/atalman
ghstack dependencies: #148124
2025-03-11 21:02:04 +00:00
cf19efd3d9 Support basic TorchBind in aot_compile and aoti_compile_and_package (#148506)
Summary:
**Codegen**

- Skip some codegen parts for torchbind (such as arg decleration) because they are loaded in proxy executor, so we do not need to declare torchbind args in cpp code
- Added a helper method to get the schema of CallTorchBind HOP. The returned schema is only the schema of `obj.method()`.

**Serialization**
Add support for torchbind object in serialization

- For CallTorchBind HOP, we need to handle it specially because of it's schema. The output serialized args is in the format of `(obj, method, *args, **kwargs)`.
- it.TorchBindObject inputs are serialized to `as_custom_obj` Argument.

**Packaging**

Add torchbind objects file and `custom_objs_config.json` file to generated files output of `aot_compile`.

The json file is stored in the `data/aotinductor/<model_name>` folder in pt2 archive.

The torchbind objects are stored in data/constants/ folder in pt2 archive.
The format of torchbind objects are `f"{CUSTOM_OBJ_FILENAME_PREFIX}{custom_obj_idx}"`. e.g. `custom_obj_0`.
CustomClassHolder objects implement their own pickle methods.

Note that this `custom_objs_config.json` file is different from the `model_constants_config.json` file produced in package_sigmoid(). The keys in `custom_objs_config` directly correspond to the arg name in extern nodes json.
The key in `model_constants_config.json` produced by `package_sigmoid` is the attribute name in the user mode code.

This is required for both internal and OSS torchbind support.
For OSS torchbind support, we also need to package torchbind_constants into the .pt2 output.

**Work Left**
We still need to add torchbind support in ProxyExecutor for inductor.aoti_load_package to work. See other diffs in the stack.

Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r aot_compile
```

Differential Revision: D69490718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148506
Approved by: https://github.com/angelayi
2025-03-11 20:55:18 +00:00
f69e58e8e8 [CI] Update crossvit_9_240 as pass (#148989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148989
Approved by: https://github.com/ZainRizvi
2025-03-11 20:54:39 +00:00
b54cf1a281 Revert "[logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)"
This reverts commit 73c8068cf889829fb811fc75baac03163c9a42ee.

Reverted https://github.com/pytorch/pytorch/pull/148693 on behalf of https://github.com/ZainRizvi due to This is breaking lint on trunk. Please rebase these changes before merging them back in. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13796723235/job/38590020554) [HUD commit link](73c8068cf8) ([comment](https://github.com/pytorch/pytorch/pull/148693#issuecomment-2715671875))
2025-03-11 20:50:23 +00:00
c18858d633 [MPS] Make torch.mps.compile_shader public (#148972)
It was a private method in 2.6, but nothin changes in its API for 2.7
and it will likely remain the same in 2.8, so time to remove underscore
from its name

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148972
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/seemethere, https://github.com/albanD, https://github.com/dcci
2025-03-11 20:20:58 +00:00
abcec55532 gracefully handle tokenize.TokenError in funcname parser. Adds support for non-Python source (#148737)
This change allows defining python functions in non-python source and having them be able to compiled by torch.compile. The existing implementation already returns None for the case where the file couldn't be read, so returning None (by making an empty funcname cache) makes sense for the case of non-python source code too.

Example [basilisp](https://github.com/basilisp-lang/basilisp):
```clojure
(import torch)
(import [torch.nn.functional :as F])
(torch/rand 10)

(defn f {:decorators [torch/compile]} [x]
  (* (F/relu x) x))

(f (-> (torch/randn 100)
       (.cuda)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148737
Approved by: https://github.com/williamwen42
2025-03-11 19:49:28 +00:00
73c8068cf8 [logging] Set compile_id in the CachingAutotuner during compilation so we have it for dynamo_timed logging (#148693)
Summary: This is a simpler alternative to https://github.com/pytorch/pytorch/pull/146455, where we can stick the compileId (and forward/backward bool) in the CachingAutotuner so that we have it for logging `benchmark_all_configs`. Recall that the first attempt put the compileId in the inductor_meta and that interfered with caching.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/e71yn6uc
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/4ageghhv
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/4fgv1itq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148693
Approved by: https://github.com/eellison
2025-03-11 19:38:40 +00:00
5b8da17681 [cutlass backend] Add addmm and bmm tests for AOTI (#148929)
Needs to do:
1. Expand addmm tests to cover all 4 shapes
2. Add dynamic shape support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148929
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-03-11 19:38:24 +00:00
7b2ecb80eb [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148928)
Differential Revision: D70908557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148928
Approved by: https://github.com/angelayi
2025-03-11 19:36:30 +00:00
61f9b50e09 [ROCm] Fix TORCH_CHECK for hdim 512 support added in AOTriton 0.9b (#148967)
Fixes #148850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148967
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-03-11 19:21:10 +00:00
971606befa Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/atalman
2025-03-11 19:12:46 +00:00
4d10da731b [ROCm] CK Memory-Efficient Attention (attention bias support) (#147778)
Implements CK as the backend for memory efficient attention with a couple caveats:

- Still enabled via `torch.backends.cuda.preferred_rocm_fa_library("ck")
- Does NOT support Nested Tensors

Using the mem_eff path allows us to use attention bias with a CK sdpa backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147778
Approved by: https://github.com/houseroad
2025-03-11 19:02:59 +00:00
a1cb67b69e [ROCm] Improve backwards indexing when stride is not one (#147630)
Improve backwards indexing when stride is not one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147630
Approved by: https://github.com/jeffdaily
2025-03-11 19:02:48 +00:00
daff65d671 Correctly propagate exception to parent tx (#146502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146502
Approved by: https://github.com/anijain2305, https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #146504, #146499
2025-03-11 18:55:45 +00:00
fb53e9e514 Add __context/cause/suppress_context/traceback__ to Exception (#146499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146499
Approved by: https://github.com/zou3519, https://github.com/anijain2305
ghstack dependencies: #146504
2025-03-11 18:55:45 +00:00
4e7d264cf8 Introduce UserDefinedExceptionClassVariable (#146504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146504
Approved by: https://github.com/anijain2305
2025-03-11 18:55:45 +00:00
8d08b49015 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-11 18:51:06 +00:00
c916a8efc5 Revert "Use the device interface for detecting Triton availability (#139171)"
This reverts commit 940b60db974f08a31c746eec2f9c399fc8a861ee.

Reverted https://github.com/pytorch/pytorch/pull/139171 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @jansel can you please help get these changes working? See D70946254 for more details. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/139171#issuecomment-2715392451))
2025-03-11 18:49:21 +00:00
57ee821a41 fix dynamo ide (#148849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148849
Approved by: https://github.com/bobrenjc93
2025-03-11 18:43:30 +00:00
883fb78c7e Update jinja2 version in requirements-gha-cache.txt
As previous version is vulnerable  to CVE-2025-27516
This closes Dependabot report
2025-03-11 11:42:38 -07:00
5ee9dbc0a1 Bump jinja2 from 3.1.5 to 3.1.6 in /.ci/docker (#148812)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-11 11:39:55 -07:00
cyy
a5f6b24d87 Remove outdated skipIfRocmVersionLessThan decorations (#148941)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148941
Approved by: https://github.com/jeffdaily
2025-03-11 18:37:40 +00:00
ef6296e7f2 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70937982](https://our.internmc.facebook.com/intern/diff/D70937982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-11 18:36:12 +00:00
b366f33606 [MPSInductor] Prep for mutlistage reductions (#148969)
----

- Move reduction variable initialization from `loads` to  `indexing_code`
- Move barriers from `codegen_kernel` to `reduction` and only use them for `any` reductions (as other reduction ops do  barriers explicitly inside the respective reduction functions)
- Use `self.compute` instead of `self.body` for all compute operations

Checked that number of before/after failures stays at `164 failed, 616 passed, 53 skipped`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148969
Approved by: https://github.com/dcci
2025-03-11 18:35:23 +00:00
dcc502f376 [ROCm][TunableOp] Add bias data type to params signature. (#146227)
Add bias vector data type in TunableOp params signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146227
Approved by: https://github.com/jeffdaily
2025-03-11 18:31:22 +00:00
52acc1f955 [DSD] Update the document to mention the limitation of set_optimizer_state_dict (#148918)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/140898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148918
Approved by: https://github.com/fduwjj, https://github.com/mori360
ghstack dependencies: #148825
2025-03-11 18:24:12 +00:00
e0d4c43ad1 Add env for disabling meta reference on functionalization. (#148822)
Fix: https://github.com/pytorch/xla/issues/8755

This PR introduces `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE`
environment variable. Setting this variable makes it so the
functionalization kernels won't run the meta reference, which is used to
propagate expected sizes and strides.

Currently, PyTorch/XLA doesn't actually propagates the correct strides
to its tensors. It was also shown that calling these meta functions may
incur in significant overhead.

Running the provided minimal reproducer (see issue), we see a speedup
close to 4.3x:

- Baseline: 0.0747s
- `XLA_DISABLE_FUNCTIONALIZATION=1`: 0.0159s
- `TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1`: 0.0175s

In summary, this PR:

- Creates the `disable_meta_reference()` function, which checks whether
  the environment variable is set
- Modifies codegen for functionalization kernels, adding the call to
  `disable_meta_reference()` function to the appropriate conditions
- Creates a new bash function for running `lazy/test_ts_opinfo.py` with
  the environment variable set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148822
Approved by: https://github.com/bdhirsh
2025-03-11 16:13:35 +00:00
09029010e5 [inductor] Fix create_specialize_impl error in latest Triton (#148933)
```py
$ python test/inductor/test_triton_kernels.py KernelTests.test_triton_kernel_2d_autotune_grad_False_dynamic_True_backend_inductor_grid_type_1
WARNING:torch._dynamo:Encountered an exception in identify_mutated_tensors, assuming every input is mutated
Traceback (most recent call last):
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 715, in identify_mutated_tensors
    ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 289, in generate_ttir
    specialization = _get_specialization(ordered_args.values())
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jansel/pytorch/torch/_higher_order_ops/triton_kernel_wrap.py", line 262, in _get_specialization
    specialize_impl = triton.runtime.jit.create_specialize_impl()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: create_specialize_impl() missing 1 required positional argument: 'specialize_extra'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148933
Approved by: https://github.com/yanboliang, https://github.com/davidberard98
2025-03-11 15:54:47 +00:00
16560d4e8f Revert "Refactor test/test_torch.py by moving testcase to test_indexing.py (#148875)"
This reverts commit 0fa0a740958ffc474843ceb1d19ee43c4bff4c09.

Reverted https://github.com/pytorch/pytorch/pull/148875 on behalf of https://github.com/ZainRizvi due to That torch.version failure you got in CI was a legitimate failure and is now breaking trunk. [GH job link](https://github.com/pytorch/pytorch/actions/runs/13778023702/job/38534207536) [HUD commit link](0fa0a74095) ([comment](https://github.com/pytorch/pytorch/pull/148875#issuecomment-2714757288))
2025-03-11 15:27:25 +00:00
3945954741 Bump triton pin. Add aarch64 triton build (#148705)
1. Bumps pin for triton to release/3.3.x branch
2. Bump pin for triton-xpu
3. Remove ROCm xfail tests
4. Add aarch64 triton build:
	* Depends on: https://github.com/pytorch/pytorch/pull/148768
	* Fixes: https://github.com/pytorch/pytorch/issues/130558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148705
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/EikanWang
2025-03-11 15:12:21 +00:00
c983e1124c Revert "[WIP] Initial implementation of Grouped Gemm API (#148531)"
This reverts commit ff29791ed8f815bdbca1a5606de046380baca69d.

Reverted https://github.com/pytorch/pytorch/pull/148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](https://github.com/pytorch/pytorch/pull/148531#issuecomment-2714577498))
2025-03-11 14:40:58 +00:00
f1787ee0f7 [dynamo] Remove L scoping for recompilation messages (#148917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148917
Approved by: https://github.com/williamwen42
2025-03-11 14:26:26 +00:00
992838e702 [dynamo][guards] Do not ID_MATCH on numpy tensors (#148923)
Might help with https://github.com/pytorch/pytorch/issues/148535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148923
Approved by: https://github.com/jansel
2025-03-11 14:20:26 +00:00
ee21ccc816 Skip ao_sparsity TestComposability for missing FBGEMM (#144146)
Those tests (from test_ao_sparsity) require FBGEMM which may not be available. So add the skip decorator.

Fixes #87364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144146
Approved by: https://github.com/jerryzh168, https://github.com/jcaip
2025-03-11 13:02:18 +00:00
da4bb72a71 Backout D70075331 (#148824)
Summary:
The AOTI lowering for model 699109736 and other new models worked before D70075331, but failed after with error "RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul with transpose_mat1 1 transpose_mat2 0 m 4096 n 10 k 7936 mat1_ld 7936 mat2_ld 7936 result_ld 4096 abcType 2 computeType 68 scaleType 0"

So we revert D70075331 as a workaround now.

Test Plan: The model could be lowered and published successfully. e.g. 702869739_16

Differential Revision: D70823254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148824
Approved by: https://github.com/eqy
2025-03-11 12:51:17 +00:00
9ad64ce795 [triton 3.3] Forward-fix mm template selection logic (#148924)
Follow-up from https://github.com/pytorch/pytorch/pull/148662.

The logic from https://github.com/pytorch/pytorch/pull/148662 is incorrect; what we want is "choose the second template 'AMD-specific template' only if we're on hip AND triton version < 3.3" - negating it, the code should be "choose the cirst template if we're NOT on hip OR triton version >= 3.3".

Tested locally to verify that it fixes the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148924
Approved by: https://github.com/drisspg, https://github.com/atalman, https://github.com/eellison
2025-03-11 09:05:44 +00:00
2bcc3acb90 Update low prec codegen for div/mod (#142350)
Div/mod in fp16/bf16 requires a downcast to preserve its inputs' dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142350
Approved by: https://github.com/blaine-rister
2025-03-11 08:02:30 +00:00
41e4728f74 update types on dynamo configs (#146873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146873
Approved by: https://github.com/williamwen42
2025-03-11 05:33:48 +00:00
1fcc4bc109 Don't look at TESTING_ONLY in fuzzer (#146870)
Lots of configs aren't meant to be set because they're testing only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146870
Approved by: https://github.com/masnesral
2025-03-11 05:32:25 +00:00
bed92a8523 [Window][Inductor UT] Fix for tempfile.NamedTemporaryFile(delete=True) not work on Windows. (#148632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148632
Approved by: https://github.com/jansel
2025-03-11 05:05:15 +00:00
ecfbfe1603 [AOTI] Remove aoti_torch_cpu__weight_int4pack_mm_cpu_tensor (#148907)
Summary: shim.h is only meant for generic tensor util shim functions. We should switch to use the auto fallback generation, but it will need some extra care on the op schema.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148907
Approved by: https://github.com/janeyx99
2025-03-11 04:41:05 +00:00
940b60db97 Use the device interface for detecting Triton availability (#139171)
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.

This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
2025-03-11 03:56:11 +00:00
ff29791ed8 [WIP] Initial implementation of Grouped Gemm API (#148531)
This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation.
I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert.
I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself.
I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`.
Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531
Approved by: https://github.com/drisspg
2025-03-11 02:41:09 +00:00
621dadd4ca partitioner: when materializing unbacked tensor intermediates, apply hint to symbol, not expr (#144097)
Fixes https://github.com/pytorch/pytorch/issues/144095

open to suggestions: the `hint_int(..., fallback=...)` API feels like a bit of a footgun, because:

(1) we use the same guess for every unbacked symint (both symbols, and compound expressions)
(2) the user may have established some relationship between some unbacked symints that we are not taking into account.

I'm not sure how real of an issue (2) is - is it common to e.g. generate two unbacked symints, and then add a runtime assert that they are unequal?

Instead I did something simpler that's just enough to fix the linked issue: if we have a sympy expression containing an unbacked symbol (e.g. `u0 + 1`), then the partitioner will now fill in the symbol with our guess instead of the expression (plugging in `u0=4096` gets us 4097). This was important for an internal custom op, that had some logic like this:
```
def custom_op(x: [u0], y: [u0 + 1]):
    assert x.shape[0] = y.shape[0] - 1
    ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144097
Approved by: https://github.com/laithsakka
2025-03-11 02:11:57 +00:00
8c45d44abb Skip distributed subprocess test internally as they don't work (#148909)
Follow up from https://github.com/pytorch/pytorch/pull/146098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148909
Approved by: https://github.com/janeyx99
2025-03-11 02:07:45 +00:00
457ff9b7ae [reland][ca] side-effect free inital trace: compiled_args (#148376)
This reverts commit ea12fc8a9ff7da808e0b661ca07e9d4ce75d04bc.
Reland https://github.com/pytorch/pytorch/pull/147804, there was a bad import inserted by my linter.

Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376
Approved by: https://github.com/jansel
2025-03-11 01:57:36 +00:00
9fddbf3417 Update the comment (#148726)
Differential Revision: D70747931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148726
Approved by: https://github.com/yf225
2025-03-11 01:19:14 +00:00
0fa0a74095 Refactor test/test_torch.py by moving testcase to test_indexing.py (#148875)
Fix `FIXME` in `test_torch.py` by moving test-cases to `test_indexing.py`

```python
# FIXME: move to test indexing
# FIXME: move to indexing test suite
```

- Move tests in `test/test_torch.py` to `test_indexing.py`
- Remove `FIXME` comments

## TestResult

```bash
pytest test/test_torch.py -k TestTorchDeviceType -vv
pytest test/test_indexing.py  -k TestIndexing -vv
```

![image](https://github.com/user-attachments/assets/49a80985-e74a-4da6-a063-476e87e6aa8a)

![image](https://github.com/user-attachments/assets/77afa936-5dba-480c-b293-eb1f7bc74420)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148875
Approved by: https://github.com/soulitzer
2025-03-11 01:01:59 +00:00
c297c09a37 Fix invalid nested int guarding in broadcast_shapes() (#145957)
Fixes #145874

This PR takes the approach of updating the logic determining whether multiple shapes broadcast together to handle nested ints specially.

Possible alternative approach: don't update `broadcast_shapes()` + indicate that e.g. `Ne(j0, 1)` should statically evaluate to False. I briefly tried this but it wasn't straightforward. Is it better?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145957
Approved by: https://github.com/bobrenjc93

Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-03-11 00:53:13 +00:00
cyy
295f2ed4d1 Fix "invalid application of 'sizeof' to an incomplete type" (#148854)
Fixes with C++23 and constexpr std::unique_ptr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148854
Approved by: https://github.com/Skylion007
2025-03-11 00:40:00 +00:00
cyy
a6e71dbc88 Enable ASAN on inductor CUDA tests (#148749)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148749
Approved by: https://github.com/jansel
2025-03-10 23:53:40 +00:00
b215841ebb [MM] Add sm carevout to lowerings (#148793)
# Summary

See https://github.com/pytorch/pytorch/issues/145115 for more details. I have been using
the following to verify, need to figure out how to do proper guarding

This does  do the correct thing if we compile w/ sm carvout already set but since we dont guard on it just yet we dont recompile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148793
Approved by: https://github.com/lw, https://github.com/eellison
2025-03-10 23:49:26 +00:00
492f3fd5cf replace usages of upload_graph in inductor with tlparse (v2) (#148720)
Reland of https://github.com/pytorch/pytorch/pull/148703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148720
Approved by: https://github.com/mengluy0125
2025-03-10 22:47:58 +00:00
5bbca7d328 [ROCm][Windows] Fix OpenMP Flags for clang-cl (#148097)
When clang-cl parses its command line arguments, it expects MSVC-style arguments (beggining with `/` such as `/WX`, `/MD`, etc.) to be provided, and clang-style arguments to be preceded by `-Xclang`, otherwise, the clang-style parameters are ignored as they are interpreted unrecognized compiler options.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148097
Approved by: https://github.com/jeffdaily
2025-03-10 22:47:15 +00:00
a95eb0c0a7 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 2149f6c6845d00711ffab648132b7377e8cd3edb.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/ZainRizvi due to Breaking internally, see D70873275. Discussed reverting this with Ke. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2712001270))
2025-03-10 22:38:40 +00:00
12a95390ae [Minimizer] allow overriding of ShapeProp logic by subclasses of _MinimizerBase (#148784)
Summary:
The changes contained in this diff
- allow subclass Minimizer implementations to override the default shape propagation logic with custom logic
- copies over the meta attribute on get_attr graph nodes during the graph splitting step
- for both changes, behavior for existing classes do not change

Test Plan: CI

Differential Revision: D70799942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148784
Approved by: https://github.com/blaine-rister
2025-03-10 22:22:16 +00:00
fcb633fafa Introduce TORCH_ABI_VERSION and a runtime aoti_torch_abi_version C shim ABI (#148892)
Importable https://github.com/pytorch/pytorch/pull/148836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148892
Approved by: https://github.com/albanD
2025-03-10 22:22:10 +00:00
98b3f1db9f [Flex Attention] support num_heads > 1 in block_mask (#148857)
Previously flex decoding errors when block mask has num_heads > 1. So users have to use num_heads=1, or explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}`.

This PR fixes this issue. When not using grouped query attention (GQA, i.e., Hq == Hkv), we support block mask with num_heads = 1 and num_heads = num_query_heads (i.e., Hq). This is the same setting as flex attention kernel.

When using GQA (i.e., Hq != Hkv), we support block mask with num_heads = 1. When num_heads = Hq, we fall back to flex attention kernel so user don't need to explicitly mark `kernel_options={"FORCE_USE_FLEX_ATTENTION": True}` anymore.

Why fallback? In the current flex decoding triton kernel, grouped query heads for the same kv head are handled by the same thread block. Supporting num_heads = Hq with GQA requires support different kv num blocks for different query heads in the same thread block, leading to lots of redundant workload. So we should better use the main flex_attention kernel where each query head is handled by a separate block.

Fixes #148527
Fixes #147267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148857
Approved by: https://github.com/drisspg
2025-03-10 22:02:50 +00:00
6ef15c7f46 [pytorch] Update flexattention bwd config generation (#148600)
Summary: Currently `flex_attention` template's backward config generation returns values for every case. This change instead stores intermediate values in `'bwd_config` returned at the end.

Test Plan: CI. Existing tests.

Differential Revision: D70649316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148600
Approved by: https://github.com/Skylion007
2025-03-10 22:00:56 +00:00
8701b302cc setuptools pinning (#148879)
Fixes #148877

---

On 9 March 2025, [setuptools](https://pypi.org/project/setuptools/#history) published a new version  and it is causing an issue on `pytorch` with the following error:

```
AttributeError: module 'distutils' has no attribute '_msvccompiler'. Did you mean: 'ccompiler'?
```

Last known working version is [75.8.2](https://pypi.org/project/setuptools/75.8.2/)

Currently it is affecting Windows ARM64 nightly build, however soon it might affect also Windows x64 builds. (conda version is not updated yet [setuptools conda](https://anaconda.org/anaconda/setuptools)

Locally both `Windows ARM64` and `Windows x64` are having same problem with the latest `setuptools` (>75.8.2)

---

This PR is pinning `setuptools` version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148879
Approved by: https://github.com/seemethere
2025-03-10 21:29:32 +00:00
c652772af7 [aarch64] install ninja for docker to build triton on arm (#148768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148768
Approved by: https://github.com/atalman, https://github.com/Skylion007

Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-10 21:28:53 +00:00
b706044cca [ROCm][Windows] Enable hipblaslt for Windows (#148563)
This PR adds hipblaslt library as one of the Windows' dependencies. `rocBLAS` is added too, since certain symbols aren't detected with `hipblas` alone on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148563
Approved by: https://github.com/jeffdaily
2025-03-10 21:07:16 +00:00
2a1eeaeed8 Remove 12.4 x86 builds and 12.6 sbsa builds from nightly (#148895)
https://github.com/pytorch/pytorch/issues/145570

redo https://github.com/pytorch/pytorch/pull/148625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148895
Approved by: https://github.com/atalman
2025-03-10 20:55:09 +00:00
4a2173d9a0 [cutlass backend][ez] Incorporate AOTI dynamic shape test into main test of MM (#148786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148786
Approved by: https://github.com/jingsh
2025-03-10 20:35:10 +00:00
e9c12e819d Update torch-xpu-ops commit pin (#148881)
Update the torch-xpu-ops commit to [026b2c8c7c92a7b2cec5d26334006e3423251cc6](026b2c8c7c), includes:

- Enable AOT for LNL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148881
Approved by: https://github.com/EikanWang
2025-03-10 20:31:51 +00:00
ed969d1236 [DSD] Fix the shared parameter mismatch for optimizer state_dict when flattening FQNs are used (#148825)
Summary:
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148825
Approved by: https://github.com/fduwjj, https://github.com/mori360
2025-03-10 20:04:36 +00:00
494abeff8a CUDACachingAllocator,c10d: fixes for IPC release performance (#148805)
This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL.

1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock
2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there

Test plan:

Run with torchft + torchtitan

```
REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par
allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096

...

[rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61  loss:  7.4825  memory: 79.73GiB(83.89%)  tps: 317  tflops: 16.34  mfu: 1.65%
```

Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors

![20250307_17h50m10s_grim](https://github.com/user-attachments/assets/fa8b359f-e337-4ed5-be22-a42ab2bee03d)
![20250307_17h50m00s_grim](https://github.com/user-attachments/assets/206f869a-f07e-4fbd-9e28-89b3da95ef6e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito
2025-03-10 19:47:04 +00:00
2e4874e48d Update RELEASE.md with latest changes to release process and release 2.7 information (#148888)
1. Update for Release 2.7 compatibility matrix
2. Remove mention of builder project, the scripts for release management were migrated to test-infra

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148888
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2025-03-10 19:20:27 +00:00
clr
6b0fd741d1 dynamo: Count number of opcodes processes (#147149)
This gives us a decent proxy for how big of a graph we functionally had to parse.

Note that this is a cummulative counter. If people feel strongly, I can either write into the dynamo_timed datasets with metrics contexts, or clear the counters / write a counter per frame id as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147149
Approved by: https://github.com/jansel
2025-03-10 19:20:09 +00:00
3129faf8be Optimize shard_dim_alltoall to use alltoall_single (#148868)
as titled, previously the shard_dim_alltoall uses `all_to_all`, which essentially could incur lots of copies if the tensor become non-contiguous during splits, and alltoall itself also incur copies

This PR uses alltoall_single instead, so that we could minimize tensor copies.

tested on all the shard dim change tests and it works properly:
```
pytest test/distributed/tensor/test_redistribute.py -s -k shard_dim_alltoall
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148868
Approved by: https://github.com/tianyu-l
2025-03-10 18:38:12 +00:00
ed7e964f2b codecache.py: use str.format rather than % formatting (#148691)
Additionally, swaps over a fixed length `std::vector` used by `cpp_wrapper` for a `std::array`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148691
Approved by: https://github.com/desertfire
2025-03-10 18:33:58 +00:00
d1f21d8ec3 Enable Direct Use of Arm Compute Library (ACL) in ATen (#148584)
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set.
Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without  oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620.
This patch enables such use cases by exposing ACL to ATen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584
Approved by: https://github.com/malfet
2025-03-10 18:29:51 +00:00
00cabd4235 [Inductor][Windows] add env_var switch to turn all Windows inductor UTs. (#148733)
For timeout reason, we can't turn on all Windows Inductor UTs in CI: https://github.com/pytorch/pytorch/issues/135927
And without the UTs, we can't ensure Windows inductor quality.

Intel team will do some local test for Windows inductor, but we still need to add a switch to turn on the full Windows inductor UTs.

The switch is an environment variable:
```cmd
set TORCHINDUCTOR_WINDOWS_TESTS=1
```
After setup this environment variable, we can turn on all Windows inductor UTs, It will not affect to PyTorch CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148733
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2025-03-10 18:25:29 +00:00
4c13a859e5 Workaround no triton float8_e8m0fnu support in inductor (#148722)
Triton doesn't support actual float8_e8m0fnu yet, so we can't currently codegen any arithmetic on them. But we can support bitcasting, and view/memory operators and treat them as uint8 for now. Fix for https://github.com/pytorch/pytorch/issues/147873.

The one question i'm not sure of is whether or not we need to explicitly disable triton template fusion since it would fuse in these dtypes as uint8..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148722
Approved by: https://github.com/vkuzo
ghstack dependencies: #148450
2025-03-10 17:37:39 +00:00
cyy
203dd18c5c Bump Clang-tidy to 19.1.4 (#148648)
Because Clang-tidy 19 has more powerful clang-analyzer checks to detect subtle bugs. New checks such as misc-use-internal-linkage can help identify potential static variables or functions, thus reducing binary sizes.

Some new checks are disabled temporarily for later enabling. Additional warnings have been fixed or suppressed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148648
Approved by: https://github.com/Skylion007
2025-03-10 17:32:30 +00:00
ebd087e4b5 Revert "[pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)"
This reverts commit f08146b67bab331f7bdc9fa247f526f6e60a7190.

Reverted https://github.com/pytorch/pytorch/pull/113257 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/113257#issuecomment-2711299830))
2025-03-10 17:19:21 +00:00
2ec9aceaeb Revert "Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)"
This reverts commit 3680e666d8ceaa43069555f821d1e8a5de01d5ab.

Reverted https://github.com/pytorch/pytorch/pull/148834 on behalf of https://github.com/janeyx99 due to sorry I don't think I want this PR in before the branch cut, as it'd freeze the API in the file when it should really be in a different header ([comment](https://github.com/pytorch/pytorch/pull/148834#issuecomment-2711162193))
2025-03-10 16:29:40 +00:00
9dbc2527dc Disable some SVE autovec (#148489)
Summary: autovec miscompiles on patterns of the type:
```cpp
for (const auto i : c10::irange())
```
Same issue as described in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001 and addressed by https://github.com/pytorch/pytorch/pull/137795 for gcc, but not clang

Test Plan:
buck2 build //caffe2/caffe2/fb/transforms:sigrid_interface

Differential Revision: D70422723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148489
Approved by: https://github.com/malfet
2025-03-10 16:25:00 +00:00
a60b4ed623 [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148288
2025-03-10 16:06:19 +00:00
8f858e226b [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261
2025-03-10 16:06:19 +00:00
5d4e7d58b4 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-10 16:06:11 +00:00
bf752c36da [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-10 16:06:02 +00:00
bec7bdad47 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-10 16:05:53 +00:00
cyy
b8b1b364c9 Fix invalid format string in libfmt calls (#148855)
Wrap shaderSource inside fmt::runtime because the format string is not a string literal and can't pass libfmt's compile time check in C++23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148855
Approved by: https://github.com/Skylion007
2025-03-10 14:47:52 +00:00
a81751d8b7 [CD] Annotate linux/arm64 cuda wheels with consistent nvidia dependencies (#145021)
This resolves issues installing torch nightly wheels into a `uv sync`-generated `.venv`

The root cause is that the x64 and arm64 cuda nightly wheels have inconsistent metadata. This can be seen comparing `generated-linux-aarch64-binary-manywheel-nightly.yml` and `generated-linux-binary-manywheel-nightly.yml`

`uv` expects consistency:

https://github.com/astral-sh/uv/issues/10693
>Frankly, it's really not ideal that they change their dependencies from wheel to wheel.
>They could still put the dependencies there with the same platform markers they're using in the other wheel though... 🤷‍♀

https://github.com/astral-sh/uv/issues/10119#issuecomment-2559898792
>I think this is something that basically has to be solved by PyTorch. The issue is that the wheels for `2.6.0.dev20241222+cu126` don't have consistent metadata, and it's a fundamental assumption of uv that the metadata for a given version _is_ consistent.

To resolve this, I modified the arm64 nightly build workflow to add two new `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under `manywheel-py3_11-cuda-aarch64-build` and `manywheel-py3_12-cuda-aarch64-build`. These are based on their equivalents in the x64 workflow for the corresponding python versions.

I used the cuda 12.6 dependencies versions for the nvidia packages, to match the `DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cuda12.6-main` being used by these jobs.

(The arm64 workflow file already had several `PYTORCH_EXTRA_INSTALL_REQUIREMENTS` entries, under various cpu wheels. I'm not sure why these are there, but I left them as-is.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145021
Approved by: https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
Co-authored-by: Andrey Talman <atalman@fb.com>
2025-03-10 14:39:39 +00:00
4fdd076907 [CD] Add triton xpu as dependency of torch xpu windows whl (#148755)
Depends on PR #147637 land

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148755
Approved by: https://github.com/atalman
2025-03-10 14:04:30 +00:00
31625b08b8 Add ccode for FloorDiv (#148727)
Summary: Add ccode for FloorDiv

Test Plan: CIs

Differential Revision: D70749021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148727
Approved by: https://github.com/bobrenjc93
2025-03-10 14:00:18 +00:00
2068235c0a Add timm_efficientnet to flaky models after cuda 12.6 update in CI/CD (#148788)
After https://github.com/pytorch/pytorch/pull/148612
This model have become flaky

Tracking this regression in an issue : https://github.com/pytorch/pytorch/issues/148699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148788
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-03-10 13:40:41 +00:00
68c12ecfe2 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-10 13:17:58 +00:00
098494e9cb [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-10 13:14:05 +00:00
59f14d19ae Implement gradient for the residuals of torch.linalg.lstsq (#148526)
Fixes #147543.

I have written some tests in python using `gradcheck`. Please advise where I should put these tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526
Approved by: https://github.com/lezcano
2025-03-10 12:35:09 +00:00
ea86b8d315 Fix redistribution cost for all-reduce (#148761)
This issue seems to have been introduced in https://github.com/pytorch/pytorch/pull/119897. With the current implementation, it might be more favorable to perform a reduce_scatter followed by an all-gather than simply an all-reduce.

Thanks @lw for the helpful discussions on getting this PR out!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148761
Approved by: https://github.com/Skylion007, https://github.com/lw, https://github.com/tianyu-l, https://github.com/fegin
2025-03-10 12:13:11 +00:00
526524b489 Update slow tests (#148873)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148873
Approved by: https://github.com/pytorchbot
2025-03-10 11:46:30 +00:00
74da76f67c [ROCm][Windows] Fix ROCm/HIP version header (#148560)
On Windows, ROCm libraries do not have a `<rocm-core/rocm_version.h>` header, which causes the compilation to fail. This PR resolves this problem by utilising `<hip/hip_version.h>`  from HIP SDK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148560
Approved by: https://github.com/jeffdaily
2025-03-10 11:28:13 +00:00
00199acdb8 [inductor][triton] Block ptr analysis fix assert on matched index expression (#148446)
If dynamic shapes are enabled, then block analysis may create new precomputed size replacements from the index which can lead to an assertion failure when the matched index is compared with the original index. For example the below assertion fails, despite the expressions being equivalent (ps2 = 3 * ps0). This can be resolved by updating the original index with the replacements, or simply removing the replacements when the expressions are tested to be equal - the latter option is implemented in this PR.

```
       torch._inductor.exc.InductorError: AssertionError:
E       Invalid match!
E       Index: 3*ps0*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E       Matched expression: ps2*((yindex//3)) + (ModularIndexing(yindex, 1, 3))
E
```

This PR fixes the test below when `config.triton.use_block_ptr=True`:
```
python test/inductor/test_torchinductor_dynamic_shapes.py DynamicShapesCpuTests.test_conv3d_channels_last_dynamic_shapes_cpu
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148446
Approved by: https://github.com/jansel
2025-03-10 05:26:55 +00:00
3680e666d8 Move aoti_torch_cpu__weight_int4pack_mm_cpu_tensor to not be mangled (#148834)
I noticed that this op was likely intended to be in the `extern "C"` portion of the file, but it was not added as such in https://github.com/pytorch/pytorch/pull/145250 which means this function is actually not stable/would get mangled by C++.

Following the thread there I am thinking there are two possible solutions:
(1) Since this op was never stable to begin with, and @Xia-Weiwen already landed the fallback, maybe this op is deletable + should get deleted before the 2.7 branch cut
(2) Or we could just move the op to the right portion of the code. While I like just deleting the op, I am hesitant to do in case there's something I haven't considered, so this PR does option 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148834
Approved by: https://github.com/desertfire
2025-03-10 03:23:48 +00:00
7ae0ce6360 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:36 +00:00
b47d81682d [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-10 00:21:24 +00:00
cyy
aac230a511 [MPS] Fix Wreorder-init-list (#148839)
Fixes the following warning:
```
 warning: ISO C++ requires field designators to be specified in declaration order; field 'value' will be initialized after field 'size' [-Wreorder-init-list]
    662 |       return {.value.cf = scalar.to<c10::complex<float>>(), .size = sizeof(int64_t), .type = type};
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148839
Approved by: https://github.com/Skylion007
2025-03-09 23:45:46 +00:00
b95889042c [MPS] Introduce strides unary op (#148468)
By adding following template
```metal
template <typename T, typename F>
kernel void unary_strided(
    device result_of<F, T>* output [[buffer(0)]],
    constant T* input [[buffer(1)]],
    constant long* sizes [[buffer(2)]],
    constant long* input_strides [[buffer(3)]],
    constant long* output_strides [[buffer(4)]],
    constant uint& ndim,
    uint index [[thread_position_in_grid]]) {
  F f;
  int pos[max_ndim];
  pos_from_thread_index(int(index), pos, sizes, ndim);
  const auto input_offs = offset_from_coord(pos, input_strides, ndim);
  const auto output_offs = offset_from_coord(pos, output_strides, ndim);
  output[output_offs] = f(input[input_offs]);
}
```
and instantiating it for all existing unary shaders, which eliminates the need to any intermediate copies.
No extra testing are needed as those cases are already covered by `test_output_grad_match_corrcoef_cpu_float32` as well as `test_unary_ops_storage_offset_strided`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148468
Approved by: https://github.com/dcci
2025-03-09 22:30:51 +00:00
275a7c5dbb Revert "Add a stable TORCH_LIBRARY to C shim (#148124)"
This reverts commit 327e07ac1dc3351bb5f0ad436760b83590c400aa.

Reverted https://github.com/pytorch/pytorch/pull/148124 on behalf of https://github.com/malfet due to Sorry for reverting your PR, but somehow it caused test failures in newly introduced tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=pull%20%2F%20linux-focal-cuda12.6-py3.10-gcc11-sm89%20%2F%20test%20(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148124#issuecomment-2709057833))
2025-03-09 20:44:56 +00:00
19a39a7a06 Revert "[dynamo] allow global import from collections import deque in user code (#148676)"
This reverts commit 685fb377131cc684633dc5471e77038988db53f6.

Reverted https://github.com/pytorch/pytorch/pull/148676 on behalf of https://github.com/malfet due to Looks like it broke ROCM, see f1444f006c/1(default%2C%201&mergeLF=true ([comment](https://github.com/pytorch/pytorch/pull/148676#issuecomment-2709057326))
2025-03-09 20:42:03 +00:00
f1444f006c [caffe2/torch] Fixup upstream LLVM (major version 21) API changes (#148833)
Latest LLVM introduced two changes related to the `Triple` usage that causes build failures when building pytorch.

## Failure in llvm_codegen.cpp:
Triple is stored in Modules instead of the string: 979c275097

## Failure in llvm_jit.cpp:
Triple argument is removed from LLJITBuilder::... : b18e5b6a36

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148833
Approved by: https://github.com/Skylion007
2025-03-09 18:58:36 +00:00
9a1a2e1516 Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
2025-03-09 17:12:47 +00:00
a8e3d1984a [Inductor UT][XPU] Skip test case test_cat_max_autotune_triton for known issue. (#148734)
The mm triton template/configs have not been tuned for XPU, we observer that the epilogue fusion can not speed up on XPU because of registers spill. So XPU failed on the case `test_cat_max_autotune_triton` which checks the fusion. We'll remove the skip after #146568 being resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148734
Approved by: https://github.com/jansel
2025-03-09 15:09:43 +00:00
bb9c426024 Typo Errors fixed in multiple files (#148262)
# Fix typo errors across PyTorch codebase

This PR fixes various spelling errors throughout the PyTorch codebase to improve documentation quality and code readability.

## Changes Made

### Documentation Fixes
- Changed "seperate" to "separate" in multiple files:
  - `setup.py`: Build system documentation
  - `torch/_library/triton.py`: AOT compilation comments
  - `torch/csrc/dynamo/compiled_autograd.h`: Node compilation documentation
  - `torch/export/_unlift.py`: Pass population comments
  - `torch/export/exported_program.py`: Decomposition table notes

### Code Comments and Error Messages
- Changed "occured" to "occurred" in:
  - `test/mobile/test_lite_script_module.py`: Exception handling comments
  - `torch/export/_draft_export.py`: Error message text
  - `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebra.cpp`: MAGMA bug comment
  - `torch/csrc/utils/python_numbers.h`: Overflow handling comment
  - `torch/csrc/jit/OVERVIEW.md`: Graph compilation documentation
  - `torch/_dynamo/symbolic_convert.py`: Error explanation

### API Documentation
- Changed "fullfill" to "fulfill" in `torch/distributed/checkpoint/state_dict_loader.py`
- Changed "accross" to "across" in:
  - `torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp`
  - `torch/distributed/distributed_c10d.py`

## Motivation
These changes improve code readability and maintain consistent spelling throughout the codebase. No functional changes were made; this is purely a documentation and comment improvement PR.

## Test Plan
No testing required as these changes only affect comments and documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148262
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-09 12:21:40 +00:00
327e07ac1d Add a stable TORCH_LIBRARY to C shim (#148124)
This PR adds two main parts:
- shim.h stable C APIs into torch::Library APIs
- a higher level API in torch/csrc/stable/library.h that calls into this shim.h + otherwise is self contained

Goal: custom kernel writers should be able to call the apis in the directories above in order to register their library in a way that allows their custom extension to run with a different libtorch version than it was built with.

Subplots resolved:

- Do we want a whole separate StableLibrary or do we want to freeze torch::Library and add `m.stable_impl(cstring, void (*fn)(void **, int64_t, int64_t)` into it
    - Yes, we want a separate StableLibrary. We cannot freeze Library and it is NOT header only.
- Should I use unint64_t as the common denominator instead of void* to support 32bit architectures better?
    -  Yes, and done
- Should I add a stable `def` and `fragment` when those can be done in python?
    - I think we do want these --- and now they're done
- Where should library_stable_impl.cpp live? -- no longer relevant
- I need some solid test cases to make sure everything's going ok. I've intentionally thrown in a bunch of random dtypes into the signature, but I still haven't tested returning multiple things, returning nothing, complex dtypes, etc.
    - Have since tested all the torch library endpoints. the others can be tested in a followup to separate components that need to be in shim.h vs can be added later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148124
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-03-09 10:07:25 +00:00
685fb37713 [dynamo] allow global import from collections import deque in user code (#148676)
See https://github.com/pytorch/pytorch/pull/148669#discussion_r1983462218 for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148676
Approved by: https://github.com/jansel
2025-03-09 09:35:29 +00:00
6566d67bd3 [dynamo] show stack above dynamo in graph break user tracebacks (#148401)
Also show the line of code relevant to a dynamo-compiled frame, instead of just the first line (this was broken for data-dependent jump graph breaks and for 3.11+).

Also collapses resume frames together (use config.verbose to see full stack trace - for developers).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148401
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-03-09 07:37:38 +00:00
2149f6c684 [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-09 07:32:23 +00:00
85fe576ee3 [set_linter] allow x in {...} (#148422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148422
Approved by: https://github.com/Skylion007
2025-03-09 06:43:11 +00:00
9cb25f0ea2 Revert "[PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)"
This reverts commit 17dbeb11db7afbab792ad76c24840c1552a0e76d.

Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/janeyx99 due to PR break backward compat test ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2708641172))
2025-03-09 03:01:55 +00:00
17dbeb11db [PGNCCL] Launch kernel on current stream & remove record_stream entirely (#148590)
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](https://github.com/pytorch/pytorch/issues/147168#issuecomment-2660142460).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Cc: @ngimel @awgu @Aidyn-A @skyw @wconstab @leonardo0lyj

Differential Revision: [D70835197](https://our.internmc.facebook.com/intern/diff/D70835197)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148590
Approved by: https://github.com/eqy, https://github.com/Aidyn-A, https://github.com/fduwjj
2025-03-08 20:00:12 +00:00
5245304f1e Update decompositions_for_jvp.py (#148821)
small typo thing that got my eye

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148821
Approved by: https://github.com/Skylion007
2025-03-08 19:08:42 +00:00
148eb735ee Change nvcc arch flags for sm100 (#148774)
### Summary
- Addressing this comment https://github.com/pytorch/pytorch/pull/148274#discussion_r1984944012

### Test plan
- Verified building from source w/ B200s is successful
- Verified B200 tensorcores are still being utilized properly via benchmarking script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148774
Approved by: https://github.com/Skylion007
2025-03-08 19:05:53 +00:00
7ffadff286 c10d/ProcessGroup: cleanup abort and shutdown (#148798)
This adds `abort` and `shutdown` to `Backend` and `ProcessGroup` objects. This simplifies the logic in `distributed_c10d.py` by having a default noop implementation for all PGs.

This will be useful for torchft and upcoming versions of NCCL which will handle abort correctly. Currently `torchft` would have to call internal methods `_abort` on the PGNCCL object directly but with this change we can now just call `.abort()` and have it work for any PG implementation.

Test plan:

```
pytest distributed/test_backends.py distributed/test_c10d_common.py distributed/test_c10d_pypg.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148798
Approved by: https://github.com/kwen2501
2025-03-08 18:33:18 +00:00
9841f0ddcf Add support for non functional collectives under FakeTensorMode and fake_pg for memory tracking (#147566)
This PR adds support for non-functional collectives under `FakeTensorMode` and `fake_pg`. It helps eliminate the patching of collectives for memory and runtime estimation.

It also modifies the `ModTracker` to enable the post-backward hook call for modules whose inputs don't require gradients but parameters do.

For the memory tracking, we now enable tracking DTensor dispatcher for custom dispatch functions like `entropy_loss`.
Dispatcher is only enabled for the memory tracking part and disabled as soon as it is done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147566
Approved by: https://github.com/weifengpy
2025-03-08 18:00:49 +00:00
439782960c Fix typos in SpectralOps.cpp (#148818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148818
Approved by: https://github.com/Skylion007
2025-03-08 17:34:59 +00:00
eqy
849cc058ee [CUDA][TF32] Account for tf32 in test_efficient_conv_bn_eval (#148802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148802
Approved by: https://github.com/Skylion007
2025-03-08 16:17:04 +00:00
c3b05c4a27 [triton 3.3] support both specialize_impl and create_specialize_impl (#148806)
After https://github.com/triton-lang/triton/pull/6099, we sometimes need to do `from triton.runtime.jit import specialize impl` and sometimes do `triton.runtime.jit.create_specialize_impl()`. This should fix a bunch of the new errors that appeared with the triton 3.3 / pytorch 2.7 integration (e.g. `python test/inductor/test_aot_inductor.py -k test_triton_kernel_equal_to_1_float_arg_dynamic_False_cuda`, failing at https://hud.pytorch.org/pr/pytorch/pytorch/148684#38392501220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148806
Approved by: https://github.com/drisspg
2025-03-08 09:31:52 +00:00
118c9e501a [ONNX] Remove inaccurate test comment (#148813)
Remove the comment that says jit trace strategy doesn't support dynamic shapes as dict because it does support it (which is what the test is testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148813
Approved by: https://github.com/cyyever, https://github.com/titaiwangms
2025-03-08 08:55:56 +00:00
3745da18f4 [AOTI] Swith to local cpp compile for fbcode (#148592)
Summary: as title, otherwise we can not find lamdhip64

Test Plan: https://www.internalfb.com/phabricator/paste/view/P1747104431

Differential Revision: D70637798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148592
Approved by: https://github.com/hl475
2025-03-08 08:38:26 +00:00
666508eb17 [aot cache][ca] remove restriction on caching ca's aot inference graph (#148491)
but still can't cache CA's aot inference graph yet: the CA functional ops aren't serializable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148491
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148381
2025-03-08 06:08:26 +00:00
c16cd25cf5 [ca] remove compiled_autograd_tracing (#148381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148381
Approved by: https://github.com/jansel
2025-03-08 06:08:26 +00:00
5f1c79ba2b [CD] Enable triton xpu windows build (#147637)
Depends on #147727, which introduce triton xpu windows support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147637
Approved by: https://github.com/atalman
2025-03-08 05:28:46 +00:00
cyy
f7c0c230b0 Fix compile errors (#148758)
Fix
```
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:91:16: error: invalid application of 'sizeof' to an incomplete type 'torch::jit::AliasDb::WriteRegistry'
     91 |         static_assert(sizeof(_Tp)>0,
        |                       ^~~~~~~~~~~
  /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/14.2.1/../../../../include/c++/14.2.1/bits/unique_ptr.h:399:4: note: in instantiation of member function 'std::default_delete<torch::jit::AliasDb::WriteRegistry>::operator()' requested here
    399 |           get_deleter()(std::move(__ptr));
        |           ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:10: note: in instantiation of member function 'std::unique_ptr<torch::jit::AliasDb::WriteRegistry>::~unique_ptr' requested here
    200 | AliasDb::~AliasDb() = default;
        |          ^
  ../torch/csrc/jit/ir/alias_analysis.cpp:200:23: note: in defaulted destructor for 'torch::jit::AliasDb' first required here
    200 | AliasDb::~AliasDb() = default;
        |                       ^
  ../torch/csrc/jit/ir/alias_analysis.h:298:10: note: forward declaration of 'torch::jit::AliasDb::WriteRegistry'
    298 |   struct WriteRegistry;
        |          ^
  1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148758
Approved by: https://github.com/Skylion007
2025-03-08 04:56:42 +00:00
75179fd6e6 [Codemod][AddExplicitStrictExportArg] caffe2/test/inductor (#148781)
Differential Revision: D70575053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148781
Approved by: https://github.com/SherlockNoMad
2025-03-08 04:43:32 +00:00
8f71d4563e Fix rms_norm in fp16/bf16 (#147203)
Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done.

Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203
Approved by: https://github.com/eqy
2025-03-08 04:43:18 +00:00
85467ed063 Fix for AOTI + CUDAGraphs when calling from Python (#148601)
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601
Approved by: https://github.com/desertfire
2025-03-08 02:44:14 +00:00
9f170d9d13 [Triton 3.3] Remove ROCm specific mm gemm template (#148662)
Fixes: https://github.com/pytorch/pytorch/issues/147121
Since triton 3.3.x fixes the problem

Needs to be handled in none BC breaking way, so we will conditionalise this change on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148662
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2025-03-08 01:24:40 +00:00
a89e7c2da9 [Upstream] Wrap log_2_e in tl.constexpr for new 3.3 bump (#148785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148785
Approved by: https://github.com/davidberard98
2025-03-08 01:09:28 +00:00
179b7a0abc Do not crash when compiling quantized LORA models (#148435)
Fixes #148072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148435
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel
2025-03-08 00:02:08 +00:00
24085db082 Don't clear feedback_saver_fns after cache clear (#148723)
Summary:
Since feedback_saver_fns are used for logging, I don't think it makes sense to clear them, and this resulted in weird behavior in user code where disabling caches caused logging code to break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148723
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-03-07 23:43:59 +00:00
d96c85558a [ONNX] Use torch export to get dynamic shapes for JIT convert strategy (#148627)
Use torch export to get dynamic shapes for JIT converted graph. I just realized we can retrace a converted jit graph with `torch.export` and produce dynamic shapes using `torch.export`.

-	**Prior:** The exporter will produce a **static graph silently** even when dynamic_shapes are provided.
-	**Proposed:** When `dynamic_shapes` is provided and when the strategy is able to handle it, it will succeed

## Why are we still keeping the JIT strategy?

It is useful when users want to convert JIT modules or `.pt` files into ONNX via the new path. Sometimes also useful when there are JIT scripted modules in the nn module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148627
Approved by: https://github.com/titaiwangms
2025-03-07 23:41:50 +00:00
26f8d81037 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-07 23:35:47 +00:00
187d5c0eb1 [logging] Log cudagraphify timings to dynamo_timed (#143220)
Summary: this adds some new dynamo_timed calls in cudagraph_trees, primarily with the aim to add cudagraph-related timing to scuba. Things to note:
* Uses the changes in https://github.com/pytorch/pytorch/pull/141919 to log "runtime" entries
* The logging for chromium/tlparse/scuba relies on us providing a compile_id since it's not available in the environment. A lot of the changes here are just passing around the compile_id
* I believe the spirit of the scuba logging is to capture the overheads of `torch.compile`. Therefore, I'm not adding _every_ dynamo_timed to scuba. For example, "run_eager" is the first real execution of the inductor graph -- it's not cudagraph overhead, per se. Watch out for the two instances of `dynamo_compile_runtime_column_us="runtime_cudagraphify_time_us"`. Those are the spots I believe are _extra_ overhead we'd contribute to torch.compile.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only dcgan`:
* tlparse: https://fburl.com/21yrdn8h
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/wt90wnjz

`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
* tlparse: https://fburl.com/r9mp7uiv
* scuba: https://fburl.com/scuba/dynamo_compile/sandbox/1nvx94re

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143220
Approved by: https://github.com/eellison
2025-03-07 23:07:13 +00:00
f2dfe2d99c [Triton 3.3] [ROCm] Enabled split_scan support for ROCm builds (#147619)
Fixes issue https://github.com/pytorch/pytorch/issues/133228

Enabled split_scan support for ROCm builds.

Must be handled in a non BC breaking way so this functionality is enabled conditionalised on triton version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147619
Approved by: https://github.com/davidberard98

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: David Berard <davidberard98@gmail.com>
2025-03-07 23:06:21 +00:00
0f852641c2 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit d35a4ddae2345e639001bfee58a0932e96597f2d.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/henrylhtsang due to mistakes when writing the tests ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2707637965))
2025-03-07 22:42:13 +00:00
755965d2e4 [inductor] fix matmul w/ torch.bucketize epilogue (#148769)
See https://github.com/pytorch/pytorch/issues/148764.

Inductor was codegen-ing wrong shapes for bucketize when it was fused as an epilogue: the binary search helper function requested the shape of the input tensor, and Inductor was generating `[XBLOCK]`, when `XBLOCK` doesn't exist.

As a workaround, this PR removes the `BLOCK_SHAPE` parameter from the helper function (and just uses `values.shape`) so that we don't even have to generate the shape.

This PR also introduces `torch._inductor.config.triton.disallow_failing_autotune_kernels_TESTING_ONLY` to test this behavior. This config is needed to enforce that _all_ autotune kernel candidates pass - otherwise, the fused-bucketize exception just gets caught and an `inf` latency is assigned to it.

Differential Revision: [D70794563](https://our.internmc.facebook.com/intern/diff/D70794563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148769
Approved by: https://github.com/benjaminglass1, https://github.com/aaronenyeshi
2025-03-07 22:34:13 +00:00
67742128b7 [ROCm] Bump AOTriton to 0.9.2b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-07 22:10:07 +00:00
7b79e17275 [BE] Move cuda12.6 builds to gcc11 (#148740)
I.e. `s/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9/pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11/`

Which accidentally fixes  undefined symbol references errors namely
```
/usr/bin/ld: /var/lib/jenkins/cpp-build/caffe2/build/lib/libtorch_cuda.so: undefined reference to `std::__throw_bad_array_new_length()'
```
Which happens because `libmagma.a` that were build with gcc-11 (after https://github.com/pytorch/pytorch/pull/148135 ) contains symbols which are defined in `/opt/rh/gcc-toolset-11/root/usr/lib/gcc/x86_64-redhat-linux/11/libstdc++_nonshared.a` but missing from the corresponding library bundled with `g++-9`)

Though I could not figure out what flags one must use to trigger generation of those symbols, see https://godbolt.org/z/E9KfdhzzY or
```
$ echo "int* foo(int x) { return new int[x];}"|g++ -std=c++17 -S -O3 -x c++ -o - -
	.file	""
	.text
	.section	.text.unlikely,"ax",@progbits
.LCOLDB0:
	.text
.LHOTB0:
	.p2align 4
	.globl	_Z3fooi
	.type	_Z3fooi, @function
_Z3fooi:
.LFB0:
	.cfi_startproc
	endbr64
	movslq	%edi, %rdi
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	movabsq	$2305843009213693950, %rax
	cmpq	%rax, %rdi
	ja	.L2
	salq	$2, %rdi
	addq	$8, %rsp
	.cfi_def_cfa_offset 8
	jmp	_Znam@PLT
	.cfi_endproc
	.section	.text.unlikely
	.cfi_startproc
	.type	_Z3fooi.cold, @function
_Z3fooi.cold:
.LFSB0:
.L2:
	.cfi_def_cfa_offset 16
	call	__cxa_throw_bad_array_new_length@PLT
	.cfi_endproc
```

Fixes https://github.com/pytorch/pytorch/issues/148728 and https://github.com/pytorch/pytorch/issues/148495
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148740
Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-03-07 21:21:12 +00:00
08baaa7d63 [Docs][TunableOp] TunableOp documentation update (#148384)
This PR aligns documentation to what is in the README file:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md

and removes the prototype NOTE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148384
Approved by: https://github.com/jeffdaily, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-03-07 21:02:49 +00:00
bb94b65da7 Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 2fb654676f6291f6e27c6bab2761f170516598dd.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2707440106))
2025-03-07 20:58:28 +00:00
d8dc700e25 Delete duplicate entry from docker-builds.yml (#148782)
Regression introduced by merge conflict of https://github.com/pytorch/pytorch/pull/148612

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148782
Approved by: https://github.com/atalman
2025-03-07 20:55:46 +00:00
99da439d10 Revert "Remove Cuda 12.4 from nightly Binaries (#148625)"
This reverts commit 1239176fe717839ca5612ac03a4806051225f381.

Reverted https://github.com/pytorch/pytorch/pull/148625 on behalf of https://github.com/malfet due to Broke lint ([comment](https://github.com/pytorch/pytorch/pull/148625#issuecomment-2707415005))
2025-03-07 20:47:45 +00:00
6602e632cd Suppress build warnings when gcc-11 is used (#148763)
By decorating the header with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wmismatched-new-delete")`
that will suppress following (when building against ancient llvm-9)
```
In file included from /var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_codegen.cpp:24:
/opt/llvm/include/llvm/IR/IRBuilder.h: In member function 'llvm::LoadInst* llvm::IRBuilder<T, Inserter>::CreateLoad(llvm::Type*, llvm::Value*, const llvm::Twine&) [with T = llvm::ConstantFolder; Inserter = llvm::IRBuilderDefaultInserter]':
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: error: 'static void llvm::User::operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
 1581 |     return Insert(new LoadInst(Ty, Ptr), Name);
      |                   ^~~~~~~~~~~~~~~~~~~~~
/opt/llvm/include/llvm/IR/IRBuilder.h:1581:19: note: returned from 'static void* llvm::UnaryInstruction::operator new(size_t)'
```

Probably a reasonable followup will be to disable NNC testing all-together, as project has been in a maintenance mode for a while now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148763
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/atalman
ghstack dependencies: #148739
2025-03-07 20:43:35 +00:00
d36391307f [ONNX] Handle error in verification interpreter (#148730)
Use a simple try catch to handle onnx runtime errors in the verification interpreter when that happens. One example is ort will sometimes produce a list of None for some nodes. I am not sure how that happens yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148730
Approved by: https://github.com/titaiwangms
ghstack dependencies: #148706
2025-03-07 20:24:49 +00:00
aebd2e411f [pytree][easy] lock global registry containers properly for thread-safety (#148750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148750
Approved by: https://github.com/StrongerXi
2025-03-07 20:04:52 +00:00
6b44a91a62 use statically_known_true instead of guard_size_oblivious in pattern matcher (#147557)
We shouldn't add guards here. Use statically_known_true instead. Internal xref: https://fb.workplace.com/groups/1075192433118967/?multi_permalinks=1609560723015466&comment_id=1610040026300869&notif_id=1740082892544333&notif_t=work_feedback_reaction_generic&ref=notif

Differential Revision: [D69950122](https://our.internmc.facebook.com/intern/diff/D69950122/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147557
Approved by: https://github.com/eellison
2025-03-07 19:17:25 +00:00
b246cd7b82 Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 17302b4bc837af079d2f6480f07ea2c99b93fb4b.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/albanD due to Still fails with cuda build on a non-gpu machine ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2707191770))
2025-03-07 18:59:58 +00:00
1239176fe7 Remove Cuda 12.4 from nightly Binaries (#148625)
https://github.com/pytorch/pytorch/issues/145570

removes cuda 12.4 nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148625
Approved by: https://github.com/atalman
2025-03-07 18:56:04 +00:00
61c4074df7 Add Windows Arm64 Nightly Builds (#139760)
This PR creates 3 new worklflows for Windows Arm64 target. The workflows and outputs can be reviewed at the following links:
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-release-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-debug-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-wheel-nightly.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139760
Approved by: https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-03-07 18:53:56 +00:00
cyy
e839e4f5bd Fix Wc++98-compat-extra-semi (#148757)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148757
Approved by: https://github.com/Skylion007
2025-03-07 18:49:12 +00:00
0a7ccee1e0 [ROCm][Windows] Disable Composable Kernels and Triton for Windows builds (#147334)
Currently, Composible Kernels and Triton aren't available on Windows. This PR ensures that the files relating to this dependency are not included during the build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147334
Approved by: https://github.com/jeffdaily
2025-03-07 18:40:49 +00:00
eqy
18c6e00c7b [CUDA Graphs][NCCL] Set event queries to happen under thread-local mode in ProcessGroupNCCL.cpp (#148594)
Should mean we don't need to coordinate the watchdog with CUDAGraph captures anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148594
Approved by: https://github.com/kwen2501
2025-03-07 18:39:02 +00:00
9769618d35 [CI] [inductor] Add cu126 inductor jobs and move away cu124 (#148612)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793 into eager and inductor benchmarks to unblock

Seems many inductor yml are added after initial change was prepared.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148612
Approved by: https://github.com/nWEIdia, https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 18:30:14 +00:00
da923afdc7 [MPS][BE] Align bitshift behavior with CPU (#148719)
By casting the argument to output type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148719
Approved by: https://github.com/Skylion007
ghstack dependencies: #148685, #148686
2025-03-07 18:28:14 +00:00
f84710aef4 [MPS] Fix scalar to tensors bitshifts (#148686)
By introducing a concept of non-commutative binary op and renaming all op templates from `bitwise_foo_tensor` and `bitwise_foo_scalar` to `bitwise_foo_tensor_tensor` and `bitwise_foo_tensor_scalar`

Add regression tests

Please note, that for some undefined values MPS and CPU behaviors are different, for example
```
>>> import torch
>>> 4095 >> torch.arange(12, device="mps", dtype=torch.uint8)
tensor([255, 255, 255, 255, 255, 127,  63,  31,  15,   7,   3,   1],
       device='mps:0', dtype=torch.uint8)
>>> 4095 >> torch.arange(12, device="cpu", dtype=torch.uint8)
tensor([255, 127,  63,  31,  15,   7,   3,   1,   0,   0,   0,   0],
       dtype=torch.uint8)
```
Because on CPU scalar is cast to output dtype before operation is performed, but on MPS this happens after the op is done

Fixes https://github.com/pytorch/pytorch/issues/147889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148686
Approved by: https://github.com/albanD
ghstack dependencies: #148685
2025-03-07 18:28:14 +00:00
cyy
116c1e42c5 Re-enable tests (#148732)
No UBSAN failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148732
Approved by: https://github.com/Skylion007
2025-03-07 18:11:57 +00:00
8059ead823 [ROCm] Incorporate ROCm triton specific tuning parameters (#148437)
Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above.

A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437
Approved by: https://github.com/eellison, https://github.com/jansel
2025-03-07 18:09:47 +00:00
a3b77d434a Subprocess compile (attempt 2) (#148635)
Add a mode to fx_codegen_and_compile() to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Fixed the test which caused the previous version (#146134) to be reverted:
```
$ PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/inductor/test_compile_subprocess.py CpuTests.test_conv_bn_fuse_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148635
Approved by: https://github.com/jamesjwu
2025-03-07 17:50:14 +00:00
50c9f6d83b [Windows][Inductor][XPU] Unload triton pyd files to be able to remove them on Windows. (#148323)
In `fresh_inductor_cache` remove pyd files will raise permission error
on Windows because they are still used by the process.
So we clear the references to the loaded pyd libray obj and unload them
from the process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148323
Approved by: https://github.com/jansel
ghstack dependencies: #148534, #148538, #147727
2025-03-07 17:19:59 +00:00
d05694807d [XPU][Inductor] Update Intel triton for release 2.7. (#147727)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147727
Approved by: https://github.com/EikanWang, https://github.com/Skylion007
ghstack dependencies: #148534, #148538
2025-03-07 17:19:59 +00:00
136b8165d1 [DCP] Save Plan Caching: Fix the missing all_plans update in the cache. (#148577)
Summary: Save Plan Caching: Fix the missing all_plans update in the cache.

Test Plan:
```
buck2 test //aiplatform/modelstore/experimental/integration_tests/tests/nosan:checkpoint_dist_save_load_test
```

https://www.internalfb.com/intern/testinfra/testrun/17451448626323264

Reviewed By: MeetVadakkanchery

Differential Revision: D70229019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148577
Approved by: https://github.com/MeetVadakkanchery
2025-03-07 17:00:59 +00:00
abcca2fcbb Revert "Fix torch.nn.functional.hardswish gradients corner case (#148049)"
This reverts commit 29b28e9d9f93d78092099a44a7bcc28cfbae06e3.

Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))
2025-03-07 16:05:56 +00:00
17302b4bc8 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang, https://github.com/jeromean

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-07 15:19:34 +00:00
d54b2b7fa7 [BE] Delete split builds (#148739)
They has been disabled since Oct 2024, perhaps time to remove them from the workflows

See https://github.com/pytorch/pytorch/issues/138750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148739
Approved by: https://github.com/atalman
2025-03-07 15:10:50 +00:00
372ad7b181 Enable FSDP2 on HPU device (#148667)
The motivation of this PR is to enable FSDP2 collectives for HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148667
Approved by: https://github.com/wconstab
2025-03-07 14:33:43 +00:00
81847d08cf [Intel GPU][quant] Refine zero-point memory creation (#148640)
# Motivation
This PR skips  zero-point GPU memory creation when zero-point=0, as it would not be used by oneDNN library. This could help save the 1~3 H2D copy overhead per QLinear/QConv kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148640
Approved by: https://github.com/liangan1, https://github.com/EikanWang
2025-03-07 13:49:19 +00:00
f80aad62fa Improve Pareto frontier plot for AutoAC (#148678)
This was added in https://github.com/pytorch/pytorch/pull/126320. It's a very nice feature, which can be used to predict memory usage for different budget values.

However, it had some limitations, notably in terms of resolution (it only sampled 21 points across the whole range thus missed many threshold values) and in distributed settings.

Here I fix those by using recursive binary searches to identify all thresholds (up to a resolution of 1e-3, which can be made configurable) and output them in SVG (to be able to discern different points), plus I add the rank to the filename and store it in a user-define directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148678
Approved by: https://github.com/Chillee, https://github.com/fmassa
2025-03-07 13:22:29 +00:00
d4d7d813fa Update CURL url for manywheel images (#148343)
It looks like it was moved on the site it was downloaded from.
Switch to official site while updating URL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148343
Approved by: https://github.com/dr4gon01, https://github.com/janeyx99, https://github.com/atalman, https://github.com/seemethere
2025-03-07 11:41:12 +00:00
6cf360be04 fix lost input mutations with export_tracepoint (#148709)
Preserving module call signatures in the presence of input mutation cause incorrect results. The root cause turned out to be that export tracepoints would unwrap / wrap functional args that would lose mutation info on those args.

Differential Revision: [D70734821](https://our.internmc.facebook.com/intern/diff/D70734821/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148709
Approved by: https://github.com/angelayi
2025-03-07 09:36:18 +00:00
bb84a23c22 [ROCm] [TunableOp] Enable logging of BLAS parameters (#147034)
This PR supports a logging feature that is being requested.
```
PYTORCH_TUNABLEOP_BLAS_LOG=1
```
Enables the logging of BLAS parameters with either offline of online (in-situ) tuning.

The BLAS parameters are written to the CSV file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147034
Approved by: https://github.com/jeffdaily
2025-03-07 09:32:59 +00:00
243b47e2ec [Intel GPU] Fix SDPA dummy LSE output to match meta function (#148652)
To fix XPU patched UTs including
```bash
pytest -vs third_party/torch-xpu-ops/test/xpu/test_meta_xpu.py::TestMetaXPU::test_dispatch_symbolic_meta_outplace_nn_functional_scaled_dot_product_attention_xpu_bfloat16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148652
Approved by: https://github.com/EikanWang
2025-03-07 08:36:18 +00:00
416ea1c71c Code Clean: Remove unnecessary code (#148735)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148735
Approved by: https://github.com/jingsh, https://github.com/cyyever
2025-03-07 08:15:37 +00:00
4075646bd8 Use oneDNN v3.7.1 for Intel GPU (#148403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148403
Approved by: https://github.com/EikanWang

Co-authored-by: majing <jing1.ma@intel.com>
Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-03-07 08:03:49 +00:00
cyy
3d854ea9bd Remove deprecated std::aligned_storage_t (#148660)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148660
Approved by: https://github.com/swolchok
2025-03-07 07:29:42 +00:00
3f069e7679 [mm_logs] enhance the printing for overview info (#148716)
Summary:
previously the dynamo counters does not print the counts information automatically.

explicitly added a log msg to print after lowering for overview info for inductor aten mms

it will look like:

the name is in `{aten_op_name}_{m}_{n}_{k}`
```
torch/_inductor/compile_fx.py:832] [0/0] Overview info of inductor aten mms: (aten.addmm_16_6_16: 1), (name: count), xxx
```

 {F1975874802}

Test Plan:
```
TORCH_LOGS="+inductor" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```

Differential Revision: D70739912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148716
Approved by: https://github.com/henrylhtsang
2025-03-07 05:23:49 +00:00
5f392ae560 Throws error when using torch.cuda.MemPool with expandable segments (#148378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148378
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #148374
2025-03-07 05:22:03 +00:00
c0f1557285 [FSDP2][doc] highlight equivalence of set_requires_gradient_sync and no_sync (#148715)
we got asked a few times about FSDP2's equivalence of no_sync. highlight
set_requires_gradient_sync as the equivalence in docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148715
Approved by: https://github.com/mori360
2025-03-07 04:34:46 +00:00
fe4b88f6aa [HPU] Add hpu to fused kernels supported devices (#148666)
This change adds "hpu" to the list of device types that support fused kernels in the optimizer, ensuring
compatibility with HPU backend.

Without this change, when `test_all_gather_extension_outer_size_stride` of `pytorch/test/distributed/_composable/fsdp/test_fully_shard_extensions.py` is run on 'hpu' backend, it fails with:

RuntimeError: fused=True requires all the params to be floating point Tensors
of supported devices: ['mps', 'cuda', 'xpu', 'cpu', 'privateuseone']
but torch.float32 and hpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148666
Approved by: https://github.com/albanD
2025-03-07 04:28:33 +00:00
33f8ab2f58 [ROCm][TunableOp] Add support for rowwise scaling on scaled GEMM. (#148238)
This PR adds support for rowwise scaling versus tensorwise scaling on scaled GEMM.

There are few other items included in this PR as well:
- Fixes for offline tuning of scaled GEMM
- Simplification of existing offline UT
- Update existing online UT to also test rowwise versus tensorwise scaled GEMM
- New UT for offline scaled GEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148238
Approved by: https://github.com/jeffdaily
2025-03-07 04:12:48 +00:00
cdb4fd0d29 Update win-vs2022-cuda12.1-py3 -> win-vs2022-cuda12.6-py3 (#148717)
Should have been migrated long ago
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148717
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2025-03-07 03:21:29 +00:00
389b496062 [XPU] Add test/kernel.errors.txt to .gitignore. (#148538)
Intel GPU user mode driver may generate kernel.errors.txt files in
current working directory in certain scenarios. It includes diagnostic
information but does necessarily indicates the issue with an
application. This is a known issue and will be fixed in newer version of driver.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148538
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #148534
2025-03-07 03:12:50 +00:00
9c9b05bc4f Expose functions used in custom backend in torch_python dll (#148213)
Fixes #148208. There are solutions for exposing symbols implicitly from inline functions (i.e., inline function A calls non-inline function B in foo.h. Code includes foo.h has to see the symbol B in DLL).

Solution 1: tag the entire struct where the inline functions are defined as member functions with TORCH_PYTHON_API --- this PR does this for python_arg_parser.h. An alternative solution exists but will slow down dispatching a lot --- drop inline keyword and move implementation to .cc file.

Solution 2: tag individual functions with TORCH_PYTHON_API. This PR does this for python_tensor.h.

Related discussion about hiding torch_python symbols: https://github.com/pytorch/pytorch/pull/142214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148213
Approved by: https://github.com/malfet
2025-03-07 02:34:37 +00:00
dfb4094b9c Skip buffer in dense update (#148533)
Summary:
as title.

PyTorch Module buffer will not be published in delta publishing.  In Quinn's previous diff, constant type annotations have been introduced.

In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list

Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn

Differential Revision: D69553929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533
Approved by: https://github.com/22quinn, https://github.com/jingsh
2025-03-07 01:59:58 +00:00
00cd6c07b9 [Intel GPU][pt2e] Enable quantized grouped convolution at XPU (#148522)
# Motivation&Details
This PR fix a bug that blocked quantized group convolution before. The bug is caused by that, grouped convolution requires setting weight scale mask on both group dimension and output channel dimension. This PR fixs the wrong mask in integration and add grouped conv in UT.

# UT
` python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv2d_xpu`

# Runtime exemplification
```onednn_verbose,v1,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src:s8::blocked:acdb::f0 wei:s8::blocked:abcde::f0 bia:f32::blocked:a::f0 dst:f32::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:3:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,g4mb1_ic128oc128_ih4oh2kh3sh1dh0ph0_iw4ow2kw3sw1dw0pw0,0.0529785``
The verbose shows that we successfully run into quantized convolution, where weight is `abcde` format(group conv).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148522
Approved by: https://github.com/EikanWang, https://github.com/liangan1, https://github.com/jansel
ghstack dependencies: #148423
2025-03-07 01:57:45 +00:00
127bd5a02d Add sparsity (#148513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148513
Approved by: https://github.com/danielvegamyhre
2025-03-07 01:47:52 +00:00
b4430c3a6d [Intel GPU][pt2e]: Collapse 3D input to 2D for matmul in qlinear_pointwise_binary fusion (#148423)
# Motivation
During the `qlinear_pointwise_binary` lowering pass, dim collapsing only occurs when post-ops is `add`. It is the responsibility of  C++ kernels to handle dimension for post-ops `sum`

# Details
This PR explicitly reshape input from 3D to 2D in op `qlinear_pointwise_binary`. Besides, we refractor implementation `qlinear_pointwise_binary.tensor` to call `qlinear_pointwise_binary` for removing duplicated codes.

# UT testing
`python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlienar_add_xpu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148423
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-03-07 01:47:33 +00:00
c8cd8f68bd [dynamo] Properly account for non-list instances in list comparison (#148470)
As title; this patch also removes an unused `list_compare` method.

Fixes #148179.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148470
Approved by: https://github.com/anijain2305
2025-03-07 01:29:30 +00:00
a7fe685be8 Add cpp wrapper skip to cudagraph logs (#148700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148700
Approved by: https://github.com/jbschlosser
2025-03-07 01:02:40 +00:00
e3087f6d76 [ONNX] Improve verify_onnx_program to use VerificationInterpreter (#148706)
I realized we can just extend `verify_onnx_program` to return intermediate values. There is no need for us to expose the VerificationInterpreter to users.

I added a `compare_intermediates` option to `verify_onnx_program`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148706
Approved by: https://github.com/titaiwangms
2025-03-07 00:40:54 +00:00
cyy
50eb4f3990 Enable UBSAN test (#147511)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147511
Approved by: https://github.com/colesbury
2025-03-07 00:35:32 +00:00
33a285379a [codemod] Remove unused-variable in caffe2/torch/csrc/distributed/c10d/cuda/AsyncMM.cu (#148501)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148501
Approved by: https://github.com/Skylion007
2025-03-07 00:33:39 +00:00
a0bc6d81bb [CI][CUDA] Move away from cuda12.4, Add cuda12.6 eager CI tests (#148602)
https://github.com/pytorch/pytorch/issues/145570

breaking https://github.com/pytorch/pytorch/pull/140793/ into eager and inductor benchmarks to unblock

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148602
Approved by: https://github.com/atalman, https://github.com/malfet

Co-authored-by: atalman <atalman@fb.com>
2025-03-07 00:15:04 +00:00
e2a0296e80 [dtensor] add CuDNN SDPA op support to DTensor (#148537)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` and `_scaled_dot_product_cudnn_attention_backward` to DTensor ops

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148537
Approved by: https://github.com/drisspg, https://github.com/fegin
2025-03-06 23:44:40 +00:00
3960f97832 Documents torch.cuda.MemPool API (#148374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148374
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-03-06 23:18:43 +00:00
ed9c8a5d13 ROCm: Disable torch check for Multiplication of two Float8_e5m2 matrices (#148228)
ROCm supports Multiplication of two Float8_e5m2 matrices.
Hence disabling the torch check for ROCm.
Test command (on ROCm h/w supporting fp8)
python test/test_matmul_cuda.py TestFP8MatmulCudaCUDA.test_float8_basics_cuda -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148228
Approved by: https://github.com/jeffdaily, https://github.com/petrex
2025-03-06 22:12:45 +00:00
e6800bda7f [Test][Linalg][CUDA] Increase niter in test_svd_lowrank_cuda_float64 (#145930)
A recent PR #143049 attempted to increase tolerances to make test passable. However, we are still seeing errors like:
```
Traceback (most recent call last):
  File "~git/pytorch/test/test_linalg.py", line 2540, in test_svd_lowrank
    run_subtest(None, size, (), device, torch.svd_lowrank, density=density)
  File "~git/pytorch/test/test_linalg.py", line 2505, in run_subtest
    self.assertEqual(A, a, rtol=1e-7, atol=2e-7)
  File "~git/pytorch/torch/testing/_internal/common_utils.py", line 4044, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: Tensor-likes are not close!

Mismatched elements: 90 / 1000000 (0.0%)
Greatest absolute difference: 7.795904016052784e-07 at index (176, 930) (up to 2e-07 allowed)
Greatest relative difference: inf at index (6, 179) (up to 1e-07 allowed)
```
Increasing `niter` parameter actually decreases numerical differences.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145930
Approved by: https://github.com/ngimel
2025-03-06 22:10:53 +00:00
75d29443e7 [Docs] update bucketize documentaion (#148400)
Fixes #144504

Clarify the documentation for `torch.bucketize` by referencing the existing table. The current version includes a somewhat confusing explanation for the `right` kwarg, whereas the existing table is much clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148400
Approved by: https://github.com/benjaminglass1, https://github.com/eellison, https://github.com/albanD
2025-03-06 22:07:52 +00:00
2fb654676f [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:26 +00:00
d35a4ddae2 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 22:02:19 +00:00
5a5ac98918 [aarch64] add libcufile for cu126 and cu128 (#148465)
seeing `  File "/usr/local/lib/python3.12/site-packages/torch/__init__.py", line 411, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory` with arm cu128 nightly.
related to https://github.com/pytorch/pytorch/pull/148137
need to copy the dependency for arm build as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148465
Approved by: https://github.com/atalman, https://github.com/abhilash1910
2025-03-06 21:39:43 +00:00
3d62e81a1e [DCP] fix dcp gather_object/scatter_object_list (#147675)
gather_object/scatter_object_list's dst is `Destination rank on global process group (regardless of group argument)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147675
Approved by: https://github.com/MeetVadakkanchery
2025-03-06 21:20:38 +00:00
1d7fc0c681 [dynamo] Remove dead code path around functools.partial objects (#148683)
This removes the code paths added in #98120, which has then been
superceded by #108846.

More importantly, it makes `EQUALS_MATCH`'s `ok_mutable_types` (added in #134016)
easier to reason about, i.e., no need to worry about `dict` types, which
was only needed for #98120.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148683
Approved by: https://github.com/yanboliang
2025-03-06 21:20:04 +00:00
262411e48b [inductor] online softmax (#127011)
Softmax need do some preparation work that access the input tensor in two passes
- compute amax of each row
- compute (x - amax).exp.sum for each row

When the row size is large, cache can not hold all the active data and accessing the input multiple passes increases execution time since the kernel is membw bounded.

Online softmax uses a customized reduction to compute max and sum at the same time by accessing the data in one pass. Check this paper for more details ( https://arxiv.org/abs/1805.02867 ).

Also here is an online softmax kernel generated by inductor as a reference: https://gist.github.com/shunting314/67ae4fffd45d4f2753c781780332fa54

## Microbenchmark

- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=0 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax` : without online softmax
  - eager_ms=6.671296119689941
  - opt_ms=8.06931209564209
- `TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_ONLINE_SOFTMAX=1 DO_PERF_TEST=1 python test/inductor/test_online_softmax.py -k test_softmax`: with online softmax
  - eager_ms=6.634047985076904
  - opt_ms=6.230591773986816

Ideally, online softmax should save about 2ms here. We saves about 1.84ms in practice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127011
Approved by: https://github.com/jansel
2025-03-06 21:07:18 +00:00
cf9efbdf16 Revert "Enable onednn in pytorch for ppc64le architecture (#143743)"
This reverts commit d4cf0e5af406239881acfeb4f9e4f62373faca8b.

Reverted https://github.com/pytorch/pytorch/pull/143743 on behalf of https://github.com/davidberard98 due to windows build failures look related [GH job link](https://github.com/pytorch/pytorch/actions/runs/13705127978/job/38329845095) [HUD commit link](d4cf0e5af4) ([comment](https://github.com/pytorch/pytorch/pull/143743#issuecomment-2704903253))
2025-03-06 20:47:57 +00:00
1add61c242 Replace unimplemented with unimplemented_v2' in codegen.py` (#148069)
Fixes #147913

- replace `unimplemented` in `codegen.py`
- remove unused import `unimplemented`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148069
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2025-03-06 20:42:37 +00:00
edd640a95a [BE][Ez]: Use itertools.chain.from_iterable when possible (#148190)
Often makes the code more readable, more efficient, and adds support for infinite iterables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148190
Approved by: https://github.com/jansel, https://github.com/malfet
2025-03-06 20:37:06 +00:00
65dbc3b454 [BE][MPS] Remove redundant handle_tensor_scalar_binary_op (#148685)
After https://github.com/pytorch/pytorch/pull/143934 `mtl_setBuffer` can handle scalar tensors correctly, so no need to have a specialized function here
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148685
Approved by: https://github.com/dcci
2025-03-06 19:24:46 +00:00
29b28e9d9f Fix torch.nn.functional.hardswish gradients corner case (#148049)
Fixes #147801

## Changes

- Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html)
- Enable cuda for test `test_hardswish_grad_corner`
- Add test case for value=-3

## Test Result

```bash
pytest test/test_nn.py -k test_hardswish
pytest test/test_unary_ufuncs.py -k test_hardswish
pytest test/inductor/test_torchinductor.py -k test_hardswish
```

![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d)
![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8)
![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049
Approved by: https://github.com/soulitzer
2025-03-06 19:04:52 +00:00
f08146b67b [pytree] add APIs to determine a class is a namedtuple or PyStructSequence (#113257)
Changes in this PR:

1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.

Resolves #75982. New tests are included in this PR.

- #75982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
2025-03-06 18:59:02 +00:00
96176e32a9 Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433)"
This reverts commit 8af79b7ec816f5c73536a806aa4c7ea1f7bd3867.

Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))
2025-03-06 18:32:48 +00:00
b85ae06bed Update CPU tolerance for f16 triplet margin loss (#147742)
Currently, the `test_torchinductor_opinfo` test for `nn.functional.triplet_margin_loss` fails on AArch64, this PR increases the acceptable ATOL and RTOL for this test when using F16. There is precedent for this as XPU and CUDA already increase the tolerance. Additionally, the CPU backend increases the tolerance for the `with_distance_loss` variant of `triplet_margin_loss`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147742
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 18:09:43 +00:00
d10bacd4ce [AOTI][dashboard] Skip torchbench models not supported by export (#148359)
Summary: Certain models fail in export because of data-dependent ops. Skip them so that oncall can better track the AOTInductor dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148359
Approved by: https://github.com/angelayi, https://github.com/ysiraichi
2025-03-06 18:08:17 +00:00
d91a634edf [c10d] Make getDefaultBackend more fault tolerant (#148596)
This is a forward fix for #135338.
It hits error like this:
```
"distributed_c10d.py", line 2156, in destroy_process_group
    if type(pg) == ProcessGroup and pg._has_hooks():
RuntimeError: Could not find the default backend type 0 for Process Group with name undefined.
```

When users call `init_process_group(nothing)`, default backend is not set, or set to `undefined`. Thus the above signature. Triggered by the `_has_hooks()` call.

The fix wraps `getDefaultBackend` with a try-catch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148596
Approved by: https://github.com/LucasLLC, https://github.com/fduwjj
2025-03-06 18:07:43 +00:00
d4cf0e5af4 Enable onednn in pytorch for ppc64le architecture (#143743)
This PR will enable onednn for powerpc Architecture which will help to do quantization of the model via onednn for powerpc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143743
Approved by: https://github.com/malfet, https://github.com/albanD
2025-03-06 18:00:55 +00:00
097b0d372a [pytree] fix previously failed dynamo tests (#148669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148669
Approved by: https://github.com/zou3519
2025-03-06 17:59:29 +00:00
28b68b46bc Revert "[cutlass backend] fix assertion that prevent self multiplication (#148233)"
This reverts commit 4aeca28137dcee74b5fcd0c0636d0ee1f113d5fb.

Reverted https://github.com/pytorch/pytorch/pull/148233 on behalf of https://github.com/henrylhtsang due to mistake in PR  ([comment](https://github.com/pytorch/pytorch/pull/148233#issuecomment-2704534995))
2025-03-06 17:45:49 +00:00
3cde4c3069 [BE] Remove onlyCPU decorator from test_local_scalar_dense (#148559)
Followup from https://github.com/pytorch/pytorch/pull/145717, not sure why author thinks those tests should be limited to one architecture.
And fixed similar crashes for CUDA and MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148559
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/seemethere
2025-03-06 17:43:02 +00:00
841451af9f Revert "[Inductor] Avoid tensor slice overflow for large step (#147433)"
This reverts commit 1d7397a2d04a4d636559f41511a20f7dadbe5777.

Reverted https://github.com/pytorch/pytorch/pull/147433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/147433#issuecomment-2704506627))
2025-03-06 17:33:08 +00:00
679e7d257e [mm_logs] follow up to add count info based on shape for inductor aten.mms (#148623)
Summary:
as title.

when enable `TORCH_LOGS="+inductor"`, you can get logs at the end such as

stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('benchmarking.TritonBenchmarker.benchmark_gpu', 2), **(('aten_addmm', (16, 6, 16)), 1)**, ('extern_calls', 1), ('async_compile_cache_miss', 1)]
graph_break []

Test Plan: follow up to add proper logging test.

Differential Revision: D70665104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148623
Approved by: https://github.com/henrylhtsang
2025-03-06 16:20:04 +00:00
b160dda743 cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403)
This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode.

Changes:

1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions.
2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns.
3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive.

The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression.

Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403
Approved by: https://github.com/desertfire
2025-03-06 16:08:16 +00:00
5fb0f45d3b [triton 3.3] test_triton_kernel_constants fix (#148626)
Thanks @FindHao who did the initial version of this PR: https://github.com/pytorch/pytorch/pull/148505

TL;DR is that https://github.com/triton-lang/triton/pull/5961 deprecates `tl.constexpr` annotations - you're supposed to wrap the constexpr value in `tl.constexpr()` instead.

This just updates the tests to wrap with `tl.constexpr()` (and leaves the annotations - that way the old triton versions will still pass).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148626
Approved by: https://github.com/FindHao
2025-03-06 14:18:21 +00:00
d5184901c4 Make torch.serialization.skip_data work with torch.load (#148018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148018
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787, #147788
2025-03-06 12:04:46 +00:00
be0ceee1c3 Make record/storage alignment in torch.save configurable (#147788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147788
Approved by: https://github.com/albanD
ghstack dependencies: #147786, #147787
2025-03-06 12:04:46 +00:00
209977e6e5 Add information about checkpoint offset to untyped storages when torch.load under FakeTensorMode (#147787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147787
Approved by: https://github.com/albanD
ghstack dependencies: #147786
2025-03-06 12:04:39 +00:00
bdcc1b579b Allow torch.load under FakeTensorMode to load FakeTensors with correct devices (for plain Tensors) (#147786)
This only fixes _rebuild_tensor_v2 and _rebuild_tensor_v3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147786
Approved by: https://github.com/albanD
2025-03-06 12:04:32 +00:00
79aa17489c [dynamo] ctx_manager.py: replace unimplemented with unimplemented_v2 (#148570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148570
Approved by: https://github.com/williamwen42
ghstack dependencies: #148454
2025-03-06 07:46:31 +00:00
e7bc1d1791 [ONNX] Update saved exported program in debugging report if the exporting passes run_decomposition() (#148617)
Previous to this PR, if the exporting passes run_decomposition(), the report still shows the exported_program before decomposition, which adds the difficulties to our users when they want to check the exported program that are used to translate to ONNX graph.

The following example is what we see before this PR:

```
# PyTorch ONNX Conversion Report

```
 Obtain model graph with `torch.export.export(..., strict=False)`
 Obtain model graph with `torch.export.export(..., strict=True)`
 Obtain model graph with `torch.jit.trace`
 Decompose operators for ONNX compatibility
 Translate the graph into ONNX
 Run `onnx.checker` on the ONNX model
 Execute the model with ONNX Runtime
 Validate model output accuracy
```

## Error messages

```pytb

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 707, in _translate_fx_graph
    _handle_call_function_node_with_lowering(

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 486, in _handle_call_function_node_with_lowering
    raise _errors.DispatchError(

torch.onnx._internal.exporter._errors.DispatchError: No ONNX function found for <OpOverload(op='aten.slice', overload='Tensor')>. Failure message: No decompositions registered for the complex-valued input

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1371, in export
    onnx_program = _exported_program_to_onnx_program(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 1007, in _exported_program_to_onnx_program
    values = _translate_fx_graph(
             ^^^^^^^^^^^^^^^^^^^^

  File "/home/titaiwang/pytorch/torch/onnx/_internal/exporter/_core.py", line 733, in _translate_fx_graph
    raise _errors.ConversionError(

torch.onnx._internal.exporter._errors.ConversionError: Error when translating node %slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%_to_copy, 0, 0, 9223372036854775807), kwargs = {}). See the stack trace for more information.

```

## Exported program

```python
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x: "f32[3, 4]"):
             # File: /home/titaiwang/pytorch/test_slice_complex.py:6 in forward, code: x_complex = x.to(torch.complex64)
            to: "c64[3, 4]" = torch.ops.aten.to.dtype(x, torch.complex64);  x = None

             # File: /home/titaiwang/pytorch/test_slice_complex.py:8 in forward, code: return x_complex[:, :2]
            slice_1: "c64[3, 4]" = torch.ops.aten.slice.Tensor(to, 0, 0, 9223372036854775807);  to = None
            slice_2: "c64[3, 2]" = torch.ops.aten.slice.Tensor(slice_1, 1, 0, 2);  slice_1 = None
            return (slice_2,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='slice_2'), target=None)])
Range constraints: {}

```

## Analysis

PyTorch ONNX Conversion Analysis

## Model Information

The model has 0 parameters and 0 buffers (non-trainable parameters).
Number of parameters per dtype:
```python
defaultdict(<class 'int'>, {})
```
Number of buffers per dtype:
```python
defaultdict(<class 'int'>, {})
```

Inputs:
- `x`: `TensorMetadata(shape=torch.Size([3, 4]), dtype=torch.float32, requires_grad=False, stride=(4, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})`

Outputs:
- `slice_2`: `TensorMetadata(shape=torch.Size([3, 2]), dtype=torch.complex64, requires_grad=False, stride=(4, 1), memory_format=None, is_quantized=False, qparams={})`

The FX graph has 5 nodes in total. Number of FX nodes per op:
- `placeholder`: 1
- `call_function`: 3
- `output`: 1

Of the call_function nodes, the counts of operators used are:

- `aten.slice.Tensor`: 2
- `aten.to.dtype`: 1

## ONNX Conversion Information

The model contains operators the dispatcher could not find registered ONNX decompositions for. This may be due to missing implementations, decompositions not registered correctly, or a bug in the dispatcher.

Errors grouped by operator:

- `aten.to.dtype`:     No decompositions registered for the real-valued input. Example node: `%to : [num_users=1] = call_function[target=torch.ops.aten.to.dtype](args = (%x, torch.complex64), kwargs = {})`. All nodes: `[to]`
- `aten.slice.Tensor`:     No decompositions registered for the complex-valued input. Example node: `%slice_1 : [num_users=1] = call_function[target=torch.ops.aten.slice.Tensor](args = (%to, 0, 0, 9223372036854775807), kwargs = {})`. All nodes: `[slice_1, slice_2]`

## Decomposition comparison

Ops exist only in the ExportedProgram before decomposition: `['aten.to.dtype']`

Ops exist only in the ExportedProgram after decomposition: `['aten._to_copy.default']`

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148617
Approved by: https://github.com/justinchuby
2025-03-06 07:03:45 +00:00
ae6bb58483 Revert "[cutlass backend] Forward fix for less aligned gemm shapes (#148521)"
This reverts commit ad49cfc9f0a8a4d8881b3734edd8c33a087c8b97.

Reverted https://github.com/pytorch/pytorch/pull/148521 on behalf of https://github.com/davidberard98 due to broke lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/13690720601/job/38283359447) [HUD commit link](ad49cfc9f0) ([comment](https://github.com/pytorch/pytorch/pull/148521#issuecomment-2702980028))
2025-03-06 06:59:39 +00:00
4dc956a1d8 [Inductor][Triton] Fix test_autotune_inplace_kernel to work with newer Triton version (#148595)
For new Triton version 3.3, constexpr are included as part of the signature. Update failing test to reflect this change, additional context in https://github.com/pytorch/pytorch/pull/145051.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148595
Approved by: https://github.com/davidberard98
2025-03-06 05:37:08 +00:00
1fac47702e [Break XPU][Inductor UT] Generalize device-bias code introduced by #146866. (#148534)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148534
Approved by: https://github.com/nandesuka
2025-03-06 04:39:50 +00:00
f057206fca [ONNX] Support complex comparison when verify=True (#148619)
Previously, the comparison of complex numbers was not supported when `verify=True`.

NOTE: This PR can be extended to support more complex comparison cases if there are other places in onnx codebase needed to be changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148619
Approved by: https://github.com/justinchuby
2025-03-06 04:38:43 +00:00
8b65d522e1 refactor delayed compile to use code context (#148530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148530
Approved by: https://github.com/williamwen42
ghstack dependencies: #148509
2025-03-06 04:02:30 +00:00
ad49cfc9f0 [cutlass backend] Forward fix for less aligned gemm shapes (#148521)
Differential Revision: [D70600093](https://our.internmc.facebook.com/intern/diff/D70600093/)

1. Check if config name filtering still works.
Tested, it works

2. do we get C++ compile error
Yes, potentially we need to filter them out manually.

Here we get this.
```
static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
```
We need to move some assertions to gemm_template.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148521
Approved by: https://github.com/ColinPeppler
2025-03-06 03:42:55 +00:00
02e1580e39 [MPS] fix crash for mse loss with 0 numel inputs (#148608)
Fixes #148589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148608
Approved by: https://github.com/malfet
2025-03-06 03:32:34 +00:00
8728d4b815 Clear triton kernels after parent make_launcher (#148604)
Before, we were clearing the cache only after inductor compile. But inductor may not **always** compile, i.e. on AOTAutogradCache hit.

So instead, we should clear it when the future is consumed. This is a more robust fix for the issue in D69476856

Differential Revision: [D70646281](https://our.internmc.facebook.com/intern/diff/D70646281/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148604
Approved by: https://github.com/masnesral
2025-03-06 03:28:38 +00:00
cyy
1433bc1455 Remove CAFFE2_USE_EXCEPTION_PTR (#147247)
The check is for older compilers and is now aways true.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147247
Approved by: https://github.com/janeyx99
2025-03-06 02:56:23 +00:00
43e1284c96 Fix empty matrix handling of addmv in inductor (#143792)
This is a resubmission of my previous PR that I accidentally deleted, apology in advance if any inconvenience caused. Below are details of this PR.

Fix an issue when torch.addmv behaves inconsistent between torch.compile mode and eager mode. Here is the code to reproduce:

```
import torch
import numpy as np

@torch.compile
def test_optimized(input, mat, vec):
    return torch.addmv(input, mat, vec)

def test(input, mat, vec):
    return torch.addmv(input, mat, vec)

input = torch.tensor([2], dtype=torch.int32)
mat = torch.tensor(np.random.randn(0, 0), dtype=torch.int32)
vec = torch.tensor([])
origin_out = test(input, mat, vec)
optimized_out = test_optimized(input, mat, vec)
print(origin_out)  # tensor([2.])
print(optimized_out)  # tensor([])
```

According to the equation (https://pytorch.org/docs/stable/generated/torch.addmv.html), when matrix and vector is empty, returning `[2.]` seems more reasonable to me.

Following the cpu implementation of this API:e97b97af56/aten/src/ATen/native/Blas.cpp (L62)

I add an additional branch to handle empty matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143792
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-06 02:09:27 +00:00
38b3375a81 [MTIA] Use "ieee" instead of "tf32" for MTIA's default precision in FlexAttention (#148565)
Summary: MTIA supports ieee but not tf32, so we set the default precision of MTIA to ieee similar to how it's done for AMD.

Test Plan: CI

Reviewed By: mortzur

Differential Revision: D70072064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148565
Approved by: https://github.com/mortzur
2025-03-06 02:07:18 +00:00
32715a2311 [inductor][ck] add kBatch_sweep to config.rocm (#148223)
Summary:
# Why

enable testing and users to specify a set of kBatches to try rather than relying on our hand written heuristic

# What

add rocm.kBatch_sweep as a list of kBatches to try out. These will generate a product of CK instances, one per kBatch for each existing op, though they are often filtered out if they are likely to fail at runtime

Test Plan: n/a

Reviewed By: chenyang78

Differential Revision: D70226055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148223
Approved by: https://github.com/ColinPeppler
2025-03-06 01:14:33 +00:00
63fbc738dc [Easy/Profiler] Add last entry to truncated values (#148576)
Summary: Since the ranks of a PG are usually in a consecutive range it is useful to print the last values when truncating metadata

Test Plan:
Manually changed truncate length to 2 and ran 4 gpu graph to get the following trace:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu003.rva5.facebook.com/rank-1.Mar_05_09_48_21.1280355.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D70637461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148576
Approved by: https://github.com/davidberard98
2025-03-06 01:14:15 +00:00
23441492f6 [scan] Refactoring of input checking and dynamo invocation (#142125)
This PR does a refactoring of the way dynamo is invoked and how the input shapes are checked for scan and for associative_scan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142125
Approved by: https://github.com/ydwu4
2025-03-06 01:06:54 +00:00
6cc3e69103 [inductor] use eager stride for custom op if no tags (#148367)
Fix https://github.com/pytorch/pytorch/issues/148356

This is some sort of short term fix to recover the default behavior to apply layout constraint for custom ops when there are no tags.

A longer term attempt to make sure Inductor always gets correct eager strides is here: https://github.com/pytorch/pytorch/pull/148104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148367
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-03-06 00:58:00 +00:00
703176e538 [ROCm] Fix sort for non-standard bool (#147459)
When converting from uint8 to bool using `view` op, we get a bool that has 0 for false and a non-zero value for true. However, these kinds of bool have undefined behavior. We only read the last bit as 0 or 1 to convert to false or true.

In this fix, we convert bools to uint8, which will convert false to 0 and non-zero value to 1. Essentially, converting non-standard bool to a standard bool and fixing the sort op for non-standard bool.

Fixes #139972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147459
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-03-06 00:23:02 +00:00
690fc2c876 Add aot_eager_then_compile stance (#148509)
Sometimes `eager_then_compile` stance isn't enough since some models are so close to the memory limit that going to eager will OOM since we don't get the memory reductions from activation checkpointing. This PR introduces `aot_eager_then_compile` which avoids the expensive inductor compile, but still does aot_eager to get the benefits of memory reduction in the first invocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148509
Approved by: https://github.com/williamwen42
2025-03-05 23:23:45 +00:00
d6d670ab4d [AOTI] build CPU CPP kernels at O3, and all other code at O1 (#148587)
In the future, we may also want to add LTO linking to further optimize the results (while still hopefully netting compile time benefits).

Differential Revision: [D70641543](https://our.internmc.facebook.com/intern/diff/D70641543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148587
Approved by: https://github.com/desertfire
2025-03-05 22:47:46 +00:00
897fd9b514 Revert "Subprocess compile (#146134)"
This reverts commit 07f876e9602ec6881df2360ab4817e129b563b7c.

Reverted https://github.com/pytorch/pytorch/pull/146134 on behalf of https://github.com/malfet due to looks like it broke slow jobs, see e1dee4ccb3/3 ([comment](https://github.com/pytorch/pytorch/pull/146134#issuecomment-2702239123))
2025-03-05 22:41:19 +00:00
e1dee4ccb3 [ONNX] Assert capture strategy in tests (#148348)
Previously the strategy used for obtaining the exported program is not asserted. This leads to silent errors if torch.export breaks something and a fallback strategy is used. This change adds a _capture_strategy field to ONNXProgram and enables unit tests to assert the strategy used to prevent fallbacks from happening.

Fixes #147674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148348
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1
2025-03-05 22:31:54 +00:00
5ccd659c0e Fix decomp for linspace (#147997)
In python decompositions, we shouldn't do any non-functional operations for functional operators. This should go away once we start decomposing before functionalization.

Differential Revision: [D70265200](https://our.internmc.facebook.com/intern/diff/D70265200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147997
Approved by: https://github.com/zou3519
2025-03-05 22:10:08 +00:00
9e755a1c03 [ROCm] add gfx12 to nightly wheels (#148562)
Adds gfx1200 and gfx1201 to PYTORCH_ROCM_ARCH for wheels and libtorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148562
Approved by: https://github.com/jeffdaily
2025-03-05 21:56:22 +00:00
2a639ce1d7 Add new hf storage class to torch.distributed package (#148361)
Summary:
title - Add new hf storage class  to torch.distributed package so that it can be imported by customers.
The HF storage reader/writer was added as DCP storage components so that DCP load and save can directly interact with hugging face format and storage.

Test Plan: ensure signals pass

Differential Revision: D70495399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148361
Approved by: https://github.com/MeetVadakkanchery
2025-03-05 21:52:06 +00:00
10354e146f Re-enable test_torchinductor:test_buffer_batch_norm (#148573)
Summary: Per https://github.com/pytorch/pytorch/issues/128198 seems like this is working now
Fixes #128198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148573
Approved by: https://github.com/StrongerXi
2025-03-05 21:51:24 +00:00
87bd3471ff [c10d] Move record param for init to the right place (#148571)
The place we do the log of init does not look correct. We move it to the beginning of comm init.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148571
Approved by: https://github.com/kwen2501
2025-03-05 21:43:30 +00:00
ad9a10aff0 [dynamo] Make nonstrict_trace work with some pytree.register_constant-ed instances (#148007)
As title, this enables `nonstrict_trace`-ed function to take in object
whose type has been `pytree.register_constant`-ed, as long as the object
existed outside the `torch.compile` region. This also forces Dynamo to
emit a `EQUALS_MATCH` guard on the object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148007
Approved by: https://github.com/zou3519
ghstack dependencies: #148385
2025-03-05 21:28:26 +00:00
a10f577ee0 [dynamo] Account for function id reuse in relevant Dynamo decorators (#148385)
This fixes a recent series of flaky failure from `nonstrict_trace` unit
tests: #148166, #148056, #148055, #148054, #148034, #148033, #148032, #148031.

For now we don't need to worry about the other decorators because they
are either meant for builtin/numpy functions (which should never
deallocate in practice), or used for polyfills which keeps the function
object in `get_torch_obj_rule_map()`.

Fixes #147777.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148385
Approved by: https://github.com/zou3519
2025-03-05 21:28:26 +00:00
4aeca28137 [cutlass backend] fix assertion that prevent self multiplication (#148233)
# Problem:
In a matmul, sometimes some of the nodes are the same. Say `A @ A`. In that case, when writing the stride of node B, we have to figure out if we want lda or ldb, which points to the same node, and we have no way to differentiate which one.

# Solution
Just use whichever. Since they are the same.

# Question
What if we compile with `A @ A`, and then pass in `A @ B`? Well inductor guards will raise an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148233
Approved by: https://github.com/ColinPeppler
2025-03-05 21:26:22 +00:00
ed9624ee60 [export] Fix AttrProxy slicing (#148507)
Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1159599265750221/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148507
Approved by: https://github.com/zhxchen17
2025-03-05 21:03:15 +00:00
dd6ec8706e [BE] Relax sympy dependency to 1.13.3 or newer (#148575)
Fixes https://github.com/pytorch/pytorch/issues/145225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148575
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-03-05 20:51:16 +00:00
9efa9c73f6 [Dyamo] Replace unimplemented with unimplemented_v2 for variables/distributed (#148500)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148500
Approved by: https://github.com/williamwen42
2025-03-05 20:41:43 +00:00
98458e5c81 Add a docstring to build.sh (#144566)
Add a little blurb to explain what build.sh is doing.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
2025-03-05 15:26:37 -05:00
c6a05df174 [ONNX] Use onnxscript apis for 2.7 (#148453)
Use onnxscript apis for 2.7.

Remove reference to `torchlib_opset()` and `torchlib_opset_version()` which were removed in the onnxscript 2.7 apis. These apis were removed because torchlib in onnxscript will always stay on opset 18. Future opset version bumps will happen in pytorch core after the migration of torchlib.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148453
Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1
2025-03-05 20:10:00 +00:00
c9edd37ffb Revert "[dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377)"
This reverts commit 9eef457c0241f87097a2ca7625f9961e31f3adcd.

Reverted https://github.com/pytorch/pytorch/pull/148377 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/13683650448/job/38261818684) [HUD commit link](9eef457c02) probably landrace ([comment](https://github.com/pytorch/pytorch/pull/148377#issuecomment-2701903810))
2025-03-05 19:45:16 +00:00
c5d92edd5a [dynamo] WeakRefVar reconstruct (#148083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148083
Approved by: https://github.com/anijain2305
2025-03-05 19:34:17 +00:00
50e827b3df [ONNX] Create VerificationInterpreter (#148396)
An fx interpreter for comparing ONNX values with pytorch ones.

```py
import torch
from torch.onnx._internal.exporter._verification import VerificationInterpreter

class Model(torch.nn.Module):
    def forward(self, query, key, value):
        res = torch.nn.functional.scaled_dot_product_attention(
            query, key, value
        )
        rest = res.transpose(0, 1)
        return rest.view(8, 32, 128 * 64)

model = Model()

query = torch.rand(32, 8, 128, 64, dtype=torch.float16)
key = torch.rand(32, 8, 128, 64, dtype=torch.float16)
value = torch.rand(32, 8, 128, 64, dtype=torch.float16)

onnx_program = torch.onnx.export(model, (query, key, value), dynamo=True)
interpreter = VerificationInterpreter(onnx_program)
interpreter.run(query, key, value)
for info in interpreter.verification_infos:
    print(info)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148396
Approved by: https://github.com/titaiwangms
2025-03-05 19:18:52 +00:00
8af79b7ec8 [ROCm] Bump AOTriton to 0.9.1b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-05 19:11:57 +00:00
9eef457c02 [dtensor] add aten._scaled_dot_product_cudnn_attention.default op support (#148377)
### Summary
This PR adds `_scaled_dot_product_cudnn_attention` to DTensor ops and tests it with unit test. This should allow Context Parallel and Tensor Parallel to use cudnn SDPA.

### Test
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148377
Approved by: https://github.com/drisspg
2025-03-05 19:09:52 +00:00
9dd46a9233 Deprecate sm70 for cuda 12.8 binary (#147607)
follow up for https://github.com/pytorch/pytorch/pull/146265/files, dropping sm_70 as well, since "Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release."

https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147607
Approved by: https://github.com/atalman
2025-03-05 18:54:17 +00:00
3f4311d589 [CD] Upgrade xpu runtime pypi packages version and enable windows kineto again (#148319)
Fixes https://github.com/pytorch/pytorch/issues/145155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148319
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-03-05 18:39:55 +00:00
9db9593bba Add some more meta kernels (#147862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147862
Approved by: https://github.com/zou3519
2025-03-05 18:33:00 +00:00
e555c4d8ae Fix bug in AOTI lowering (#148364)
Fixes: https://github.com/pytorch/pytorch/issues/148370

Differential Revision: [D70514480](https://our.internmc.facebook.com/intern/diff/D70514480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148364
Approved by: https://github.com/desertfire
2025-03-05 18:27:15 +00:00
38479e495e Add note to get start xpu (#148168)
Installing PyTorch from binaries will automatically install the runtime packages of Intel® Deep Learning Essentials. In this case, if we activate oneAPI in a standalone installation of Intel® Deep Learning Essentials, there will be an environment issue. Therefore, add a note to remind users to avoid this situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148168
Approved by: https://github.com/janeyx99

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-03-05 18:11:14 +00:00
c65ee728f0 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-03-05 16:13:19 +00:00
70c5edb697 [ROCm] fix CK compile for gfx1200 (#148496)
gfx1200 causes the CK-based GEMM to fail to compile because CK is choosing an incorrect FP8 interpretation.  CK assumes FP8 interpretation is static and chosen prior to compilation.  This PR is a work-around that makes the selection dynamic during hipclang compilation passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148496
Approved by: https://github.com/jeffdaily
2025-03-05 16:11:03 +00:00
864b75dd50 [MPS] Fix unary_kernel_strided logic (#148512)
Fixes bug introduced by https://github.com/pytorch/pytorch/pull/148350
Before this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[  0.0000,   1.4142,   2.0000,   2.4495],
        [ 80.0000,  82.0000,  84.0000,  86.0000],
        [ 96.0000,  98.0000, 100.0000, 102.0000],
        [112.0000, 114.0000, 116.0000, 118.0000]], device='mps:0')
```
After this change
```
% python3 -c "import torch; x, y = torch.arange(128.0, device='mps').reshape(2, 8, 8).unbind(0); print(torch.sqrt(x[::2, ::2], out=y[::2, ::2]))"
tensor([[0.0000, 1.4142, 2.0000, 2.4495],
        [4.0000, 4.2426, 4.4721, 4.6904],
        [5.6569, 5.8310, 6.0000, 6.1644],
        [6.9282, 7.0711, 7.2111, 7.3485]], device='mps:0')
```
One can not avoid copies if both input and output tensors have the same strides, one needs to make sure that they are dense-in-storage (transposed tensor would be dense, but say selecting every odd and even column wouldn't)

Add regression test to prevent those from happening again

Also, no need to check that sizes match, luckily it is checked by the structured op (and `out` for unary ops does not support broadcasting, I just checked)

Revived needs_copy_logic, though it  will become irrelevant after https://github.com/pytorch/pytorch/pull/148468 is landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148512
Approved by: https://github.com/janeyx99
2025-03-05 15:57:54 +00:00
8274da9312 [c10d][PGNCCL] Fix capturability of isend and irecv (#148462)
This PR fixes an issue of inability to capture `isend`/`irecv` ops in `async` mode.

<details>
<summary>The repro code</summary>

```Python
import os
import torch
import torch.distributed as dist

USE_ASYNC = True

def test_func(x, rank):
    if rank == 0:
        x += 1
        # Send the tensor to process 1
        if USE_ASYNC:
            a = dist.isend(tensor=x, dst=1)
        else:
            dist.send(tensor=x, dst=1)
    else:
        # Receive tensor from process 0
        if USE_ASYNC:
            a = dist.irecv(tensor=x, src=0)
        else:
            dist.recv(tensor=x, src=0)
    if USE_ASYNC:
        a.wait()
    return x + 2

def run(rank):
    torch.cuda.set_device(rank)
    x = torch.ones(1, device='cuda')
    with torch.cuda.stream(torch.cuda.Stream()):
        for i in range(11):
            x.copy_(torch.ones(1, device='cuda'))
            y = test_func(x, rank)
            print(f"Rank{rank} has data {y} in warmup")
    torch.cuda.synchronize()
    graph = torch.cuda.CUDAGraph()

    x.copy_(torch.ones(1, device='cuda'))
    with torch.cuda.graph(graph):
        y = test_func(x, rank)

    for i in range(1):
        x.copy_(torch.ones(1, device='cuda'))
        graph.replay()
    print(f"Rank{rank} has data {y} after graph replay")

def main():
    rank = int(os.environ['RANK'])
    local_rank = int(os.environ['LOCAL_RANK'])
    world_size = int(os.environ['WORLD_SIZE'])
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    run(local_rank)

if __name__ == "__main__":
    main()
```
</details>

Fails with an error stating that work handle is of a NoneType:
```
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/repro.py", line 54, in <module>
[rank1]:     main()
[rank1]:   File "/workspace/repro.py", line 51, in main
[rank1]:     run(local_rank)
[rank1]:   File "/workspace/repro.py", line 38, in run
[rank1]:     y = test_func(x, rank)
[rank1]:         ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/workspace/repro.py", line 22, in test_func
[rank1]:     a.wait()
[rank1]:     ^^^^^^
[rank1]: AttributeError: 'NoneType' object has no attribute 'wait'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148462
Approved by: https://github.com/kwen2501
2025-03-05 15:49:53 +00:00
19a6cf35f6 add input shape check for _local_scalar_dense (#145717)
Fix https://github.com/pytorch/pytorch/issues/145066.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145717
Approved by: https://github.com/malfet
2025-03-05 15:24:08 +00:00
96afa8a2bb [TEST][SPARSE] Simplify branching in test_cusparselt_backend (#148318)
Due to introduction of CUDA versions, the branching becomes more complicated. This PR is proposed to simplify branching in `test_cusparselt_backend` in order to avoid checking each and every CUDA version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148318
Approved by: https://github.com/jcaip
2025-03-05 10:17:00 +00:00
0ef2e938d0 [ROCm] [TunableOp] Track top solutions during tuning process (#147243)
For each set of GEMM parameters that are evaluated by Tunableop, keep track of the top 5 solutions. Print the top 5 solutions when `PYTORCH_TUNABLEOP_VERBOSE=2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147243
Approved by: https://github.com/jeffdaily
2025-03-05 09:35:02 +00:00
6c3492b491 [ROCm] Enable mi300-specific workflows to be triggered on PRs (#147904)
This change will be needed to be able to trigger the MI300-specific CI workflows on PRs by using a PR label.

* inductor-rocm-mi300.yml uses the existing `ciflow/inductor-rocm` label so that any PR manually labeled as such will trigger `inductor` config runs on both MI200 and MI300.
* rocm-mi300.yml uses a separate `ciflow/rocm-mi300` label, since we don't want to over-trigger `default` config runs on MI300 runners due to limited capacity, and [`ciflow/rocm` label is automatically applied](79438512a0/torchci/lib/bot/autoLabelBot.ts (L24)) on many PRs.
* inductor-perf-test-nightly-rocm.yml uses a separate `ciflow/inductor-perf-test-nightly-rocm` label, so that we can manually trigger a round of perf testing on MI300 runners to test the perf impact of a major inductor-related change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147904
Approved by: https://github.com/huydhn
2025-03-05 06:00:37 +00:00
2295efa1b3 Fix only logging ir_post_fusion with torch_compile_debug enabled (#148499)
Because we were invoking the logs through `V.debug`, it was not running if TORCH_COMPILE_DEBUG was not set. this is because there is some magic the in debug [getattr](d789c22712/torch/_inductor/debug.py (L468-L480)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148499
Approved by: https://github.com/shunting314
2025-03-05 05:35:09 +00:00
fb1b7ec173 Remove deprecate method and attirbute in LRScheduler (#147301)
Following [#99270 suggestion](https://github.com/pytorch/pytorch/issues/99270#issuecomment-1511656408), remove deprecate method `LRScheduler.print_lr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147301
Approved by: https://github.com/janeyx99
2025-03-05 05:30:19 +00:00
df7e43e5d4 [AOTI] Fix aot_inductor_package test errors (#148279)
Summary: Fix fbcode test failures introduced by https://github.com/pytorch/pytorch/pull/147975. Make sure script.ld is copied to the build-time directory.

Differential Revision: D70454149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148279
Approved by: https://github.com/zoranzhao
2025-03-05 05:22:48 +00:00
b020d166f2 stage 1 of depreate silent fallback of tuning gemm (#147798)
Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/)

context:
https://github.com/pytorch/pytorch/issues/147479

For the most part, this should not change the behavior.

For int_mm, I also removed
```
    # TODO: Re-enable eager mode implementation once cuBLAS is fixed
    if use_cutlass or use_triton_template(layout, enable_int32=True):
        choices = []
```
because I think it is unwanted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798
Approved by: https://github.com/eellison
2025-03-05 05:15:59 +00:00
913356fb41 Fix recent regression in evaluate_expr that effect cache lookups (#147836)
PR https://github.com/pytorch/pytorch/pull/146939/ added an argument for evaluate_expr for the purpose of logging.
This caused a regression that we thought is due to calling id on symnode.

I digged deeper and found that adding that argument although does not effect results of evaluate_expr it mess the cache
lookups.
I refactored the code to avoid using expr_sym_node_id in the cache lookup, I also introduced evaluate_sym_node to and simplified the calls to evaluate_expr
#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147836
Approved by: https://github.com/oulgen
2025-03-05 04:11:41 +00:00
ed8ec0cb98 [cutlass backend][BE] Fix two small things in cutlass backend standalone debugger (#148493)
Differential Revision: [D70583777](https://our.internmc.facebook.com/intern/diff/D70583777/)

Two really small things:
* The bits in BlockFillRandomUniform would round float to ints
* when bias exists, the order of args are C, A, B, D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148493
Approved by: https://github.com/chenyang78
2025-03-05 04:01:36 +00:00
e0ea593974 [CD] Upgrade Windows xpu support package to 2025.0.1 for binary compression (#148313)
The binary compression feature can reduce the size of the Torch XPU Windows wheel packages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148313
Approved by: https://github.com/atalman
2025-03-05 03:00:27 +00:00
1673bc7610 [mm_logs][ez] dump tuned mm info at lowering stage (#148363)
Summary:
As title. it would be beneficial for judging e2e perf improvement

Easy first step to dump mm info at lowering stage.

e.g.

```
fbsource/fbcode/caffe2/torch/_inductor/kernel/mm.py:525] [0/0] Tuned aten.addmm: m=16, n=6, k=16, layout=FixedLayout('cuda:0', torch.float32, size=[16, 6], stride=[6, 1])
```

Next step:

Dump overview info at `post_grad_graph` stage such as
overall count of `aten.mm` in the graph & visualize to a table structure.

Test Plan: by looking very hard in aot inductor bmm and mm UTs.

Differential Revision: D70507880

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148363
Approved by: https://github.com/henrylhtsang
2025-03-05 02:21:27 +00:00
edc3ca577e [Profiler] Add profiler activity for HPU devices (#148182)
Fixes #148181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148182
Approved by: https://github.com/sraikund16
2025-03-05 01:37:48 +00:00
3985ce0b88 [dynamo] rename test_graph_break_messages -> test_error_messages (#148220)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148220
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #148205
2025-03-05 01:16:53 +00:00
b28cbe5db3 [dynamo] remove internal stack trace for fullgraph=True graph breaks (#148205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148205
Approved by: https://github.com/zou3519
2025-03-05 01:16:53 +00:00
2927a64357 [inductor][cpu] Fix error with FlexibleLayout weights in BMM (#148188)
Fixes #148074

When node A is reshaped (is a `ReinterpretView`) and node B has a `FlexibleLayout`, then the layout of node B *may* be changed during the `kernel.select(options["W"], 0, self.b_index)` call, which could cause the assertion in `kernel.select` to fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148188
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-03-05 01:05:05 +00:00
713a504a82 [dynamo][guards] Fix mem leak caused be refcount increment (#148480)
Should help [internalfb.com/sevmanager/view/491701](https://www.internalfb.com/sevmanager/view/491701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148480
Approved by: https://github.com/xmfan, https://github.com/StrongerXi, https://github.com/williamwen42, https://github.com/zou3519
2025-03-05 01:04:08 +00:00
b5873292c6 Add overload names to profiler trace (#143114)
Currently, recorded profiler events for aten ops do not store overload names. It would be useful to know which overloads are actually called to analyse performance.
For example, consider the following dispatch trace which occurs if there is a fallthrough kernel registered for aten::add:
```
             [call] op=[aten::add.Tensor], key=[AutogradCPU]
               [redispatch] op=[aten::add.Tensor], key=[Undefined]
                 [call] op=[aten::empty.memory_format], key=[BackendSelect]
                   [redispatch] op=[aten::empty.memory_format], key=[CPU]
                 [call] op=[aten::add.out], key=[CPU]
```

In this case, aten::add.out is a child of aten::add.Tensor, however the current profiler trace provides no way to differentiate aten op calls.

See the added unit test for a more detailed example.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143114
Approved by: https://github.com/sraikund16
2025-03-05 01:00:29 +00:00
cf5e3f3cea Add cutlass kernel for rowwise scaled mm on sm100 (#148421)
### Important
- Previous PR in stack https://github.com/pytorch/pytorch/pull/148274
- Despite the changes between sm90 vs sm100 being fairly minimal, I created a separate kernel since we'll be making various arch specific perf optimizations to the sm100 kernel next.
- This kernel has not been optimized yet. However, initial perf testing shows numbers which indicates the tensorcores are being utilized as expected (not just CUDA cores).

### Summary of changes
- This PR adds a new cutlass kernel for rowwise GEMM on sm100.
- sm100 kernel is based on sm90 kernel, with the following changes:
  - Use new arch tag `cutlass::arch::Sm100`
  - Do not use [large tile](4eb0c45297/aten/src/ATen/native/cuda/RowwiseScaledMM.cu (L203)) schedule in CollectiveMainLoop or CollectiveEpilogue (causes build errors)
- SM90 vs SM100 kernel diff: https://www.diffchecker.com/ZCAPaFAg/

### Next steps
- Arch specific performance optimization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148421
Approved by: https://github.com/drisspg
2025-03-05 00:46:01 +00:00
a907b6abae [compiled_autograd] workaround windows compilation issue (#148454)
torch.compile doesn't work on windows so we can ifdef-away the problem.
I do not know what the root cause actually is. Most notably, the pytorch
windows build is fine, but some third-party projects that use pytorch headers
on windows (e.g. torchaudio) have issues.

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148454
Approved by: https://github.com/atalman, https://github.com/xmfan
2025-03-05 00:18:20 +00:00
e02a2ca07a Fix dist.init_process_group on windows (#148266)
Fix https://github.com/pytorch/pytorch/issues/139990

We don't build libuv on windows so anything that creates `TCPStore` which includes `init_process_group()` will fail, which is a bad experience. We should just default to `USE_LIBUV=0` for windows. There were a decent amount of hits for this [error on google ](https://www.google.com/search?q=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&sca_esv=921f59ac5f8bd98a&sxsrf=AHTn8zpG3PxdKoomFHkclOc451rBhoc3jw%3A1740854890873&source=hp&ei=albDZ5GHM-uIptQP4NTikQw&iflsig=ACkRmUkAAAAAZ8Nkei9H-aB2IBCk3pUOK3yFl5xBLZUt&ved=0ahUKEwiR5P7qxemLAxVrhIkEHWCqOMIQ4dUDCBg&uact=5&oq=use_libuv+was+requested+but+PyTorch+was+build+without+libuv+support&gs_lp=Egdnd3Mtd2l6IkN1c2VfbGlidXYgd2FzIHJlcXVlc3RlZCBidXQgUHlUb3JjaCB3YXMgYnVpbGQgd2l0aG91dCBsaWJ1diBzdXBwb3J0SABQAFgAcAB4AJABAJgBAKABAKoBALgBA8gBAPgBAvgBAZgCAKACAJgDAJIHAKAHAA&sclient=gws-wiz) and https://github.com/pytorch/pytorch/issues/139579, so I figured we should add a more helpful message as well.

We don't have CI for windows and our support is just best effort, so I just tested these changes on my windows machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148266
Approved by: https://github.com/d4l3k
2025-03-05 00:07:56 +00:00
84b58bd63e Enable FSDP tests on XPU device (#147518)
**Motivation:**

Enable FSDP tests on XPU device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518
Approved by: https://github.com/weifengpy
2025-03-04 23:49:37 +00:00
c98c3af421 Add a couple config options to compiler bisector (#148450)
These are commonly source of bugs/divergence (through bad interactions etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148450
Approved by: https://github.com/shunting314
2025-03-04 23:23:21 +00:00
0c0a4baddd [MPS] unary kernels - avoid copying tensors if they have same stride (#148350)
I was a bit concerned when I saw in #148272 that metal unary kernel was 0.02x of the performance of what we had with MPS Graphs for sqrt(for non contiguous) tensors. This change makes it so that copying is only done if we don't have same strided tensors(for input/output). So if out tensor is not provided then we don't do copy(don't call contiguous) at all and dispatch the kernel as is. After making this change the script that I listed at the end of the above PR has the same execution time as the non-transposed one.

Times for reference(on transposed tensor where matrix is NxN matrix):

| N     | time_old           | time_new           |
|-------|--------------------|--------------------|
| 100   | 0.0002241021       | 0.0001548659       |
| 1000  | 0.0005934822       | 0.0002150342       |
| 10000 | 0.3242016407       | 0.0045755033       |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148350
Approved by: https://github.com/janeyx99
2025-03-04 23:20:26 +00:00
ade4af8c95 [MPS][BE] Fix c10:🤘:sinc implementation (#148471)
Restrict scalar implementation to `is_scalar_floating_point_v` types, but perform all internal computations in full 32-bit floats. Make complex implementation a template for `is_complex_v` types
This makes its eager kernel implementation for both real and complex type a trivial call to the template
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148471
Approved by: https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448, #148449
2025-03-04 23:14:03 +00:00
93e9daed54 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-03-04 23:09:09 +00:00
d789c22712 Upgrade github ubuntu-20.04 runners to ubuntu-24.04 (#148469)
The github provided ubuntu-20.04 gha runners are being deprecated (https://togithub.com/actions/runner-images/issues/11101) so upgrade workflows using them to the latest runner 24.04

They are currently doing a brownout, resulting in failures like: https://github.com/pytorch/pytorch/actions/runs/13660782115
```
[do_update_viablestrict](https://github.com/pytorch/pytorch/actions/runs/13660782115/job/38192777885)
This is a scheduled Ubuntu 20.04 brownout. Ubuntu 20.04 LTS runner will be removed on 2025-04-01. For more details, see https://github.com/actions/runner-images/issues/11101
```

Should we be using ubuntu-latest instead?

I attempted to upgrade actionlint to 1.7.7 but on my local in test-infra it seems to add a lot of new checks, and on test-infra's CI, I seem to have uploaded the wrong executable or something so it failed.  I'll try again later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148469
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-04 22:29:04 +00:00
5f47b7e268 [ROCm][TunableOp] Unit test for offline tuning of GEMM with bias (#148371)
One more unit test for the offline version of TunableOp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148371
Approved by: https://github.com/jeffdaily
2025-03-04 22:24:27 +00:00
842ffea445 [MPS][BE] Towards strided unary ops support (#148449)
Add generic functors kernels and rewrite all existing implementations into functors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148449
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399, #148448
2025-03-04 22:22:39 +00:00
70d0e1b96a Bump onnxscript to 0.2.2 in CI (#148388)
Unblock https://github.com/pytorch/pytorch/pull/148140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388
Approved by: https://github.com/malfet
2025-03-04 22:09:50 +00:00
c677f3251f [export] don't use unbacked_renamings in export (#147574)
Plan: avoid the use of unbacked renamings, and introduce a pass run in `_produce_aten_artifact` that recomputes unbacked bindings. Decided to do this because in we don't serialize unbacked renamings (or any ShapeEnv state), so this used to compose poorly with de/serialization. This hopefully establishes the invariant that the unbacked binding keys are always in sync with the example values (i.e. same indices, and removed if the symbol is replaced / specialized).

For de/serialization, we don't stored unbacked bindings, and just rerun the pass.

Involved a refactor of compute_unbacked_bindings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147574
Approved by: https://github.com/avikchaudhuri
2025-03-04 21:43:49 +00:00
84961a0c17 ci: Add workflow dispatch for commit hash update (#148486)
Maybe this should also be split into its own workflow instead of piggy
backing off of nightly?

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148486
Approved by: https://github.com/clee2000
ghstack dependencies: #148466, #148472
2025-03-04 21:26:23 +00:00
d290186ed3 ci: Add triton to update hash workflow (#148472)
Adds triton to our auto-update workflows so that PRs can be
automatically made and the triton team can follow up to fix any issues
that may arise.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148472
Approved by: https://github.com/Camyll, https://github.com/atalman
ghstack dependencies: #148466
2025-03-04 21:26:23 +00:00
9be8f74156 ci: Consolidate commit hash updates into a matrix (#148466)
Consolidates all of our commit hash update jobs into a single matrix to
make it easier to add more jobs later on.

Side note: How do I even test if this works?

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148466
Approved by: https://github.com/Camyll, https://github.com/clee2000, https://github.com/atalman
2025-03-04 21:26:13 +00:00
d1abde11ec [dynamo] Support passing arguments to DeviceMesh.get_group (#147741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147741
Approved by: https://github.com/StrongerXi
2025-03-04 21:19:47 +00:00
f30776c37a [BE] Upgrade to mypy 1.14 (#145966)
Upgrade mypy version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145966
Approved by: https://github.com/Skylion007
2025-03-04 20:58:26 +00:00
60205b0eb2 [export] Fix logging so that it doesn't result in max recursion error (#148231)
Test Plan:
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:pt2i_readiness_main -- --model_id=487493491 --test_suite ads_all --mode test_full_model

Produces https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmp2wsjQH/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Differential Revision: D70416613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148231
Approved by: https://github.com/yiming0416
2025-03-04 20:47:25 +00:00
e4c558be1d [scan] Corrections for scan (#146110)
This PR resolves some minor issues with the scan HOP and unifies the handling of the additional_inputs in the same way as for associative_scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146110
Approved by: https://github.com/ydwu4
2025-03-04 20:29:08 +00:00
439395c0ae [MPS] add slogdet and logdet implementations to mps (#148287)
Low hanging fruits, all ops for these are implemented so just adding them to native functions adds the functionality on mps. Probably next op I should add should be lu solve seeing as how many ops need it for the grad calculation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148287
Approved by: https://github.com/malfet
2025-03-04 19:49:23 +00:00
92beda54c8 Revert "[fx] Move map_aggregate to C++ (#148243)"
This reverts commit edaff88f69f069d517b72ea23fd5eb04702eb0b5.

Reverted https://github.com/pytorch/pytorch/pull/148243 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
17d003fe75 Revert "[fx] Move Node._update_args_kwargs to C++ (#148260)"
This reverts commit 0135f57f4aaeaba8d720f551eab6dca6fcede8cd.

Reverted https://github.com/pytorch/pytorch/pull/148260 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
97b9e68bc6 Revert "[fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)"
This reverts commit 29c2de9ae16f1673f3f44363243294d403e53d37.

Reverted https://github.com/pytorch/pytorch/pull/148261 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
6fb18ff685 Revert "Better log message to update pr_time_benchmarks/expected_results.csv (#148303)"
This reverts commit a3d69e6e1a530ae2b91cd549ea26aac51ffc7566.

Reverted https://github.com/pytorch/pytorch/pull/148303 on behalf of https://github.com/jovianjaison due to breaking internal builds [T216910920] ([comment](https://github.com/pytorch/pytorch/pull/148243#issuecomment-2698724058))
2025-03-04 19:40:21 +00:00
63778cb8a0 Revert "[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#147019)"
This reverts commit e3e45d90d8578083da8b51a3b1d911e9a4523e5b.

Reverted https://github.com/pytorch/pytorch/pull/147019 on behalf of https://github.com/clee2000 due to broke inductor test inductor/test_max_autotune.py::TestMaxAutotune::test_cat_max_autotune_extern [GH job link](https://github.com/pytorch/pytorch/actions/runs/13653495421/job/38171259603) [HUD commit link](e3e45d90d8) on inductor workflow and rocm workflow ([comment](https://github.com/pytorch/pytorch/pull/147019#issuecomment-2698677222))
2025-03-04 19:20:15 +00:00
9d196edb7d Revert "Bump onnxscript to 0.2.2 in CI (#148388)"
This reverts commit 7ab6749ec7db32e0b3cdfd19db087f15dd0bebe2.

Reverted https://github.com/pytorch/pytorch/pull/148388 on behalf of https://github.com/clee2000 due to broke libtorch debug build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/13646179239/job/38152039312) [HUD commit link](7ab6749ec7) ([comment](https://github.com/pytorch/pytorch/pull/148388#issuecomment-2698665495))
2025-03-04 19:16:34 +00:00
c219c5ca38 Fix code descriptions in the test package. (#148145)
The parameter and function description have something wrong and make them correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148145
Approved by: https://github.com/janeyx99
2025-03-04 19:14:41 +00:00
e8900fbe4f [MPS] Add some useful utils (#148448)
Like `is_compex_v`, `is_scalar_intergral_v`, `result_of` etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148448
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148398, #148399
2025-03-04 19:09:17 +00:00
f859722f70 [dtensor] refactor sharding prop to handle cross mesh computation (#147869)
as titled, this PR moves the same mesh check from the sharding propagation level to each individual operator level.

This is to allow more flexibility for each individual operator to check the operator can be run on the same mesh or not. For example, before this PR if user have two DTensor params that lives on different DeviceMesh, and want to run `for_each` operator on them individually, it would error out with cross mesh error. But for foreach computation there could be DTensors that live on different meshes, as long as the the mesh are the same in a "zipped way".

This should also fix https://github.com/pytorch/pytorch/issues/134212

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147869
Approved by: https://github.com/tianyu-l
2025-03-04 18:30:44 +00:00
eea54a55f6 ci: Switch manywheel build.sh to just use dev (#148310)
To avoid annoying error message like:

> fatal: no tag exactly matches 'a6520c85bd85875b09f2c68e51622699d7d07595'

These were popping up when GITHUB_REF is not set so let's just assume
that if someone is building without directly setting GITHUB_REF then
they're probably doing a dev build.

Signed-off-by: Eli Uriegas <github@terriblecode.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148310
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-03-04 18:27:44 +00:00
611b0e9bc4 Revert "[fx] Optimizations for node name generation (#148288)"
This reverts commit 5eb0337cfd5e7c2cdf4a2d4829609e391467270f.

Reverted https://github.com/pytorch/pytorch/pull/148288 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
ed9055c303 Revert "[fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)"
This reverts commit 8531d247ba411993f9a10686d70514f6945f9960.

Reverted https://github.com/pytorch/pytorch/pull/148292 on behalf of https://github.com/clee2000 due to something in this stack broke some dynamo and higher order ops tests like higher_order_ops/test_invoke_subgraph.py::TestInvokeSubgraphCompile::test_dedupe [GH job link](https://github.com/pytorch/pytorch/actions/runs/13645082540/job/38149882002) [HUD commit link](8531d247ba).   dynamo/test_graph_deduplication did run on the PR but the higher_order_ops one didn't, probably combo of landrace and bad TD ([comment](https://github.com/pytorch/pytorch/pull/148288#issuecomment-2698365172))
2025-03-04 17:10:12 +00:00
67937be673 [BE] Move sinc kernels to the same OP family (#148399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148399
Approved by: https://github.com/dcci
ghstack dependencies: #148398
2025-03-04 15:49:20 +00:00
7fcbaff206 [BE] Remove stale arg for complex ops (#148398)
Not need to pass DTYPE0 and DTYPE1 if only one DTYPE is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148398
Approved by: https://github.com/dcci
2025-03-04 14:35:43 +00:00
f2f25a5444 Upgrade submodule oneDNN to v3.7.1 (#148293)
This PR is to upgrade submodule oneDNN to v3.7.1.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293
Approved by: https://github.com/mingfeima, https://github.com/atalman
2025-03-04 13:56:45 +00:00
f339e41a38 [inductor][triton] Fix average pool nd for int64 dtype (#146061)
The eager mode implementation of average pool nd returns an integer tensor if the input is also an integer tensor. This should also be preserved in inductor.

Fixes pytest -k test_comprehensive_nn_functional_avg_pool2d_cpu_int64 error: Triton compilation failed: triton_poi_fused_avg_pool2d_0

See WIP https://github.com/pytorch/pytorch/pull/145865#issuecomment-26200289890 to potentially enable such tests as they aren't enabled yet.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146061
Approved by: https://github.com/eellison
2025-03-04 13:53:50 +00:00
fdee60769a [DCP] Introduce process based async checkpointing (#147039)
Summary:
### Context
Background checkpoint upload thread interfering with trainer thread:

In [async save API](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L239-L248), the background thread spends a considerable amount of time on CPU-bound tasks (pickling/unpickling several metada objects a.k.a SavePlans) on rank0 during the collective operation; this kind of asymmetric computation heavily contends for GIL with the trainer thread causing GPU util to suffer significantly for the E2E checkpoint duration.

### Solution:
Introduce async save via a checkpoint daemon process. This daemon process will be created once (during the first save attempt) and can serve async checkpoint requests for the remainder of training lifetime.

Test Plan: Added E2E UTs for process based async save.

Differential Revision: D69272583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147039
Approved by: https://github.com/saumishr
2025-03-04 13:33:28 +00:00
16d07988fc add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338)
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501, https://github.com/albanD
2025-03-04 12:37:06 +00:00
e3e45d90d8 [Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#147019)
Modified  TorchInductor’s autotuning flow so that each `best_config` JSON file also includes the Triton “base32” (or base64) cache key.

**Motivation**

Debugging & Analysis: With this change, we can quickly identify which compiled binary and IRs belongs to a given best config.
The impact is minimal since it is only an extra field in .best_config. It can help advanced performance tuning or kernel-level debugging.

Also, since Triton already stores cubin/hsaco in its cache, developers/researchers can avoid to set `store_cubin = True` since they can get the cubin/hsaco in the Triton cache and with the code provided in this PR, they can easily match the best_config with the right Triton cache directory for the "best" kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147019
Approved by: https://github.com/davidberard98
2025-03-04 12:16:38 +00:00
f1cce0951b Create unique test report files for distributed tests (#148325)
The distributed tests are executed once for each backend and for each init method.
`$TEST_REPORT_SOURCE_OVERRIDE` is used such that test results from different backends are stored in different files.
The same needs to be done for the init method.

Move the setting of the variable into `test_distributed` and incorporate the init method into the name.

Useful for e.g. https://github.com/pytorch/pytorch/issues/126523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148325
Approved by: https://github.com/clee2000
2025-03-04 10:45:33 +00:00
0b0d28accd Optimize param prepend class reference torch.nn.Module (#148304)
Fixes #147696

## Changes

Change `prepend` description  `torch.nn.modules.Module` to `torch.nn.Module`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/054f54b7-9487-4505-a926-3e17a84bd2f9)

### After

![image](https://github.com/user-attachments/assets/1d2a5708-62d1-428e-b136-bcaa35e5e6da)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148304
Approved by: https://github.com/Skylion007
2025-03-04 08:46:14 +00:00
da2688f624 Introduce delayed compile via eager_then_compile stance (#147983)
Recently I've been experimenting with introducing new APIs to delay compile as a way to reduce compile times while improving the ergonomics of using dynamic shapes. The high level idea is to run the first invocation of compile in eager, save the example inputs, and on the second invocation we can derive the dynamism in the inputs so that we don't need to waste our time doing a compile with static shapes (which is the status quo today with automatic dynamic).

Another benefit of this is most users no longer need to annotate their inputs with mark_dynamic and mark_unbaked calls since we can derive the dynamism on the very first call. Additionally we get dynamic ints out of the box in this new regime.

This PR implements this idea through the set_stance APIs. In particular it introduces a new `eager_then_compile` stance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147983
Approved by: https://github.com/williamwen42
2025-03-04 07:46:31 +00:00
e0f0db0105 updates to benchmarks (#144831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144831
Approved by: https://github.com/danielvegamyhre
2025-03-04 06:21:12 +00:00
ac99fc7e57 Updates to build rowwise scaled mm kernel on SM10.0a (#148274)
## Summary
Update cmake files and RowwiseScaledMM.cu to build on SM10.0a arch.

**NOTE**: performance optimization will be done in separate follow up PRs

## Steps to verify build
1. Access devgpu/machine with B200 GPUs, verify B200s are visible w/ `nvidia-smi`
2. Install CUDA tookit 12.8
    - e.g. see [Nvidia docs](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Rocky&target_version=9&target_type=rpm_local)
3. Verify CUDA toolkit installation
    - e.g. `nvcc --version` should have `... Cuda compilation tools, release 12.8 ... ` in output
4. Set env var `TORCH_CUDA_ARCH_LIST=10.0a`
4. Build pytorch from source with this PR ([steps](https://github.com/pytorch/pytorch#from-source))
5. Uninstall `pytorch-triton` with `pip uninstall pytorch-triton`
6. Build and install triton from source: https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source
7. Run tests shown in test plan below

**NOTE**: performance optimization will be done in a separate PR. The goal of this PR is just to ensure it builds correctly.

## Test plan
- `python test/distributed/tensor/test_matrix_ops.py  -k scaled_mm`: OK
- `python test/test_matmul_cuda.py -k rowwise`: OK
- `python test/test_flop_counter.py -k scaled_mm`: OK
- `python test/inductor/test_aot_inductor.py -k fp8`: OK
- `python test/inductor/test_fp8.py`: OK

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148274
Approved by: https://github.com/drisspg
2025-03-04 05:23:41 +00:00
7ab6749ec7 Bump onnxscript to 0.2.2 in CI (#148388)
Unblock https://github.com/pytorch/pytorch/pull/148140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148388
Approved by: https://github.com/malfet
2025-03-04 04:21:58 +00:00
d54cab78e1 [codemod] Fix missing field initializer in caffe2/torch/lib/libshm/manager.cpp +1 (#148393)
Summary:
The LLVM warning `-Wmissing-field-initializers` has found one or more structs in this diff's files which were missing field initializers.

This can be unintended such as:
```
my_struct s1 = {0}; // Initializes *only* the first field to zero; others to default values
my_struct s2 = {}; // Initializes *all* fields to default values (often zero)
```
or it may be because only some of the members of a struct are initialized, perhaps because the items were added to the struct but not every instance of it was updated.

To fix the problem, I've either used `{}` to initialize all fields to default or added appropriate default initializations to the missing fields.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: dtolnay

Differential Revision: D70472663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148393
Approved by: https://github.com/Skylion007
2025-03-04 04:20:04 +00:00
70410f93f2 doc/xpu: align description of SyclExtension with CPP/CUDA (#147988)
This commit just aligns description of `py_limited_api` feature in SyclExtension with CPP/CUDA. We've missed this change on doing SyclExtension due to parallel work on the changes. For CPP/CUDA change was done in 515e55e6927ad5f57ec222d7779712630341acf3.

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147988
Approved by: https://github.com/janeyx99, https://github.com/guangyey
2025-03-04 04:17:36 +00:00
cyy
ec2805ada8 Remove outdated CUDA version check (#148142)
Since Torch requires CUDA>=11, some checks can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148142
Approved by: https://github.com/janeyx99, https://github.com/eqy
2025-03-04 03:33:44 +00:00
cyy
98bf2f1170 Use Python 3.9 typing (#148157)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148157
Approved by: https://github.com/janeyx99
2025-03-04 03:09:55 +00:00
cyy
b7832f0339 Enable ASAN in CUDA tests (#147812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147812
Approved by: https://github.com/janeyx99
2025-03-04 02:50:39 +00:00
8531d247ba [fx] Optimize TracerBase.create_arg and Graph._gen_python_code (#148292)
Before: 19502951 function calls (18702776 primitive calls) in 8.533 seconds
After: 16402551 function calls (15602452 primitive calls) in 7.701 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148292
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303, #148288
2025-03-04 02:42:23 +00:00
5eb0337cfd [fx] Optimizations for node name generation (#148288)
Before:
![image](https://github.com/user-attachments/assets/3a9ed22b-ae33-41ec-a0db-01f4f3ca2ffe)

After:
![image](https://github.com/user-attachments/assets/44c6e578-c63e-4a43-b3e0-d11d4bdbb6db)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148288
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260, #148261, #148303
2025-03-04 02:42:23 +00:00
a3d69e6e1a Better log message to update pr_time_benchmarks/expected_results.csv (#148303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148303
Approved by: https://github.com/Skylion007
ghstack dependencies: #148243, #148260, #148261
2025-03-04 02:42:23 +00:00
17518007b2 [cutlass backend] Benchmark compared to aten and triton (#148347)
Benchmark for cutlass backend.

```
python benchmarks/inductor_backends/cutlass.py
```

Test Plan:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 12.759539298713207 |  2.7271360370796174  |         NA          |
|        triton         | 10.573655366897583 |  1.8661278090439737  | -17.131370346859384 |
| triton_persistent_tma | 10.884030722081661 |  0.5315794269554317  | -14.698873781600327 |
|  cutlass_lvl_default  | 13.09632882475853  |  0.5520401500398293  | 2.6395116481931873  |
|   cutlass_lvl_1111    | 11.05172373354435  |  0.569593315012753   | -13.384617776451302 |
|   cutlass_lvl_2222    | 11.371277272701263 |  133.58984916994814  | -10.880189272601317 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 14.472318813204765 |  1.5445372510002926  |         NA          |
|        triton         | 10.568295605480671 |  16.583424195996486  | -26.975796056689987 |
| triton_persistent_tma | 10.45411266386509  |  5.830657540936954   | -27.764770809729562 |
|  cutlass_lvl_default  | 12.742593884468079 |  28.994930602959357  | -11.951954286402668 |
|   cutlass_lvl_1111    | 11.522261425852776 |  79.85037935699802   | -20.38413764531163  |
|   cutlass_lvl_2222    | 10.993581265211105 |  132.86601971101481  | -24.037181552548486 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.700622126460075 |  2.225986961973831   |         NA          |
|        triton         | 29.17378954589367  |  38.571991189033724  |  -4.97329524553989  |
| triton_persistent_tma | 29.642896726727486 |   7.2848734309664    | -3.4452897904663744 |
|  cutlass_lvl_default  | 29.514770954847336 |  29.819900761009194  | -3.8626291243482167 |
|   cutlass_lvl_1111    | 29.411429539322853 |  23.82907024596352   |  -4.19923929172139  |
|   cutlass_lvl_2222    | 29.57325428724289  |  134.31008586101234  | -3.672133530628152  |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 30.858177691698074 |  1.181898436974734   |         NA         |
|        triton         | 28.630023822188377 |  39.24473957403097   | -7.220626868414034 |
| triton_persistent_tma | 28.641965240240097 |  5.275042273919098   | -7.181929126210897 |
|  cutlass_lvl_default  | 29.16003204882145  |  29.934022572939284  | -5.503065216107967 |
|   cutlass_lvl_1111    | 28.79570797085762  |  23.948012012057006  | -6.683705504085324 |
|   cutlass_lvl_2222    | 29.02756631374359  |  136.25560767308343  | -5.932337924306467 |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1456.143856048584  |  1.020197194069624   |         NA         |
|        triton         | 1708.2737684249878 |  5.766509635956027   | 17.31490410985819  |
| triton_persistent_tma | 1476.485013961792  |  7.455113030038774   | 1.3969195302177155 |
|  cutlass_lvl_default  | 1583.3594799041748 |  50.408804678940214  | 8.736473620182366  |
|   cutlass_lvl_1111    | 1636.4418268203735 |  82.82403108896688   | 12.381879030898025 |
|   cutlass_lvl_2222    | 1507.5665712356567 |  260.03901409788523  | 3.531430975962381  |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1382.230520248413  |  1.2586536260787398  |         NA         |
|        triton         | 1646.9683647155762 |  5.442052865982987   | 19.15294450447995  |
| triton_persistent_tma | 1423.9195585250854 |  6.515797697938979   | 3.016069871556595  |
|  cutlass_lvl_default  | 1500.9030103683472 |  51.36402789200656   |  8.58557877152115  |
|   cutlass_lvl_1111    | 1446.9740390777588 |  30.65435610699933   | 4.683988515729638  |
|   cutlass_lvl_2222    | 1419.661521911621  |  205.1948991640238   | 2.7080144096717635 |
+-----------------------+--------------------+----------------------+--------------------+
```

Differential Revision: D70147589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347
Approved by: https://github.com/drisspg, https://github.com/chenyang78
2025-03-04 01:45:36 +00:00
c21dc11a17 [Intel GPU] Enable SDPA on XPU (#147614)
Motivation
===

This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`.

Depends on OneDNN version v3.7 upgrade in #147498
Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-04 01:40:45 +00:00
b17f5223a4 Generate AOTI input check by default (#148005)
Summary:
Generate AOTI size and stride input check by default. But the checks are only run if `AOT_INDUCTOR_DEBUG_COMPILE` env variable is set (to avoid slowing down the performance).

Example output:

```cpp
            bool _check_aoti_runtime_check_inputs_env() {
                const static char* env_var_value = getenv("AOTI_RUNTIME_CHECK_INPUTS");
                const static bool result = env_var_value != nullptr && env_var_value[0] != '\0';
                return result;
            }

            AOTI_NOINLINE static void __check_inputs_outputs(
                AtenTensorHandle* input_handles,
                AtenTensorHandle* output_handles) {
                if (!_check_aoti_runtime_check_inputs_env()){
                    return;
                }
//rest of the check
}

```

Test Plan: CI

Differential Revision: D70260490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148005
Approved by: https://github.com/hl475, https://github.com/desertfire, https://github.com/jingsh
2025-03-04 00:55:14 +00:00
0bd2caac55 Docker release - pin buildkit to v0.19.0 (#148372)
Fix nightly build failure during arm64 docker build (since 02.21.2025): https://github.com/pytorch/pytorch/actions/runs/13452177170/job/37588508155#step:12:851

Error:
```
#10 73.62 Segmentation fault (core dumped)
#10 73.67 qemu: uncaught target signal 11 (Segmentation fault) - core dumped
#10 73.85 Segmentation fault (core dumped)
#10 73.85 dpkg: error processing package libc-bin (--configure):
#10 73.85  installed libc-bin package post-installation script subprocess returned error exit status 139
```
Looks like we are hitting: https://github.com/moby/buildkit/issues/5783

Update setup-qemu and buildkit actions to v3 and buildkit to v0.19.0

Please note: CUDA 12.8 error is not related to this failure in nightly cpu arm64. Looks like we are trying to install release torch when running on PR. Cuda 12.8 build is not released yet, hence a failure. Will send followup to make sure we are using nightly torch when running on PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148372
Approved by: https://github.com/seemethere
2025-03-03 23:55:30 +00:00
d43c6f0033 [invoke_subgraph] Run joint passes on the hop graphs (#139325)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139325
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
ghstack dependencies: #147559
2025-03-03 23:38:14 +00:00
216a108aaf [ROCm] Add rocm-mi300 and inductor-rocm-mi300 to upload-test-stats.yml (#148365)
We currently run MI300X machines on rocm-mi300 and inductor-rocm-mi300 but we don't have artifacts for the results:
e.g.
6e10471966 (rocm-mi300)
![image](https://github.com/user-attachments/assets/f5588072-b818-4f54-a348-0e6ac7e96829)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148365
Approved by: https://github.com/jeffdaily
2025-03-03 23:22:56 +00:00
586d8df651 Fix condition for CONVERT_NON_VECTORIZED_INIT invocation (#148362)
Yet another regression caused by https://github.com/pytorch/pytorch/pull/146596 that breaks builds if PyTorch is compiled for Android or using NVIDIA GraceHopper systems

Not sure why author was trying to change the conditon to begin with

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148362
Approved by: https://github.com/izaitsevfb
ghstack dependencies: #148354
2025-03-03 23:13:37 +00:00
5887a2d8de [BE] Use C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED (#148354)
Instead of `#pragma GCC diagnostic ignored "-Wignored-qualifiers"`
Also limit the scope to just `Vectorized::map` that has to be declared that way due to sleef function signature definitions that return `const __m256` for AVX2 methods

Also delete `#pragma GCC diagnostic pop` from vec256_half and vec256_bfloat16 as it results in an unbalanced pop warning, for push that is defined in vec256_16bit_float, which will be included only once
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:232:27: warning: pragma diagnostic pop could not pop, no matching push [-Wunknown-pragmas]
  232 | #pragma GCC diagnostic pop
      |                           ^
1 warning generated.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148354
Approved by: https://github.com/izaitsevfb
2025-03-03 23:00:47 +00:00
d0b23e661d [cutlass backend] Add main tests for mm, addmm and bmm - step 1 (#148229)
This adds very good coverage for normal mm tests {aoti x torch.compile} x {default, dynamic}.

There are some parts that are less tested. For example:
* different layout combo
* shapes that are less aligned

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148229
Approved by: https://github.com/chenyang78
2025-03-03 22:31:46 +00:00
a41413829c Use release notes label for module: distributed_checkpoint (#148352)
module: distributed_checkpoint is redundant with oncall: distributed checkpointing.

@fduwjj let us know that module: distributed_checkpoint is just used for release notes, so let's use the release notes label for the release notes, which the bot will pick up better.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148352
Approved by: https://github.com/fegin
2025-03-03 21:33:28 +00:00
e45040b1d3 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang
2025-03-03 21:32:21 +00:00
52078154f2 Add support for no-op concat with padded output (#146866)
Add support for no-op concat with padded output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146866
Approved by: https://github.com/shunting314
2025-03-03 21:10:46 +00:00
07f876e960 Subprocess compile (#146134)
Add a mode to `fx_codegen_and_compile()` to compile in a separate process. This is to prepare for async compile where we'll compile and run eager in parallel (and also be able to move the compile phase to a remote computer).

Added a test based which runs the test_torchinductor tests with subprocess compiling turned on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146134
Approved by: https://github.com/jamesjwu
2025-03-03 21:10:12 +00:00
8f361c808b [dynamo] run-only recursively on recompile limit exceeded (#148021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148021
Approved by: https://github.com/anijain2305
2025-03-03 21:01:08 +00:00
1bbe57336b Replace unimplemented with unimplemented_v2 for dynamo (#148158)
torch/_dynamo/variables/constant.py

https://github.com/pytorch/pytorch/issues/147913

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148158
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
2025-03-03 21:00:17 +00:00
b162b1600b [Inductor] Hot fix after #148011 (#148270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148270
Approved by: https://github.com/davidberard98
2025-03-03 20:18:21 +00:00
d260d4fc55 HSDP custom hook UTs are multi-threaded - can't set device rank (#148099)
HSDP custom hook UTs are multi-threaded and using single physical GPU. If we set rank in each thread, then we are referencing the same GPU with multiple ranks, which isn't right. Therefore, removing the rank setting from these UTs. Now, they are passing with 1, 2, 4 GPUs.

Fixes #147767 and #147769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148099
Approved by: https://github.com/jeffdaily
2025-03-03 19:48:49 +00:00
302c660298 Consistently use load_torchbind_test_lib in tests (#148082)
The same code is repeated multiple times with slightly different implementations.
Use the existing function for brevity and consistency.

In the function the code from `test_export` is used which does a single `load_library` with cleaner conditions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148082
Approved by: https://github.com/angelayi
2025-03-03 19:37:28 +00:00
40c2505f16 [logging] Log individual Triton kernel compilation times to dynamo_compile (#147022)
Summary: Gather the compilation time of individual triton kernels and log them to dynamo_compile:
* Time compilation in `_worker_compile_triton` and pass back to the main process and logged from `get_result()`.
* Added a way to track the "top N" (or N most-expensive compiles) in the metrics_context. I did this because I doubt we really care to capture potentially thousands of kernel compile times. That would be problematic for scuba logging anyway, so let's limit the number we track from the beginning. Arbitrarily chose 25 for now.
* Format the list of compile times as a json string before logging.

Test Plan:
`python benchmarks/dynamo/torchbench.py --performance --training --amp --backend inductor --device cuda --print-compilation-time --repeat 5 --cold-start-latency --only nanogpt`
Scuba: https://fburl.com/scuba/dynamo_compile/sandbox/nc4dzm3r

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147022
Approved by: https://github.com/jamesjwu
2025-03-03 19:32:17 +00:00
aade4fbd55 Expose the rendezvous keepalive arguments (#145228)
Enables support for this:

```python
from torch.distributed.launcher.api import LaunchConfig

config = LaunchConfig(
    ...,
    rdzv_configs={"keep_alive_interval": 1122, "heartbeat_timeout": 321, "keep_alive_max_attempt" 5},
)
```

These arguments are currently hard-coded inside torchrun. The default values are not suitable for jobs with thousands of ranks.

Today, `rdzv_configs` only allows the keys `join_timeout`, `last_call_timeout`, `close_timeout`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145228
Approved by: https://github.com/wconstab
2025-03-03 19:11:56 +00:00
a929e11e4f [dynamic shapes][export] ignore when real-tensor fallback fails (#147779)
Summary: uninspired solution to https://github.com/pytorch/pytorch/issues/147402

Test Plan: test_draft_export

Differential Revision: D70132269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147779
Approved by: https://github.com/bobrenjc93
2025-03-03 19:09:56 +00:00
cyy
09291817b2 Fix extra semicolon warning (#148291)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291
Approved by: https://github.com/Skylion007
2025-03-03 18:51:44 +00:00
1c544a9ddd [Inductor-CPP] If all of the activation scale dims are 1, make it a 0D tensor (#147033)
For int8 dynamically quantized activation & int8 quantized weights, add a workaround for some indexing issue that expected an empty index ( so, was expecting a 0D tensor) in epilogue creator when the activation scale was sized [1, 1] by converting it into a 0D tensor.

The issue was discovered while running LLaMA2 quantized with torchao's `int8_dynamic_activation_int8_weight` quantization on CPU with max-autotune enabled (although this error would've occurred regardless).

The final hidden states tensor that's activation to LM head is of shape `[batch_size, sequence_length, hidden_dim]` during decoding. For decoding one token at a time with batch size 1, sequence length is 1. The activation scale is shaped `[1, 1]` (reshaped from `[1, 1, 1]`). However, Inductor epilogue creator expects a 0D tensor in this case (my guess is that the corresponding logic in Inductor expects a 0D tensor if a tensor has only one element, even if it's 1D?).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147033
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
2025-03-03 18:32:27 +00:00
57addfcd58 Significantly speed up save_cache_artifacts (#148227)
While using save_cache_artifacts on internal workloads, we have noticed that repeatedly calling this function after every batch is incredibly expensive. This PR significantly speeds up this function call by opting out of pickle and redesigning serialization algorithm.

Essentially what we want is to be able to call serialize many times without incurring costs from scratch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148227
Approved by: https://github.com/jamesjwu
ghstack dependencies: #148226
2025-03-03 17:28:41 +00:00
3ca1a2564d [BE][MPS] Use copysign for imaginary part of sqrt (#148286)
Also it's tempting trying to replace `a*a + b*b` with `dot(input[index])` but for some reason it results in a slightly different output
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148286
Approved by: https://github.com/dcci
ghstack dependencies: #148285
2025-03-03 16:03:54 +00:00
84502baaff [MPS] Fix sqrt and other for torch.chalf (#148285)
Those kernels, instead of being instantiated for half2 (which corresponds to ComplexHalf) were instnatiated for short2, which resuled in the following test
```
% python3 -c "import torch; print(torch.rand(6, device='mps', dtype=torch.chalf).sqrt())"
```
Fail with
```
RuntimeError: Failed to create function state object for: sqrt_complex_half_half
```

As sqrt is not implemented for CPU, add explicit test to `test_sqrt`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148285
Approved by: https://github.com/dcci
2025-03-03 16:03:54 +00:00
d57f617844 [Inductor][CPP] Avoid transpose with cpp micro-gemm for FlexAttention (#147069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147069
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/drisspg
ghstack dependencies: #147068
2025-03-03 15:22:11 +00:00
6c089f5da3 ci: move xpu triton build to manylinux 2.28 (#148195)
Follow PR #148129 to remove manylinux builds for triton xpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148195
Approved by: https://github.com/seemethere
2025-03-03 12:31:08 +00:00
165e33531c [Inductor][CPP] Fix the vec codegen for tanh (#148254)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/148241, The previous vectorized code generation for `tanh` used a decomposed implementation, leading to numerical differences that were further amplified by `atan2`. For example, in the given test case after `tanh`, the eager output at `[0,0,11,47]` was `-5.820766091346741e-10`, while the compiled output was `1.4319084584712982e-08`, resulting in different `atan2` outputs of `-2.3561` and `0.7853`. This issue is fixed by switching to the Sleef implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_tanh_atan2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148254
Approved by: https://github.com/malfet, https://github.com/jgong5
2025-03-03 11:46:57 +00:00
118a165ac5 [Inductor][CPP] Add transposed B matrix support for CppMicroGemmFP32Vec (#147068)
* Add transposed B support for CppMicroGemmFP32Vec.
* Add support for cases where N is not divisible by `block_n`.
Expand CppMicroGemmFP32Vec to generate gemm kernel that supports transposed B and N of arbitrary size.
This is the basis for https://github.com/pytorch/pytorch/pull/147069 to get better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147068
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-03 11:08:23 +00:00
6a3a1f96ce Enable XPU for Inductor MM Triton Kernel Benchmark (#148237)
#147620 enabled `force_shape_pad` for triton kernel benchmark. Intel GPU supports this scenario. Hence, we need to enable the case in this PR. Otherwise, there would be a test case regression for Intel GPU as #147620 has been landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148237
Approved by: https://github.com/jansel
2025-03-03 10:09:06 +00:00
b3bb73e11c Separate transpose from memory load/store and add load size support for convert_to_int32 (#147067)
Separate transpose from memory load/store and add load size support for convert_to_int32 to facilitate the expansion for CppMicroGemmFP32Vec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147067
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 02:56:16 +00:00
ab81ca5053 [Inductor][CPU] Add GEMM templates for _weight_int4pack_mm_for_cpu with AVX512 (#146756)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds GEMM templates for `torch.ops.aten_weight_int4pack_mm_for_cpu`. The micro kernel used for the templates is based on AVX512 and it's a copy of the ATen implementation of `torch.ops.aten_weight_int4pack_mm_for_cpu` with minor changes.

Due to better blocking and loop schedule, the GEMM template based implementation outperforms the ATen implementation in all cases we tested.

**Test plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_avx512
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146756
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-03 00:56:29 +00:00
608377d341 Revert "[import][inductor] Simplify grid handling (#147583)"
This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e.

Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))
2025-03-03 00:49:32 +00:00
29c2de9ae1 [fx] Move Node._prepend/Node._remove_from_list to C++ (#148261)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```
after:
```
20003454 function calls (19203257 primitive calls) in 8.936 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148261
Approved by: https://github.com/oulgen
ghstack dependencies: #148243, #148260
2025-03-02 22:42:31 +00:00
0135f57f4a [fx] Move Node._update_args_kwargs to C++ (#148260)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```
after:
```
24303536 function calls (23503339 primitive calls) in 10.726 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148260
Approved by: https://github.com/oulgen
ghstack dependencies: #148243
2025-03-02 22:42:31 +00:00
edaff88f69 [fx] Move map_aggregate to C++ (#148243)
Microbenchmarking `fx.symbolic_trace(lambda x: functools.reduce(operator.add, [x, *range(100000)]))`, before:
```
30603618 function calls (29403419 primitive calls) in 13.744 seconds
```
after:
```
25203549 function calls (24403352 primitive calls) in 12.090 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148243
Approved by: https://github.com/oulgen
2025-03-02 22:42:31 +00:00
94afb165d9 Revert "[c10d] Add hccl distributed backend to c10d data structures (#146478)"
This reverts commit dae3fbfe9720e83e7e81d41430fb5067221bbed7.

Reverted https://github.com/pytorch/pytorch/pull/146478 on behalf of https://github.com/malfet due to This seems to break ROCM tests, see dae3fbfe97 ([comment](https://github.com/pytorch/pytorch/pull/146478#issuecomment-2692913573))
2025-03-02 21:22:04 +00:00
1106eb0212 [BE] Fix extra semicolon warning (#148284)
Introduced by https://github.com/pytorch/pytorch/pull/146596

I.e. while building locally my log was littered with
```
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL2d.cpp:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:15:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_half.h:228:42: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  228 | LOAD_FP32_NON_VECTORIZED_INIT(Half, fp16);
      |                                          ^
2 warnings generated.
[230/1017] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/LossNLL.cpp.o
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/LossNLL.cpp:9:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/cpu/utils.h:5:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec.h:7:
In file included from /Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h:14:
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/cpu/vec/vec256/vec256_bfloat16.h:228:46: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
  228 | LOAD_FP32_NON_VECTORIZED_INIT(BFloat16, bf16);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148284
Approved by: https://github.com/Skylion007
2025-03-02 19:06:46 +00:00
6d70b42810 [BE][Ez]: Update fmt submodule to 11.1.4 (#148264)
This minor release is mostly bugfixes, ABI fixes, and compiler support fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148264
Approved by: https://github.com/jansel, https://github.com/cyyever
2025-03-02 19:00:00 +00:00
95d81d21a6 [MPS] Speedup interpolation (#148277)
First of all, perf claims made in https://github.com/pytorch/pytorch/pull/145581 and https://github.com/pytorch/pytorch/pull/148154 are too good to be true (due to the bug in the script that did not call `torch.mps.synchronize` at the end of the benchmark script, but still slightly better than MPS, probably due to the launch overhead.

And while measure performance correctly, I've noticed that a lot of time is spent on 64-bit integral division of thread_index to get spatial coordinates. Simply downcasting divisior to 32-bit integer (which is also the thread index) speeds it up almost 2x for bilinear and bicubic as could be demonstrated by running following script
```python
import torch
import time
import subprocess
import itertools

def benchmark(device, dtype, mode="bilinear", antialias=False, sf=.5):
    # Create example inputs
    x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype)

    # define kwargs
    kwargs = {"antialias": antialias, "mode": mode, "scale_factor": sf}

    # Skip for unimplemented flavors
    if antialias and mode == "bicubic" and device == "mps":
       return None, "Skip"
    elif antialias and dtype != torch.float32:
       if device == "cpu":
           return None, "Skip"
       outputs_match = None
    else:
        # Check output
        y = torch.nn.functional.interpolate(x, **kwargs)
        z = torch.nn.functional.interpolate(x.cpu(), **kwargs)
        outputs_match = torch.allclose(y.cpu(), z)
        if not outputs_match:
           atol = (y.cpu() - z).abs().max()
           rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
           print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, **kwargs)
    torch.mps.synchronize()
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
for mode,antialias in itertools.product(["bilinear", "bicubic"], [False, True]):
    outputs_match_list = []
    average_time_list = []
    for device in ["mps", "cpu"]:
      for dtype in [torch.float32, torch.float16, torch.bfloat16]:
          outputs_match, average_time = benchmark(device, dtype, mode=mode, antialias=antialias)
          outputs_match_list.append(str(outputs_match))
          average_time_list.append(average_time)

    print(f"\nBenchmarking Results (collected on {brand_string}) for {mode} interpolation {'with antialias' if antialias else ''}:")
    print("-"*40)
    print("Device            :                MPS        |               CPU")
    print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16")
    print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
    print(f"Average Time (us) :", "  |".join(average_time_list))
```

Before
```
Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  292.0  | 264.7  | 267.9  | 289.1  | 230.9  | 309.1
atol=1.430511474609375e-06 rtol=0.11363636702299118

Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias:
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  False |  False |  True  |  None |  None
Average Time (us) :  698.3  | 684.2  | 683.8  | 851.0  |Skip  |Skip
atol=2.086162567138672e-06 rtol=0.019750799983739853

Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  True  |  True  |  True  |  True  |  True
Average Time (us) :  314.3  | 301.0  | 298.8  | 681.5  | 616.7  | 833.7
```

After
```
Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  119.9  |  98.9  |  98.6  | 289.8  | 231.9  | 308.5
atol=1.430511474609375e-06 rtol=0.05681818351149559

Benchmarking Results (collected on Apple M4 Pro) for bilinear interpolation with antialias:
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  False |  False |  True  |  None |  None
Average Time (us) :  541.9  | 531.1  | 531.0  | 846.8  |Skip  |Skip
atol=2.0265579223632812e-06 rtol=0.008604463189840317

Benchmarking Results (collected on Apple M4 Pro) for bicubic interpolation :
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   False |  True  |  True  |  True  |  True  |  True
Average Time (us) :  314.3  | 301.0  | 298.8  | 681.5  | 616.7  | 833.7
```

TODO:
 - Figure out if this ops make more sense as 3D jobs with n and c channels dispatch as one more dimension
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148277
Approved by: https://github.com/Skylion007
2025-03-02 17:13:52 +00:00
cyy
9aa897b992 Remove unnecessary tensor clone (#148159)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148159
Approved by: https://github.com/Skylion007
2025-03-02 16:21:39 +00:00
1d7397a2d0 [Inductor] Avoid tensor slice overflow for large step (#147433)
Fixes #147071

Currently, if step is a value very close to INT64_MAX, the calculation of slice output length will overflow. This PR tries to fix this problem and thus fix #147071.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147433
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-03-02 16:07:15 +00:00
9c506aa8a6 [aotinductor] add option to disable runtime assertions (#146462)
A recent user experience is like this:
* User runs AOTI lowering, it's successful.
* They take AOTI model and run it with some sample inputs. Everything runs well
* Then they boot up a serving test that loads the AOTI model and runs it with a set of sample requests.
* They see that some of the requests fail. The logs show them this:
* AOTInductorModel run failed with input spec: [1, 32]:c10::BFloat16, [2]:long ...
  * Error: u45 >= 2
* To the untrained eye, "AOTInductorModel run failed" is all they see. But, the true reason is Error: u45 >= 2

However, the assertion isn't always correct.
* In fact, u45 can actually be 0.
* So, why did AOTI say u45 ≥ 2? It's a two-piece combo:
* With 0/1 Specialization, the ShapeEnv creates symbolic shapes (e.g. s0) with a default value-range of [2, inf]
* In the graph, Dynamo traces torch.mul(A, B) where A is [s0, ...]and B is [u45, ...]. So, Dynamo learns Eq(s0, u45).
* Therefore, u45 also has a range of [2, inf]. Hence, the incorrect runtime assertion.

So, the motivation for this PR is to add an option to disable the logging. If you run into a situation like this. However, another way to avoid this is to call `mark_unbacked()` on all the dynamic dims.

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146462
Approved by: https://github.com/desertfire, https://github.com/22quinn
2025-03-02 09:14:58 +00:00
26358fa2d8 Add AppendingByteSerializer class (#148226)
This PR adds a new util class that enables efficient appending of sequential byte data with custom serialization and deserialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148226
Approved by: https://github.com/aorenste
2025-03-02 08:20:58 +00:00
b59776d857 [import][inductor] Simplify grid handling (#147583)
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Note the attached diff contains some minor fbcode-only changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-03-02 07:31:07 +00:00
dae3fbfe97 [c10d] Add hccl distributed backend to c10d data structures (#146478)
# MOTIVATION
Intel Gaudi is an out-of-tree PyTorch accelerator having its own device /dispatch key ```hpu``` .
With this change we add entries for Gaudi's distributed backend ```hccl``` to the c10d Backend data structures.
This is to ensure that there is no naming conflict in case a new in-tree accelerator is introduced with the same backend name.

The Out-of-tree backends are registered calling fd0cd6a08f/torch/distributed/distributed_c10d.py (L302)

Successful registration adds the backend name to the list :
fd0cd6a08f/torch/distributed/distributed_c10d.py (L265)

We are binding the process group creator constructs at run-time so if there are other distributed backend with the same device name they can safely add the device type to the dictionary

fd0cd6a08f/torch/distributed/distributed_c10d.py (L274)

And add another entry to the dictionary with the same backend name ( but different device name )
fd0cd6a08f/torch/distributed/distributed_c10d.py (L268)

In addition the out-of-tree devices can utilize the ```backend_list``` to check for successful backend registration  eg: APIs like ```is_hccl_available```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146478
Approved by: https://github.com/H-Huang, https://github.com/guangyey
2025-03-02 05:13:48 +00:00
6e10471966 [ci] disable cudagraph for tts_angular on dashboard (#148221)
tts_angular with cudagraph is flaky. Its speedup varies from .05 to 1.01. This PR disables cudagraph for tts_angular to avoid the noise. Since tts_angular shows ~1x speedup while other torchbench models show ~2x speedup, skipping tts_angular would wrongly bump the cudagraph speedup. So this PR only disables cudagraph for tts_angular instead of skipping tts_angular.

[Dashboard ](https://github.com/pytorch/pytorch/actions/runs/13597394087)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148221
Approved by: https://github.com/eellison
2025-03-02 03:31:19 +00:00
de7af81f18 [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-03-02 03:25:28 +00:00
ce2f680e00 [fr] Added protection against missing stack frames in fr (#148203)
Summary: We have quite a while failures due to this unprotected access. https://fburl.com/scuba/ai_rca_debug_tracing/qtnb63qf

Test Plan:

Reviewed By: fduwjj

Differential Revision: D70358287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148203
Approved by: https://github.com/fduwjj
2025-03-02 01:03:49 +00:00
19de523de6 [MPS] metal unary kernel for sqrt (#148272)
Issue #148219 highlighted the high dispatch times of ops which ran with MPS Graph on smaller tensors. This PR rewrites the sqrt with metal kernel to mitigate that issue

## Speedups:

Matrix size means NxN matrix here.
![speedup_sqrt](https://github.com/user-attachments/assets/db0a705b-1a0e-42b4-bd42-4e7960415c81)

Code to generate the times(needs building the torch with old time and new time):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [1, 100, 1000, 10_000]
num_runs = 1000
warmup_runs = 3

def run_sqrt(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = torch.sqrt(A)
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.rand((n, n), dtype=torch.float32, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_sqrt(A_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_sqrt(A_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('sqrt_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148272
Approved by: https://github.com/malfet
2025-03-02 00:45:45 +00:00
1a6883759d Fix macro for bit_cast in c10/util/bit_cast.h - one line change (#148265)
Fixes #148263.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148265
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-03-01 20:55:31 +00:00
1919e0de9a Revert "stage 1 of depreate silent fallback of tuning gemm (#147798)"
This reverts commit 297c00264e54cfb192f289e23a41775b81cb9cb8.

Reverted https://github.com/pytorch/pytorch/pull/147798 on behalf of https://github.com/wdvr due to failing internal builds, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147798#issuecomment-2692390551))
2025-03-01 20:04:23 +00:00
82603fd7d2 introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk, https://github.com/wdvr
2025-03-01 19:57:54 +00:00
5301710b15 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/cuda/detail/CUDAHooks.cpp +4 (#147555)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69945678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-03-01 19:46:13 +00:00
0ff2e6a85a Fix None and equal_to_1 arguments issue in Triton kernel generated by AOTI (#148102)
Summary:
When a Triton kernel has arguments with None values followed by arguments with value 1, AOTI attempts to remove the None arguments and update the indices of the equal_to_1 arguments in triton_meta["configs"]. However, if the same kernel is called multiple times, this optimization process is repeated. Prior to this diff, the indices of equal_to_1 arguments from subsequent calls (second and later) were based on the updated indices from the previous call, resulting in incorrect behavior.
This diff aims to localize the updated indices for equal_to_1 arguments within the optimization process of the current call, ensuring accurate and consistent results.

Test Plan:
Unit Test:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_triton_kernel_with_none_inputs_and_equal_to_1_arg
```

Differential Revision: D69998314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148102
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
2025-03-01 18:38:33 +00:00
2b86309da3 separate f16 vectorized class from bf16 (#146596)
Separating the f16 vectorized class into a different file from the bf16 vectorized class in order to be able to add a new bf16 SVE vectorized class in https://github.com/pytorch/pytorch/pull/143666. This is required as we would need to exclude the current bf16 class in order to use the sve bf16 class but still include the current f16 vectorized class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146596
Approved by: https://github.com/malfet
2025-03-01 18:22:32 +00:00
8e004865dd Revert "introduce dynamism library (#147981)"
This reverts commit 1c1bf410ecdeac8d240e15bf8c33c0f00fab0673.

Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/malfet due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2692351017))
2025-03-01 18:16:52 +00:00
a983b2b11a Revert "Initial implementation of host memory stats (#147660)"
This reverts commit 945e359fc1afe6c0bb6129ed9607b237fa19cd98.

Reverted https://github.com/pytorch/pytorch/pull/147660 on behalf of https://github.com/mradmila due to There is an issue with ambiguous definition of Stat structure when different C++ tools are used. Backing out for now. ([comment](https://github.com/pytorch/pytorch/pull/147660#issuecomment-2692346379))
2025-03-01 18:05:45 +00:00
d23051f29b [Inductor] Support parallel reduction for GroupNorm (#144020)
Summary:
Support parallel reduction for GroupNorm by optimizing the parallelization heuristics: When the range of the first inner loop is much larger than the range of all outer loops, change the starting depth of parallelization to the first inner loop.
I tested the Inductor benchmark with this PR on CPU. One torchbench model(pytorch_CycleGAN_and_pix2pix) achieved ~45% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) achieved ~2% performance improvement.

Example:
```
import torch
import torch.nn as nn

class GN(nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

x = torch.randn(2, 64, 168, 168).to(memory_format=torch.channels_last)
m = GN(2, 64).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    out = compiled_m(x)
```

Generated code:

- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       float* out_ptr3,
                       float* out_ptr4)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(56448L));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L))
                            {
                                {
                                    if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L)))
                                    {
                                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32L*x1 + 64L*x2 + 1806336L*x0), static_cast<int64_t>(16));
                                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                                    }
                                }
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        #pragma omp single
        {
            {
                #pragma GCC ivdep
                for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
                    {
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                        {
                            {
                                if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                                {
                                    auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2L*x0)];
                                    auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                                    auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2L*x0)];
                                    auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                                    auto tmp1 = static_cast<float>(903168.0);
                                    auto tmp2 = tmp0 / tmp1;
                                    auto tmp3 = static_cast<float>(1e-05);
                                    auto tmp4 = decltype(tmp2)(tmp2 + tmp3);
                                    auto tmp5 = 1 / std::sqrt(tmp4);
                                    auto tmp7 = at::vec::Vectorized<float>(tmp5);
                                    auto tmp8 = tmp7 * tmp6;
                                    auto tmp10 = decltype(tmp9)(-tmp9);
                                    auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                    auto tmp12 = tmp11 * tmp8;
                                    auto tmp14 = tmp12 + tmp13;
                                    tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                                    tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                                }
                            }
                        }
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L))
                {
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L))
                    {
                        {
                            if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L)))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp2 = tmp0 * tmp1;
                                auto tmp4 = tmp2 + tmp3;
                                tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')
```

- After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       float* out_ptr3,
                       float* out_ptr4)
{
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            #pragma GCC ivdep
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
            {
                {
                    Welford<float> tmp_acc0 = Welford<float>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                    Welford<at::vec::Vectorized<float>> tmp_acc0_vec_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>();
                    }
                    Welford<float> tmp_acc0_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        tmp_acc0_arr[i] = Welford<float>();
                    }
                    Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_arr[56];
                    for (int i = 0; i < 56; i++)
                    {
                        masked_tmp_acc0_vec_arr[i] = Welford<at::vec::Vectorized<float>>();
                    }
                    #pragma omp parallel num_threads(56)
                    {
                        int tid = omp_get_thread_num();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(1008L));
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>();
                        Welford<float> tmp_acc0_local = Welford<float>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec_local = Welford<at::vec::Vectorized<float>>();
                        #pragma omp for
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(28224L); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(16L))
                            {
                                {
                                    if(C10_LIKELY(x3 >= static_cast<int64_t>(0) && x3 < static_cast<int64_t>(32L)))
                                    {
                                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + 32L*x1 + 64L*x2 + 1806336L*x0), static_cast<int64_t>(16));
                                        tmp_acc0_vec_local = welford_combine(tmp_acc0_vec_local, tmp0, &wrecps0);
                                    }
                                }
                            }
                        }
                        tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local;
                        tmp_acc0_arr[tid] = tmp_acc0_local;
                        masked_tmp_acc0_vec_arr[tid] = masked_tmp_acc0_vec_local;
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp_acc0_vec_arr[tid]);
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]);
                    }
                    for (int tid = 0; tid < 56; tid++)
                    {
                        masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, masked_tmp_acc0_vec_arr[tid]);
                    }
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                    tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                    out_ptr0[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.mean);
                    out_ptr1[static_cast<int64_t>(x1 + 2L*x0)] = static_cast<float>(tmp_acc0.m2);
                }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
        {
            #pragma GCC ivdep
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2L); x1+=static_cast<int64_t>(1L))
            {
                for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(32L); x2+=static_cast<int64_t>(16L))
                {
                    {
                        if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(32L)))
                        {
                            auto tmp0 = out_ptr1[static_cast<int64_t>(x1 + 2L*x0)];
                            auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                            auto tmp9 = out_ptr0[static_cast<int64_t>(x1 + 2L*x0)];
                            auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x2 + 32L*x1), static_cast<int64_t>(16));
                            auto tmp1 = static_cast<float>(903168.0);
                            auto tmp2 = tmp0 / tmp1;
                            auto tmp3 = static_cast<float>(1e-05);
                            auto tmp4 = decltype(tmp2)(tmp2 + tmp3);
                            auto tmp5 = 1 / std::sqrt(tmp4);
                            auto tmp7 = at::vec::Vectorized<float>(tmp5);
                            auto tmp8 = tmp7 * tmp6;
                            auto tmp10 = decltype(tmp9)(-tmp9);
                            auto tmp11 = at::vec::Vectorized<float>(tmp10);
                            auto tmp12 = tmp11 * tmp8;
                            auto tmp14 = tmp12 + tmp13;
                            tmp8.store(out_ptr2 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                            tmp14.store(out_ptr3 + static_cast<int64_t>(x2 + 32L*x1 + 64L*x0));
                        }
                    }
                }
            }
        }
    }
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2L); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(28224L); x1+=static_cast<int64_t>(1L))
                {
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(64L); x2+=static_cast<int64_t>(16L))
                    {
                        {
                            if(C10_LIKELY(x2 >= static_cast<int64_t>(0) && x2 < static_cast<int64_t>(64L)))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0), static_cast<int64_t>(16));
                                auto tmp1 = at::vec::Vectorized<float>::loadu(out_ptr2 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp3 = at::vec::Vectorized<float>::loadu(out_ptr3 + static_cast<int64_t>(x2 + 64L*x0), static_cast<int64_t>(16));
                                auto tmp2 = tmp0 * tmp1;
                                auto tmp4 = tmp2 + tmp3;
                                tmp4.store(out_ptr4 + static_cast<int64_t>(x2 + 64L*x1 + 1806336L*x0));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144020
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-01 17:11:50 +00:00
cyy
8bf3920279 Remove unneeded Clang-tidy suppression (#148246)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148246
Approved by: https://github.com/Skylion007
2025-03-01 16:51:54 +00:00
1c1bf410ec introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk
2025-03-01 14:57:06 +00:00
3a0c9f7f9d [MPS] Fix SDPA crash (#148239)
If operation is invoked with mask twice it will crash, as mask expansion logic was implemented inside cache creation block, which is executed only once for all shapes

Fixes https://github.com/pytorch/pytorch/issues/148194 which is a regression introduced by https://github.com/pytorch/pytorch/pull/147545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148239
Approved by: https://github.com/dcci
2025-03-01 13:06:51 +00:00
735d7b1af6 [EZ][BE] Increase tolerances for interpolate op (#148224)
Not sure why tolerances were set like that, this logic was added in https://github.com/pytorch/pytorch/pull/104181 without much explanation
But if I'm to make a guess, it's likely due to the inaccuracy of bilinear op, that has since been replaced by shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148224
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #148154, #148187, #148211
2025-03-01 13:03:59 +00:00
762724f3d0 [Break XPU][Inductor] Generalize device-bias code and fix test_graph_partition for XPU (#148178)
This PR generalized the device-bias code introduced by #147038 . And align the behavior between XPU and CUDA on add + mm + pointwise pattern (for XPU, from addmm + pointwise to mm + fused_add_pointwise) , which fix the failed test case `test_graph_partiton` on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148178
Approved by: https://github.com/benjaminglass1, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #148155
2025-03-01 10:59:55 +00:00
ab78bf5c66 [Break XPU][Inductor UT] Avoid custom op registration conflicts in test_auto_functionalize.py. (#148155)
Fix #148148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148155
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-01 10:59:55 +00:00
2f1b8e0fe2 [DTensor][Test] Add a test to demonstrate current dtensor view behavior if redistribution happens (#148015)
This does not fix the view op issue when redistribution happens. We want to add a test to demonstrate/record the issue, in which the distributed behavior does not match up with single device behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148015
Approved by: https://github.com/XilunWu
2025-03-01 10:24:40 +00:00
191c9bd013 Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)"
This reverts commit b8efebe57d05a87be5b0f304218d2af7bb2bf6c6.

Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/davidberard98 due to looks like another lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2692042859))
2025-03-01 07:43:58 +00:00
fe3b9e3764 [Inductor] optimize the heuristics of outer loop fusion (#147523)
Summary:
Optimize the heuristics of outer loop fusion: When the range of the first inner loop is much larger than the range of all outer loops, do not fuse the outer loops and fallback to standard codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147523
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5
2025-03-01 06:50:04 +00:00
b8efebe57d [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-03-01 06:38:39 +00:00
fd16311e7f [inductor][subgraph] Plumbing to get ShapeAsConstantBuffer from subgraph to main graph output (#147559)
I am unable to create a test case that fails without the next PR. The idea is to have a symint which is returned by the inner subgraph and then returned by the forward graph after partitioning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147559
Approved by: https://github.com/eellison
2025-03-01 06:17:11 +00:00
c87097e74a [triton 3.3] Fix inductor/test_profiler.py test (#148230)
test_inductor_profiling_kernel_names_pointwise is checking that the profiler correctly records the input shapes to the kernel. After triton 3.3, we get a different number of args (because the constexpr args are passed in, from the python perspective). This just patches the test to pass in either case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148230
Approved by: https://github.com/drisspg, https://github.com/YUNQIUGUO
2025-03-01 04:27:49 +00:00
9377a32cd1 [Inductor][NFC] Remove unused functions from compile_tasks.py (#147564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147564
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2025-03-01 03:44:43 +00:00
baf1c8fcdc Revert "introduce dynamism library (#147981)"
This reverts commit 6eff6b28e4d09cbf632f79502a8e317bf5b53c34.

Reverted https://github.com/pytorch/pytorch/pull/147981 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/147981#issuecomment-2691906065))
2025-03-01 03:43:01 +00:00
493cd97af5 add skips to test_notifies_oom and test_set_per_process_memory_fraction (#148134)
Tests fail in NVIDIA internal CI since we do not support nvml on Jetson, but nvml is required for OOM reporting to work properly, so we are skipping the failing tests for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148134
Approved by: https://github.com/eqy
2025-03-01 02:59:48 +00:00
6eff6b28e4 introduce dynamism library (#147981)
This is the first step in supporting delayed compile. This library takes in example inputs and outputs a dict of dynamism across the inputs. We will use this to detect dynamism across multiple inputs in delayed compile. We will also use this to make shape collections more ergonomic by providing an affordance to generate a shape collection using example inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147981
Approved by: https://github.com/pianpwk
2025-03-01 02:49:16 +00:00
08434df1f2 [MPS] fix empty place holder error for smooth l1 loss (#148133)
Fixes #123171

And parametrizes the tests for it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148133
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-01 02:32:45 +00:00
02c5f21541 [Inductor] fix AOTInductorTestABICompatibleGpu.test_triton_kernel_weird_param_order with new Triton (#148011)
In this case, the parameters have already been filtered [here](201666d77d/torch/_inductor/codegen/cpp_wrapper_gpu.py (L335)) and subsequent filtering is not only unnecessary, it breaks the code, since the positions of the parameters change after filtering. For this test, for example, the second filtering discarded `buf0`.

For example:
```python
(Pdb) triton_meta["signature"]
{'in_ptr0': '*fp32', 'in_ptr1': '*fp32', 'n_elements': 'i32', 'BLOCK_SIZE': 'constexpr', 'out_ptr': '*fp32'}
(Pdb) call_args
['arg0_1', 'arg0_1', '256L', 'buf0']
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148011
Approved by: https://github.com/davidberard98
2025-03-01 01:21:20 +00:00
338ed67a1e [inductor] Implement max_pool2d_with_indices as a reduction for large window sizes (#147876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147876
Approved by: https://github.com/eellison
2025-03-01 01:07:01 +00:00
230a3b0f83 Add cuda 11.8 guard for cufile preload (#148184)
Follow up after https://github.com/pytorch/pytorch/pull/148137
Make sure we don't try to load cufile on CUDA 11.8

Test:
```
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu118'
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148184
Approved by: https://github.com/mikaylagawarecki
2025-03-01 01:01:04 +00:00
2544afaa1a [DeviceMesh] Add some documentation for from_group API and add a 2D test (#146364)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146364
Approved by: https://github.com/fduwjj
2025-03-01 00:57:37 +00:00
5d297f7a34 [MPS][BE] Combine two upsample_kernel_out_template into one (#148211)
- First, by stopp inverting sizes and strides, i.e. passing them as is, but reading them in inverse order in the shader as 1st stride of 4D tensor is one used for batches, 2nd for channels and 3rd and 4th for spatial coordinates
- Pass `scales` as float2 even in linear tensor
Above  allows one to collide two flavors `upsample_kernel_out_template` into one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148211
Approved by: https://github.com/dcci
ghstack dependencies: #148154, #148187
2025-03-01 00:39:26 +00:00
clr
83fb974b5d scriptfunction: Make sure we have valid __name__ and __qualname__ (#147906)
It's not fully clear why these are not being created, but you can definitely
    reproduce this in code. `__name__` is fun, since there appears to be no way to
    explicitly set it on the pybind11 layer or c++ layer. I've set this in the
    python wrapper code (which works correctly). But let me know if people feel
    strongly and want us to go explicitly cast to python within the cpp functions
    and set it there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147906
Approved by: https://github.com/jansel
ghstack dependencies: #147894
2025-02-28 23:25:47 +00:00
1ae7cc41ca Define __all__ for torch.utils.tensorboard (#147550)
Fixes the issue:

```python
import torch.utils.tensorboard
torch.utils.tensorboard.FileWriter  # pyright: "FileWriter" is not exported from module "torch.utils.tensorboard"
torch.utils.tensorboard.RecordWriter  # pyright: "RecordWriter" is not exported from module "torch.utils.tensorboard"
torch.utils.tensorboard.SummaryWriter  # pyright: "SummaryWriter" is not exported from module "torch.utils.tensorboard"
```

The [docs page for `torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147550
Approved by: https://github.com/albanD
2025-02-28 23:06:11 +00:00
3a69dee955 [Submodule][FlashAttention] Bump to 2.7.4 (#148147)
# Summary
This makes me happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148147
Approved by: https://github.com/Skylion007
2025-02-28 22:40:02 +00:00
83ec7cdcd4 Fix recompile reason logging (#148200)
for the following test case

```
        @torch.compile(dynamic=False, backend=cnts)
        def fn(x, y, z):
            return x * y * z[0]

        fn(1, torch.randn(1), {0: torch.randn(1)})
        fn(2, torch.randn(2), {0: torch.randn(2)})
        fn(3, torch.randn(3), {0: torch.randn(3)})
        fn(4, torch.randn(4), {0: torch.randn(4)})
        fn(5, torch.randn(5), {0: torch.randn(5)})
```

previously we would log

```
0/0: L['x'] == 1
0/0: L['x'] == 1
0/0: L['x'] == 1
0/0: L['x'] == 1
```

but after this change we now log

```
0/0: L['x'] == 1
0/1: L['x'] == 2
0/2: L['x'] == 3
0/3: L['x'] == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148200
Approved by: https://github.com/xmfan
2025-02-28 22:33:37 +00:00
40b3e4a358 [dynamo] expose code execution strategy to python (#148020)
@anijain2305 this can be used to mark a code object to be skipped/run-only (recursively) while tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148020
Approved by: https://github.com/jansel
2025-02-28 21:59:12 +00:00
e74fdbe6d0 [inductor] ignore block ptr advancements for removed buffers (#148087)
Follow up to https://github.com/pytorch/pytorch/pull/147193. Some buffers are removed only when the kernel context is exited so defer the lines instead.

Added `use_block_ptr` as a parameter to test case that fails if run with block ptrs enabled.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148087
Approved by: https://github.com/jansel, https://github.com/eellison
2025-02-28 21:31:15 +00:00
d174562487 [MPS][BE][EZ] Aggregate macros (#148187)
Refactor `INSTANTIATE_UPSAMPLE_BILINEAR2D(DTYPE)`, `INSTANTIATE_UPSAMPLE_BICUBIC2D(DTYPE)` and `INSTANTIATE_UPSAMPLE_BILINEAR2DAA(DTYPE)` use common `INSTANTIATE_UPSAMPLE2D`
Then combine multiple invocations into `INSTANTIATE_UPSAMPLE_ALL`

I.e. functionally it's a no-op, but achieves the same with fewer lines of code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148187
Approved by: https://github.com/Skylion007
ghstack dependencies: #148154
2025-02-28 21:30:00 +00:00
4995e058bf [user-triton] handle inline_asm_case (#148043)
Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value

Test Plan:
../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel

```
test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True)
ok

----------------------------------------------------------------------
Ran 2 tests in 0.706s

OK
```

Differential Revision: D69878591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043
Approved by: https://github.com/zou3519
2025-02-28 20:52:51 +00:00
6f91720e1c [inductor][ck] manual kBatch heuristic (#148118)
Summary:
# Why

Leverage kBatch parameter for large splitK examples for CK for better than ATEN performance

# What

replace default kBatch = 1 with a manual heuristic

- if K > 16 * max (M,N)
- leverage k_per_block, and K and number of SMs on the chip
- upper bound to 128, lower bound to 1

This is better than defaulting to 1, cheap to calculate, and shows performance beyond ATEN

This is of course subject to change and improvement

Test Plan:
with minor modifications to to run torch.mm on the shape `M, N, K = 2048, 2048, 524288`

```
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

```
AUTOTUNE mm(2048x524288, 524288x2048)
  rocm_ck_gemm_template_49 10.4972 ms 100.0%
  rocm_ck_gemm_template_8 10.6132 ms 98.9%
  rocm_ck_gemm_template_9 10.6907 ms 98.2%
[...]
  mm 18.9880 ms 55.3%
```

Reviewed By: ColinPeppler

Differential Revision: D70224591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148118
Approved by: https://github.com/ColinPeppler
2025-02-28 20:36:16 +00:00
48c55a66ec [ROCm] Move ROCm unstable MI300 jobs back to stable (#146675)
Fixes #145790
Needs #145504 to be merged first to resolve an artifact uploading issue with MI300 runners.

This PR moves rocm unstable MI300 back to stable. The change to unstable was introduced through this [PR](https://github.com/pytorch/pytorch/pull/145790). This was because the MI300s were failing with a [docker daemon](https://github.com/pytorch/pytorch/actions/runs/13015957622/job/36306779536) issue which has been resolved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146675
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-02-28 20:34:27 +00:00
6778084531 [inductor][cutlass] Environment variables for allow/denylist (#148161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148161
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-02-28 20:33:10 +00:00
5a1954eb93 [Inductor-CPU] Fix broken int8 WoQ GEMM AMX implementation in main (#147895)
#146843 broke int8 WoQ GEMM's (for BF16 activation) AMX ISA implementation in the main branch.
UT: `python test/inductor/test_cpu_select_algorithm.py -v -k woq`

The issue remained undetected because in case of templated kernel compilation failure, the auto-tuning infra marks its runtime as `inf`, and the op against which it was being benchmarked is used, so UTs didn't fail even on machines that support AMX ISA.

`test/inductor/test_cpu_select_algorithm.py` UTs checked the value of the `select_algorithm_autotune` counter, which only counts how many ops were selected for autotuning against their templated codegened counterparts.

@leslie-fang-intel advised using a new counter. I added `counters["inductor"]["cpp_templated_kernel_counter"]`, which is incremented after a codegened kernel's compilation, so it'd help catch breakage scenarios in which a templated kernel could not be codegened due to a compilation failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147895
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2025-02-28 20:20:45 +00:00
clr
e0e516c554 Don't crash when we call __qualname__ on torch._C.ScriptFunction (#147894)
We've root caused this to correctly throwing attribute error on ScriptFunction
when missing attributes are caused. This PR will fix crashes that are showing
up. I'm going to stack a second PR to fix torch._c.ScriptFunction just being a
very badly behaving python object (which should also fix this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147894
Approved by: https://github.com/jansel
2025-02-28 20:15:38 +00:00
297c00264e stage 1 of depreate silent fallback of tuning gemm (#147798)
Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/)

context:
https://github.com/pytorch/pytorch/issues/147479

For the most part, this should not change the behavior.

For int_mm, I also removed
```
    # TODO: Re-enable eager mode implementation once cuBLAS is fixed
    if use_cutlass or use_triton_template(layout, enable_int32=True):
        choices = []
```
because I think it is unwanted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798
Approved by: https://github.com/eellison
2025-02-28 19:51:55 +00:00
ebc3f27bf4 Revert "[async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)"
This reverts commit 6e037ac41c095dfdb37fdd4b36bf8ec2ebf84bf1.

Reverted https://github.com/pytorch/pytorch/pull/148001 on behalf of https://github.com/wdvr due to lint error ([comment](https://github.com/pytorch/pytorch/pull/148001#issuecomment-2691421540))
2025-02-28 19:44:54 +00:00
42aeb5d259 Resolve zip file permission issue when uploading artifacts on ROCm MI300 CI runners (#145504)
E.g.: https://github.com/pytorch/pytorch/actions/runs/13500418791/job/37719437613#step:19:120
```
Beginning upload of artifact content to blob storage
Error: An error has occurred while creating the zip file for upload
Error: EACCES: permission denied, open '/home/runner/_work/pytorch/pytorch/test/test-reports/backends.xeon.test_launch_1.1_22ba1133f3fcd140_.log'
/home/runner/_work/_actions/actions/upload-artifact/v4/dist/upload/index.js:3459
    throw new Error('An error has occurred during zip creation for the artifact');
    ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145504
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-02-28 19:16:28 +00:00
6e037ac41c [async TP] insert reshape node to handle "reshape -> scaled mm -> reshape pattern" in async TP with rowwise scales (#148001)
Fixes https://github.com/pytorch/torchtitan/issues/864

## Summary
While testing torchtitan with float8 training with rowwise scaling + async TP, a [bug](https://github.com/pytorch/torchtitan/issues/864) was discovered. The symptom was the scaling factor dims did not match the dims of the tensor the scales were to be applied to.

My [root cause analysis](https://github.com/pytorch/torchtitan/issues/864#issuecomment-2672465060) determined the reason is that when async TP graph manipulation constructs the `fused_scaled_matmul_reduce_scatter` op, it does not yet handle the "reshape -> scaled mm -> reshape" pattern used in torchao [here](ed361ff5c7/torchao/float8/float8_linear.py (L122-L124)) - specifically when row-wise scales are being used.

## TL;DR of root cause
- When a Float8Tensor is reshaped, the scale is reshaped along with it so the dimensions are aligned.
- In the graph manipulation logic of the micropipeline TP post grad pass, the scaled_mm `A tensor` node is referencing the tensor _before_ to the reshape op, but referencing the `A_scale` node _after_ the reshape op.

## Example
- Concrete example:
    - `A tensor` is a Float8Tensor with shape (1,8192,2048) and scale of shape (1,8192,1) when a matmul op is called in torchao [here](8706d3f3b0/torchao/float8/float8_linear.py (L70)). Torchao does a reshape -> scaled mm -> reshape [here](ed361ff5c7/torchao/float8/float8_linear.py (L122)). When a Float8Tensor is reshaped, its scale is reshaped along with it [here](8706d3f3b0/torchao/float8/float8_ops.py (L152)). So the first reshape makes the "A tensor" (1,8192,2048) => (8192,2048) and the scale (1,8192,1) => (8192,1).
    - During post grad pass in async TP:
        - `A_node` has shape (1,8192,2048) (tensor from before this [reshape](ed361ff5c7/torchao/float8/float8_linear.py (L122)))
        - `A_scale` has shape (8192,1) (due to reshape op above, which caused the scale to be reshaped from (1,8192,1) => (8192,1)).

## Solution

**Note:** the compiler inserts a `reciprocal` op after the reshape, so we can't simply use the node before the reshape as the `A_scale_node`, otherwise it will affect the numerics.

- Short-term solution: if the specific pattern showne below is detected, insert a reshape node after the reciprocal, to reshape the reciprocal output back to the originals shape before the reshape.
    - reshape is just a view, so there should be no impact on performance
```
Before:
    reshape (a,bc,) to (a*b,c) -> reciprocal

After:
    reshape (a,bc,) to (a*b,c) -> reciprocal -> reshape (a*b,c) to (a,b,c)
```

- Long-term solution: implement a `torch._scaled_matmul` which can support 3D+ `A tensor`

## Test plan
- Added unit test which exercises this new path
- Manually tested with torchtitan with float8 rowwise + async TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148001
Approved by: https://github.com/yifuwang
2025-02-28 18:51:42 +00:00
945e359fc1 Initial implementation of host memory stats (#147660)
This is an initial attempt to provide some statistics for the pinned host memory allocations flowing through CachingHostAllocator. Many times in the past we have had inexplicable slowdowns that would be much easier to diagnose if we had some host memory characteristics.

This change tries very hard not to disrupt the initial design of the allocator, and it uses existing locking mechanism, whenever possible, to gather statistics "for free". Only deviation from that is on the "slow path" where we incur CUDA calls anyway, so taking a short lock is not going to hurt the performance much, especially in the steady state where most allocations will come from cache.

As mentioned before, this is the first PR, to introduce the concept and to see if it fits the right paradigm. We can always add more later.

Metrics that would require more involved changes to the code base and locks, like requested memory, have been punted for now. I also tried to reuse the Stat structure used in CUDA caching allocator, in order to maintain symmetry.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147660
Approved by: https://github.com/ngimel
2025-02-28 18:36:44 +00:00
982d7ba3ef [while_loop][inductor] relax the constraint that all inputs must be on the same device (#148019)
Previously, we require all inputs of while_loop to be on the same device. However, there're use cases where we want to keep some of the inputs on cpu while others on gpu e.g. an loop_idx on cpu will save the gpu to device copies. This PR relaxes the constraint and only check if carry and input at the same position have the same device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148019
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-28 18:27:03 +00:00
2d2f60bdda [cond] support mismatched output in inductor (#147567)
In this PR, we extract `codegen_unbacked_symbol_defs` of FallbackKernel out as a `codegen_unbacked_symbol_defs_for_outputs` method in wrapper. With it,  HOPs can support the case where the subgraph returns a tensor with unbacked symints. This PR only do it for cond, we'll have follow up PRs for others (e.g. while_loop) as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147567
Approved by: https://github.com/jansel
2025-02-28 18:26:48 +00:00
d765077004 [cutlass backend] Sort the list of ops for better repro (#148047)
Differential Revision: [D70298051](https://our.internmc.facebook.com/intern/diff/D70298051/)

This only affects anything if `cutlass_max_profiling_configs` is used. I believe cutlass_max_profiling_configs is more of a testing config.

Problem is when we get the configs from cutlass_library, the ops can come in different orders.

Motivation is to make repro small issues easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148047
Approved by: https://github.com/chenyang78, https://github.com/coconutruben
2025-02-28 18:04:10 +00:00
790ec756ee [cutlass backend] Check if len(timings) == len(choices) before skipping precompile (#148050)
Differential Revision: [D70298908](https://our.internmc.facebook.com/intern/diff/D70298908/)

Mostly from @coconutruben observation. Right now, we skip precompilation if we find **some** timings. That sounds like a bug. Most of the time it is fine, since we don't change the number of configs and triton compilation doesn't take too long. But it is devastating for cutlass backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148050
Approved by: https://github.com/coconutruben
2025-02-28 17:58:58 +00:00
e5e31050d3 [MPS] Implement linear1d as shader (#148154)
And get rid of MPS call, as for some reason implementation via MPSGraph
API call is 100x+ times slower that Metal shader, at least according to
the following benchmark
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(3, 5, 65536, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="linear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="linear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS        |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16  ")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results after the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    2.5  |   2.1  |   2.2  | 161.4  | 115.0  | 161.1
```
And before the change
```
Benchmarking Results (collected on Apple M2 Pro):
----------------------------------------
Device            :                MPS        |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  354.0  | 336.0  | 332.4  | 145.5  | 114.7  | 148.3
```

Fixes https://github.com/pytorch/pytorch/issues/144245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148154
Approved by: https://github.com/dcci
2025-02-28 16:47:42 +00:00
b5cd4ac950 [torchgen] Add support for schema with namespace (#148038)
Fixes https://github.com/pytorch/executorch/issues/8711

In ExecuTorch when we try to parse the following schema:

```
aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor
```
Repro:

```python
from torchgen.model import FunctionSchema
native_schema = FunctionSchema.parse("aten::__lshift__.Scalar(Tensor self, Scalar other) -> Tensor")
```
It's failing because `BaseOperatorName` categorizes it to be a
inplace operator.

I understand we are not supposed to pass in namespace "aten::" into
`FunctionSchema.parse()` but unfortunately ExecuTorch requires this
feature to work.

This PR adds a new `namespace` attribute to `BaseOperatorName` and makes
sure the rest of the stack works as before, if a schema without
namespace  is passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148038
Approved by: https://github.com/bdhirsh
2025-02-28 16:41:50 +00:00
e593288859 ci: Remove manylinux builds for triton, except for XPU (#148129)
We're dropping regular old manylinux so let's drop it here too

Relates to #123649

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148129
Approved by: https://github.com/Camyll, https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
ghstack dependencies: #148126
2025-02-28 16:23:18 +00:00
4708cfdbd9 Support whitelist of dynamic sources (#147979)
This PR introduces the ability to whitelist sources as dynamic. This is particularly useful for large models with graph breaks, as you can keep the dynamism across graph breaks since source names stay consistent. Additionally you can use this to mark ints as dynamic.

NB: I intentionally didn't complicate the interface by supporting specification of per dimension dynamism. There is virtue in keeping true to the standard way of representing sources (eg. L['x']). If we find in practice that we need more more fine grained control, we can explore further affordances at that time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147979
Approved by: https://github.com/Mingming-Ding
2025-02-28 15:43:14 +00:00
0a948f705b [Dynamo] Fix AssertionError when dynamo traces torch.functional.xxx() functions (#148075)
Fixes #147840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148075
Approved by: https://github.com/yanboliang
2025-02-28 15:09:11 +00:00
1db3c58fab Remove manylinux 2014 artifacts (#148135)
1. Switch Magma build to Manylinux 2.28 base
2. Use manylinux 2.28 as default in populate_binary_env.sh
3. Remove manylinux 2014 docker builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148135
Approved by: https://github.com/malfet
2025-02-28 13:43:14 +00:00
1cb4e2df65 [BE][PYFMT] migrate PYFMT for torch._inductor to ruff format (#144550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550
Approved by: https://github.com/jansel
2025-02-28 13:33:19 +00:00
34d726011f [dynamo] update data-dependent branching graph break messages (#147912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147912
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #147494, #147872
2025-02-28 12:30:06 +00:00
4106aa33eb [dtensor][fix] fix _scaled_dot_product_flash_attention sharding (#148125)
### Summary
https://github.com/pytorch/pytorch/pull/146372/ changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at 40ad5e01df/torch/distributed/tensor/_ops/_matrix_ops.py (L232)

### Test
`pytest test/distributed/tensor/test_attention.py`

### Follow-up
It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148125
Approved by: https://github.com/tianyu-l, https://github.com/fegin
2025-02-28 09:26:43 +00:00
af720cd5a7 [Intel GPU] Decompule Intel GPU oneDNN from other backends (#147926)
# Motivation
Currently, Intel GPU is moving forward rapidly with the development of feature. We(Intel GPU) want an independent version control over oneDNN component so as to quickly adopt the optimization or bug fixing provided by oneDNN team.

This PR does not change the behaviors of other backends like Intel CPU, ARM. They can keep using the stable version contained in `third_party/ideep`.

# Detail

At compilation time, we will `git clone` oneDNN via  URL `https://github.com/oneapi-src/oneDNN` and checkout to the tag/commit that Intel GPU backend prefers. This feature is supported by CMake `Externalproject_add` command.
Following is a build log example:
```bash
[11/60] Performing download step (git clone) for 'xpu_mkldnn_proj'
Cloning into 'xpu_mkldnn_proj'...
HEAD is now at 5e92240360 meta: updated citation file
[12/60] Performing update step for 'xpu_mkldnn_proj'
-- Already at requested tag: v3.7
[13/60] No patch step for 'xpu_mkldnn_proj'
```
The log demonstates that, we explicitly download the source files and checkout to a specific tag. The source file of oneDNN is located at `build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj`

# Runtime verification
Running UT for CPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit fc3f17ad469b8a6da7192ae12d32625faa509f1e)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine
```

Runnint UT for Intel GPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:DPC++
onednn_verbose,v1,info,gpu,engine,sycl gpu device count:2
```

We can see that, Intel GPU would uses commit `5e922` (tag v3.7), while CPU uses `fc3f17`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147926
Approved by: https://github.com/EikanWang

Co-authored-by: leizhenyuan <zhenyuan.lei@intel.com>
2025-02-28 07:42:06 +00:00
3a58a04898 Build a storage reader/writer to write checkpoints in HF format (#148089)
Summary: D69984656 caused issues by adding the fsspec dependency to torch distributed when many packages internally didn't have it. In this diff I'm not adding HFStorageReader/Writer to __init__.py so that HFStorage components don't get imported internally and in turn there is no fsspec import that happens. I did the removal from __init__.py in D70286926 to fix the failing tests but the revert was done concurrently. I'll add the classes to __init__.py when I figure out a better way to get fsspec added as a dependency everywhere

Test Plan:
signals pass
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/distributed/checkpoint:test_hf_storage

Differential Revision: D70324090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148089
Approved by: https://github.com/saumishr
2025-02-28 07:38:10 +00:00
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
4e160d5fd9 [triton 3.3] Fix aoti cpp wrapper remaining 5 issue. (following #148051) (#148117)
Summary:
Fix the following 5 on a100:

- test_foreach_cpp_wrapper_cuda_gpu_wrapper

- test_enable_dynamic_shapes_cpp_wrapper_cuda_gpu_wrapper

- test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_gpu_wrapper

- test_enable_dynamic_shapes_cpp_wrapper_cuda_dynamic_shapes_gpu_wrapper

- test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_dynamic_shapes_gpu_wrapper

Test Plan:
oss :

```
TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCH_LOGS="+inductor, output_code" TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 CPLUS_INCLUDE_PATH=/usr/local/cuda-12.6/include:$CPLUS_INCLUDE_PATH python test/inductor/test_gpu_cpp_wrapper.py -k test_foreach_cpp_wrapper_cuda_gpu_wrapper
```

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148117
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
2025-02-28 06:56:30 +00:00
ea12fc8a9f Revert D70262395 (#148164)
Summary:

This reverts #147804 due to internal revert.

---
This diff reverts D70262395

Reviewed By: RossMcKenzie

Differential Revision: D70318024

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148164
Approved by: https://github.com/xmfan
2025-02-28 06:39:48 +00:00
baba7beed2 [dynamo] add context manager debug information to graph breaks (#147872)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147872
Approved by: https://github.com/zou3519
ghstack dependencies: #147494
2025-02-28 06:23:28 +00:00
4caeede799 [dynamo] more better error messages [3/N] (#147494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147494
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-02-28 06:23:28 +00:00
bc362cc15a Move expanded dim require_exact_stride handling to api from sdpa lowering (#148101)
See issue: https://github.com/pytorch/pytorch/issues/147156#issue-2852362217.

Original tests from https://github.com/pytorch/pytorch/pull/146054 should cover these changes, and I tested that the perf on https://github.com/pytorch/pytorch/issues/145760 remains fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148101
Approved by: https://github.com/zou3519
2025-02-28 06:02:18 +00:00
cyy
b0dfd242fa Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 05:53:19 +00:00
3b4b23ab0b [BE][Ez]: Remove extra copy in dtensor parallel loss (#148096)
Remove an extra copy of the input to `_log_softmax` when there is a dtype and memory format change. Fuse the copies instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148096
Approved by: https://github.com/jansel, https://github.com/wconstab
2025-02-28 05:42:32 +00:00
9b7130b8db Clean temporary directory at exit (#147813)
Issue: A temporary directory is created in [pytorch/torch/distributed/nn/jit/instantiator.py](https://github.com/arthurlw/pytorch/blob/clean-temp-directory-at-exit/torch/distributed/nn/jit/instantiator.py) but is never cleaned up, leading to a ResourceWarning on program exit.

Solution: Registered an `atexit` handler to properly clean up the temporary directory when the program exits.

Fixes #147744

**Line 23 in [0a49f8f](0a49f8fd3d)**
```python
23  atexit.register(_TEMP_DIR.cleanup)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147813
Approved by: https://github.com/H-Huang
2025-02-28 04:12:23 +00:00
760921a7d8 [MPS] Add inductor support for the entr() operator. (#148128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148128
Approved by: https://github.com/jansel, https://github.com/malfet
2025-02-28 03:33:22 +00:00
eb9c127341 [dynamo][optimizers] Install ID_GUARDED tensors into the Fx graph (#147824)
Earlier, with inline flag we were lifting id-guarded tensors to the inputs to the Fx graph. But this offers no benefit. Main idea behind lifting parameters as inputs was to reuse the compilation units across many instances of the nn-module. However, if we are guarding on the `id`, we are explicitly specializing the compiled artifact to the parameter.

This PR installs the parameters back into the graph. The benefit is removal of all pre-graph bytecode to extract the id-guarded tensors from locals/globals. This increases speedup from 1.67x to 1.75x for an internal model that has large number of optimizer parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147824
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-02-28 03:22:11 +00:00
926b7b5027 Revert "Remove NO_MULTIPROCESSING_SPAWN checks (#146705)"
This reverts commit 40ad5e01dff05c7d64e070fb01683820e678f788.

Reverted https://github.com/pytorch/pytorch/pull/146705 on behalf of https://github.com/cyyever due to Broke lint?, I guess land race with rufff update ([comment](https://github.com/pytorch/pytorch/pull/146705#issuecomment-2689603077))
2025-02-28 03:04:38 +00:00
3ce352e389 [BE][PYFMT] migrate PYFMT for torch._dynamo to ruff format (#144549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144549
Approved by: https://github.com/jansel
2025-02-28 03:03:53 +00:00
edc5bf91d2 [Intel GPU] Add synchronize() in torch.utils.benchmark (#147835)
When following https://pytorch.org/tutorials/recipes/recipes/benchmark.html on XPU, I notice that the device it is not synchronized in the benchmark. This PR tries to fix this and align the behavior with CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147835
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-02-28 02:58:17 +00:00
0edb2da4a4 [dynamo] add sourceless builder for types.MethodType (#147880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880
Approved by: https://github.com/jansel
2025-02-28 02:30:04 +00:00
30375cb326 Fix minor typo in python_nccl (#148088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148088
Approved by: https://github.com/Skylion007
2025-02-28 00:47:09 +00:00
481a57bc37 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-28 00:47:03 +00:00
c6d1038aaa only print GraphModule during fx.Interpreter errors if valid (#148090)
Came up in https://www.internalfb.com/diff/D69057074?dst_version_fbid=970771615000938&transaction_fbid=1723357345264461 - we need to make sure the GraphModule is valid before calling `print_readable` on it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148090
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
ghstack dependencies: #147749
2025-02-28 00:44:27 +00:00
5a14ff8ace Add cufile to list of libraries to preload (#148137)
Fixes: https://github.com/pytorch/pytorch/issues/148120

Test with almalinux/9-base:latest :
```
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.9/site-packages/torch/__init__.py", line 401, in <module>
    from torch._C import *  # noqa: F403
ImportError: libcufile.so.0: cannot open shared object file: No such file or directory
>>> exit()
[root@18b37257e416 /]# vi /usr/local/lib64/python3.9/site-packages/torch/__init__.py
[root@18b37257e416 /]# python3
Python 3.9.19 (main, Sep 11 2024, 00:00:00)
[GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
/usr/local/lib64/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.)
  cpu = _conversion_method_template(device=torch.device("cpu"))
>>> torch.__version__
'2.7.0.dev20250227+cu126'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148137
Approved by: https://github.com/malfet
2025-02-28 00:35:47 +00:00
40ad5e01df Remove NO_MULTIPROCESSING_SPAWN checks (#146705)
py 3.9 has spawn.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146705
Approved by: https://github.com/colesbury
2025-02-28 00:15:32 +00:00
2978771c9d [CI] test upload: better check for if job is rerun disabled tests (#148027)
Some disabled test runs weren't being uploaded as disabled tests because some dynamo tests are set to mark themselves as skipped if they are failing.  This makes the script think that there are fewer retries than there are actually are and that the job is not a rerun disabled tests job.  Instead, query for the job name to see if it contains rerun disabled tests and fall back to counting the number of retries if querying fails

Alternate options: relax the check for the number of tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148027
Approved by: https://github.com/huydhn
2025-02-28 00:04:33 +00:00
fc78192b1d ci: Only run CI specific things when in CI (#148126)
This was blocking me from running this locally so don't run it like this

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148126
Approved by: https://github.com/Camyll, https://github.com/malfet, https://github.com/atalman
2025-02-27 23:27:57 +00:00
f4235310e8 [BE][Ez]: Remove redundant empty tensor copies in meta-reg (#147978)
Empty_likes includes a memory_format arg. Let's use it to avoid unnecessary copy operations. Noticed while reviewing: https://github.com/pytorch/pytorch/pull/147862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147978
Approved by: https://github.com/jansel
2025-02-27 23:16:44 +00:00
915b9c80ab [export] Sync aoti schema to schema.py (#148017)
Summary: Synchronizing internal AOTI schema to OSS schema.py

Test Plan: CI

Differential Revision: D70271151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148017
Approved by: https://github.com/yiming0416
2025-02-27 21:46:11 +00:00
871b3909fc ci: Remove manylinux 2014 remnants (#148028)
These are the only remaining references I could find to manylinux2014,
we should probably look to remove these a bit quicker since it made it
difficult to know which Dockerfiles were important in
.ci/docker/manywheel/

> [!TIP]
> I checked if we were using these by running
> `rg 2014 .github/`

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148028
Approved by: https://github.com/wdvr, https://github.com/malfet, https://github.com/atalman
2025-02-27 21:37:00 +00:00
10ffd94216 Reference the commit explicitly (#148026)
Reference the commit tested by CI explicitly, and fail the merge if the PR was updated.

Tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148026
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/atalman
2025-02-27 21:06:34 +00:00
783d83c5d8 [PT2] Port fuse_split_getitem_squeeze to PT2 pre_grad passes (#148059)
Summary: put it as an add_pass option

Reviewed By: frank-wei

Differential Revision: D68909559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148059
Approved by: https://github.com/frank-wei
2025-02-27 21:03:51 +00:00
d48eb58d1d [BE][CI] bump ruff to 0.9.8 (#145606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145606
Approved by: https://github.com/malfet
ghstack dependencies: #144546
2025-02-27 21:01:10 +00:00
644d84d594 Revert "optimize the decomposition of aten.native_group_norm (#144733)"
This reverts commit b533bb4b133c36767270bd8a24f11d5c37f8dd5c.

Reverted https://github.com/pytorch/pytorch/pull/144733 on behalf of https://github.com/desertfire due to Cause TIMM pass rate regression on H100, see https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2020%20Feb%202025%2020%3A53%3A55%20GMT&stopTime=Thu%2C%2027%20Feb%202025%2020%3A53%3A55%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=main&lCommit=4216478250e08e950fdd090fc23a1b270c520cc4&rBranch=main&rCommit=4986f0f52eb871cdb91b8124ee162cfe622b8688 ([comment](https://github.com/pytorch/pytorch/pull/144733#issuecomment-2689092714))
2025-02-27 20:57:25 +00:00
1845e7d1f5 Use nightly-wheel-upload env for triton wheel publishing (#148108)
Required for publishing triton builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148108
Approved by: https://github.com/malfet
2025-02-27 20:47:40 +00:00
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
f0d00421cf [inductor][ck] kBatch filtering with gen_ops (#148004)
Summary:
# Why

not all choices of kBatch are valid and will lead to a runtime error (when CK checks the validity of the args)

c9bcfd755e/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v3_multi_d.hpp (L1020)

# What

- move kBatch inside the gen_ops to have more control over it, and be able to filter it
- expand filtering based on the cpp logic
- refactor the padding checks to be more readable

Test Plan:
```
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

with

kBatch = 128: some filering
kBatch = 1: no filering
kBatch = 1738: all options filtered out

Reviewed By: henrylhtsang

Differential Revision: D70211442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148004
Approved by: https://github.com/ColinPeppler, https://github.com/tenpercent
2025-02-27 20:13:58 +00:00
ce805a5ba5 [BE/metal] Rename REGISTER_I0_I1 to REGISTER_SPECIAL. (#148036)
Now that it's used for other ops as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148036
Approved by: https://github.com/malfet, https://github.com/jansel
2025-02-27 17:56:26 +00:00
9a1f720a72 Validate inputs to _nested_view_from_buffer to prevent overflows (#147356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147356
Approved by: https://github.com/albanD, https://github.com/jbschlosser
ghstack dependencies: #147352, #147354
2025-02-27 15:48:58 +00:00
536bce5a04 Make Tensor.set_ validate storage_offset when sizes/strides are unchanged (#147354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147354
Approved by: https://github.com/albanD
ghstack dependencies: #147352
2025-02-27 15:48:58 +00:00
e64441915f Fix overflow in checkInBoundsForStorage (#147352)
Use `computeStorageNbytes` (which checks for overflows) to include the computation re the storage_offset

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147352
Approved by: https://github.com/albanD
2025-02-27 15:48:50 +00:00
6ccbff1450 [Inductor] Fix inductor/test_kernel_benchmark.py for new Triton; do not duplicate parameters in _dump_launch_params (#147746)
The problem is that the new Triton uses the following code branch, which does not filter the call parameters, which may already be in the launcher's cfg.kwargs. This is generally expected behavior, so I just stopped adding arguments from `launcher.config.kwargs`: cde12207a0/torch/_inductor/runtime/triton_heuristics.py (L1099)

Issue example (from https://github.com/intel/intel-xpu-backend-for-triton/issues/3499):

```bash
Failed when when running cleaned triton Command '['/home/xinanlin/xinanlin/miniforge3/bin/python', '/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3b
dmtky5n4j4jrd5k5pu.py.cleaned']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 103, in <module>
    compiled_module_main('None', benchmark_compiled_module)
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/wrapper_benchmark.py", line 435, in compiled_module_main
    wall_time_ms = benchmark_compiled_module_fn(times=times, repeat=repeat) * 1000
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 98, in benchmark_compiled_module
    return print_performance(fn, times=times, repeat=repeat)
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in print_performance
    [timed(model, example_inputs, times, device) for _ in range(repeat)]
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 451, in <listcomp>
    [timed(model, example_inputs, times, device) for _ in range(repeat)]
  File "/home/xinanlin/xinanlin/pytorch/torch/_inductor/utils.py", line 434, in timed
    result = model(*example_inputs)
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 97, in <lambda>
    fn = lambda: call([arg0_1, arg1_1])
  File "/tmp/torchinductor_xinanlin/4g/c4gp5j3t44nmaxvl7ndgcptyur6sij4k3bdmtky5n4j4jrd5k5pu.py.cleaned", line 86, in call
    triton_poi_fused_add_0[grid(1)](arg0_1, arg1_1, buf0, 1, 1, XBLOCK=1, num_warps=1, num_stages=1)
  File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 336, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/home/xinanlin/xinanlin/miniforge3/lib/python3.10/site-packages/triton/runtime/jit.py", line 531, in run
    bound_args, specialization, options = binder(*args, **kwargs)
TypeError: dynamic_func() got multiple values for argument 'XBLOCK'
```

Reroduce:
`python test/inductor/test_kernel_benchmark.py -k test_remove_inductor_deps`

Triton: c4a79a1960
Pytorch: bea72180ed75f522ce4fe5e723bc2112e0874732

@davidberard98 @etaf please take a look
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147746
Approved by: https://github.com/jansel
2025-02-27 14:40:22 +00:00
2c35af4def [Intel GPU] Avoid including CPU oneDNN header files for Intel GPU (#147969)
XPU builds oneDNN in another folder. The XPU oneDNN head files are in the XPU-specific folder - `${__XPU_MKLDNN_BUILD_DIR}`.
f522d899fb/cmake/Modules/FindMKLDNN.cmake (L73)

 So, `${PROJECT_SOURCE_DIR}/third_party/ideep/mkl-dnn/include` is useless for XPU. `XPU_MKLDNN_INCLUDE` is good enough. Meanwhile, it may mess up the included files if the version of XPU oneDNN differs from other backends.

* __->__ #147969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147969
Approved by: https://github.com/ZhiweiYan-96, https://github.com/liangan1, https://github.com/atalman
2025-02-27 14:22:17 +00:00
71ee17baa1 Smoke Test skip cuda.gds on windows (#148060)
Follow up after : https://github.com/pytorch/pytorch/pull/147120
Cufile was enabled only on Linux: https://pypi.org/project/nvidia-cufile-cu12/#files
Fixes validation workflow failues: https://github.com/pytorch/test-infra/actions/runs/13558218752/job/37896578837

```
 File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 105, in __init__
    raise RuntimeError("GdsFile is not supported on this platform.")
RuntimeError: GdsFile is not supported on this platform.
Exception ignored in: <function GdsFile.__del__ at 0x000001772B5003A0>
Traceback (most recent call last):
  File "C:\Jenkins\Miniconda3\envs\conda-env-13558218752\lib\site-packages\torch\cuda\gds.py", line 113, in __del__
    if self.handle is not None:
AttributeError: 'GdsFile' object has no attribute 'handle'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148060
Approved by: https://github.com/mikaylagawarecki
2025-02-27 14:00:49 +00:00
7ae0e0b2ea [aotd] Log torch._functorch.config in tlparse (#147883)
Adding torch._functorch.config to tlparse for better debugability.
E.g. https://github.com/pytorch/pytorch/pull/147638 happened only with `torch._functorch.config.view_replay_for_aliased_outputs=False` which is True by defautl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147883
Approved by: https://github.com/bdhirsh, https://github.com/jamesjwu
2025-02-27 11:22:45 +00:00
c5bf9aaf1c Log graph breaks (#146537)
Graph breaks currently aren't logged to dynamo_compile and pt2_compile_events. We want to log them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146537
Approved by: https://github.com/c00w
2025-02-27 11:06:33 +00:00
0489a349e7 Skip the logging if the pass cannot be pickled (#148053)
Summary:
Skip the logging for vllm at this moment, we can add some pickle logic later.

The log is only for debugging purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148053
Approved by: https://github.com/chenyang78
2025-02-27 10:54:34 +00:00
26f19539ad [triton 3.3] cpp_wrapper: add a global_scratch arg (#148051)
Following triton # 4916, the generated cubin expects a global_scratch argument to support on-device TMA. We believe this is the source of many of the "invalid argument" failures on AOTI/cpp_wrapper tests. AFAIK, we don't use on-device TMA in Inductor as of now, so it should be safe to use a nullptr for the scratch space.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148051
Approved by: https://github.com/YUNQIUGUO
2025-02-27 10:13:57 +00:00
91e7c7945c [Intel GPU] Avoid unnecessary copy when the dst of Matmul is non-contiguous (#144759)
We should not always call contiguous on the dst of matmul. We have already removed copy of matmul input in https://github.com/pytorch/pytorch/pull/143784

I also fixed an accuracy issue by using onednn sum post op instead of binary add in the case of inplace to avoid UT failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144759
Approved by: https://github.com/EikanWang
2025-02-27 08:04:34 +00:00
8ee84aa703 [ONNX] Fix missed None type support in dyamic shapes string cases (#148025)
In `_any_str_or_dim_in_dynamic_shapes`, we strictly guard the `dynamic_shapes` to make sure the flattened shapes are valid. But the code missed to consider None could be in the shapes.

NOTE: Found in benchmarking with Olive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148025
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-02-27 07:57:47 +00:00
fd43c36aa9 [ca] side-effect free initial trace: RAII PyCompilerInterface (#147891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147891
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796, #147804
2025-02-27 07:17:30 +00:00
9017becf1d Add unique kernel name support for user defined triton kernel (#147587)
Summary:
Add unique_user_kernel_names which mimics what unique_kernel_names do, but for user defined Triton kernels.
This does rewrite the copied kernel src, and modifies non-Inductor generated code, so we split it out from unique_kernel_names, where we have more control over all namings and generations.

Test Plan: Only used for debug purpose

Differential Revision: D69966608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147587
Approved by: https://github.com/desertfire
2025-02-27 06:00:50 +00:00
c622796cde Revert "Build a storage reader/writer to write checkpoints in HF format (#147622)"
This reverts commit 6a658d983e84f7bcb8e67328b00661ec49db78c5.

Reverted https://github.com/pytorch/pytorch/pull/147622 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147622#issuecomment-2686932514))
2025-02-27 05:14:28 +00:00
21bd5fe203 Update torch-xpu-ops commit pin (#147968)
Update the torch-xpu-ops commit to [86aaaf8a9dd6932c088b7afcac0c0856b23d341a](86aaaf8a9d), includes:

- Bugfix (PT2E/BatchNorm)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147968
Approved by: https://github.com/Skylion007
2025-02-27 05:01:12 +00:00
b6fe28ff02 [Inductor] Graph Partition (#147038)
This PR implements inductor graph partition. Previously, 1 dynamo graph is mapped to 1 inductor graph, and further mapped to 1 call function. In this PR, we allow 1 dynamo graph mapped to multiple inductor graphs and multiple `graph_partition` functions in the generated code. This allows applying different further optimizations to different `graph_partition`.

Design Doc: [link](https://docs.google.com/document/d/1qPgOfy25l7SIYnrQrvU-TO1mdHMslCwv_SLmeXID6tM/edit?usp=sharing)
Example: [Generated code before and after this diff](https://www.internalfb.com/intern/diffing/?paste_number=1737334601)

In the follow-up PR, we will extend the work to cudagraph, which allows applying cudagraph to parts of the generated code (#125864).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147038
Approved by: https://github.com/eellison
2025-02-27 04:50:43 +00:00
e0b93082f1 Remove HuggingFace reader and writer from __init__.py (#148030)
Summary: This is causing a HFStorageReader/Writer to be imported which imports fsspec but dependencies don't have fsspec, which is causing failing builds

Differential Revision: D70286926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148030
Approved by: https://github.com/hl475
2025-02-27 04:50:14 +00:00
8cb8722979 [inductor][triton] Ignore block ptr advances for removed buffers (#147193)
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-27 03:37:33 +00:00
17358ce778 Revert "Support torch.compile rng selective activation checkpointing with cudagraph (#146878)"
This reverts commit ad0c879e2203145f6d56df0b95af36822220ab8f.

Reverted https://github.com/pytorch/pytorch/pull/146878 on behalf of https://github.com/wdvr due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/146878#issuecomment-2686767956))
2025-02-27 03:36:16 +00:00
9d3636283b [Inductor] Use generic GPU device in test_preserves_strides (#148006)
#147861 added a new test tagged for the generic GPU but uses the cuda GPU type for creating the tensors. Update the GPU type to also be generic. This passes with my local Intel Triton install, presumably it will work for the current pin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148006
Approved by: https://github.com/eellison, https://github.com/etaf
2025-02-27 02:52:51 +00:00
07b7b3ed4e torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-27 02:44:39 +00:00
84c89a4527 [cutlass backend] cache_clear algorithm select cache on fresh inductor cache (#147590)
Differential Revision: [D69959917](https://our.internmc.facebook.com/intern/diff/D69959917/)

AlgorithmSelectorCache is a cache. The expectation is that when we force disable cache + clear inductor caches, it would be clear. However that is not the case.

The reason why this is a problem can be seen by following this repro:
What we will see is
```
SingleProcess AUTOTUNE benchmarking takes 6.2202 seconds and 46.0568 seconds precompiling for 36 choices
SingleProcess AUTOTUNE benchmarking takes 492.3141 seconds and 0.0010 seconds precompiling for 36 choices
```

The root cause is, while precompiling is skipped, due to it being cache, autotuning isn't skipped since we force disable it.

repro:
```
import logging
import os

os.environ["TORCH_LOGS"] = "+output_code,+benchmarking,+inductor"

import torch

import torch._inductor.config
from torch._inductor.utils import clear_inductor_caches

torch._inductor.config.max_autotune = True
torch._inductor.config.force_disable_caches = True
torch._inductor.config.autotune_num_choices_displayed = None
torch._inductor.config.max_autotune_gemm_backends = "CUTLASS"
torch._inductor.config.autotune_fallback_to_aten = False
torch._inductor.config.cuda.cutlass_instantiation_level = "0001"

def main():
    M, N, K = 2048, 2048, 2048
    dtype = torch.bfloat16
    A = torch.randn(M, K, device="cuda", dtype=dtype)
    B = torch.randn(K, N, device="cuda", dtype=dtype)
    for _ in range(2):
        torch._dynamo.reset()
        clear_inductor_caches()
        compiled_model = torch.compile(torch.mm, fullgraph=True)
        _ = compiled_model(A, B)

    print("done")

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147590
Approved by: https://github.com/eellison, https://github.com/chenyang78
2025-02-27 02:30:49 +00:00
97ebccaa91 Add _fft_r2c as core ATen (#147998)
As titled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147998
Approved by: https://github.com/tugsbayasgalan
2025-02-27 02:29:59 +00:00
ad0c879e22 Support torch.compile rng selective activation checkpointing with cudagraph (#146878)
TODO:
- [x]  Add handling for when forward is invoked multiple times without invoking backward, so that the fwd/backward states are out of sync
- [x] Update rng state initialization to take from correct device
- [x]  Tests
- [x] handling of retain_graph
- [x] respect fallback random

Fix for https://github.com/pytorch/pytorch/issues/130123.

Updates the aot_eager and cudagraph compilation of `run_and_save_rng_state` to use the new mechanism added by https://github.com/pytorch/pytorch/pull/114068 for CUDAGraph safe rng states.

We have a pair of rng states for the fwd and backward respectively. In both forward and backward the rng op will get run with `graphsafe_run_with_rng_state` which takes in RNG state and it hooks onto the current RNG generator before running the operator. The rng states for fwd/backward are initialized with the same value. We ensure that for any given run of the forward, the corresponding backward run will have the same rng states for the op as was observed in the forward.

```
 ===== Forward graph 1 =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", fwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = fwd_rng_state_0);  fwd_rng_state_0 = None
        ...

 ===== Backward graph 1 =====
    def forward(self, primals_1: "f32[4, 4][4, 1]cuda:0", primals_2: "f32[4, 4][4, 1]cuda:0", tangents_1: "f32[4, 4][4, 1]cuda:0", bwd_rng_state_0):
        sin: "f32[4, 4][4, 1]cuda:0" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        graphsafe_run_with_rng_state = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype = torch.float32, device = device(type='cuda', index=0), pin_memory = False, rng_state = bwd_rng_state_0);  bwd_rng_state_0 = None
```

There is some extra complication when a user either calls backward with retain_graph, or calls the backward in a different order as they called the forward. If a user has state fwd_rng_state0, bwd_rng_state0 and calls:
- fwd0: fwd_rng_state0 -> fwd_rng_state1
- fwd1: fwd_rng_state1 -> fwd_rng_state2
- bwd1
- bwd0

Then naively, when bwd1 is invoked the bwd rng states would not be equal to the same states that were observed in fwd1. I added handling of this in the aot runtime wrappers to detect pending backward invocations, and the current position of the bwd rng states, and to update when necesssary.

Other notes:

Because nodes which appear later in the forward appear earlier in the backward, we need a separate rng state for each operator. If we reused the rng across ops, the forward and backward would be run with different rng states. I.e., not applied in the same order.

Questions for reviewers:

This does change numerics, bc the rng of the op is now taken from the input rng state instead of whatever the rng would be midway through running the graph. Technically, we only need this for cuda graph. But, I'd prefer to not have a rng divergence just for cudagraph. I am making it respect `fallback_random`.

Edit: decided to apply to non cudagraphs as well, so long as fallback_random is not set

I'm initializing the rng states by cloning the current state. If you had something like 5 different rands in the model with the same shape, theyd all get the same value. This doesn't seem great. I could use some other initialization scheme like taking seed from graph position, or etc etc. Not sure. Let me know thoughts.

Edit: updated to be taken from randint()

Update: initializing rng states from torch.randint..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146878
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2025-02-27 02:08:29 +00:00
784902983e Remove +PTX from cuda 12.6 builds (#148000)
Similar to: https://github.com/pytorch/pytorch/pull/141142

Ahead of the release 2.7
I see following validation failure: https://github.com/pytorch/test-infra/actions/runs/13552433445/job/37879041739?pr=6339
```
RuntimeError: Binary size of torch-2.7.0.dev20250226+cu126-cp310-cp310-manylinux_2_28_x86_64.whl 1076.45 MB exceeds the threshold 750 MB
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148000
Approved by: https://github.com/clee2000, https://github.com/ngimel, https://github.com/tinglvv
2025-02-27 02:02:11 +00:00
20ce67cd06 Udpate hw requirement for FP64 on "Getting Started on Intel GPU" (#147802)
Fixes #147731

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147802
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-27 01:54:19 +00:00
cyy
9ca871f32b Remove binaries/benchmark_args.h (#147920)
It's not used in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147920
Approved by: https://github.com/Skylion007
2025-02-27 01:16:28 +00:00
ea5d40db73 Address source code building command for Intel GPU support (#143476)
As the title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143476
Approved by: https://github.com/EikanWang, https://github.com/malfet

Co-authored-by: Xu Han <xu.han@outlook.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-27 01:07:40 +00:00
f104ef1248 [AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode (#147975)
Summary: Let CppBuilder handle all the cpp build logic

Differential Revision: D70141808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147975
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2025-02-27 00:35:12 +00:00
f98cd84b04 cpp_wrapper: use largeTensorTest for test memory checks (#146991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146991
Approved by: https://github.com/desertfire
2025-02-27 00:30:21 +00:00
723f3a9eab torch.utils._content_store: fix error in hash_storage on XPU (#147785)
See https://github.com/pytorch/pytorch/actions/runs/13508573465/job/37745227468 for an example error. This is triggering after the merge of #147541, which enabled Dynamo compilation on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147785
Approved by: https://github.com/jansel
2025-02-26 23:57:59 +00:00
915eb012e1 Revert "[dynamo] add sourceless builder for types.MethodType (#147880)"
This reverts commit 08f4c1a2332921e57c782c80a66b2adc9cdc0575.

Reverted https://github.com/pytorch/pytorch/pull/147880 on behalf of https://github.com/wdvr due to failing trunk tests ([comment](https://github.com/pytorch/pytorch/pull/147880#issuecomment-2686436432))
2025-02-26 23:29:58 +00:00
84e60eece8 [ROCm] [TunableOp] Unit tests for scaled GEMM and GEMM with bias (#147890)
Two more unit tests for TunableOp:
- Scaled GEMM
- GEMM with bias

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147890
Approved by: https://github.com/jeffdaily
2025-02-26 22:41:24 +00:00
b13ad1a193 [ROCm][TunableOp] Remove extra transpose characters in hipBLASLt signature. (#147900)
Cleanup the TunableOp hipBLASLt signature of extra transpose characters.

Test manually and no new regressions found.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147900
Approved by: https://github.com/jeffdaily
2025-02-26 22:28:00 +00:00
7e7d05bf85 Revert "[do not merge yet] update grammar (#147996)"
This reverts commit 6e129a697f86425d0682ed30ffc9b3f8abe00e9e.

Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686291282))
2025-02-26 22:01:12 +00:00
6e129a697f [do not merge yet] update grammar (#147996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996
Approved by: https://github.com/seemethere
2025-02-26 21:52:58 +00:00
dc7556f1bd Revert "[do not merge yet] update grammar (#147996)"
This reverts commit a1ee2c3a08c3bf3d83c4e9f352ea179c107edb13.

Reverted https://github.com/pytorch/pytorch/pull/147996 on behalf of https://github.com/seemethere due to Need to revert ([comment](https://github.com/pytorch/pytorch/pull/147996#issuecomment-2686266052))
2025-02-26 21:43:06 +00:00
a1ee2c3a08 [do not merge yet] update grammar (#147996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147996
Approved by: https://github.com/seemethere
2025-02-26 21:39:08 +00:00
201666d77d [cutlass backend] turn autotuning logs off by default + rename log to autotuning log (#147922)
things we did:
* turn off autotuning logs by default
* rename autotuning logs from log to autotuning_log, so people are aware that it is a special artifact log.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147922
Approved by: https://github.com/eellison
2025-02-26 21:02:04 +00:00
976ff5cf01 Add cmake hints to USE_SYSTEM_NVTX for nvtx3 include dir (#147418)
per title

sometimes, it's hard for cmake to find NVTX3 without the cuda include path hint
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147418
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2025-02-26 20:52:28 +00:00
6a658d983e Build a storage reader/writer to write checkpoints in HF format (#147622)
Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.
Copy of [D68444967](https://www.internalfb.com/diff/D68444967) (https://github.com/pytorch/pytorch/pull/146352). That diff got reverted because of lint errors. The lint error was due to having imports of uninstalled libraries. This was on purpose because we don't want to install safetensors and huggingface, this new diff explicitly ignores this lint so that we don't have the error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147622
Approved by: https://github.com/saumishr
2025-02-26 20:47:54 +00:00
7c71ab1d40 [scan] User-facing reverse flag handling (#147886)
This PR removes the reverse flag from the backend implementation and resolves it via `torch.flip` in the frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147886
Approved by: https://github.com/ydwu4
2025-02-26 20:04:57 +00:00
683e083e8d [MPS] Add support for entr() in eager. (#147948)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147948
Approved by: https://github.com/malfet
2025-02-26 19:55:02 +00:00
eb08ada5d3 [dynamo] Support reads to global/captured tensors in nonstrict_trace-ed function (#147572)
As title. Without this patch we get the following error:

Tweaking the `allow_non_fake_inputs` flag on tensor mode doesn't quite
work for AOTAutograd, which also needs to fake-tensor-propagate the
`nonstrict_trace`-ed function, but that's _after_ Dynamo has handled the
`nonstrict_trace` processing and put the `flat_apply(...)` node into the graph.

So we can't easily to temporarily enable the `allow_non_fake_inputs`
flag on current fake mode, when AOTAutograd processes a `flat_apply`
node from Dynamo's `nonstrict_trace` handling. And after discussing
with zou3519, I decided to add a global `FakeTensorTLS` that contains a
`allow_non_fake_inputs_override` flag, and patch the `nonstrict_trace`-ed
function to temporarily tweak this flag during its execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147572
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367, #146950, #147571
2025-02-26 19:47:39 +00:00
73e963459e [dynamo] Support nonstrict_trace on class method (#147571)
As title, also see
1. new test `test_nonstrict_trace_on_method` for example.
2. newly added comments for why we need special treatment on methods.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147571
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367, #146950
2025-02-26 19:47:39 +00:00
7e0ef2c844 [dynamo] Use the new get_unique_name_wrt helper when applicable (#146950)
This patch removes some duplicated name generation logic in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146950
Approved by: https://github.com/zou3519
ghstack dependencies: #146714, #146367
2025-02-26 19:47:39 +00:00
f46f0e465c [dynamo] Initial support for nonstrict_trace (#146367)
## Context
> **Note:** `mark_traceable` got renamed to `nonstrict_trace` after
> offline discussion. The reasons are (1) it aligns with `torch.export`'s
> `nonstrict` notion, and (2) it's more definitive in behavior suggestion.

1. [Overall Design](https://docs.google.com/document/d/1O-dR2ZQaJQVt_v67AVcDCw2yJLtqgkZFwoXK0buEWRg/edit?tab=t.0)
2. [Dynamo graph representation with `torch._higher_order_ops.flat_apply`](https://docs.google.com/document/d/1YHl5nPTJvYeCPE5TO9uA18DPWNgUYGE4gCn6bFvXcBM/edit?tab=t.0#heading=h.xtw3hhbro4gn)

## Summary
This patch adds a `torch._dynamo.nonstrict_trace` decorator, which
currently is an enhanced version of `torch._dynamo.allow_in_graph` (see
docstring for their differences). Specifically, this patch focuses on
the UI and functionality prototyping/plumbing.

The main enhancement is supporting more input types, and the
implementation challenge lies in reconstructing the input objects from
Dynamo `VariableTracker` (while accounting for buffered side-effects and
guards).  This patch takes a middle-ground (simple implementation with a
bit of user labor), by
1. asking the user to provide pytree registration for non-proxy-able
   input types,
2. letting Dynamo trace through `pytree_flatten` (which accounts for
   buffered side-effects and guards automatically),
3. and passing in the TreeSpec as a graph attribute constant into
   `torch._higher_order_ops.flat_apply` (which unflattens the inputs and
   invokes the underlying function).

## Next Steps
In subsequent patches, we will try to support the following:
- annotating on class method
- reads to global tensors
- inputs that contains `pytree.register_constant`-ed instances.
- function as input
- more output types (e.g., any pytree-registered type)
- `torch.nn.Module` as inputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146367
Approved by: https://github.com/zou3519
ghstack dependencies: #146714
2025-02-26 19:47:39 +00:00
bab84f0bd9 [hop] Support more output types for flat_apply (#146714)
This patch enables `flat_apply` to support certain non-Tensor output
types like containers and graphable types. This will in turn enable the
upcoming `mark_traceable` to support more output types.

The patch also exposes a `func_to_graphable` rather than having the
users calling the lower level `pytree.flatten(ConstantFunction(...))`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146714
Approved by: https://github.com/zou3519
2025-02-26 19:47:39 +00:00
8594856651 [aotd] Alias of intermediate unwrap TensorAlias (#147638)
Bug was reported by internal user.

AOTD classified outputs that are aliases of intermediates of the graph in different categories.

...
- output is alias of intermediate which base is already output
- output is alias of intermediate which base is not in output

If we look at the fn:
```
def fn(x):
    ix = x + 1
    a = ix.transpose(0, 1)
    return a.detach(), a
```

output 0: detach view of alias a, where a is already output
output 1: alias of intermediate ix, then additional output ix will be added internally

output 0 base is TensorAlias(a) in this case, but could be Tensor.
Adding runtime unwrapping solves this problem.

Alternatively we should track base of a.detach() all the way to ix, in that case the base will be always a Tensor, not TensorAlias.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147638
Approved by: https://github.com/bdhirsh
2025-02-26 19:42:21 +00:00
30db64bf51 [PT2] Support add/remove passes in pre_grad (#146064)
Summary:
support the same functionality with acc_tracer disabled, add a new config for pre_grad add/remove_passes, at the front end it still uses the same interface

some minor updates in pre_grad passes to make sure the passes are run in desired order, after added passes, still run pass like remove_noops at the end

Test Plan: add new UT, please see stacked diff for add pass tests (TODO: update diff link)

Differential Revision: D68909278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146064
Approved by: https://github.com/frank-wei
2025-02-26 18:46:43 +00:00
00732c3f7e [MPS] Implemented masked_fill_scalar as shader (#147369)
- Move `pos_from_thread_index and `offset_from_pos` from `UnfoldBackward.metal` into `c10/metal/indexing.h` header
- Initial idea were to implement `StridedTensor` and `ConstStridedTensor` and use them to have masked_fill kernel a something simple as the following loop
```metal
ConstStridedTensor<bool> mask(mask_data, sizes, mask_strides, ndim);
if (mask[thread_index]) {
  StridedTensor<T> input(input_data, sizes, input_strides, ndim);
  input[thread_index] = val;
}
```
But though it looks elegant and works correctly, performance wise it's much slower that the existing MPS shader (see table below), as int64 divisions on M2 GPU are really slow

- Solved performance issue by implementing 3 flavors of the same shader: `dense`, that is used when both input and mask are dense tensors of the same size, `broadcast`, which is used when `mask` is leading dimensions expandable into input tensor and `strided`  which is a general purpose fallback, but still computes position in the tensors only ones. As result, perf is even better than existing MPS shader for dense and broadcast able tensors.

Performance measured on M2Pro thru different iterations of the same shader

| dtype | MPS | int64-idx | int64-inlined | 32-bit strided | 32-bit broadcasted |
| ------|------| -----|   ---- | --- | ---- |
| float32 | 2.8 msec  | 41.6 msec | 26.9 msec | 5 msec | 2.4 msec |
| float16 | 1.86 msec | 38.2 msec| 26.6 msec | 4.6 msec | 1.9 msec |
|bfloat16|1.86 msec |38.3 msec | 26.6 msec | 4.6 msec | 1.9 msec |

And benchmark script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_mask_fill(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"x.masked_fill(y, -17.0); torch.mps.synchronize()",
        setup=f"x,y = torch.rand(1, 20, {n}, {n}, dtype={dtype}, device='mps'), torch.ones({n}, {n}, device='mps').triu().bool()",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_mask_fill(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.masked_fill_() {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```
Fixes https://github.com/pytorch/pytorch/issues/143477
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147369
Approved by: https://github.com/dcci
ghstack dependencies: #147977
2025-02-26 18:39:15 +00:00
ebf6b9839c [MPS] faster integer batched matmul (#147877)
Followup to #147526
Tiled matmul for bmm as well.

## Speed ups:
![speedups_bmm](https://github.com/user-attachments/assets/02501145-7d64-4bbe-9dcc-994f004b4829)

Script to record times:
```python
import torch
import numpy as np
import time
import csv

batch_sizes = [1, 2, 4, 8]
matrix_sizes = [256, 512, 1024, 2048]
num_runs = 10
warmup_runs = 3

def run_int_mm(A, B):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = A @ B
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'B': [],
    'mean_time': [],
    'std_time': []
}

for b in batch_sizes:
    for n in matrix_sizes:
        print(f"\nBenchmarking N={n} and B={b}")

        try:
            A_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps")
            B_mps = torch.randint(low=-100, high=100, size=(b, n, n), dtype=torch.int8, device="mps")

            for _ in range(warmup_runs):
                _, _ = run_int_mm(A_mps, B_mps)

            times = []
            for _ in range(num_runs):
                _, t = run_int_mm(A_mps, B_mps)
                times.append(t)

            mean_time = np.mean(times)
            std_time = np.std(times)

            results['N'].append(n)
            results['B'].append(b)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)

            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

        except RuntimeError as e:
            print(f"Error for N={n}: {e}")
            continue

with open('int_bmm_benchmark_times_new.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['B'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147877
Approved by: https://github.com/Skylion007
2025-02-26 18:37:13 +00:00
cfb293ee02 [inductor] Add logs for precompile and autotuning (#147923)
Differential Revision: D70222645

I want to add more logs around precompile, especially around the reason why sometimes it gets fast returned. See https://github.com/pytorch/pytorch/pull/147590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147923
Approved by: https://github.com/Skylion007
2025-02-26 18:26:07 +00:00
0ea5d1067b ROCm: Remove static specifier for allow_tf32 variable. (#147186)
Since the env variable HIPBLASLT_ALLOW_TF32 can change, remove static type for allow_tf32 variable so that it captures the current value of env variable HIPBLASLT_ALLOW_TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147186
Approved by: https://github.com/jeffdaily, https://github.com/naromero77amd
2025-02-26 18:24:02 +00:00
4e4191854b [logs][qol] Print log options alphabetically (#147888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147888
Approved by: https://github.com/jansel
2025-02-26 18:15:39 +00:00
fb566c5aea Fix auto_functionalize x inference_mode (#147925)
Fixes #147924

We were using the wrong FunctionalTensorMode to construct
FunctionalTensors. FunctionalTensors modify the FunctionalTensorMode on
construction, so that led to the wrong FunctionalTensorMode being
modified. This PR threads the FunctionalTensorMode through correctly.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147925
Approved by: https://github.com/bdhirsh
2025-02-26 18:05:30 +00:00
678435c443 [FlexAttention] Fix IMA bug (#147918)
# Summary
Fixes: https://github.com/pytorch/pytorch/issues/147268

I got this right for the backwards and somehow forgot to do the flip in the forward, not sure how this wasnt found earlier..

Testing IMAs is tuff in pytest so didnt add but verified on reproducer

```py
❯ sanitize python flex/maurice_ima.py --setting 0
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(1.0078, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(1.7994, device='cuda:0')
========= ERROR SUMMARY: 0 errors
❯ sanitize python flex/maurice_ima.py --setting 1
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(2.8297, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(1.9714, device='cuda:0')
========= ERROR SUMMARY: 0 errors
❯ sanitize python flex/maurice_ima.py --setting 2
========= COMPUTE-SANITIZER
pool: torch.Size([64, 8, 784, 64]) tensor(3.2232, device='cuda:0')
Feat shape torch.Size([64, 8, 784, 64])
Feat strides (401408, 50176, 64, 1)
Feat is contig: True
attn: torch.Size([64, 8, 784, 64]) tensor(2.2095, device='cuda:0')
========= ERROR SUMMARY: 0 errors
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147918
Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007
2025-02-26 17:59:05 +00:00
3f7e242c86 [CI] Checkout with more processes (#147652)
The default action doesn't use more processes, possibly because most github provided runners only have 2 cpus, but we have more than that, so we might as well use them

Generally cuts maybe 1 min off of checkout time?

Changed checkout from pytorch/pytorch@main to pytorch/pytorch@my branch to test on 249a936998e66cc0d6ad8664e0e93ec1b9432a8b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147652
Approved by: https://github.com/ZainRizvi
2025-02-26 17:51:28 +00:00
ef61c290e1 [DTensor][random] defer DTensor RNG state sync until first random op call or manual_seed call; support more flexible OffsetBasedRNGTracker init (#147025)
Resolves https://github.com/pytorch/pytorch/issues/146767.

May also resolve https://github.com/pytorch/pytorch/issues/147584.

### Summary
This PR removes the RNG tracker init from the `distribute_tensor` call for the following reasons:

1. if the user does not use random ops on DTensor, there's no need to init DTensor RNG which currently requires CUDA device to be present.
2. this complies with the 0-communication semantic of `src_data_rank=None` shard distribution.

Besides, `OffsetBasedRNGTracker` only accepts `DeviceMesh` argument to its constructor method.

### Consequence

DTensor RNG initialization is delayed till the first DTensor random ops call or `torch.distributed.tensor.random.manual_seed`.

### Test
`pytest test/distributed/tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`
`pytest test/distributed/tensor/parallel/test_tp_style.py`

Differential Revision: [D70201856](https://our.internmc.facebook.com/intern/diff/D70201856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147025
Approved by: https://github.com/kwen2501
2025-02-26 17:33:22 +00:00
5ef94ca816 [BE] Do not copy arguments in variadic template (#147977)
By adding missing  `std::forward<Args>(args)...` and declaring template as passing args by reference

Noticed while working on creating `mtl_setBytes` specification that takes `MPSScalar` as argument
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147977
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-26 17:20:16 +00:00
ba9ed856e0 [FlexAttention] Improve error msg for embedding < 16 (#147765)
flex_attention uses tl.dot, which [does not support embedding < 16](https://github.com/triton-lang/triton/issues/2266) on input shapes. This PR adds explicit error message for users who are prototyping with small tensors.

Fixes #147701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147765
Approved by: https://github.com/drisspg
2025-02-26 17:06:35 +00:00
ac926f81cc [Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)
Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395
Approved by: https://github.com/eellison
2025-02-26 16:56:17 +00:00
fd1220e386 [ca] side-effect free inital trace: compiled_args (#147804)
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796
2025-02-26 16:37:27 +00:00
5e3069dde8 [ca] side-effect free initial trace: GraphTask (#147796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796
Approved by: https://github.com/jansel
ghstack dependencies: #147242
2025-02-26 16:37:27 +00:00
0a2da008f8 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-26 16:37:17 +00:00
08f4c1a233 [dynamo] add sourceless builder for types.MethodType (#147880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147880
Approved by: https://github.com/jansel
2025-02-26 15:43:47 +00:00
edaf9ddeb5 Add basic Gaudi support to benchmarks/dynamo (#145920)
This PR adds basic Gaudi support to benchmarks/dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145920
Approved by: https://github.com/eellison
2025-02-26 14:50:22 +00:00
be830c8b1c [Inductor][CPP] fix store mode atomic add (#147961)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/147848 and https://github.com/pytorch/pytorch/issues/146390. While addressing these issues, 2 problems were encountered:

- In `CppVecKernel`, when the number of threads is 1 and the mode is `atomic_add`, `store` did not `load/add` before storing. This has been fixed in this PR.

- In `CppTile2DKernel`, `store` did not support `atomic_add` mode. Support for this has been added in this PR.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_nn_fold
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147961
Approved by: https://github.com/malfet
2025-02-26 14:04:34 +00:00
f522d899fb Add MSVC version condition to "Fix for MSVC problem on Windows Arm64 (#136765)" (#145076)
This PR adds MSVC version guards around the if block presented on f7e36d8d6f9706ee9b9653538c4c8d2ba375a181. This commit was to provide a workaround for the problem reported here: https://developercommunity.visualstudio.com/t/MSVC-loop-unrolling-problem-194033813-/10720692 .
The issue is fixed now and only appears between versions 19.36 and 19.42.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145076
Approved by: https://github.com/malfet, https://github.com/alinpahontu2912

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-02-26 12:08:24 +00:00
60d94ea22b Add option to limit number of SMs used by matmul kernels (#147966)
Resubmission of #144974 which was reverted for unrelated reasons.

Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147966
Approved by: https://github.com/danthe3rd
2025-02-26 12:01:12 +00:00
7ffae2c028 Split test_transformers.py (#147441)
Split test_transformers.py into test_transformers.py and test_transformers_privateuser1.py. Currently the privateuse1 test cases in test_transformers.py are skipped since they conflict with cuda test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147441
Approved by: https://github.com/drisspg
2025-02-26 11:54:24 +00:00
cf6d1e6824 [dynamo] add generic graph break hints (#147429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147429
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #147385
2025-02-26 09:20:28 +00:00
3fd68e4e2f [dynamo] make some more graph break messages readable in English [2/N] (#147385)
This is for "for some large number Z, make sure the error messages are readable English." - beginning to audit all `unimplemented` sites and making sure that all messages are at least English-readable. Hints may not necessarily be provided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147385
Approved by: https://github.com/jansel
2025-02-26 09:20:28 +00:00
7a06bfdd1c [inductor][ck] kBatch parametrized (#147885)
Summary:
# Why

Enable us to set the kBatch parameter, rather than bake it in

Especially for larger splitK scenarios, this can yield very good performance (up to 1.5x vs hipblaslt from initial tests)

## Why like this

The obvious question should be: why not add this to the op itself, and maybe even into the template/kernel. That would simplify the code.

The choice to have it as a "runtime" param that we fix is be able to reuse the compiled CK `.so` libraries, as now multiple choices of kBatch can be used with the exact same `.so` (as the shared library does not depend on kBatch, but takes it as a parameter)

# What

- copy cutlass approach for swizzle to have a "runtime" arg that we pass in but is really choice dependent
- pipe through everything from template and kernel
- hard-code it to be kBatch=1 for now (same as before, just now settable)

This is part of a series of Diffs, where next we need to figure out
1. how to filter out ops + kBatch that don't work
2. set this better for splitK scenarios (hand written heuristic)

Test Plan:
(with minor modifications)

```
# show it working with AOTI
buck2 run mode/opt-amd-gpu //scripts/henrylhtsang/repros:aot
```

```
# show it working with inductor only
buck2 run -c fbcode.re_gpu_tests=False mode/opt-amd-gpu  fbcode//deeplearning/aot_inductor/benchmark/sampling:test_gemm_autotune_benchmark_AMD_block_0
```

Differential Revision: D70200008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147885
Approved by: https://github.com/ColinPeppler
2025-02-26 07:28:19 +00:00
a84db75e1b Revert "torch._scaled_mm with MXFP8 (#147548)"
This reverts commit 12b9674cb603438639298d6c9757ea93e18a7289.

Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - similar to previous, see below ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2684134336))
2025-02-26 07:17:24 +00:00
4216478250 Fix the benchmark config name from H100 benchmark (#147947)
When using the wrong benchmark configs, the benchmark jobs will be skipped.  The name should have the `_cuda_h100` suffix as used in the test matrix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147947
Approved by: https://github.com/wdvr
2025-02-26 06:40:07 +00:00
4ec6c1d1ec Fix test_halide.py report invocation to re-run failed tests (#147640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147640
Approved by: https://github.com/jansel
2025-02-26 06:32:22 +00:00
acca9b9cb0 Revert "[AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803)"
This reverts commit 0b9da1ae0ad30ef228f132354b875bcaec214ace.

Reverted https://github.com/pytorch/pytorch/pull/147803 on behalf of https://github.com/wdvr due to breaking internal tests, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147803#issuecomment-2683938121))
2025-02-26 05:32:17 +00:00
12b9674cb6 torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-26 05:21:26 +00:00
9ed40af917 [BE][EZ] Delete MacOS-12.3 xfail list (#147905)
As PyTorch requires at least MacOS-13 (and Metal-3) to work, delete any pre-MacoS13 checks from test script
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147905
Approved by: https://github.com/dcci
ghstack dependencies: #147892
2025-02-26 05:08:09 +00:00
a2399c9b44 [BE] Switch index_variable to torch.testing.make_tensor (#147892)
As it was a long-time todo and actually ublocks using this function for MPS devices (that do not support double)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147892
Approved by: https://github.com/dcci
2025-02-26 05:08:09 +00:00
c839fa4dd2 [Resubmit] Record input strides at time of tracing, constrain to them for triton fn (#147861)
Resubmit of https://github.com/pytorch/pytorch/pull/145448. it lost its changes on rebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147861
Approved by: https://github.com/zou3519
2025-02-26 05:05:06 +00:00
ba25e26baa [ROCm] Use IPT=8 for block radix sort (#147657)
Improve performance for shapes that use block radix sort by decreasing the item_per_thread to 8.
This will increase the thread block size leading to higher occupancy.

Co-author: @amd-sushetty

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147657
Approved by: https://github.com/jeffdaily
2025-02-26 04:22:16 +00:00
f211818bc0 [c10d] Restrict use condition of NCCL mem pool (#147764)
Add check to see if CUDA driver support multicast, as does in Symmetric Memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147764
Approved by: https://github.com/syed-ahmed, https://github.com/yifuwang
2025-02-26 03:40:00 +00:00
d3fc583ff0 [cutlass backend] force_disable_caches for test_number_mm_precompiles (#147901)
Summary: Test is flaky right now.

Differential Revision: D70209511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147901
Approved by: https://github.com/ColinPeppler
2025-02-26 03:22:49 +00:00
9ad0ad6497 [MPS] Introduce a shader for entr(). (#147914)
To be used in eager/inductor in order to implement the missing operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147914
Approved by: https://github.com/malfet
2025-02-26 02:54:44 +00:00
805f7d97f7 [Inductor][Optimus] Fix a corner case in split cat aten pass (#147784)
Summary: We need to further check the input of the cat to make sure all of them are from the same split node.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/c875cbdd-5374-46cf-811c-45f91cf6ba3e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10977524161964655
Network: Up: 64KiB  Down: 27KiB  (reSessionID-2e5915cb-4894-48f6-ab1c-3981adb42dab)
Executing actions. Remaining     0/3                                                                         1.5s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:52.1s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

before
aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e

tlparse:
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/aps-recgpt_ig_emb_pt2_comment_out-30c4d5127e/attempt_0/version_0/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

after
aps-recgpt_ig_emb_pt2_comment_out-c03f74e353

Differential Revision: D70132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147784
Approved by: https://github.com/Microve
2025-02-26 02:19:48 +00:00
b533bb4b13 optimize the decomposition of aten.native_group_norm (#144733)
Summary:
Optimize the decomposition of aten.native_group_norm. Reduce unnecessary repeated operations by changing the order of operations for `mean`, `rstd`, `weight`, `bias `and `input`, which can improve performance when `flattened_inner_size `is large.

The original decomposition:
1. compute `mean `and `rstd`,
2. out = (x - mean) * rstd, compute in the range [N, C, *],
3. out = out * weight + bias, compute in the range [N, C, *],

The new decomposition:
1. compute `mean `and `rstd`,
2. new_weight = rstd * weight, new_bias = - mean * rstd * weight + bias, compute in the range [N, C],
3. out = out * new_weight + new_bias, compute in the range [N, C, *],

I tested the Inductor performance benchmark with this PR on both CPU and A100. On CPU, two torchbench models(functorch_dp_cifar10 and opacus_cifar10) have about 25% performance improvement, and two diffusion models(Stable Diffusion and Latent Consistency Model(LCM)) have about 2% performance improvement. On A100, no performance gains or regressions were seen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144733
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-26 01:42:46 +00:00
12112fd198 Fix bug in FSDP wrapped module with zero argument (#147771)
Fixes https://github.com/pytorch/pytorch/issues/147531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147771
Approved by: https://github.com/awgu
2025-02-26 01:40:53 +00:00
8de6fe8c0b [docs] fix numpy docs reference (#147697)
Fix a link to numpy documentation that has moved and now 404's

I"ve checked other numpy doc links that point to docs.scipy.org (which then redirects to numpy.org) and they do work, so I am fixing just this 404.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147697
Approved by: https://github.com/soulitzer
2025-02-26 01:30:03 +00:00
90e3a3d86d Revert "[ca] trace saved variable unpacking (#147242)"
This reverts commit 68ddca94498fd7961cc5ebcb0dffafb8c2f4baca.

Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))
2025-02-26 00:40:16 +00:00
4d614baa30 Revert "[ca] side-effect free initial trace: GraphTask (#147796)"
This reverts commit 5758743f3c92f9cd9b61bc435602f13dd19c13d7.

Reverted https://github.com/pytorch/pytorch/pull/147796 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147796#issuecomment-2683599896))
2025-02-26 00:36:08 +00:00
143f0f0006 Revert "[ca] side-effect free inital trace: compiled_args (#147804)"
This reverts commit ec768d8dc04b334e01db1a90e4e6646e4e867e67.

Reverted https://github.com/pytorch/pytorch/pull/147804 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147804#issuecomment-2683594740))
2025-02-26 00:31:40 +00:00
3ecfe6be25 [Submodule] Turning flash-attention integration into 3rd party submod (#144120) (#146372)
Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
276dfe8150 [dynamo][cpp-guards] Disable dict-tag optim if the guard_manager has child accessors (#147694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147694
Approved by: https://github.com/isuruf
2025-02-26 00:02:08 +00:00
8e7e5ba182 Add sparse tensors constructed via legacy constructor to _sparse_tensors_to_validate (#147759)
This is a redo of https://github.com/pytorch/pytorch/pull/147408 which added validation at the end of the legacy constructor calls.

The reason why I didn't land that was because in `legacy_load`, constructor would be called before storages of indices/values are set. So the tensor would not actually be validated.

Technically, torch.sparse.{Foo}Tensor should not even be called by our rebuild process since afaict this was the first PR that added support for sparse tensor serialization https://github.com/pytorch/pytorch/pull/27062 and it already uses `_rebuild_sparse_tensor` (which would add the rebuilt tensor to the list to validate), but torch.sparse.FooTensor is allowlisted

This PR adds tensors constructed as such to the list to validate at the end of torch.load.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147759
Approved by: https://github.com/albanD
2025-02-25 23:51:12 +00:00
c82c1411c6 Revert "torch._scaled_mm with MXFP8 (#147548)"
This reverts commit e34c15a05b027b9da0962c971d448138fcf94926.

Reverted https://github.com/pytorch/pytorch/pull/147548 on behalf of https://github.com/wdvr due to failing internal build - discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147548#issuecomment-2683517851))
2025-02-25 23:28:15 +00:00
0633f63f0d [cutlass backend] try fix standlone runner test (#147811)
Differential Revision: [D70147859](https://our.internmc.facebook.com/intern/diff/D70147859/)

Trying to fix this test one last time, especially when mixed mm is getting removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147811
Approved by: https://github.com/chenyang78
2025-02-25 23:27:02 +00:00
05bc8fe62e Revert "follow up to #147548, fix regression on MI300 (#147878)"
This reverts commit cc444e75d540daff127f0210b7f8965a5c2b8d2a.

Reverted https://github.com/pytorch/pytorch/pull/147878 on behalf of https://github.com/wdvr due to temporary reverting to revert an older one in the stack ([comment](https://github.com/pytorch/pytorch/pull/147878#issuecomment-2683515567))
2025-02-25 23:25:59 +00:00
2df9a8d72d [Inductor][Tests] Update get_divisible_by_16 function in test_torchinductor.py to work correctly with new Triton (#147865)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147865
Approved by: https://github.com/davidberard98
2025-02-25 23:14:13 +00:00
1e894d2635 Revert "Add option to limit number of SMs used by matmul kernels (#144974)"
This reverts commit af2d63637ed025789679a17c241e6bb466508a1d.

Reverted https://github.com/pytorch/pytorch/pull/144974 on behalf of https://github.com/wdvr due to reverting in order to revert #147548 that causes a merge conflict ([comment](https://github.com/pytorch/pytorch/pull/144974#issuecomment-2683461733))
2025-02-25 22:46:38 +00:00
cc444e75d5 follow up to #147548, fix regression on MI300 (#147878)
Removing curly braces seemed superficial but broke MI300 rowwise matmul.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147878
Approved by: https://github.com/drisspg
2025-02-25 22:16:28 +00:00
a821d69d92 Fix register constant to be usable in exportz (#147533)
Differential Revision: [D69939737](https://our.internmc.facebook.com/intern/diff/D69939737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147533
Approved by: https://github.com/zou3519
2025-02-25 21:10:47 +00:00
0d31c621a3 Revert "[inductor][triton] Ignore block ptr advances for removed buffers (#147193)"
This reverts commit 17766b7aad0d9931bb6b3485fcf3d4c7532c3557.

Reverted https://github.com/pytorch/pytorch/pull/147193 on behalf of https://github.com/wdvr due to failing tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147193#issuecomment-2683286358))
2025-02-25 21:04:04 +00:00
6eb3d1e762 [DCP] Cache save plans in default planner (#147343)
Summary:
This PR caches the save plans to significantly reduce the collective cost for successive checkpoint save attempts. Here is the high level approach:
-  Create the local plan and cache the same.
- In next iteration, compare the local plan with the cached plan metadata. If no change, do not send that local plan in the collective.
- Global plan step, will only create the global plan with the new delta plans and empty plans for the cached ones.
- Finish plan step will check for the empty plans. If its empty, it will grab the cached plan. If not, it will use the new plan provided.

Test Plan: UTs

Differential Revision: D69224491

## How to enable the caching:
DefaultSavePlanner introduces the enable_plan_caching which is set to False by default for now.
https://github.com/pytorch/pytorch/pull/147343/files#diff-579bbb7b82572753afa91085fbf954f7c7613ff8376da9b26153d5cc3a3c4ee8R77
Set this to True to enable the caching and we should see significant speed up in the subsequent checkpoint save attempts, specially for larger scale jobs. Reference issue: https://github.com/pytorch/pytorch/issues/123695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147343
Approved by: https://github.com/MeetVadakkanchery
2025-02-25 20:59:25 +00:00
8d921eb97f export method (#147573)
The `export` API takes a `nn.Module` and traces its `forward` method. However sometimes it is useful to export different methods of a `nn.Module`, either as a one-off for debugging or as a set of methods that are called in some sequence outside `export` (e.g., `encode` / `decode`). When multiple methods of the same module instance are exported, they should share the same of the common module instance.

This PR adds a couple of utils in `torch._export.utils` for this workflow.

The `wrap_method` util wraps a method as a `nn.Module` that can then be exported. See included test. We recommend using the same module instance to export multiple methods on that instance, in which case they are guaranteed to share  state. On serde, this state sharing is lost, so we provide another util, `sync_state`, to re-sync the state.

These utils are meant to be eventually replaced by API-level changes, but for now this can unblock users who need this workflow. In particular, in the future we can accept one or multiple method entrypoints, with their own args / kwargs / dynamic shape specifications, which can create a variant of `ExportedProgram` with multiple graphs that share state; then we can automatically ensure that the state sharing is preserved through serde.

Differential Revision: [D69960801](https://our.internmc.facebook.com/intern/diff/D69960801/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147573
Approved by: https://github.com/tugsbayasgalan
2025-02-25 20:58:54 +00:00
687fe64667 Fix crash in -[PTMCoreMLCompiler _compileModel:atPath:] (#147809)
Summary:
We could hit one of those exceptions:
https://github.com/apple/coremltools/blob/main/modelpackage/src/ModelPackage.cpp#L205-L225

And it would make this code path crash.

Test Plan: build.

Differential Revision: D70122378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147809
Approved by: https://github.com/mcr229
2025-02-25 20:56:16 +00:00
ec768d8dc0 [ca] side-effect free inital trace: compiled_args (#147804)
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242, #147796
2025-02-25 20:38:51 +00:00
5758743f3c [ca] side-effect free initial trace: GraphTask (#147796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147796
Approved by: https://github.com/jansel
ghstack dependencies: #147242
2025-02-25 20:38:51 +00:00
68ddca9449 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-25 20:38:51 +00:00
adf0f4ffd2 [custom op] fix inductor cpp codegen when returning a list of single tensor (#147649)
For a custom op that returns a list of a single tensor with unbacked symint shape:
```python

@torch.library.custom_op(
    "aoti_custom_ops::fn_ret_list_of_single_tensor", mutates_args={}
)
def fn_ret_list_of_single_tensor(x: torch.Tensor) -> list[torch.Tensor]:
    s = x.sum().to(torch.int64)
    return [torch.randn(s.item())]

@fn_ret_list_of_single_tensor.register_fake
def _(x):
    ctx = torch._custom_op.impl.get_ctx()
    i0 = ctx.new_dynamic_size()
    return [torch.randn(i0)]
```

Before the fix, we have the following error:
```
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp& std::get(const std::variant<_Types ...>&)’
  456 |     auto u0 = std::get<0>(buf1).size(0);
      |               ~~~~~~~~~~~^~~~~~
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note:   expected a type, got ‘0’
In file included from /data/users/yidi/pytorch/torch/include/c10/util/Exception.h:14,
                 from /data/users/yidi/pytorch/torch/include/c10/core/ScalarType.h:5,
                 from /data/users/yidi/pytorch/torch/include/ATen/AccumulateType.h:4,
                 from /data/users/yidi/pytorch/torch/include/ATen/native/Math.h:3,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec_base.h:31,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec512/vec512.h:8,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/vec.h:4,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional_base.h:6,
                 from /data/users/yidi/pytorch/torch/include/ATen/cpu/vec/functional.h:3,
                 from /tmp/tmp5iikarn2/3b/c3bi5gk6mslf6u4iaqafhxm64z6u65e3eain4xlary5blqnvv6xx.h:39,
                 from /tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:366:
/usr/include/c++/11/variant:1145:27: note: candidate: ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’
 1145 |     constexpr const _Tp&& get(const variant<_Types...>&& __v)
      |                           ^~~
/usr/include/c++/11/variant:1145:27: note:   template argument deduction/substitution failed:
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: error: type/value mismatch at argument 1 in template parameter list for ‘template<class _Tp, class ... _Types> constexpr const _Tp&& std::get(const std::variant<_Types ...>&&)’
  456 |     auto u0 = std::get<0>(buf1).size(0);
      |               ~~~~~~~~~~~^~~~~~
/tmp/tmp5iikarn2/cci3ruqb7zdwtl457zo4itspq3sjnqiayhcshp5uaak7ktksckix/cggzqlwf4bmu6tjqodhoto3hhkhgharhwtvw2uxsasqrdipnazrv.cpp:456:26: note:   expected a type, got ‘0’
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147649
Approved by: https://github.com/angelayi
ghstack dependencies: #147130
2025-02-25 20:28:41 +00:00
824474cb35 [cond] support output sizes mismatch in front end (#147130)
This PR finishes https://github.com/pytorch/pytorch/pull/137615 by addressing the TODOs and comments left there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147130
Approved by: https://github.com/zou3519
2025-02-25 20:28:41 +00:00
de80b6f0d3 Updated test_cuda.py to rerun tests (#147040)
Initially test_cuda::TestCudaMallocAsync::test_clock_speed and test_cuda::TestCudaMallocAsync::test_power_draw are skipped in this [commit](d4871750d9).

Pulled ROCm nightly image and verified these two tests run fine locally. Filed this PR to enable them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147040
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-25 19:58:42 +00:00
361b6c97cd cpp_wrapper: Fixup output code indentation (#147215)
Closes #142165.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147215
Approved by: https://github.com/desertfire
ghstack dependencies: #146109, #146424
2025-02-25 19:50:37 +00:00
7c515b2da4 cpp_wrapper: fix test_torchinductor* tests (#146424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146424
Approved by: https://github.com/desertfire
ghstack dependencies: #146109
2025-02-25 19:50:37 +00:00
46d1422afd cpp_wrapper: fix inductor triton tests (#146109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146109
Approved by: https://github.com/desertfire
2025-02-25 19:50:37 +00:00
9740d69e78 [logging] Add toplevel dynamo_compile / tlparse logging for AOTI (#147760)
Summary:
This adds the proper context managers in `compile_fx_aot` such that we get:
1) A toplevel chromium event (i.e., tlparse)
2) A single `dynamo_compile` log entry

Test Plan:
Before:
* Scuba (we only log the dynamo event): https://fburl.com/scuba/dynamo_compile/sandbox/gaqowzrd
* Perfetto trace: https://fburl.com/vol7r6w1

After:
* Scuba (we log the dynamo _and_ compile_fx_aot event): https://fburl.com/scuba/dynamo_compile/sandbox/cx2we8w8
* Perfetto trace (click on the toplevel event to see the additional metadata): https://fburl.com/sziy40r9

Differential Revision: D70113859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147760
Approved by: https://github.com/desertfire
2025-02-25 19:41:39 +00:00
14b9f7f7bc Remove link to search survey (#147751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147751
Approved by: https://github.com/malfet
2025-02-25 19:26:59 +00:00
17766b7aad [inductor][triton] Ignore block ptr advances for removed buffers (#147193)
block ptr advancements should also be deferrered conditional on the associated buffer not being removed. For example, if `FusedSchedulerNode(op0-op1)` has a store in `SchedulerNode` `op0` that is read in `op1`, the store and associated block ptr that would be created for `op0` in isolation is no longer needed.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147193
Approved by: https://github.com/jansel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 19:14:55 +00:00
ea6938a1f7 Add XuehaiPan to CODEOWNERS for C++ PyTree utilities (#137408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137408
Approved by: https://github.com/zou3519
2025-02-25 18:48:32 +00:00
580f1183b4 Enable ruff rule S324 (#147665)
Fixes #147627

- Add `S324` in `pyproject.toml `
- Running check and clean warnings

```bash
lintrunner --take RUFF --all-files
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-25 18:27:34 +00:00
6061664266 Enabled force_shape_pad for triton tests in test_kernel_benchmark (#147620)
During ROCm runs we naturally have those tests show that padding path will be slower for our archs and the pad_mm chooses to opt out of padding thus failing those tests.

Reasoning for this is per my understanding those tests don't check IF the operation should be padded in the first place, but HOW is it padded and if it's done in a correct way. More than that the tests shouldn't really be hardware dependent or have some condition for them.

Similar PR for reference: https://github.com/pytorch/pytorch/pull/141768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147620
Approved by: https://github.com/jeffdaily, https://github.com/chenyang78, https://github.com/shunting314
2025-02-25 18:06:48 +00:00
651e6aacf9 [ROCm] Remove benign warning about missing amdgpu.ids (#147791)
Fixes #144203.

We build a custom libdrm when preparing our docker image.  We attempt to locate the amdgpu.ids file relative to the python binary, but this is not possible for venv installs of pytorch when the python binary is a symlink.  Not finding amdgpu.ids causes `torch.cuda.get_device_name()` to return "AMD Radeon Graphics" as a generic name instead of something specific such as "AMD Instinct MI250X / MI250".  The libdrm warning is noisy, so we are removing it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147791
Approved by: https://github.com/jeffdaily
2025-02-25 17:17:25 +00:00
e5a13410cd Fix the tiny doc descriptions (#147319)
As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147319
Approved by: https://github.com/zou3519
2025-02-25 17:10:16 +00:00
346bbefa63 [BE] Parameterize TestSDPA in test_mps.py (#147856)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147856
Approved by: https://github.com/Skylion007
2025-02-25 16:07:24 +00:00
810d2a3dbd [ARM] Fix bug in _ref_test_helper in test_ops and fix failing test on Aarch64 (#146597)
We have a failing unit test on Aarch64

```
Exception: Caused by reference input at index 34: SampleInput(input=Tensor[size=(5, 5, 4), device="cpu", dtype=torch.complex64, contiguous=False], args=(), kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=34 python test/test_ops.py TestCommonCPU.test_python_ref__refs_square_cpu_complex64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

After debugging it I found that `ex` variable is not being reset to None on each loop inside _ref_test_helper. Which after fixing, highlighted another expectedFailure to reenable - `nn.functional.hinge_embedding_loss` which was incorrectly being skipped due to the same problem.

4a545eb85d/test/test_ops.py (L546)
ex variable is not reset after this for next loop iteration

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146597
Approved by: https://github.com/digantdesai
2025-02-25 14:15:10 +00:00
a695aae89b [MPS] fix attention for >4d tensors (#147545)
Fixes #147443

and adds tests for >4d tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147545
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:55:28 +00:00
0b9da1ae0a [AOTI][refactor] Consolidate CppBuilder.build and CppBuilder.build_fbcode_cpu_re (#147803)
Summary: Let CppBuilder handle all the cpp build logic

Differential Revision: [D70146185](https://our.internmc.facebook.com/intern/diff/D70146185)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147803
Approved by: https://github.com/malfet
ghstack dependencies: #147805, #147806, #147807
2025-02-25 13:33:12 +00:00
cc1c9826d4 [AOTI][refactor] Fix a typo (#147807)
Summary: defination -> definition

Differential Revision: [D70146182](https://our.internmc.facebook.com/intern/diff/D70146182)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147807
Approved by: https://github.com/malfet
ghstack dependencies: #147805, #147806
2025-02-25 13:33:12 +00:00
7ed0670e21 [AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147806)
Summary: Consolidate cpp compilation action to CppBuilder. Reland https://github.com/pytorch/pytorch/pull/147680

Differential Revision: [D70146183](https://our.internmc.facebook.com/intern/diff/D70146183)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147806
Approved by: https://github.com/malfet
ghstack dependencies: #147805
2025-02-25 13:33:03 +00:00
2680e835c8 [AOTI][refactor] Rename use_absolute_path to use_relative_path (#147805)
Summary: The option really means to compile a cpp file using its basename instead of the its full path. Reland https://github.com/pytorch/pytorch/pull/147679.

Differential Revision: [D70146184](https://our.internmc.facebook.com/intern/diff/D70146184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147805
Approved by: https://github.com/malfet
2025-02-25 13:32:54 +00:00
7e37fb0a4c [MPS] faster integer matmul for mps (#147526)
There is a naive matmul kernel written for MPS matmul which is used when input types are integer(and some other cases for older MacOSes). The old version of matmul is naive with global memory accesses which really tanks the performance especially when matrix is sufficiently large.

This PR optimizes it (even though there might be more optimizations with using simdgroup matrices which I'll cover in followup since writing that kernel will take more time)

## Performance comparison on M1 Pro:
![performance_comparison](https://github.com/user-attachments/assets/6ea8de5a-8231-4c5b-8dc9-caa79ea6879a)

You can get these numbers by running this script with old kernel compiled and then new kernel compiled(Make sure to change the csv where each output is written):
```python
import torch
import numpy as np
import time
import csv

matrix_sizes = [32, 128, 512, 1024, 2048, 4096]
num_runs = 10
warmup_runs = 3

def run_int_mm(A, B):
    torch.mps.synchronize()
    start = time.perf_counter()
    c = A @ B
    torch.mps.synchronize()
    end = time.perf_counter()
    return c, end - start

results = {
    'N': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    print(f"\nBenchmarking N={n}")

    try:
        A_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps")
        B_mps = torch.randint(low=-100, high=100, size=(n, n), dtype=torch.int8, device="mps")

        for _ in range(warmup_runs):
            _, _ = run_int_mm(A_mps, B_mps)

        times = []
        for _ in range(num_runs):
            _, t = run_int_mm(A_mps, B_mps)
            times.append(t)

        mean_time = np.mean(times)
        std_time = np.std(times)

        results['N'].append(n)
        results['mean_time'].append(mean_time)
        results['std_time'].append(std_time)

        print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

    except RuntimeError as e:
        print(f"Error for N={n}: {e}")
        continue

with open('int_mm_benchmark_times_old.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147526
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-25 13:15:18 +00:00
b63c601614 Update merge rules for oneDNN part (#147615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147615
Approved by: https://github.com/atalman
2025-02-25 11:26:59 +00:00
af2d63637e Add option to limit number of SMs used by matmul kernels (#144974)
Newer matmul kernels, e.g. those targeting Hopper GPUs, sometime use a "persistent" schedule which consists in launching as many CUDA blocks as there are SMs on the GPU, with each such block then working on multiple output tiles in a row. This allows to eliminate the overhead of starting and finishing each tile, effectively doing cross-tile pipelining. In previous generations these latencies could be hidden by having multiple CUDA blocks per SM but, with blocks becoming larger, only one can run at a time per SM and thus this needs to be taken care of in software.

Persistent kernels become an issue when other kernels are running concurrently. The classical example is a NCCL communication kernel running in the background. In such cases the matmul expects to be able to use all the SMs but is prevented from doing so because some of the are busy. This can lead to its blocks being scheduled as two separate waves on the available SMs. This "wave quantization" can double the latency of the matmul kernels.

While we wait for smarter solutions, such as automatic load balancing among the blocks, an easy way to unblock ourselves is to tell the matmuls to only use a subset of the GPU's SMs. For this, I am introducing a global `sm_carveout` flag which can be used to specify how many SMs should be left available for other kernels.

For now I only change the cuBLAS kernels and the scaled-mm CUTLASS kernel. More kernels can be opted-in later.

I tested this change manually, by using the Kineto profiler to look up the grid size of a scaled-mm kernel with different values of `sm_carveout`, and making sure it changed. Suggestions are welcome for a more automated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144974
Approved by: https://github.com/eqy, https://github.com/albanD
2025-02-25 10:19:19 +00:00
94969d0a40 [inductor][user triton] Handle scf.yield more accurately (#147762)
**TL;DR**: Previously, the mutation analysis for scf.if/scf.for would bundle all the scf.yield arguments into a single op (the scf.yield), such that a mutation on any returned value from the scf.if/scf.for would register as a mutation to _all_ of the scf.yield args. To fix this, this PR artificially introduces a new scf.yield op for each of the scf.yield args.

**Context**: The relevant kernel is something like this one (added as a test in test_triton_kernels.py)

```python
        @triton.jit
        def branch_with_multiple_yield_args(
            in_ptr0,
            in_ptr1,
            out_ptr,
            conditional_ptr,
            n_elements,
            BLOCK_SIZE: "tl.constexpr",
        ):
            pid = tl.program_id(axis=0)
            block_start = pid * BLOCK_SIZE
            offsets = block_start + tl.arange(0, BLOCK_SIZE)
            mask = offsets < n_elements
            conditional = tl.load(conditional_ptr)
            if conditional:
                in0 = in_ptr0 + 1
                in1 = in_ptr1 + 1
                out = out_ptr + 1
            else:
                in0 = in_ptr0
                in1 = in_ptr1
                out = out_ptr
            x = tl.load(in0 + offsets, mask=mask)
            y = tl.load(in1 + offsets, mask=mask)
            tl.store(out + offsets, x + y, mask=mask)
```

The mutation analysis starts with the `tl.store` - and then does a DFS backwards towards the parameters.  When a new op is encountered in the DFS, the analysis pass recurses on the op's arguments.

The if branch gets converted to TTIR like this:

```mlir
    %21:3 = scf.if %20 -> (!tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32>) {
      ...
      scf.yield %31, %32, %33 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc10)
    } else {
      scf.yield %arg0, %arg1, %arg2 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc11)
    } loc(#loc7)
```

and so the "source" op of the `out` variable is marked as the `scf.yield` op - and then all of the arguments to `scf.yield` are marked as mutable (including arg0, arg1, and arg2 - only one of which is actually mutated).

**This PR** we duplicate the `scf.yield` to add one `scf.yield` per return value. That way we avoid marking all the returns from the scf.if/scf.for as mutated when only some are.

Differential Revision: [D70118202](https://our.internmc.facebook.com/intern/diff/D70118202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147762
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-02-25 08:41:00 +00:00
7bd2e3bca1 Update torch-xpu-ops commit pin (#147743)
Update the torch-xpu-ops commit to [306a0ffb6e0cae27c5bd9a3b9cd378048c8e00e7](306a0ffb6e), includes:

- Bugfix (LayerNorm/Nonzeros)
- Update AOT target

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147743
Approved by: https://github.com/EikanWang
2025-02-25 08:06:35 +00:00
866dc45d3c [Inductor][ROCm][CK] Unhardedcoded kernel shapes for ck_conv_template codegen (#147504)
## [Inductor][ROCm][CK] Parameterize `ck_conv_template` Codegen

### Description
Previously, ROCm CK kernel codegen templates were hardcoded with fixed values for convolution parameters:

- `index_t GroupCount`
- `index_t NBatch`
- `index_t NOutChannels`
- `index_t NInChannels`
- `vector<index_t> FilterSize`
- `vector<index_t> InputSize`
- `vector<index_t> ConvolutionStrides`
- `vector<index_t> Dilations`
- `vector<index_t> LeftPads`
- `vector<index_t> RightPads`

This PR updates `ck_conv_template` to accept these parameters dynamically from Inductor. By doing so, we reduce the number of generated templates, improving flexibility and maintainability.

### Testing
- Verified correctness by running relevant test cases, i.e `test/inductor/test_ck_backend.py`
- Ensured generated kernels reflect the updated parameterization, i.e generated templates in `/tmp/torchinductor_root/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147504
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/tenpercent

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
2025-02-25 07:48:07 +00:00
d73b927662 [DSD] Fixes issue when there is a PG without parameters (#147730)
Fixes https://github.com/pytorch/pytorch/issues/143828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147730
Approved by: https://github.com/mori360
2025-02-25 07:25:38 +00:00
fb73b0c7c5 Revert "use copy2d in h2d/d2h copy when possible (#146256)"
This reverts commit 0bc036a9e98d2cc92ff9dd367342b1f2efcc15f0.

Reverted https://github.com/pytorch/pytorch/pull/146256 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146256#issuecomment-2680868627))
2025-02-25 07:06:38 +00:00
bb7e8fbd66 [CacheBench] Add hf_T5 llama moco to cachebench (#147783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147783
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781, #147782
2025-02-25 04:34:45 +00:00
895564d6b6 [CacheBench] Add huggingface (#147782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147782
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780, #147781
2025-02-25 04:34:45 +00:00
c4fb6ae55d [CacheBench] Separate dynamic into its own option (#147781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147781
Approved by: https://github.com/huydhn
ghstack dependencies: #147688, #147780
2025-02-25 04:34:34 +00:00
60d4cbfc06 [CacheBench] Add repeat option so that we can have more accurate cache results (#147780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147780
Approved by: https://github.com/huydhn
ghstack dependencies: #147688
2025-02-25 04:34:25 +00:00
ab3b814af3 [CacheBench] Add ciflow/trunk test (#147688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147688
Approved by: https://github.com/huydhn
2025-02-25 04:34:16 +00:00
4b7604ec10 Delete Mixed MM Special Casing (#147151)
Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup:

- prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not.
- similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path.

It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication.

The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete.

Future optimizations could include:

- cutlass prologue path
- making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit.

Differential Revision: [D70114858](https://our.internmc.facebook.com/intern/diff/D70114858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2025-02-25 04:29:54 +00:00
890213f65f Revert "[AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679)"
This reverts commit 0b52d801d2297ad6c38e631eedfd4dead9360e1b.

Reverted https://github.com/pytorch/pytorch/pull/147679 on behalf of https://github.com/desertfire due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147679#issuecomment-2680389225))
2025-02-25 04:11:13 +00:00
9b06b30468 Revert "[AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680)"
This reverts commit 22fae0d948ac14c72b510fafc2283072d744dff9.

Reverted https://github.com/pytorch/pytorch/pull/147680 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147680#issuecomment-2680383986))
2025-02-25 04:06:40 +00:00
9478c90e2b [Quant] flip: throw runtime error for QUInt4x2 and QUInt2x4 input (#147430)
Fixes #147208

**Summary**
The `flip` op causes memory corruption for `torch.quint4x2` and `torch.quint2x4` inputs. It is because the TensorIterator-based implementation does not support multiple elements per byte. And `torch.quint4x2` and `torch.quint2x4` are deprecated in PyTorch. So, we add a check here to throw a runtime error if input dtyps is `torch.quint4x2` or `torch.quint2x4`.

**Test plan**
```
pytest -s test/test_shape_ops.py -k test_flip_unsupported_dtype
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147430
Approved by: https://github.com/mingfeima, https://github.com/ngimel
2025-02-25 03:47:40 +00:00
20295c017e Fix import of getArtifactLogger for ir_pre_fusion and ir_post_fusion (#147560)
Fixes #147002

There was an issue with the previous PR https://github.com/pytorch/pytorch/pull/147248 that didn't show up in CI,
where a logging import was not complete in torch/_inductor/debug.py before importing it.
This only happened if someone directly imported the file without doing any other imports before.

Also set to off_by_default by request to reduce log spew.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147560
Approved by: https://github.com/Skylion007
2025-02-25 03:36:08 +00:00
e34c15a05b torch._scaled_mm with MXFP8 (#147548)
# summary

Add blockwise MXFP8 support to `torch._scaled_mm` on CUDA capability 10.0 and higher devices.  If the scales for A and B are of dtype `torch.float8_e8m0fnu`, we dispatch to the blockwise kernel from cuBLAS.

This is a skeleton PR where we test basic functionality (numerics of various simple matrices, as well as one end to end quantization + gemm).

- Scales are flipped based on transpose_result
- Handles boundary conditions

Note that MXFP4 is not added in this PR - we can tackle that in a future PR.

This PR was created by taking https://github.com/pytorch/pytorch/pull/145562, switching e8m0 to in-core dtype, removing fp4 for now, and adding test cases.

# test plan

```
pytest test/test_matmul_cuda.py -k blockwise_mxfp8 -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147548
Approved by: https://github.com/drisspg

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-02-25 03:32:22 +00:00
cyy
8f728e28dd Enable ASAN in CUDA tests (#147512)
It should work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147512
Approved by: https://github.com/soulitzer
2025-02-25 02:58:39 +00:00
b0fa92042b Fix torch.mean out dtype check (#147188)
**For CPU**:
Type promotion is supported for torch.mean

**For Meta**:
Not supported for torch.mean

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147188
Approved by: https://github.com/albanD
2025-02-25 02:50:03 +00:00
33ff96b3f9 cpp_builder: unbreak clang++ detection (#147775)
Fixes an issue where `_is_gcc` would match on `clang++` due to the string ending with `g++`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147775
Approved by: https://github.com/desertfire
2025-02-25 02:33:01 +00:00
dacdc9782b [Inductor] Add input value checking to randint meta function (#147191)
Fixes #147070

Adding value checking for the range to the meta function, similar to which in the CUDA/CPU aten op.

Test with
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_tensor_creation_ops.py -k test_randint_inference
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147191
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-25 02:18:16 +00:00
c644f4c5fe [Inductor] Fix the decompositions of torch isin (#147519)
**Summary**
Fixed two decomposition issues in `torch.isin`:

- Issue 1: As reported in [#147329](https://github.com/pytorch/pytorch/issues/147329), the current decomposition does not support cases where test_element is a scalar. This is now implemented by referring to the ead970c8d0/aten/src/ATen/native/TensorCompare.cpp (L1004-L1008)

- Issue 2: Found while enabling a unit test with `elements = 1` and `test_elements = torch.tensor([1, 2, 3, 4])`, where Inductor produced different results compared to eager mode. This issue is fixed by referring to ead970c8d0/aten/src/ATen/native/cpu/TensorCompareKernel.cpp (L329-L338)

**Test Plan**
```
python test/inductor/test_torchinductor.py -k test_isin_tensor_scalar
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147519
Approved by: https://github.com/jgong5, https://github.com/FFFrog, https://github.com/peterbell10
2025-02-25 01:49:44 +00:00
2c8cd41c1f Delete unused conda-aws-upload environment (#147792)
As this environment only contains keys for Anaconda uploads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147792
Approved by: https://github.com/atalman
2025-02-25 01:42:52 +00:00
43074680b5 [ROCm] Add support for gfx1102 arch to wheel builds. (#147761)
[gfx1102 is not officially supported](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) but most ROCm libs have gfx1102 code objects available since ROCm 5.5.  Now that we're using `--offload-compress` we can fit another gfx target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147761
Approved by: https://github.com/jeffdaily
2025-02-25 01:35:52 +00:00
97557b9833 [Inductor] Update set_driver_to_gpu code to avoid backend re-initialization with new Triton (#147621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147621
Approved by: https://github.com/jansel
2025-02-25 00:04:54 +00:00
55bf3ff3a5 [Docs] Add OpDTypes.any_common_cpu_cuda_one (#147605)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147605
Approved by: https://github.com/soulitzer
2025-02-24 23:23:43 +00:00
e72b4c61bf Revert "Upgrade submodule oneDNN to v3.7 (#147498)"
This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c.

Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))
2025-02-24 22:57:39 +00:00
81dccd706b [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-24 22:47:52 +00:00
a71d8b7246 Fix ReferenceError: weakly-referenced object no longer exists in cycle detector (#146922)
Summary: weakref.proxy objects will throw errors when they re dead.  We just do not bother visulaizing them. They are weak, so they aren't relevant to cycles anyway.

Differential Revision: D69270429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146922
Approved by: https://github.com/tianfengfrank, https://github.com/Chillee
2025-02-24 22:27:39 +00:00
22fae0d948 [AOTI][refactor] Replace run_command_and_check with CppBuilder.build (#147680)
Consolidate cpp compilation action to CppBuilder

Differential Revision: [D69723632](https://our.internmc.facebook.com/intern/diff/D69723632/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147680
Approved by: https://github.com/yushangdi, https://github.com/angelayi
ghstack dependencies: #147679
2025-02-24 21:45:15 +00:00
0b52d801d2 [AOTI][refactor] Rename use_absolute_path to use_relative_path (#147679)
The option really means to compile a cpp file using its basename instead of the its full path.

Differential Revision: [D69722709](https://our.internmc.facebook.com/intern/diff/D69722709/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147679
Approved by: https://github.com/angelayi
2025-02-24 21:44:33 +00:00
96acb56626 [ROCm] Optimize the stride one indexing backwards kernel (#146420)
This patch makes several changes to the stride 1 backwards indexing kernel as follows:
- enables the computation across the `sorted_indices` array to happen in parallel by all the lanes in the warp, this means that the accesses to `sorted_indices` are now fully coalesced.
- the duplicate counting now happens in parallel: each lane in the warp counts the duplicates of a different `idx`.
- enable skipping during duplicate count: this optimization ensures that for large number of duplicates we can skip 32 values at time to speed up the count.
- for low number of duplicates i.e. we have less than `warp-size` duplicates then just perform the tail reduction which avoid the wasteful parallel reduction across the warp for this case (it would only add zero values).
- for high number of duplicates i.e. when we have more than `warp-size` duplicates then we still use the full warp of lanes to compute the reduced value with as much parallelism as possible. This is done by making sure that all lanes stick around and cooperatively execute the reduction in case there is a single `idx` which has a large number of duplicates (i.e. a duplicate spike). For this to happen we use shared memory to pass the duplicate count computed in parallel in the first part of the kernel to the cooperative reduction part of the kernel.

Benefits on examples extracted from workloads show a 3.6x to 10x speed-up.

co-author: Hashem Hashemi <Hashem.Hashemi@amd.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146420
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-24 21:19:06 +00:00
89b9c12de8 remove prints from partitioner (#147749)
See c57894cd74..22d8f9a657 (r1968015955)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147749
Approved by: https://github.com/Skylion007, https://github.com/laithsakka
2025-02-24 21:03:45 +00:00
8eb400ef66 [BE] TCPStore: use typed errors for assertions (#147647)
This is a follow up to #147465 that changes most TORCH_CHECK calls in TCPStore and TCPStoreLibUvBackend  to use typed exceptions instead of generic `TORCH_CHECK` calls which end up as RuntimeErrors in Python.

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147647
Approved by: https://github.com/fduwjj
2025-02-24 20:58:10 +00:00
19fd21fe7e [Inductor] Hot fix after #146917 (#147639)
This pull request reverts the changes to `torch/_inductor/ir.py` file that were added in #146917.

Where I tested, there were changes only from `torch/_inductor/codegen/cpp_wrapper_gpu.py`, it turns out that changes in `torch/_inductor/ir.py` file are not really needed. So it's my fault, I didn't sync the environments (between several machines) correctly.

@davidberard98 @YUNQIUGUO maybe that's why the tests on CUDA didn't pass?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147639
Approved by: https://github.com/etaf, https://github.com/davidberard98
2025-02-24 20:34:48 +00:00
754fb834db [BE][CI] bump ruff to 0.9.0: string quote styles (#144569)
Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting

- Change the outer quotes to double quotes for nested f-strings

```diff
- f'{", ".join(args)}'
+ f"{', '.join(args)}"
```

- Change the inner quotes to double quotes for triple f-strings

```diff
  string = """
-     {', '.join(args)}
+     {", ".join(args)}
  """
```

- Join implicitly concatenated strings

```diff
- string = "short string " "short string " f"{var}"
+ string = f"short string short string {var}"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569
Approved by: https://github.com/Skylion007
ghstack dependencies: #146509
2025-02-24 19:56:09 +00:00
52f6d4aa30 [BE][CI][Easy] bump ruff to 0.9.0: long statements in docstrings (#146509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146509
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-24 19:56:08 +00:00
9605c5063b [ROCm][TunableOp] Speed-up matmul_small_brute_force_tunableop unit test (#147659)
This PR has a UT speed-up and some refactoring of tests.

A previous PR https://github.com/pytorch/pytorch/pull/142422 fixed this matmul_small_brute_force_tunableop for the FP16 data type by adding TunableOp numerical checks. It had the unfortunate side effect that it increased the execution time for the FP32 and FP64 data types by a significant margin. This PR *reduces* the execution time by 20+ minutes.

We also move a hipBLASLt version check to a different tunableop UT for simplicity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147659
Approved by: https://github.com/jeffdaily
2025-02-24 19:44:38 +00:00
69c4f6ff13 [Minor] Fix minor mistake in docstring of replace_pattern (#147611)
Fixes #147610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147611
Approved by: https://github.com/soulitzer
2025-02-24 19:33:44 +00:00
b9b1fd9b93 [Intel GPU] qlinear.pointwise with mixed dtype support (#136753)
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qlinear_pointwise` op. With current PR, we allow `qlinear` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_int8_mixed_bf16_xpu \
    -k test_qlinear_relu_int8_mixed_bf16_xpu \
    -k test_qlinear_add_int8_mixed_bf16_xpu
```

# Runtime exemplification
```bash
#qlinear+bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32,,4x4:4x4,0.0698242
# qlinear_add + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:-0.677141+sum:0.0132773,,4x4:4x4,0.0419922
# qlinear_add_relu + bf16 output
onednn_verbose,primitive,exec,gpu:0,matmul,ocl:gemm_with_po:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_bf16::blocked:ab::f0_mask2 dst_bf16::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.533096+sum:0.00416481+eltwise_relu,,4x4:4x4,0.0759277
```
As shown in the oneDNN verbose, the attribute `dst_bf16::blocked:ab::f0` demonstrate that we could successfully output a bf16 tensor in int8 gemm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136753
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337, #135465

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-24 19:27:50 +00:00
075b91bef1 [Intel GPU] qconv.pointwise with mixed dtype XPU support (#135465)
# Motivation
This PR is aimed to add mixed data type(AMP) support for `qconv_pointwise` op. With current PR, we allow `qconv` kernels output Tensor that is BF16, rather than FP32/INT8.

# UT verification
```bash
DNNL_VERBOSE=1 python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qconv2d_int8_mixed_bf16_xpu \
    -k test_qconv2d_relu_int8_mixed_bf16_xpu \
    -k test_qconv2d_hardtanh_int8_mixed_bf16_xpu \
    -k test_qconv2d_hardswish_int8_mixed_bf16_xpu \
    -k test_qconv2d_silu_int8_mixed_bf16_xpu \
    -k test_qconv2d_add_int8_mixed_bf16_xpu \
    -k test_qconv2d_add_relu_int8_mixed_bf16_xpu
```

# Runtime verification
```bash
#qconv + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0539551
# qconv_silu + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_swish:1,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0588379
# qconv_hardswish + bf16
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_undef::undef::: dst_bf16::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_hardswish:0.166667:0.5,alg:convolution_direct,mb1_ic128oc128_ih6oh4kh3sh1dh0ph0_iw6ow4kw3sw1dw0pw0,0.0568848
```
The `dst_bf16::blocked:acdb::f0` attribute in oneDNN verbose demonstrate the output tensor is computed as bf16 successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135465
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189, #135337

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-24 19:27:50 +00:00
ffa19b9024 [ROCm][Windows] Fix unrecognized constexpr std::memcpy for HIP-clang (#147316)
Since in MSVC's 2019/2022 implementation of STL memcpy is not defined as a constexpr function, HIP clang compiler on Windows cannot evaluate the following memcopy as one that could be resolved during the compile time. To resolve this, a `__builtin_memcpy` is used instead which doesn't have this limitation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147316
Approved by: https://github.com/jeffdaily
2025-02-24 18:28:59 +00:00
900a774781 Revert "[ROCm] Update periodic.yml to use 2GPU runners (#146839)"
This reverts commit b6273d7f4ba4fbb126eb96816287641ca1e4efc6.

Reverted https://github.com/pytorch/pytorch/pull/146839 on behalf of https://github.com/jithunnair-amd due to This change is not needed anymore since our 4-GPU runners are back online and stable so far ([comment](https://github.com/pytorch/pytorch/pull/146839#issuecomment-2679145448))
2025-02-24 17:17:58 +00:00
cde12207a0 [Intel GPU] Add SDPA implementation on XPU with OneDNN (#147612)
Add XPU implementation of OneDNN based SDPA operator. Will be integrated and enabled later.

Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147612
Approved by: https://github.com/EikanWang
2025-02-24 16:12:04 +00:00
576ed1e400 Upgrade submodule oneDNN to v3.7 (#147498)
This PR is to upgrade submodule oneDNN to v3.7.

## Improvements

- Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support.
- Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids).
- Introduced initial optimizations for Intel GPUs based on Xe3 architecture.
- Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA.
- Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues.
- Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL).
- Improved bf16 to fp32 reorder performance.
- Improved bf16 reorder performance.
- Improved bf16 convolution with ACL.

Fixes https://github.com/pytorch/pytorch/issues/136348.

## Validation results on CPU

1. NLP models accuracy/inference/training
![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8)

![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab)

2. Torchbench cpu userbenchmark inference & training

![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd)

3. Inductor quantization

![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675)

4. Dynamo benchmarks
![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd)
![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b)
![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1)
![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd)
![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805)
![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88)
![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431)
![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b)

## Validation results on XPU
Accuracy is same as baseline. Performance is shown below.
![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0)

## Validation results on ARM
![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb)
![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498
Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman
2025-02-24 14:32:51 +00:00
80d3afc698 [inductor] Improve type annotations in _inductor/pattern_matcher.py (#146626)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146626
Approved by: https://github.com/Skylion007
2025-02-24 14:30:35 +00:00
d0f08dc3eb Update slow tests (#147728)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147728
Approved by: https://github.com/pytorchbot
2025-02-24 11:48:19 +00:00
cba14212e6 [FX] micro-optimization map_aggregate(immutable_dict) (#147691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147691
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #147699, #144640
2025-02-24 09:14:08 +00:00
a50af71fb6 [FX] Refactor immutable collections implementation (#144640)
Get rid of dynamic class creation via `type(name, bases, ...)`. Convert it to classic static class definition for better readability and static analysis support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144640
Approved by: https://github.com/jansel
ghstack dependencies: #147699
2025-02-24 09:14:08 +00:00
dc9a03d30c [Window] Fix invalid file path on windows. (#147708)
This PR aims to fix the invalid path for windows: `C:\\Users\\sdp\\AppData\\Local\\Temp\\tmp0wugz2qm\\dynamo\\code_state___main__.TestFxGraphCache.test_cache_hot_load_pgo:None:.pkl.lock`
Windows does not allow chars `\ / : * ? " < > |` in a path.

And this PR also replace `os.rename` to `os.replace` in torch/_dynamo/pgo.py because `os.replace` allows target file exists on Windows, but not `os.rename` .
| Function                      | `os.rename()`              | `os.replace()`             |
|--------------------------------|----------------------------|----------------------------|
| Rename a file                 |                           |                           |
| Move a file                   |                           |                           |
| Overwrite an existing file     |  (Error on Windows)       |  (Will overwrite)         |
| Overwrite an existing directory |  (Error on Windows)                         |   (Error on Windows)                        |
| Move across disks             |                           |                           |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147708
Approved by: https://github.com/jansel
2025-02-24 08:31:11 +00:00
5b6ad682bc Revert "[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)"
This reverts commit 85ea67983421acc30ccc76f7a159042e75c6ea08.

Reverted https://github.com/pytorch/pytorch/pull/147254 on behalf of https://github.com/jeanschmidt due to introduced reds on main ([comment](https://github.com/pytorch/pytorch/pull/147254#issuecomment-2677700862))
2025-02-24 08:20:16 +00:00
8d618f3da7 [AOTI][XPU] Suppress multi-line comment warning for XPU. (#147710)
This PR aim to suppress multi-line comment waring in sycl header when building Inductor cpp_wrapper .
```
/intel/oneapi/compiler/2025.0/include/sycl/detail/builtins/builtins.hpp:235:1: warning: multi-line comment [-Wcomment]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147710
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-02-24 07:28:59 +00:00
cee03b7746 [Inductor] Update should_decompose_mm condition for CPU (#147673)
Summary:
Previously, for cpu we decompose addmm if
```
check_device(mat1, mat2, device="cpu")
        and mat1.shape[0] == 1
        and mat2.shape[0] <= 64
        and mat2.shape[1] <= 16
```
We have a new case where `mat2.shape[2] = 304`, and benchmark shows that it will beneficial if we decompose, so update the condition to
```
check_device(mat1, mat2, device="cpu")
        and mat1.shape[0] == 1
        and mat2.shape[0] <= 64
        and mat2.shape[1] <= 512
```

Differential Revision: D70033166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147673
Approved by: https://github.com/houseroad
2025-02-24 05:51:50 +00:00
8b65dbad13 [MPS/Inductor] Add support for xlog1py. (#147709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147709
Approved by: https://github.com/jansel
2025-02-24 05:28:52 +00:00
baccadb2f1 xpu: torch.xpu.get_arch_list() to return [] if xpu not compiled (#147431)
Initially discussed here: https://github.com/pytorch/pytorch/pull/132945#discussion_r1957366131

Previously `torch.xpu.get_arch_list()` got relaxed to work even if XPU device is not available. However, we overlooked the case when pytorch is not compiled with XPU support. In such a case function throws an exception. This commit adjusts this behavior and makes function return `[]` even if pytorch is not compiled with XPU support.

CC: @EikanWang @fengyuan14 @guangyey @malfet @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147431
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-02-24 01:35:54 +00:00
7c52ef2424 Add XPU to is_compile_supported to support roi_align op in torchvision (#147541)
Part of the required fix for https://github.com/intel/torch-xpu-ops/issues/1264.

To support `roi_align`, torchvision uses `is_compile_supported` in `torch/_dynamo/utils.py` to compile a non-deterministic version of the op for backwards passes. This PR adds XPU device to the supported compile devices.

The `is_compile_supported()` util function has extremely limited usage, only being used in `torchvision.ops.roi_align` and `torch.utils._content_store.has_storage()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147541
Approved by: https://github.com/guangyey, https://github.com/jansel

Co-authored-by: lei,zhenyuan <zhenyuan.lei@intel.com>
2025-02-24 01:32:36 +00:00
4e934ee5a7 [MPS] Add eager support for xlog1py. (#147687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147687
Approved by: https://github.com/malfet
2025-02-24 01:23:59 +00:00
eqy
718cf68aee [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-23 22:01:39 +00:00
b5d7aefa57 [BE] add missing overload annotations for tree_map_only (#147699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147699
Approved by: https://github.com/Skylion007
2025-02-23 20:21:07 +00:00
f47573f70d Add super().setUp() to some test cases (#147651)
I saw that their disabled issues were getting spammed with comments, meaning that they were still running in CI despite having a disable issue, so I added the super().setUp() call to check if there's a disable issue for them since they were missing it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147651
Approved by: https://github.com/huydhn
2025-02-23 18:21:17 +00:00
f03e7f3801 [MPS] Workaround rng bug for 5D tensors (#147667)
For some reason MPSGraph returns repeated values is tensor dimention is
larger than 4, which can be clearly seen by running following
```swift
import Metal
import MetalPerformanceShadersGraph

func randMPS(device: MTLDevice, obuf: MTLBuffer, nelem: Int, ndim: Int = 5) {
  let graph = MPSGraph()
  var dims = Array(repeating: 1, count: ndim)
  dims[0] = nelem
  let shape = dims.map { NSNumber(value: $0) }
  let randNode = graph.randomUniformTensor(withShape: shape, seed: 42, name: nil)
  let mpsOutputBuffer = MPSGraphTensorData(obuf, shape: shape, dataType: .float32)
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  graph.run(with: queue, feeds: [:], targetOperations: nil, resultsDictionary: [randNode: mpsOutputBuffer])
}

func printBuf(_ prefix: String, buf: MTLBuffer, nelem: Int) {
  let buf_data = buf.contents().assumingMemoryBound(to: Float.self)
  print(prefix)
  for i in 0..<nelem {
      print(buf_data[i], terminator: i != nelem - 1 ? " " : "\n")
  }
}

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
print("Using device \(device.name)")

let nelem = 2
guard let buf = device.makeBuffer(length:nelem * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 4)
printBuf("4D uniform", buf: buf, nelem: nelem)

randMPS(device: device, obuf: buf, nelem: nelem, ndim: 5)
printBuf("5D uniform", buf: buf, nelem: nelem)
```

Workaround by flatting the tensor if it's contiguous

Fixes https://github.com/pytorch/pytorch/issues/147624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147667
Approved by: https://github.com/dcci
2025-02-23 16:52:01 +00:00
3e2d9d079e Revert "[ROCm] OCP FP8 Support for new GPUs (#146632)"
This reverts commit f95ab46797e1f3e8cc48ce2f45e4f6985132fb19.

Reverted https://github.com/pytorch/pytorch/pull/146632 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, I'll find someone to help merge this PR back to main ([comment](https://github.com/pytorch/pytorch/pull/146632#issuecomment-2676823614))
2025-02-23 12:04:50 +00:00
d0adff761e Propagate AttributeError to user code in user_defined.py (#146497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146497
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #146496
2025-02-23 01:18:28 +00:00
8c761ac7e3 Handle is/is not (#146496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146496
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-02-23 01:18:28 +00:00
b084635735 [MPS/inductor] Adjust more tests that depends on non-divisible input sizes (#147681)
Also adjust a comment while I'm at it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147681
Approved by: https://github.com/jansel
2025-02-23 00:33:26 +00:00
6a5e3917a7 [MPS] Add inductor support for spherical_bessel_j0. (#147650)
Counterpart to my previous patch that added support for the op in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147650
Approved by: https://github.com/jansel
2025-02-23 00:32:36 +00:00
f9c117f859 [mps/inductor] XFAIL adaptive_avg_pool_with_output_size_0. (#147676)
Non-divisible input sizes are not implemented on MPS device yet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147676
Approved by: https://github.com/malfet
2025-02-22 20:17:33 +00:00
db15cb0988 [Submodule] [Cutlass] Update to 3.8.0 tag (#147655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147655
Approved by: https://github.com/henrylhtsang, https://github.com/eqy
2025-02-22 20:05:31 +00:00
85ea679834 [TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)
[TorchRec][PT2] disable contextlib in PT2 train pipeline (#147254)

Summary:

# context
* more details in the [post](https://fb.workplace.com/groups/1075192433118967/permalink/1587079018596970/)
* disable contextlib with PT2

Test Plan:
* run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+dynamo,+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_ultra_mini training.pipeline_type=pt2 data_loader.dataset.table_ds=[2024-12-02] 2>&1 | tee -a output.log
```
* old tlparse
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpYYAS3o/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
* new tlparse
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpUJhCGZ/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Reviewed By: Microve

Differential Revision: D68480678
2025-02-22 18:57:55 +01:00
fa8e3a28a7 Revert "[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)"
This reverts commit 533b884870acd951e684e0bf551eb76904dec047.

Reverted https://github.com/pytorch/pytorch/pull/141178 on behalf of https://github.com/jeanschmidt due to Broke internal arvr signals, see D69971019. @jbschlosser please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/141178#issuecomment-2676317470))
2025-02-22 17:28:12 +00:00
bea72180ed Revert "[ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)"
This reverts commit e7bf490c430ac5a70ccb7ab8e954d3386fd29413.

Reverted https://github.com/pytorch/pytorch/pull/144572 on behalf of https://github.com/jeanschmidt due to Broke internal signals, D69994027, I'll find someone to help get this change merged ([comment](https://github.com/pytorch/pytorch/pull/144572#issuecomment-2676314308))
2025-02-22 17:19:38 +00:00
3409cbd177 Revert "Delete Mixed MM Special Casing (#147151)"
This reverts commit d6bb1d7f0a9dc3d11d2864da9ab46872377a6e52.

Reverted https://github.com/pytorch/pytorch/pull/147151 on behalf of https://github.com/jeanschmidt due to Broke a few internal signals, see comments on D69994157 ([comment](https://github.com/pytorch/pytorch/pull/147151#issuecomment-2676312215))
2025-02-22 17:14:32 +00:00
72b4f35cb5 [CI] Reduce the AOT target list to reduce build time (#147601)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147601
Approved by: https://github.com/atalman
2025-02-22 14:43:26 +00:00
3cc3d7e08f Also support non-contiguous activation for torch._weight_int8pack_mm on CPU (#147588)
### Problem
Non-contiguous activation for `torch._weight_int8pack_mm` is unsupported on CPU.
So, with int8 WoQ with B16 activation with torchao, for batch-size 2 & above, an assertion is hit regarding non-contiguous A being unsupported. Such an issue was encountered with LLaMA models.

### Solution
Also support non-contiguous activation for `torch._weight_int8pack_mm`, so long as it's contiguous on the last dimension & remove the assertion that requires contiguous activation.

### Alternative solutions considered
Could modify LLaMA model in transformers library to call `contiguous` after obtaining the final hidden state, just before computing logits with the LM head. However, [it](https://github.com/huggingface/transformers/pull/36078) might cause some regression for other users of that code.

Another aspect to this issue is - is latency always lower if we make an activation tensor contiguous before linear or `torch._weight_int8pack_mm` is called on CPU? I guess we need some data-points to analyze this part, although I think the performance should be good enough with this patch, since the first cache lines of rows of A are being explicitly prefetched in the existing code (and it also avoids copy, which a `contiguous` call would do).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147588
Approved by: https://github.com/mingfeima, https://github.com/leslie-fang-intel, https://github.com/malfet
2025-02-22 08:29:07 +00:00
e1bf892d90 [DDP] Temporarily disable comm mem (#147663)
For fear that it incur slightly more memory usage and cause some applications at tight memory margin to OOM.
(bc the comm mem pool is a separate pool than the regular pool ?)

Differential Revision: [D70026681](https://our.internmc.facebook.com/intern/diff/D70026681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147663
Approved by: https://github.com/d4l3k
2025-02-22 05:55:43 +00:00
086d146f6f Update ruff linter for PEP585 (#147540)
This turns on PEP585 enforcement in RUFF.

- Updates the target python version
- Stops ignoring UP006 warnings (PEP585)
- Fixes a few issues which crept into the tree in the last day

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147540
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-22 04:45:17 +00:00
77d2780657 Enable strobelight profiling specific compile frame ids using COMPILE_STROBELIGHT_FRAME_FILTER (#147549)
running python test/strobelight/examples/compile_time_profile_example.py
```
strobelight_compile_time_profiler, line 123, 2025-02-20 14:08:08,409, INFO: compile time strobelight profiling enabled
strobelight_compile_time_profiler, line 159, 2025-02-20 14:08:08,409, INFO: Unique sample tag for this run is: 2025-02-20-14:08:081656673devgpu005.nha1.facebook.com
strobelight_compile_time_profiler, line 160, 2025-02-20 14:08:09,124, INFO: URL to access the strobelight profile at the end of the run: https://fburl.com/scuba/pyperf_experimental/on_demand/9felqj0i

strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:12,436, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:15,553, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 205, 2025-02-20 14:08:16,170, INFO: profiling frame 0/0 is skipped due to frame_id_filter 1/.*
strobelight_compile_time_profiler, line 214, 2025-02-20 14:08:16,877, INFO: profiling frame 1/0
strobelight_function_profiler, line 247, 2025-02-20 14:08:19,416, INFO: strobelight run id is: 4015948658689996
strobelight_function_profiler, line 249, 2025-02-20 14:08:21,546, INFO: strobelight profiling running
strobelight_function_profiler, line 289, 2025-02-20 14:08:25,964, INFO: work function took 4.417063233006047 seconds
strobelight_function_profiler, line 230, 2025-02-20 14:08:28,310, INFO: strobelight profiling stopped
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Total samples: 119
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/73h2f7ur
strobelight_function_profiler, line 221, 2025-02-20 14:08:44,308, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/zs06fi9e
strobelight_compile_time_profiler, line 167, 2025-02-20 14:08:44,308, INFO: 1 strobelight success runs out of 1 non-recursive compilation events.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147549
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #147547
2025-02-22 03:44:53 +00:00
fc095a885c move _strobelight/example to avoid graph breaks (#147547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147547
Approved by: https://github.com/bobrenjc93
2025-02-22 03:44:53 +00:00
fecd3f7ecb [ROCm] change is_hip_clang() to always return True (#147646)
hipify is replacing kernel launchs <<< >>> with hipLaunchKernelGGL() macro and this is a regression caused by /opt/rocm/hip/.hipinfo no longer existing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147646
Approved by: https://github.com/jeffdaily, https://github.com/petrex
2025-02-22 03:26:55 +00:00
b11d5cd584 [Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481)
This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #147347
2025-02-22 02:53:16 +00:00
2d433cf1ad [Inductor UT][Windows][XPU] Enable Inductor UT on XPU Windows. (#147347)
This PR removes the restrictions on general cases for XPU on Windows, allowing us to run Inductor UT on Windows.
Additionally, this series of PRs has also fixed all XPU Inductor UT issues on Windows. However, due to resource constraints, we have not yet set up a Windows CI pipeline online.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147347
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-02-22 02:53:16 +00:00
84fcf1bb11 constexpr all the things in irange.h (#147633)
I got complaints while irangeifying some files in ExecuTorch
that irange could not be used in a constexpr function. This made the
complaints go away.

I added a constexpr function in irange_test that used to fail to build
with `error: variable of non-literal type 'iterator' (aka
'integer_iterator<int, true>') cannot be defined in a constexpr
function before C++23` and now builds fine.

Differential Revision: [D69959614](https://our.internmc.facebook.com/intern/diff/D69959614/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147633
Approved by: https://github.com/albanD
2025-02-22 01:51:51 +00:00
6e0b09728a [export] Remove report from draft-export output (#147558)
Summary: This matches the export API. To print the report, people can just do `print(ep._report)`. This information is also displayed in the terminal after the draft_export call.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D69689154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147558
Approved by: https://github.com/pianpwk
2025-02-22 00:54:29 +00:00
1c334893dc [CacheBench] Refactor code to prepare for mode benchmarks (#147641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147641
Approved by: https://github.com/huydhn
2025-02-22 00:20:54 +00:00
5d26b7108f [PP] Remove extra code and docs BE (#147636)
current docs:
<img width="746" alt="image" src="https://github.com/user-attachments/assets/4c4088fc-ee97-4a82-be28-e33eb35e76f5" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147636
Approved by: https://github.com/awgu
2025-02-22 00:10:31 +00:00
f95ab46797 [ROCm] OCP FP8 Support for new GPUs (#146632)
TLDR: Follow up/ Build on top of https://github.com/pytorch/pytorch/pull/144476. add OCP FP8 support for gfx950
refer to https://github.com/pytorch/ao/pull/1677

This pull request includes several changes to improve compatibility and support for new GPU architectures and data types, particularly for ROCm. The key updates involve adding support for new ROCm versions and GPU architectures, updating data type handling, and removing outdated checks.

### Improvements to GPU Architecture and ROCm Version Support:
* [`aten/src/ATen/Context.cpp`](diffhunk://#diff-33de472d304acbe57d693c8567370c638068bedc1aa0ce8e9dc115dad05a7810L323-R326): Added support for new GPU architectures `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks.
* [`aten/src/ATen/native/cuda/Blas.cpp`](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199): Updated architecture support in multiple functions to include `gfx1200`, `gfx1201`, and `gfx950` based on ROCm version checks. [[1]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL196-R199) [[2]](diffhunk://#diff-e8a569efee1e650172f120a0fdcda024fe3e4703a4ee3336425c8f685af6b3abL865-R876)

### Updates to Data Type Handling:
* [`aten/src/ATen/cuda/CUDADataType.h`](diffhunk://#diff-9188bb13b1a49f459141f5f9b875593d1c5ce2beb5ad711fdbaf5bc7089ec015L81-L98): Enhanced data type conversion to include new float8 types for both CUDA and ROCm environments.
* [`aten/src/ATen/cuda/tunable/GemmHipblaslt.h`](diffhunk://#diff-bfa1a3b5d4bef1892bf50338775f3b0fd8cd31fc1868148f3968b98aefb68e3fL29-R80): Updated `HipDataTypeFor` template to handle new float8 types and added hard-coded enum values for ROCm versions prior to 6.3.

### Removal of Outdated Checks:
* [`cmake/public/LoadHIP.cmake`](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197): Removed the check for `HIP_NEW_TYPE_ENUMS` as it is no longer necessary with the updated ROCm versions. [[1]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L169-L197) [[2]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5L211-R182)

These changes ensure better compatibility and performance on newer hardware and software environments, particularly for users leveraging ROCm and CUDA for deep learning and scientific computing tasks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146632
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 23:44:08 +00:00
b1a81a4a65 Don't use '-e' when installing Triton (#147228)
Currently the install_triton.sh script uses "pip install -e ." to install Triton.
Using the -e is sometimes appropriate for develop work but is less appropriate for delivery.
To make matters worse it seems the behavior of the -e various depending on the version of pip invovled.

This PR removes the -e and installs Triton normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147228
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-02-21 23:00:12 +00:00
995b125cdd [CI] Build sm89 with more procs experiment (#147487)
Add a build that uses 4 out of the 8 processes available on a linux.2xlarge/c5.2xlarge.  Currently it's set to 2 because it would oom, but I'm curious as to how often people's builds oom.  I can't test this on my own because of caching, so it has to run on pull request

This might result in a failing job on may people's PRs and I'm not sure how to get around it.  I named it stable to make it automatically get sorted into the stable group for Dr. CI but it'll still show up
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147487
Approved by: https://github.com/huydhn
2025-02-21 22:07:00 +00:00
7c8c82cd64 [trymerge] Post initial starting merge comment on stacked PRs (#147028)
Post a small comment stating if a PR is being merged as part of a stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028
Approved by: https://github.com/ZainRizvi
2025-02-21 22:05:00 +00:00
698f6f9fae specify only some dimensions in shapes collection (#147534)
Differential Revision: [D69936316](https://our.internmc.facebook.com/intern/diff/D69936316/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147534
Approved by: https://github.com/bobrenjc93
2025-02-21 22:02:42 +00:00
2fb9416e6f [inductor][cpu] Move VNNI weight packing into AMX GEMM kernel for contiguous BMM weights (#146843)
Currently, the bfloat16 microkernel that uses AMX vectorization requires that the weights are in an interleaved VNNI format. For GEMM code, this hasn't been an issue because GEMM currently only supports constant weights, so the VNNI weight packing is done during compile-time and saved as a constant tensor to the graph. But for BMM ops where weights are not required to be constant, current code does an expensive reshape/VNNI packing for all BMM weights.

This PR removes the need for the reshape/packing for non-constant inputs by moving VNNI packing inside the AMX microkernel. A new `K * block_n` buffer is used to store the temporary packed weights. Weight packing involves interleaving 2 rows of weights.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146843
Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-21 21:46:00 +00:00
d91be786cb [cutlass backend] clear_on_fresh_inductor_cache when generatings cutlass ops (#147586)
Differential Revision: [D69966732](https://our.internmc.facebook.com/intern/diff/D69966732/)

This is needed if we want to generate cutlass ops with different instantiation level in one session.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147586
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-02-21 21:28:41 +00:00
ef6b16ea9d Revert "[trymerge] Post initial starting merge comment on stacked PRs (#147028)"
This reverts commit 0295aabf6071c7da62325e6a29e04ed09a3e34ef.

Reverted https://github.com/pytorch/pytorch/pull/147028 on behalf of https://github.com/clee2000 due to I think this broke merge for non ghstack prs ([comment](https://github.com/pytorch/pytorch/pull/147028#issuecomment-2675532017))
2025-02-21 21:02:19 +00:00
05e6f15966 Revert "[Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)"
This reverts commit e758d8b4d1632ea765bf8bc8e87b6039ae708b9f.

Reverted https://github.com/pytorch/pytorch/pull/147395 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D69890757 - servicelab_benchmark_pyper_local_runner, @eellison please help the author get this change landed ([comment](https://github.com/pytorch/pytorch/pull/147395#issuecomment-2675521966))
2025-02-21 20:56:40 +00:00
6eb795c9e8 [associative_scan] compile backend change to "eager" (#146973)
This PR fixes some issues with torch export discussed here: https://github.com/pytorch/pytorch/pull/140043#discussion_r1941932960

However, this backend change does still not resolve the failure for specific shapes mentioned here: https://github.com/pytorch/pytorch/issues/137943#issuecomment-2649564994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146973
Approved by: https://github.com/ydwu4
2025-02-21 20:21:41 +00:00
5ed1e23e3a Fix type stubs for SymmetricMemory (#146310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146310
Approved by: https://github.com/yifuwang
2025-02-21 19:59:43 +00:00
fd8ae1aa04 [ROCm] gfx940 and gfx941 cleanup (#147394)
Removing gfx architectures not supported by ROCm.

NOTE: For users wanting to build PyTorch for gfx archs that are *not* supported by the official wheels on download.pytorch.org, you can build PyTorch from source for your desired gfx arch [using the PYTORCH_ROCM_ARCH env var](https://github.com/pytorch/pytorch/blob/main/README.md#amd-rocm-support).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147394
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-02-21 19:42:12 +00:00
c0ee62573a [Easy][optim] Add LBFGS params optional desc (#147579)
[LBFGS docs](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html#torch.optim.LBFGS) missing `optional` description for params in compare with other optimizer docs, like [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html)

## Test Result

### Before

![image](https://github.com/user-attachments/assets/34877490-16b4-4c68-bf6c-405bae563352)

### After

![image](https://github.com/user-attachments/assets/7fba94c8-7091-47b8-bdf1-ca7d779a027f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147579
Approved by: https://github.com/janeyx99
2025-02-21 19:38:10 +00:00
b5c3bb6185 Add continuous run for cachebench (#147546)
This PR adds a continuous run for cache bench.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147546
Approved by: https://github.com/huydhn
ghstack dependencies: #147537
2025-02-21 19:02:17 +00:00
76ce194b8e For addmm and bmm, check if config.autotune_fallback_to_aten before using aten as a fallback. Also fix bmm cutlass backend (#147148)
This PR also fixes BMM, which was silently failing for a while.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147148
Approved by: https://github.com/eellison
2025-02-21 18:41:52 +00:00
0295aabf60 [trymerge] Post initial starting merge comment on stacked PRs (#147028)
Post a small comment stating if a PR is being merged as part of a stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147028
Approved by: https://github.com/ZainRizvi
2025-02-21 18:05:05 +00:00
2190ca7f47 Use __qualname__ in add_safe_globals and update Unpickling error raised for Unsupported GLOBAL (#146815)
- Fixes #146814

Change
```python
for f in _marked_safe_globals_set:
    module, name = f.__module__, f.__name__
```
to

```python
for f in _marked_safe_globals_set:
    module, name = f.__module__, f.__qualname__
```
for avoiding same key string overwrite.

A test is also added.
```
python test/test_serialization.py TestSerialization.test_serialization_nested_class
```

- Fixes #146886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146815
Approved by: https://github.com/mikaylagawarecki
2025-02-21 18:04:59 +00:00
f4e4cfcb91 [caffe2] Ignore compiler option when building using clang (#147556)
Summary:
Skip adding unrecognized option optimize("-fno-tree-loop-vectorize") when building using clang

This piece of code began to be compiled after armv9a has been set as default compilation profile

Test Plan: buck2 run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12 lego/scripts:lego_cli -- run-locally --model_entity_id ${MODEL} --config_version ${CONFIG_VERSION} --disable_generate_new_checkpoint --checkpoint_version 0 --publish_context OFFLINE_PUBLISH --lego_pipeline aiplatform.modelstore.model_generation.lego.lego_pipeline_builder.gmpp_lego_pipeline --gmpp_config '{"gmpp_pipeline_descriptor": "aiplatform.modelstore.model_generation.v1.ads_pipelines.aimp_pyper_pipeline.model_generation_pipeline", "worker_process_number":12, "worker_thread_per_process_number": 6, "use_work_assignment": true}' 2>&1 | tee aimp_697790515.log

Reviewed By: andrewjcg

Differential Revision: D69947027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147556
Approved by: https://github.com/janeyx99
2025-02-21 17:46:04 +00:00
a0c7d96028 [Easy] Add Delimeter To Show Where Allocation Addr Begins (#147461)
Summary: When we print the addr we append an "s" or a "b" to the beginning of an addr. Since the addr is in hex, a user might be confused and think the "b" is part of the address. Added an approstrophe to clear this up

Test Plan: CI

Differential Revision: D69828538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147461
Approved by: https://github.com/zdevito
2025-02-21 17:19:53 +00:00
784f64bb05 [inductor] triton support port-#5512, update cpp wrapper for gpu (#146917)
In short, this pull request enhances `constexprs` expression filtering.

Note: I tested the changes on xpu backend.

Part of https://github.com/pytorch/pytorch/issues/144103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146917
Approved by: https://github.com/EikanWang, https://github.com/etaf, https://github.com/davidberard98, https://github.com/YUNQIUGUO
2025-02-21 17:10:53 +00:00
6a6de0e09d better error message (#147532)
Differential Revision: [D69939736](https://our.internmc.facebook.com/intern/diff/D69939736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147532
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2025-02-21 17:08:47 +00:00
a8ce4d1846 Add cachebench (#147537)
This PR adds a new benchmark called cachebench in order to measure/demonstrate the prowess of PT2 caching.
```
python benchmarks/dynamo/cachebench.py --output="result.json"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147537
Approved by: https://github.com/jamesjwu
2025-02-21 17:06:45 +00:00
af1072ffb6 [Intel GPU] Enable BUILD_GRAPH for xpu_mkldnn (#147608)
For preparation of OneDNN based XPU SDPA enabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147608
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-02-21 16:12:30 +00:00
d6bb1d7f0a Delete Mixed MM Special Casing (#147151)
Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup:

- prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not.
- similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path.

It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication.

The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete.

Future optimizations could include:

- cutlass prologue path
- making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2025-02-21 16:02:40 +00:00
36c461af95 Support SymmetricMemory's signaling kernels on sm60 and sm70 (#146308)
By leveraging libcudacxx's utilities: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives/atomic_ref.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146308
Approved by: https://github.com/yifuwang
2025-02-21 15:29:02 +00:00
7ce4974e50 Fix PEP585 update (#147536)
Summary: D69920347 causes a pyre failure due to changing a base object from typing.Iterable to abc.Iterable.  For now revert that change until it can be dealt with on its own.

Test Plan:
failures from D69920347 pass locally
unit tests pass

Reviewed By: oulgen

Differential Revision: D69936518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147536
Approved by: https://github.com/jeanschmidt
2025-02-21 14:37:03 +00:00
654f2666d9 Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.12xlarge.memory.ephemeral`.

This change depends on https://github.com/pytorch/test-infra/pull/6316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-21 14:15:40 +00:00
51748a5d1a Update OpenBLAS to 0.3.29 (#144857)
* Improvements for GEMM to GEMV kernels
 * Improvements for SVE kernels for SGEMV and DGEMV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144857
Approved by: https://github.com/malfet
2025-02-21 10:07:06 +00:00
71d2827eeb Code Refactoring for getting start and stride from global ranks (#147230)
Summary: Code Refactoring for getting start and stride from global ranks, this function can be used in different collective backend.

Differential Revision: D69555405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147230
Approved by: https://github.com/kwen2501
2025-02-21 10:02:50 +00:00
e7bf490c43 [ROCm] Implemented dropout usage for RNN with MIOpen backend (#144572)
This PR fixes https://github.com/pytorch/pytorch/issues/107183 for ROCm.

Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144572
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-21 10:01:27 +00:00
cffe7183f1 [cutlass backend] Fix standalone runner test after swizzle became a runtime parameter (#147554)
Differential Revision: [D69945114](https://our.internmc.facebook.com/intern/diff/D69945114/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147554
Approved by: https://github.com/mlazos
2025-02-21 09:27:44 +00:00
cyy
b61a556427 Turn onnx functions into static (#147598)
To avoid exposing ONNX symbols.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147598
Approved by: https://github.com/justinchuby
2025-02-21 07:40:28 +00:00
3395da7f7c Revert "Build a storage reader/writer to write checkpoints in HF format (#146352)"
This reverts commit c615b8c174c80936d365d19d8b8f4d9ad9a195f3.

Reverted https://github.com/pytorch/pytorch/pull/146352 on behalf of https://github.com/jeanschmidt due to Author ignored linting errors ([comment](https://github.com/pytorch/pytorch/pull/146352#issuecomment-2673789271))
2025-02-21 07:30:52 +00:00
e5da9df421 Revert "Increase memory for linux binary builds (#147542)"
This reverts commit 87e6e2924eb706b928cdfc4a11623b39259fa830.

Reverted https://github.com/pytorch/pytorch/pull/147542 on behalf of https://github.com/jeanschmidt due to seems that it is best to use another machine type ([comment](https://github.com/pytorch/pytorch/pull/147542#issuecomment-2673765724))
2025-02-21 07:14:57 +00:00
4986f0f52e [PT2]: allow empty dict to pass type check (#147167) (#147480)
Summary:

Seeing errors like when testing sigmoid for inline_cvr and perevent_cvr models.
```
terminate called after throwing an instance of 'c10::Error'
  what():  forward() Expected a value of type 'Dict[int, Tuple[Tensor, Tensor, Tensor]]' for argument 'event_based_features' but instead found type 'Dict[Any, Any]'.
```
Let empty dict pass type check.

please, do NOT use any of the following flags, those are result of manual interventions in other parts of the system, misuse of them can be very painful for both detect and recover:

Test Plan:
```
MODEL_ENTITY_ID=691508446
SNAPSHOT_ID=0
OTHER_MODEL_ENTITY_ID=649645886
OTHER_SNAPSHOT_ID=0
MODULE=local

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- \
    --loadMode=BenchmarkAB \
    --inputNetFile=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${suffix} \
    --otherNetFile=/data/users/${USER}/models/${OTHER_MODEL_ENTITY_ID}/${OTHER_SNAPSHOT_ID}/${OTHER_MODEL_ENTITY_ID}_${OTHER_SNAPSHOT_ID}${suffix} \
    --moduleName=${module} \
    --submodToDevice "" \
    --benchmarkDontRebatchSamples=true \
    --sampleInputFilePath=/data/users/${USER}/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/archive_.predictor.disagg.gpu.local/data/sample_inputs/local.pt
```

Reviewed By: yjhao

Differential Revision: D69871393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147480
Approved by: https://github.com/henryoier, https://github.com/jeanschmidt
2025-02-21 07:00:46 +00:00
c74b59fc1f [ROCm][TunableOp] resolve the rocBLAS version dynamically (#147363)
Dynamically gets rocBLAS version instead of relying on some preprocessing-time definitions which may be stale.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147363
Approved by: https://github.com/pruthvistony, https://github.com/naromero77amd, https://github.com/jeffdaily
2025-02-21 06:50:21 +00:00
86ae672b6a Use has_triton_package in _inductor.runtime.hints (#147442)
Fixes #ISSUE_NUMBER
Use existing method for triton check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147442
Approved by: https://github.com/Skylion007
2025-02-21 05:52:00 +00:00
533b884870 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-02-21 05:22:19 +00:00
a2c3a2c5c4 Support serialization for uintx/intx in weights_only (#147500)
Summary:
Fixing the issue reported by huggingface

Test Plan:
python test/test_serialization.py -k test_serialization_uintx_intx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147500
Approved by: https://github.com/mikaylagawarecki
2025-02-21 04:38:44 +00:00
c615b8c174 Build a storage reader/writer to write checkpoints in HF format (#146352)
Summary: Title - we want to write checkpoints in HF format with DCP, this diff allows this for the non-distributed use case.

Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_hf_torchtune_storage

N6476188 --> able to save and load tensor in hf format

Differential Revision: D68444967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146352
Approved by: https://github.com/saumishr
2025-02-21 03:31:21 +00:00
fe100c3c5b Add libtorch nightly build for CUDA 12.8 (#146265)
Try removing sm50 and sm60 to shrink binary size, and resolve the ld --relink error

"Architecture support for Maxwell, Pascal, and Volta is considered feature-complete and will be frozen in an upcoming release." from 12.8 release note.

Also updating the runner for cuda 12.8 test to g4dn (T4, sm75) due to the drop in sm50/60 support.

https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146265
Approved by: https://github.com/atalman
2025-02-21 03:04:06 +00:00
ba214ab56c TCPStore: soft fail bind when agent store active (#147465)
This makes it easier to roll out `TORCHELASTIC_USE_AGENT_STORE` by opportunistically swallowing bind errors when the agent store is enabled and the port matches `MASTER_PORT`.

This should be very safe as if the store is somehow not up and the envs are set, the TCPStore client connections will fail to connect so we end up with a slightly different error message but success/failure behavior is identical.

This also pybinds `c10d::SocketError` into Python so we can assert on the error type in tests.

https://docs.google.com/document/d/1CzOn_N53AiFxWGgbyMWSnd2elCJd4lZ-ajPg2lzcxoM/edit?tab=t.0#heading=h.2j2f5dimrdau

Test plan:

```
pytest test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147465
Approved by: https://github.com/fduwjj
2025-02-21 03:02:26 +00:00
8a5265cb37 [Intel GPU] qlinear_pointwise.binary[_tensor] XPU support (#135337)
# Motivation
This PR intends to enable quantized fusion `qlinear+add` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary")` and `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary_tensor")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_add_xpu
```

# Runtime Verification
```bash
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu,,4x4:4x4,0.0319824
```
The verbose is collected from UT. We can see the attribute ` attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu`, the post add and ReLU is successfully fused on GEMM computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135337
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/liangan1, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-21 02:09:28 +00:00
8b818ab58f Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353)
Fixes #147303.
Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353
Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch
2025-02-21 01:33:33 +00:00
ac88a6c00d [fx] demote node prepend to self log from warning to debug (#147538)
FIXES https://github.com/pytorch/pytorch/issues/147175

This is harmless, not sure why this is a user warning. Writing reordering graph passes is more concise when we ignore this warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147538
Approved by: https://github.com/yanboliang
2025-02-21 01:32:34 +00:00
4b35139a46 [ROCm][TunableOp] Fix TunableOp warmup environment variable. (#147412)
This PR corrects the behavior of the TunableOp warmup variables:
```
PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS
PYTORCH_TUNABLEOP_MAX_WARMUP_ITERATIONS
```

See the updated comments which describe how the environment variables are intended to work. Previously, if you only set one of the two environment variables the warmup iters would always be zero.

Manually tested the four possible combinations to make sure things still behavior as intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147412
Approved by: https://github.com/jeffdaily
2025-02-21 00:29:58 +00:00
fdb1305ace reland "[sigmoid] Test OSS model runner with test_export.py" (#147535)
Summary: There are ~260 tests for all the corner cases of export from test_export.py. utitlizing to test sigmoid in the OSS setting.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r _sigmoid

Differential Revision: D69937387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147535
Approved by: https://github.com/yiming0416
2025-02-20 23:45:13 +00:00
87e6e2924e Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.24xlarge.ephemeral`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-20 23:02:45 +00:00
be0df96b50 Fix c++ implementation of strip_function_call (#147436)
#143063 was missing handling a couple UCS cases as well as had some bugs in the way it dealt with errors.

- Fix all the UCS handling (and make some of the common code more common)
- Make sure all the error paths return `nullptr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147436
Approved by: https://github.com/jansel
2025-02-20 20:41:21 +00:00
af31640391 [cutlass backend] enable mixed mm test (cutlass2x) for H100 (#147474)
I am okay with not landing this as well. The motivation is to make developing on H100 smoother.

The reason the current test works on A100 but not H100 is because of alignment issue. Which was caused by arch specific filtering logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147474
Approved by: https://github.com/alexsamardzic, https://github.com/ColinPeppler
2025-02-20 20:28:44 +00:00
d068141c3b [cutlass backend] add subproc tests (#147173)
I want to separate subproc autotuning from the main tests. And I observed that for addmm, it can work without subproc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147173
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #147169
2025-02-20 20:07:42 +00:00
2565951f8a [cutlass backend] remove triton from most tests and add an integration test (#147169)
Removing aten and triton from the list of backends for the tests that have it. Instead, add a small integration test to make sure autotuning works fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147169
Approved by: https://github.com/ColinPeppler
2025-02-20 20:07:42 +00:00
fb1f7f6a09 [codemod] Fix unused-value issue in caffe2/aten/src/ATen/native/miopen/Conv_miopen.cpp +1 (#147496)
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.

For questions/comments, contact r-barnes.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Differential Revision: D69755123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147496
Approved by: https://github.com/Skylion007
2025-02-20 19:00:38 +00:00
6971b77510 [CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935)
Summary: Adds wait_event and record_event endpoints to CPU stream in order to facilitate device-agnostic code. Both methods are noops.

Test Plan: CI

Differential Revision: D68833927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145935
Approved by: https://github.com/Skylion007
2025-02-20 18:50:55 +00:00
863ac20659 [CI] Do not overwrite return code of test file when fails for rerun disabled tests (#147484)
Do not overwrite the return code of a single file when it fails.  This will allow the log to be printed to stdout and the gha logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147484
Approved by: https://github.com/ZainRizvi
2025-02-20 17:51:58 +00:00
83bb921a5a [ROCm] Update meta_registration for efficient attention (#146979)
Fixes a series of failing and skipped unit tests.

For nvidia hw, the longsumexp last dimension is required to be a multiple of 32.  This is not the case for rocm.

A related issue: https://github.com/pytorch/pytorch/issues/146848

The unit tests in question:
```bash
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_6_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_6_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146979
Approved by: https://github.com/shunting314
2025-02-20 15:05:13 +00:00
382fbcc1e4 add the torch.float8_e8m0fnu dtype to PyTorch (#147466)
Summary:

Continuing the work from https://github.com/pytorch/pytorch/pull/146427

Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in
https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format.  Example of basic functionality:

```python
import torch

# round trip
x0 = torch.randn(4, 4, dtype=torch.float32)
x1 = x0.to(torch.float8_e8m0fnu)  # RNE rounding
x2 = x1.to(torch.float32)  # 2 ** exponent

# creation with empty
x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu)

# printing
print(x0)
```

Done in this PR:
* numerical correctness
* op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32
* printing a tensor works

For future PRs:
* performance optimizations for casting
* torch._scaled_mm
* PT2
* various cleanups (detailed in comments with issue numbers)

Test Plan:

```
pytest test/quantization/core/experimental/test_float8.py -s
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466
Approved by: https://github.com/drisspg
2025-02-20 13:55:42 +00:00
574371d828 Add current cuda device index to FXGraphCache key (#147464)
This PR intends to fix the cache related issues from https://github.com/pytorch/pytorch/issues/147405.
It does *not* handle the dynamo recompile case in process, because it does not introduce any extra guards. For FXGraphCache and AOTAutogradCache, we simply have to have the device context in the cache key.

Note that for any function that accepts tensor inputs, the device context is naturally already included in the cache key by the metadata of example inputs. However, for functions that return constants or have no arguments, the device context still needs to be in the cache key.

A more robust fix for this would be to have inductor generate device guards that are dynamic, instead of specialized. This would also help us share more cache artifacts.

I've added unit tests for FXGraphCache and AOTAutogradCache, both of which would fail without this change.

Differential Revision: [D69875939](https://our.internmc.facebook.com/intern/diff/D69875939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147464
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2025-02-20 12:38:21 +00:00
ead970c8d0 Revert "Add cifllow/riscv64 label"
This reverts commit 5116b27792d37c38039459c922a466581e219fc2.
(I've pushed to the wrong branch by accident)
2025-02-20 11:55:52 +01:00
5116b27792 Add cifllow/riscv64 label 2025-02-20 11:09:06 +01:00
6beba8dcce Optimize graph.py typing (#147099)
Optimize `graph.py` methods type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147099
Approved by: https://github.com/cyyever, https://github.com/aorenste
2025-02-20 09:32:30 +00:00
f9b8121350 Make Inductor scheduler aware of _scaled_mm (#146992)
This is used for example to estimate runtime when doing comms overlap

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146992
Approved by: https://github.com/drisspg, https://github.com/eellison, https://github.com/shunting314
2025-02-20 09:02:31 +00:00
9da250aada type fully_shard so that the return value can be chained with typing enabled (#147489)
This allows for

```
fsdped = fully_shard(model)
fsdped.set_xyz()
```

same applies if `model` is actually a list of modules

Differential Revision: [D69888119](https://our.internmc.facebook.com/intern/diff/D69888119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147489
Approved by: https://github.com/Skylion007
ghstack dependencies: #147488
2025-02-20 08:43:16 +00:00
6a72aaadae Fix torch.max optional args dim, keepdim description (#147177)
[`torch.max`](https://pytorch.org/docs/stable/generated/torch.max.html#torch.max) optional args `dim`, `keepdim` not described in document, but users can ignore them.

```python
>>> import torch
>>> a = torch.randn(3,1,3)
>>> a.max()
tensor(1.9145)
>>> a.max(dim=1)
torch.return_types.max(
values=tensor([[ 1.1436, -0.0728,  1.3312],
        [-0.4049,  0.1792, -1.2247],
        [ 0.8767, -0.7888,  1.9145]]),
indices=tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]]))

```

## Changes

- Add `optional` description for `dim`, `keepdim`
- Add example of using `dim`, `keepdim`

## Test Result

### Before

![image](https://github.com/user-attachments/assets/3391bc45-b636-4e64-9406-04d80af0c087)

### After

![image](https://github.com/user-attachments/assets/1d70e282-409c-4573-b276-b8219fd6ef0a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147177
Approved by: https://github.com/colesbury
2025-02-20 08:18:09 +00:00
452315c84f Fix RuntimeError: value cannot be converted to type int64_t without overflow (#147492)
The exact call is coming from here:

78a94c9114/torch/_inductor/memory.py (L161)

I have no idea why this error is being thrown and what mode/modes might be failing for this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147492
Approved by: https://github.com/eellison
2025-02-20 08:00:26 +00:00
a000c7e6d2 Add hint message for pack_padded_sequence (#146747)
Fixes #144207

Add truncate hint message in docs [torch.nn.utils.rnn.pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html)

## Test Result

![image](https://github.com/user-attachments/assets/46258f36-f6c7-4f11-9213-8513e52a9001)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146747
Approved by: https://github.com/mikaylagawarecki
2025-02-20 06:27:07 +00:00
db4ce78d46 PEP585: More UP006 fixes (#146392)
This should be the final PR before we can enable RUFF UP006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146392
Approved by: https://github.com/justinchuby, https://github.com/albanD, https://github.com/Skylion007
2025-02-20 06:18:13 +00:00
76ad19a549 [dynamo][codegen] Implement CSE for pre-graph graph-arg bytecode reconstruction (#147425)
This reduces fixed overhead seen in a few internal models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147425
Approved by: https://github.com/jansel, https://github.com/StrongerXi
2025-02-20 05:42:52 +00:00
8f6b9403c1 [audio hash update] update the pinned audio hash (#147423)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147423
Approved by: https://github.com/pytorchbot
2025-02-20 05:39:46 +00:00
77aa602871 [torchbind] Differentiate ScriptModule and ScriptObject with qualified name (#147399)
Summary:
This pr add a _is_script_object method to differentiate scriptModule and scriptObject, where the formal inherits from ScriptObject in C++ so they both passes the isinstance(obj, torch.ScriptObject) check.

The qualified name of ScriptObject (i.e. custom class) would starts with "__torch__.torch.classes", this has been a widely used assumption for dealing with custom class across our code base.

Test Plan: Add new test.

Differential Revision: D69685316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147399
Approved by: https://github.com/yushangdi
2025-02-20 04:57:57 +00:00
7185ca8348 [Cutlass] Add test verifying number of precompiles (#147477)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147477
Approved by: https://github.com/henrylhtsang
2025-02-20 04:47:57 +00:00
5f5b44f6bf [ROCm] Update inductor-periodic.yml to use the correct label (#147473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147473
Approved by: https://github.com/jeffdaily
2025-02-20 04:44:18 +00:00
0d56b7e665 Support size oblivious max equation (#147344)
Addresses https://github.com/pytorch/pytorch/issues/125914 by detecting when we have a sym_max between {0, 1} and a summation of size-like unbacked symints.

The basic idea is max(1, u0 + u1) can be simplified to u0 + u1 if both u0 and u1 are size-like since their value ranges are [2, inf].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147344
Approved by: https://github.com/angelayi
2025-02-20 04:33:19 +00:00
0b0da81021 Support static method of torchbind attributes in torch.compile with inductor backend (#146927)
As title.

Many changes adapted from https://github.com/pytorch/pytorch/pull/129537.

Also this diff is only for *static* method of torchbind *attributes*. Some case that's not supported/tested:
- dynamic torchbind objects
-  torchbind objects as an input to the module.

Note that in JIT Inductor, the attributes are lifted as inputs. So even if we just have torchbind objects as attributes, they will show up as inputs in the graph.

Example generated python code in torch.compile with inductor backend for the test case in `inductor/test_torchbind.py` (P1730554370):

```python
async_compile.wait(globals())
del async_compile

def call(args):
    arg1_1, arg2_1, arg3_1 = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    assert_size_stride(arg2_1, (2, 3), (3, 1))
    buf2 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, arg2_1, buf2)
    del arg1_1
    del arg2_1
    # Topologically Sorted Source Nodes: [x, takes_foo_tuple_return], Original ATen: [aten.add]
    buf3 = torch.ops._TorchScriptTesting.takes_foo_tuple_return.default(arg3_1, buf2)
    buf4 = buf3[0]
    assert_size_stride(buf4, (2, 3), (3, 1))
    buf5 = buf3[1]
    assert_size_stride(buf5, (2, 3), (3, 1))
    buf6 = buf4; del buf4  # reuse
    cpp_fused_add_1(buf6, buf5)
    del buf5
    # Topologically Sorted Source Nodes: [y, b], Original ATen: [aten.add]
    buf7 = torch.ops._TorchScriptTesting.takes_foo.default(arg3_1, buf6)
    del buf3
    del buf6
    buf8 = buf7
    assert_size_stride(buf8, (2, 3), (3, 1))
    # Topologically Sorted Source Nodes: [c], Original ATen: []
    buf9 = torch.ops.higher_order.call_torchbind(arg3_1, 'add_tensor', buf2)
    del arg3_1
    del buf7
    buf10 = buf9
    assert_size_stride(buf10, (2, 3), (3, 1))
    del buf9
    buf11 = buf2; del buf2  # reuse
    cpp_fused_add_2(buf11, buf8, buf10)
    return (buf11, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg1_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
    import pickle
    global arg3_1
    arg3_1 = pickle.loads(b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x05torch\x94\x8c\x0cScriptObject\x94\x93\x94)\x81\x94]\x94(K\nK\x14e\x8c0__torch__.torch.classes._TorchScriptTesting._Foo\x94\x86\x94b.')
    fn = lambda: call([arg1_1, arg2_1, arg3_1])
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146927
Approved by: https://github.com/angelayi
2025-02-20 03:33:19 +00:00
de1cb0f351 capture the return value in the contract typing (#147488)
----

* the existing typing makes the return type `Optional[nn.Module]`
* this doesn't seem to be what the decorator actually does as it does
  not alter the original return type
* This PR aims to fix the typing

Differential Revision: [D69888120](https://our.internmc.facebook.com/intern/diff/D69888120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147488
Approved by: https://github.com/Skylion007
2025-02-20 03:32:34 +00:00
fea718f062 [BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730)
Our three main users are OK with this, with two of them (foreach_map,
invoke_quant) prefering it like this.

I was originally worried about BC issues (this now means you cannot add
any positional args) but I think that's not a concern -- one can always
add kwonly args.

Test Plan
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730
Approved by: https://github.com/ydwu4, https://github.com/mlazos
2025-02-20 02:30:36 +00:00
f79b352f5a [Intel GPU] qconv_pointwise.binary XPU support (#135189)
# Motivation
This PR intends to enable quantized fusion `qconv+add` and `qconv+add+relu` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qconv2d_pointwise.binary")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
   -k test_qconv2d_add_xpu \
   -k test_qconv2d_add_relu_xpu 2>&1
```

# Runtime exemplification
Following is the oneDNN verbose collected from UT
```bash
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_s8::blocked:acdb::f0 wei_s8::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_s8::blocked:acdb::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:1:f32 attr-zero-points:src0:0:s32+dst:0:s32 attr-post-ops:eltwise_linear:1:0.337704+sum:0.0241217+eltwise_relu,alg:convolution_direct,mb1_ic3oc6_ih8oh6kh3sh1dh0ph0_iw8ow6kw3sw1dw0pw0,0.151123
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135189
Approved by: https://github.com/liangan1, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jerryzh168
ghstack dependencies: #133307

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-20 02:02:54 +00:00
93316cfe94 Move ir_pre_fusion.txt and ir_post_fusion.txt to TORCH_LOGS (#147248)
Fixes #147002

Moves ir_{pre, post}_fusion.txt to be controlled by TORCH_LOGS instead of TORCH_COMPILE_DEBUG.
Updated tests of these logs as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147248
Approved by: https://github.com/eellison
2025-02-20 00:26:17 +00:00
16e202a38e [dynamo] improved graph break messages for some common graph break sites [1/N] (#146525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146525
Approved by: https://github.com/jansel
2025-02-20 00:08:13 +00:00
1e94c7aaa4 [draft_export] only clear pending unbacked symbols for overwritten kernels (#147427)
This was wrong, we were doing this in all cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147427
Approved by: https://github.com/angelayi
2025-02-20 00:07:54 +00:00
3986c3e4a6 [reland][cutlass backend] Do not change dtype of GEMM template for cutlass 3x (#147434)
Reland of https://github.com/pytorch/pytorch/pull/146877
incorporate forward fix (didn't land): https://github.com/pytorch/pytorch/pull/147185

Summary:
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69825865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147434
Approved by: https://github.com/ColinPeppler
2025-02-20 00:07:07 +00:00
a88d7d4268 [util] fetch logical count cpu (#147413)
To match with Vcpu count with aws:

after (96), before (48)
Instance Ref: https://instances.vantage.sh/aws/ec2/g4dn.metal
before: https://hud.pytorch.org/utilization/13377376406/37360984234/1
after: https://hud.pytorch.org/utilization/13401543806/37435031356/1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147413
Approved by: https://github.com/clee2000
2025-02-19 23:44:54 +00:00
004d65aeb0 Add type hints to cuda kernel (#147471)
Missed this in a previous PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147471
Approved by: https://github.com/eellison
2025-02-19 23:35:10 +00:00
48203bec63 [BE] remove sysconfig.get_config_var("LIBDIR") from cuda lib paths (#147409)
Summary: I think the path is not needed anymore. It was added in https://github.com/pytorch/pytorch/pull/126408, but  it has been a while since then. See if CI complains.

Differential Revision: D69573185

See also  https://github.com/pytorch/pytorch/pull/147158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147409
Approved by: https://github.com/chenyang78
2025-02-19 23:04:22 +00:00
f63db6255f Re-land exclude upsample_bilinear2d.vec and nearest2d.vec from default export decomposition table (#147153)
Note: This is a re-land of https://github.com/pytorch/pytorch/pull/141791, which I reverted due to breaking some Meta-internal tests - an internal ET delegate did not handle the non-decomposed upsample_nearest2d, and it was not caught in CI. I've resolved that issue and should be ready to safely re-land.

Summary:
As upsample_bilinear2d.vec and upsample_nearest2d.vec are core ATen ops, they should not be decomposed by default in the export path. Because the operators have CompositeImplicitAutograd dispatch, their decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.

In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining two ops, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.

Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and recompose it, but this is no longer necessary.

Test Plan:
Added a new test (`test_default_decomposition_core_cia_ops`) in test_export.py to verify that upsample_bilinear2d.vec (and in the future, other core-tagged CIA ops) are not decomposed by default. Also, I manually validated end to end with ExecuTorch that the op is not decomposed in to_edge (see N6238522).

```
buck test //caffe2/test:test_export -- test_default_decomposition_core_cia_ops
```

Differential Revision: D69625112

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147153
Approved by: https://github.com/manuelcandales
2025-02-19 23:03:29 +00:00
fb55bac3de [fr][fix] Split MatchState and dynamic info for fr analysis downstream (#147439)
The original MatchState type was declared as a python Enum. Although we did make it callable but we consume it right away. There are downstream cases when we need it to be a python class which is not supported in Python enum. So we did a small refactoring so that we keep both the enum state and dynamic info (culprit) for the fr analysis script.

Differential Revision: [D69830994](https://our.internmc.facebook.com/intern/diff/D69830994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147439
Approved by: https://github.com/fegin
2025-02-19 22:09:16 +00:00
41ae15faa3 [ONNX] Add scaffolding for onnx decomp and logic for op tests (#147392)
Create scaffold for onnx op test data and common logic. This PR creates the scaffolding for new onnx decomp functions described in https://github.com/pytorch/pytorch/issues/139301. It adds two ops: abs and add, and enables the related tests.

https://github.com/pytorch/pytorch/issues/139301
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147392
Approved by: https://github.com/titaiwangms
ghstack dependencies: #147396
2025-02-19 21:55:12 +00:00
24738768a8 more dist ops in non strict (#147417)
Summary: Previously we added support for `all_reduce` to non strict. This PR extends this support to other non-functional collectives that are remapped in Dynamo: `all_gather`, `all_gather_into_tensor`, `all_to_all_single`, `reduce_scatter_tensor`.

Test Plan: added unit tests

Differential Revision: D69813991

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147417
Approved by: https://github.com/angelayi
2025-02-19 21:29:16 +00:00
394676759d ci: Add h100 nightly perf testing (#146868)
This infrastructure has been up for a while so add a workflow to actually run things on it.

> [!IMPORTANT]
> We only have **14** linux.aws.h100 runners so it might be beneficial for us to actually pair this list down.
> Will leave it up to the compiler team to comment on this PR on which tests are actually important vs. what is not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146868
Approved by: https://github.com/eellison, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-02-19 21:13:17 +00:00
8bea08e5bc [BE] Fix tensor stub (#147384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147384
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/atalman
2025-02-19 19:47:03 +00:00
e758d8b4d1 [Inductor][Triton] Rework casting logic to avoid illegal bitcast (#147395)
Triton introduced checks for bitcasts where the casted value does not fit into the casted type (e.g. https://github.com/triton-lang/triton/pull/5926, though in this instance I think the issue is related to the type for the broadcast). Some routines in Inductor now perform illegal bitcasts. I reworked the compare and swap w/ index routine used in sort to remove the illegal bitcast (~~I left the bitcast for now, but I think it could probably be removed assuming the reshape does not change the type~~). The explicit cast is correct, and I don't think there are performance issues, but because the cast on the sum is not a bitcast I suppose there could be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147395
Approved by: https://github.com/eellison
2025-02-19 19:45:01 +00:00
279c7f262e [ONNX] Refactor dispatcher and registry (#147396)
This PR sets up the registry to accept onnx decomp functions to be moved into PyTorch (https://github.com/pytorch/pytorch/issues/139301).

The ops from onnx script are currently appended to the registry. When the ops are moved into PyTorch, the moved ops takes precedence because they appear first in the registry list.

After the migration hooks for loading ops from onnx script will be removed.

1. Use a private field `_pt_onnx_signature` to store function signatures to avoid conflicts
2. Update the registry to record the signature in OnnxDecompMeta and update the dispatcher to leverage the data structure
3. Update registry to prepare for onnx op registration, and update the the onnx_impl decorator to support a no_compile option

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147396
Approved by: https://github.com/titaiwangms
2025-02-19 19:38:28 +00:00
4f3c070b25 [inductor] GraphLowering code movement (#147335)
moved these methods under __init__ to be more idiomatic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147335
Approved by: https://github.com/eellison
ghstack dependencies: #147331
2025-02-19 19:32:30 +00:00
5a3a50c791 Update Arm Compute Library (ACL) to v25.02 (#147454)
Among many things, this version of ACL fixes the redundant declaration  warning that we're blocked on in (#145942, #146620, #147337) and introduces better scheduling heuristics for GEMMs

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147454
Approved by: https://github.com/malfet
2025-02-19 18:51:08 +00:00
9fee408daa [caffe2] disable warning for unused arguments (#147411)
Summary: Disable warnings on unused command line arguments for ukernels_asm.

Test Plan:
On top of D69602077:
```
$ buck2 build --flagfile fbsource//xplat/mode/arstudio/auto.py fbsource//xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack:ukernels_asmAppleMac
```

Differential Revision: D69807977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147411
Approved by: https://github.com/kimishpatel
2025-02-19 17:54:31 +00:00
5220d402b5 [ROCm] TopK optimizations for AMD GPUs (#146387)
TopK performance on ROCm performs better on the test suite with the default config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146387
Approved by: https://github.com/malfet, https://github.com/ngimel
2025-02-19 17:10:59 +00:00
e6c86952c6 Add CUDA 12.8 windows nightly build (#147037)
https://github.com/pytorch/pytorch/issues/145570

windows AMI is deployed to prod today, prepping the windows cuda 12.8 build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147037
Approved by: https://github.com/atalman
2025-02-19 16:59:32 +00:00
8cbf7d0d6e [Inductor UT][XPU] Skip fft_c2c case since it's not implemented on XPU. (#147351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147351
Approved by: https://github.com/jansel
2025-02-19 16:03:03 +00:00
ed83b0b70b [ddp] decouple python reducer from compilation mode (#147123)
Current implementation reads as: we will only actually use the "python_reducer" config if the DDP forward is compiled. Otherwise, we will silently fallback to C++ reducer + no DDPOptimizer.
I'm changing this behavior to always use the python reducer if the config is specified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147123
Approved by: https://github.com/fegin
2025-02-19 15:51:40 +00:00
303ad1916f [FlexAttention] Fix weird generate stride call in flex decode (#147435)
# Summary
Seems like we had a redundant tuple unpack and that doesn't appear to be supported in new triton

Fixes https://github.com/pytorch/pytorch/issues/147373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147435
Approved by: https://github.com/BoyuanFeng
2025-02-19 12:12:27 +00:00
77dbd28535 [Cutlass] Restore search space for swizzle (#147224)
This restores the previous search space, since swizzle is now a runtime parameter, there shouldn't be extra compile-time overhead from searching this now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147224
Approved by: https://github.com/eellison
ghstack dependencies: #147222, #147223
2025-02-19 09:22:51 +00:00
e9b3ff0570 [Cutlass] Add support for runtime param choices, starting with swizzle (#147223)
This PR adds support for swizzle as a runtime parameter choice. Future runtime parameter choices can be added to the [get_runtime_arg_info](2d40f9fb52/torch/_inductor/codegen/cuda/cuda_template.py (L282)) list method and then possible choices can be [looped over similarly to swizzle](933f921b36/torch/_inductor/codegen/cuda/gemm_template.py (L532)). For precompile, we now filter choices by hash to only compile each distinct kernel source once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147223
Approved by: https://github.com/Chillee, https://github.com/eellison
ghstack dependencies: #147222
2025-02-19 09:22:51 +00:00
81eb2a78ad [Inductor] Add autotuning artifact logging (#147222)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147222
Approved by: https://github.com/henrylhtsang, https://github.com/eellison
2025-02-19 09:22:42 +00:00
655b061ef0 [inductor] Freeze runtime asserts after shape prop but before codegen (#147331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147331
Approved by: https://github.com/eellison
2025-02-19 06:29:13 +00:00
454fbd5bbe realize stride symbols in estimate_runtime (#146752)
Unfortuanlty could not create a local repo, or unit test.
fix https://github.com/pytorch/pytorch/issues/146686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146752
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-02-19 06:02:49 +00:00
2c3680ce38 [apf] Fix input adapter (#147238)
Summary: Add support for inputs that no longer exist in `input_fields`, but is not actually used by the original program. In this case, we just give it a dummy input based on the node's metadata.

Test Plan: Verified for S488841

Differential Revision: D69328093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147238
Approved by: https://github.com/pianpwk
2025-02-19 04:49:58 +00:00
465930ee81 Revert "[ROCm] ROCm-specific gemm tuning parameters" (#147388)
Summary:
This diff reverts D69573225 / https://github.com/pytorch/pytorch/pull/143286

15% cold compile time regression, see https://fb.workplace.com/groups/1075192433118967/permalink/1608559059782299/

Test Plan: NA

Differential Revision: D69790102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147388
Approved by: https://github.com/yanboliang
2025-02-19 04:47:35 +00:00
4ece056791 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-19 03:52:26 +00:00
bd370c138a fix pt2e block wise quantization unit test (#147406)
Differential Revision: D69806596

https://github.com/pytorch/pytorch/pull/146946 breaks the unit test, because the quant nodes are folded by default now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147406
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168
2025-02-19 02:40:27 +00:00
5006932cbc [cutlass backend] forward fix of standalone runner for fbcode (#147158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147158
Approved by: https://github.com/chenyang78
2025-02-19 02:02:10 +00:00
f16d30137c [OSS] Update FileSystem methods to properly handle a string argument (#145751)
Summary: When testing, I tried to pass in a string argument to the FileSystem class' methods, which is a valid input, but the cast() that casted the string to a path wasn't working as was likely expected and was leading all the methods to fail with a string arg. Instead of a cast, a proper constructor should be used.

Test Plan: N6475361 methods don't throw an error with a string arg like they were previously

Differential Revision: D68713937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145751
Approved by: https://github.com/pradeepfn
2025-02-19 01:50:24 +00:00
953f7834cc [ONNX] Pick up missing types in dynamic shapes renaming (#147407)
Found in `_check_dynamic_shapes` that int and None type are valid inputs of dynamic_shapes.
This PR adds the support on these two types and add the tests to guard the sync of ONNX flatten logic and the one in expor.t
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147407
Approved by: https://github.com/justinchuby
2025-02-19 01:49:53 +00:00
757d7f28d1 [CD] Increase timeout for windows binary builds (#147390)
Mitigates https://github.com/pytorch/pytorch/issues/147376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147390
Approved by: https://github.com/huydhn, https://github.com/jeanschmidt, https://github.com/malfet
2025-02-19 01:15:04 +00:00
959d79f85f [ONNX] Move and improve error reproduction logic in test (#147391)
https://github.com/pytorch/pytorch/issues/139301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147391
Approved by: https://github.com/titaiwangms
2025-02-19 00:00:11 +00:00
babb2dc2af Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 6f7e67c43c13b5675b4ff60cbaa71e5083a22481.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/wdvr due to failing inductor mkldnn_pattern_matcher_cpu tests ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2667186865))
2025-02-18 23:58:31 +00:00
525ca80f53 add unbacked strict mode (#147333)
fixes #145775

This is the first step in introducing a "strict" mode where we don't silent specialize and don't silent graph break. At a high level when we do mark_unbacked(... strict=True), anytime we specialize an unbacked symint we will explicitly error and tell the user their unbacked dimension was specialized to a single value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147333
Approved by: https://github.com/laithsakka
2025-02-18 23:33:55 +00:00
5d547d82e6 Add no_data_dependent_graph_break mode (#147342)
This adds a strict mode `TORCHDYNAMO_UNBACKED_STRICT` to prevent graph breaking when we guard on data dependent. This is a better UX for those who are actively trying to make their model more dynamic, but aren't close enough to full graph to use that flag directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147342
Approved by: https://github.com/laithsakka
2025-02-18 23:33:47 +00:00
bae049b439 Update addr doc (#146482)
Fixes https://github.com/pytorch/pytorch/issues/146399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146482
Approved by: https://github.com/janeyx99
2025-02-18 23:25:38 +00:00
ca397d82a6 [Sigmoid] Fix issues with constant folding and fba_ops (#146948)
Summary:
There are 2 issues:

- `skip_folding_node_fn` isn't considered when propagating constant values. So given a skipped node with constant inputs, it outputs a constant and its users can output constant values and then be included in the constant graph. However, the skipped node is not included in the constant graph when extracting the constant graph. This issue is fixed by checking for skipped node when propagating the constant values and making the skipped node to output unknown value (not constant) so that its users cannot output constant.

- `fba_linear` op can be included in the constant graph but it is not implemented for CPU so constant graph cannot be executed. This issue is fixed by converting `fba_linear` to `aten.addmm`.

- A refactor to allow more fba_ops to be included in the constant graph (via mapping fba_ops to aten ops).

Reviewed By: StellarrZ

Differential Revision: D68716393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146948
Approved by: https://github.com/zhxchen17
2025-02-18 23:17:47 +00:00
c9a15d980f [FSDP2] Simplify shard_placement_fn in test (#146847)
Summary: Found this while checking `shard_placement_fn` for Shampoo shard independent implementation.

Test Plan: OSS CI & tests

Differential Revision: D69412878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146847
Approved by: https://github.com/awgu
2025-02-18 23:01:26 +00:00
c8433c2c6c [BE] correct docs for clock_rate to MHz, fixes #147098 (#147393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147393
Approved by: https://github.com/andrewor14
2025-02-18 22:59:58 +00:00
a21a123fd5 Add fqn_modifier at loading_state_dict and unit test (#146557)
In Fusion model, users might change the state_dict keys by state_dict_hook
The load_state_dict APIs here won't call model.state_dict() so that the hooks won't be called to change the keys, causing the mismatch between fqn and state_dict keys.

The PR here suggests users to add how they would change the state_dict key prefix (they can name it, here we call "fqn_modifiers") by default
During loading state_dict, we have the prefix change during getting fqn so that they can be processed same as through state_dict hook.

For example:
There's a state_dict_hook:

```
def _state_dict_hook(self, destination, prefix, keep_vars):
    """Remove "embedding" from the original embedding in the state_dict
    name. This keeps the orginal state dict name for the embedding
    from before fusing with the FusionEmbedding.

    [!Note] This update changes the order of the OrderedDict
    """
    key = prefix + "embedding.weight"
    new_key = prefix + "weight"
    destination[new_key] = destination[key]
    del destination[key]
```

In the dsd after this PR, we would skip "embedding." before "weight" if find the "fqn_modifiers" attribute at that module
```
def fqn_modifiers(self) -> Dict[str, str]:
    return {
        "weight": "embedding",
    }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146557
Approved by: https://github.com/fegin
2025-02-18 22:54:41 +00:00
7622e29a37 Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit eecee5863e698d19458b33df7bfecbda0a04557a.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks Locally building benchmarks ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2667054179))
2025-02-18 22:23:35 +00:00
3f35664ee8 More precise check for shared storage check in inductor/reinplace pass (#147050)
Currently if two tensor share storage we have some logic to avoid re-inplacing. Before this PR two tensors share storage if use same underlying storage even if they do not overlap. This diff enhance the checks to avoid cases when we know tensors do not overlap easily.
mitigate https://github.com/pytorch/pytorch/issues/139628 but does not fix the inductor issue in it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147050
Approved by: https://github.com/zou3519
2025-02-18 21:55:34 +00:00
63e8ad49b8 [dynamo] replace hardcoded eval frame control flags skip_code_recursive_flag/cache_limit_hit_flag (#146355)
This PR and the previous:
- Moves parts of `eval_frame.c` to C++.
- Reduces code duplication in `dynamo__custom_eval_frame` and makes the control flow more clear.
- Enables `convert_frame` to signal to `eval_frame.cpp` in a general manner how to evaluate this frame, recursive frames, and future frames with the same code object (default/compile, skip, run-only). e.g. this will allow us to change skipping/cache limit hit eval_frame behavior directly from convert_frame without requiring changes to C/C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146355
Approved by: https://github.com/jansel
ghstack dependencies: #145603
2025-02-18 21:37:12 +00:00
75db0fd8a0 [dynamo] refactor dynamo__custom_eval_frame to C++, refactor SKIP_CODE[_RECURSIVE] (#145603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145603
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-02-18 21:37:12 +00:00
eb892cd768 [codegen] enable SORT and TUPLE_REDUCTION for AMD Triton (#147340)
Looks like Triton's AMD backend supports multiple inputs already.
Let's enable SORT and TUPLE_REDUCTION for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147340
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/eellison
2025-02-18 21:15:23 +00:00
1b047d5d7a Add link to non_blocking/pinmem tutorial in Tensor.to docstrings (#145651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145651
Approved by: https://github.com/svekars
2025-02-18 20:38:01 +00:00
clr
166419b9c1 dynamo: Don't crash when encountering a object with no __name__ (#147246)
This was triggering on ScriptFunctions. Note that other than badly implemented c functiosn, this seems to be almost impossible to trigger, so I wrote a smaller unit test, rather than a full repro. Let me know if people feel strongly and want a full reproduction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147246
Approved by: https://github.com/anijain2305, https://github.com/jansel, https://github.com/Skylion007
2025-02-18 20:35:49 +00:00
12v
74682e8595 Fix typo (#147330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147330
Approved by: https://github.com/srinivasreddy, https://github.com/Skylion007
2025-02-18 20:20:34 +00:00
d9b3d76b85 Fix linter warnings (#147386)
https://github.com/pytorch/pytorch/pull/145866 accidentally introduced a warning about const casts and also comparison of unsigned long int with signed long int.

This PR fixes both of those warnings.

Tested by running:

```
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUFILE -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o
```

And I got no warnings or errors. Same with `python setup.py develop`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147386
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-02-18 20:03:16 +00:00
302f56a1f2 Revert "Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)"
This reverts commit 59b7e52ad8f6146b4364515a7f3e54d6f3edd6da.

Reverted https://github.com/pytorch/pytorch/pull/146845 on behalf of https://github.com/jeanschmidt due to Seems to break a few code dependencies in multiple places ([comment](https://github.com/pytorch/pytorch/pull/146845#issuecomment-2666656834))
2025-02-18 19:01:27 +00:00
57060bebf3 [symbolic shapes] Add replacement for backed symints (#147240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147240
Approved by: https://github.com/pianpwk
ghstack dependencies: #146939
2025-02-18 18:49:51 +00:00
84abeaad5c [export] Log evaluate_expr (#146939)
We want to log each symnode created so that we can do provenance tracking in the tlparse report generated for draft export. To do this, we want to assign a unique id to every symnode, which python's `id` function already does, and then for every expression created, we can find the provenance by tracing back through its arguments ids. This logging only happens when dtrace_structured is enabled, which is only when running draft export.

An example output is as follows:

<img width="799" alt="image" src="https://github.com/user-attachments/assets/88bb31b4-8c31-43fb-aa88-08b573b9f71d" />

For the increase in the compile_time_instruction_count benchmark, this seems unavoidable because I need to call `id` to get the unique identifier for each symnode. But I believe `id` is an inexpensive operation, so hopefully it should be ok?  I tried doing the following:
* Originally I was passing around `self`, which is a SymNode, which caused the compile time to be ~6.36M
* I changed it to pass around `id(self)` instead, which reduced the compile time to ~6.33M
* Then I changed it to be passed as a positional arg instead of a kwarg, which reduced the compile time to ~6.22M, but this doesn't seem to be a super worthwhile fix?

#suppress-bc-linter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146939
Approved by: https://github.com/oulgen
2025-02-18 18:49:51 +00:00
c6b331f7d9 Deprecate skip_code_recursive_on_cache_limit_hit config flag (#136970)
Fixes one of #136862

Make `skip_code_recursive_on_cache_limit_hit` flag deprecated.

Affected logic is in here:
6931c1644a/torch/_dynamo/convert_frame.py (L866-L876)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136970
Approved by: https://github.com/williamwen42
2025-02-18 18:48:23 +00:00
6f7e67c43c Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-18 18:44:26 +00:00
dd2a943e14 Fix the AOTI compile failure with ARM CPU for Meta internal (#147204)
Summary: Fix the AOTI compile failure with ARM CPU for Meta internal

Differential Revision: D69642211

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147204
Approved by: https://github.com/houseroad
2025-02-18 17:54:34 +00:00
5d675de754 Update ck (#144799)
Updates the CK version and re-implements kernel generation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144799
Approved by: https://github.com/jianyuh
2025-02-18 17:00:27 +00:00
a00d2b5144 s390x: add cleanup for cancelled docker image builds (#147110)
When podman image build is cancelled,
a couple of processes are left behind,
and their existence prevents
proper shutdown of runner container.

Add cleanup step at the end of workflow
using new option recently introduced in podman:
https://github.com/containers/podman/pull/25102

Example of job preventing s390x worker cleaning up and restarting properly:
https://github.com/pytorch/pytorch/actions/runs/13289159296/job/37105230728
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147110
Approved by: https://github.com/huydhn
2025-02-18 16:26:46 +00:00
6edc419d69 Update torch-xpu-ops commit pin (#147358)
Update the torch-xpu-ops commit to [a14d1eaa834a616705068103dc8129319087e864](a14d1eaa83), includes:

- SparseCSR XPU support
- Refine build system

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147358
Approved by: https://github.com/EikanWang
2025-02-18 16:05:25 +00:00
0c8028e877 [export] Loosen symint input serialization (#147237)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147237
Approved by: https://github.com/avikchaudhuri
2025-02-18 13:03:47 +00:00
b10ba0a46c Unify all sympy versions to avoid conflicts within PyTorch (#147197)
As the title stated.

There are some tiny diffrences between 1.13.1 and 1.13.3:
1.13.1:
2e489cf4b1/sympy/core/numbers.py (L1591)

1.13.3:
b4ce69ad5d/sympy/core/numbers.py (L1591)

**Previous PR:**
https://github.com/pytorch/pytorch/pull/143908

**ISSUE Related:**
https://github.com/pytorch/pytorch/issues/147144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147197
Approved by: https://github.com/malfet
2025-02-18 10:51:43 +00:00
d9cf1debf9 [ROCm][Windows] Fix clang-cl error related to -Wmissing prototypes enabled (#146981)
Some of the windows files (fused_kernels.cpp or temp_file.h) contain code that fail to compile when this flag is enabled when built with clang-cl.

This PR resolves the issue by ensuring that even if we build with clang-cl, it doesn't include those flags on windows.

Alternatively if needed, I can fix the files mentioned to pass under this flag.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146981
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-02-18 07:41:12 +00:00
49e8f9c965 Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 22fae4c5f94eb43f71a2eebc1904880740cb1d60.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to third time is the charm ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2664622598))
2025-02-18 05:11:32 +00:00
59a08138c5 [executorch hash update] update the pinned executorch hash (#147345)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147345
Approved by: https://github.com/pytorchbot
2025-02-18 05:08:06 +00:00
6a2bb629ec Update torch-xpu-ops commit pin (#147302)
Update the torch-xpu-ops commit to [b421032c8fed40df5eaee395c2e7f5f8a7bcc815](b421032c8f), includes:

- Correct int4 weight pack implementation
- Enhance build system: only build one shared library for the user

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147302
Approved by: https://github.com/EikanWang
2025-02-18 05:04:15 +00:00
59915b8dec [Intel GPU] qlinear at XPU backend (#133307)
# Motivation
The PR is intended to enable `onednn.qlinear` and `onednn.qlinear_unary` at Intel GPU.

We register the qlinear ops at C++ backend via `TORCH_LIBRARY_IMPL`, the op this PR registers includes `onednn::qlinear_pointwise`, `onednn::qlinear_pointwise.tensor`, and `onednn::qlinear_prepack`. The prepack conduct transpose on weight for fitting oneDNN requirement on weight to acquire higher performance.

Also, we remove the limitation of the corresponding annotation method in  the `XPUInductorQuantizer` (`torch/ao/quantization/quantizer/xpu_inductor_quantizer.py`) to allow GPU linear conversion.

We add the kChar(`torch.int8`) dtype in the `torch/_inductor/fx_passes/quantization` and `torch/_inductor/mkldnn_ir.py`, as signed int8 is the default INT8 data type at GPU side.

We verified the op through UTs and e2e model testing like ResNet18, ResNet50.

# UT verification
```
 DNNL_VERBOSE=0 TORCH_COMPILE_DEBUG=0 python test/inductor/test_mkldnn_pattern_matcher.py -v  \
     -k test_qlinear_xpu \
     -k test_qlinear_relu_xpu \
     -k test_qlinear_gelu_xpu
```

# Runtime exemplification
Here is the oneDNN verbose collected through running above UTs
```
//pure int8 gemm
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_s8::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32+dst:0:s32,,2x4:4x3,0.187988
// post-relu fusion
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_relu,,2x4:4x4,0.115234
// post-gelu fusion
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_gelu_tanh,,2x4:4x4,0.170898

````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133307
Approved by: https://github.com/liangan1, https://github.com/guangyey, https://github.com/EikanWang, https://github.com/jerryzh168

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-18 04:02:42 +00:00
bb8c4ecc6d Allow XPU device for validating the arguments to sparse compressed tensor factory functions (#147306)
During Sparse tensor conversion, a validity check is performed. We need to allow XPU to pass this check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147306
Approved by: https://github.com/EikanWang, https://github.com/Skylion007, https://github.com/guangyey
2025-02-18 03:55:54 +00:00
71484a2106 [pt2-benchmarks] Compiler reset on every run (#147313)
Internal benchmarks call `run` in a loop. Compiler reset gives a clean env

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147313
Approved by: https://github.com/jansel
2025-02-18 02:09:19 +00:00
708428704e patch for block-wise quantization + pt2e (#146946)
Summary: https://github.com/pytorch/pytorch/pull/144492 was reverted due to duplicate kernel registration. This PR will re-introduce the patch

Differential Revision: D69488779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146946
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14
2025-02-18 01:15:26 +00:00
59b7e52ad8 Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)
Fix https://github.com/pytorch/pytorch/issues/145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845
Approved by: https://github.com/Skylion007
2025-02-17 22:42:16 +00:00
1393f9a76c [ROCm] Update inductor-perf-test-nightly-rocm.yml to use the correct labels & frequency (#147221)
This workflow takes around 75-80hrs on ROCm, so scaling down the frequency to once per week until we get more CI capacity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147221
Approved by: https://github.com/pruthvistony, https://github.com/huydhn
2025-02-17 19:29:27 +00:00
6c0e7463af Fix test_device_memory_allocated (#147311)
Fixes #147310

The `torch.ones`  allocates memory and is released immediately, thus the following assertion will fail.
This PR stores it into a temp variable to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147311
Approved by: https://github.com/guangyey, https://github.com/Skylion007
2025-02-17 19:00:53 +00:00
516133ddb0 Fix arvr macOS buck pytorch builds (#147292)
Summary:
X-link: https://github.com/ctrl-labs/src2/pull/42453

buck arvr macOS builds had a few issues that needed fixing.

Test Plan: build with buck

Differential Revision: D69722372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147292
Approved by: https://github.com/Skylion007
2025-02-17 18:47:24 +00:00
22fae4c5f9 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-17 18:39:10 +00:00
1b29de5c05 Add NEON implementation for 8 bit quantized embedding bag on aarch64 (#147322)
This improves performance by ~5.5x on NeoverseV1 cores using the following benchmarking script:
```
import torch
import torch.nn as nn
import numpy as np
import torch.autograd.profiler as profiler

np.random.seed(0)
torch.manual_seed(0)

class SimpleEmbeddingBagModel(nn.Module):
    def __init__(self, num_embeddings, embedding_dim):
        super(SimpleEmbeddingBagModel, self).__init__()

        weights = torch.from_numpy((np.random.random_sample((num_embeddings, embedding_dim)) + 1).astype(np.float32))
        obs = torch.ao.quantization.PerChannelMinMaxObserver(dtype=torch.quint8, qscheme=torch.per_channel_affine_float_qparams, ch_axis=0)
        obs(weights)
        qparams = obs.calculate_qparams()
        qweight = torch.quantize_per_channel(weights, qparams[0], qparams[1], axis=0, dtype=torch.quint8)

        # Defining the EmbeddingBag layer
        self.qembedding_bag = torch.ao.nn.quantized.EmbeddingBag(num_embeddings, embedding_dim, _weight=qweight,
                                                                 mode='sum', include_last_offset=True, dtype=torch.quint8)

    def forward(self, input, offsets):
        # Forward pass through the EmbeddingBag layer
        result = self.qembedding_bag(input, offsets, per_sample_weights=None)
        return result

num_embeddings = 40000000
embedding_dim = 128

model = SimpleEmbeddingBagModel(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
model.eval()

multi_hot = 100
batch_size = 400

input_tensor = torch.randint(0, num_embeddings, (batch_size * multi_hot,), dtype=torch.long)

offsets = torch.tensor(range(0, batch_size * multi_hot + 1, multi_hot))

with torch.no_grad():
    # warm up
    _ = model(input_tensor, offsets)

    with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
        for i in range(100):
            _ = model(input_tensor, offsets)
    print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=50))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147322
Approved by: https://github.com/malfet
2025-02-17 17:10:47 +00:00
71855a1cad Update slow tests (#147308)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147308
Approved by: https://github.com/pytorchbot
2025-02-17 12:03:40 +00:00
e8b20f6ef3 [MPS][BE] Turn exec_unary_kernel as MetalShaderLibrary method (#147299)
And delete duplicate implementations from SpecialOps and UnaryKernel.
Change input and output arguments order for SpecialOps kernels to match those of UnaryOps

Fixes https://github.com/pytorch/pytorch/issues/146770
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147299
Approved by: https://github.com/dcci
ghstack dependencies: #147296, #147297
2025-02-17 08:31:24 +00:00
ae5f7fec82 [Intel GPU] Enable fp64 GEMM (#140677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140677
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/desertfire
2025-02-17 08:15:55 +00:00
2b30e94fc0 [BE] Make exec_unary_kernel take TensorIterator as argument (#147297)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147297
Approved by: https://github.com/dcci
ghstack dependencies: #147296
2025-02-17 07:34:35 +00:00
3d251e6512 [BE] Switch all structured funcs to stubs (#147296)
No need to have separate foobar_out_mps when registering a dispatch to  foobar_stub will do

And this makes `exec_unary_kernel` defined in UnaryKernel.mm and
SpecialOps.mm look very similar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147296
Approved by: https://github.com/dcci
2025-02-17 07:34:34 +00:00
424c1b82e0 [Inductor][CPP] Add the legalize low fp support for index expr (#147298)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/147279. The test case produced a low-precision floating-point value using `ops.index_expr`, but the CPP backend did not handle its legalization. This PR adds support for it.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_low_fp_index_expr_issue_147279
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147298
Approved by: https://github.com/jgong5
2025-02-17 07:11:20 +00:00
359165734b [executorch hash update] update the pinned executorch hash (#147294)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147294
Approved by: https://github.com/pytorchbot
2025-02-17 05:03:05 +00:00
ae351d4d0e [Intel GPU] allow_tf32 for oneDNN backend - XPU part (#137570)
# Motivation
Add context variable `torch.bachend.mkldnn.allow_tf32` to control tf32 computation in convolution kernels at XPU side.  The tf32 data type is beneficial to improve the performance of deep learning workloads during training/inference. Current PR uses the [oneDNN API fpmath_mode](https://oneapi-src.github.io/oneDNN/dev_guide_attributes_fpmath_mode.html#the-floating-point-math-mode-attribute) to trigger the tf32 acceleration in convolution kernels.

# Valiadation
* ut to test context variable
`python test/xpu/test_conv.py -k test_mkldnn_allow_tf32_get_set`

* Runtime exemplification
```
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.649902
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,forward_training,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.151855
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_data,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_undef::undef::: dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.167969
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic33oc33_ih24oh24kh3sh1dh0ph1_iw49ow49kw3sw1dw0pw1,0.26709
onednn_verbose,primitive,exec,gpu:0,convolution,jit:ir,backward_weights,src_f32::blocked:abcd::f0 wei_f32::blocked:abcd::f0 bia_f32::blocked:a::f0 dst_f32::blocked:abcd::f0,attr-scratchpad:user attr-fpmath:tf32,alg:convolution_direct,mb20_ic16oc33_ih50oh24kh3sh2dh0ph0_iw100ow49kw3sw2dw0pw0,0.219971

```
According to the field `fpmath:tf32` in verbose, we could see that, current context setting utils could successfully trigger tf32 computation in conv forward/backward_data/backward_weights kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137570
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2025-02-17 01:46:43 +00:00
198ffbdf11 [MPS] Implement and test round.decimals (#147266)
If inductor can do it, why not eager
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147266
Approved by: https://github.com/Skylion007
ghstack dependencies: #147286
2025-02-16 23:17:13 +00:00
e738f7ba23 [BE]: Enable ruff rule SIM113 (#147290)
Lint rules that tells the user to avoid keeping track of their own counter and use the builtin enumerate when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147290
Approved by: https://github.com/jansel
2025-02-16 22:41:16 +00:00
a8fa4bcfd2 [StaticRuntime] Support a new pattern (aten::to with 5 inputs) for ClipRangesToGatherToOffsets (#147189)
Summary:
Support the following new pattern for ClipRangesToGatherToOffsets:

Before optimization:
```
%11175 : Tensor, %11176 : Tensor = fb::clip_ranges_gather(%int_66.1, %getitem_1784.1, %347)
%getattr_256.1 : int = prim::dtype(%11175)
%to_298.1 : Tensor = aten::to(%11176, %getattr_256.1, %13, %13, %12)
%lengths_to_offsets_333.1 : Tensor = fb::lengths_to_offsets(%to_298.1, %8)
```

After optimization:
```
%11199 : int = prim::dtype(%int_66.1)
%11200 : Tensor, %11201 : Tensor = fb::clip_ranges_gather_to_offsets(%int_66.1, %getitem_1784.1, %347, %8, %11199)
```

It is similar with https://github.com/pytorch/pytorch/pull/146931, but aten::to has 5 inputs instead of 4.

Differential Revision: D69627793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147189
Approved by: https://github.com/hanyilou123
2025-02-16 22:16:02 +00:00
5c0c99f658 [MPS][BE] Use stubs for floor/ceil/round/trunc (#147286)
To avoid duplicating logic that those ops are no-ops for integral dtypes
(And in preparation of adding `round_decimals` that calls round_stub if decimals are 0)

Tested for the corner cases by manually invoking `round`, `trunc`, `floor` and `ceil` for int dtypes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147286
Approved by: https://github.com/Skylion007
2025-02-16 17:22:49 +00:00
d27ecf85db xpu: support sycl with torch.utils.cpp_extension APIs (#132945)
This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension.

Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension.

By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for.

Fixes: #132944

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945
Approved by: https://github.com/albanD, https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-16 16:50:59 +00:00
dd5d0ea6bb Revert "xpu: support sycl with torch.utils.cpp_extension APIs (#132945)"
This reverts commit 607379960bc5093a1fe51ff72c3e0fd39ac126ab.

Reverted https://github.com/pytorch/pytorch/pull/132945 on behalf of https://github.com/malfet due to It just broke all the tests, see b16ae97ad0/1 ([comment](https://github.com/pytorch/pytorch/pull/132945#issuecomment-2661498747))
2025-02-16 16:03:42 +00:00
b16ae97ad0 Generalize mixed precision in DDP (#146808)
**Motivation:**

1. Generalize mixed precision in DDP.
2. Enable `SyncBatchNorm` for XPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146808
Approved by: https://github.com/guangyey, https://github.com/gujinghui, https://github.com/wconstab
2025-02-16 11:59:40 +00:00
ee38a32c55 [Dynamo] support isinstance(...) check for type tuple (#146984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146984
Approved by: https://github.com/jansel
2025-02-16 10:41:49 +00:00
607379960b xpu: support sycl with torch.utils.cpp_extension APIs (#132945)
This patch adds support for sycl kernels build via `torch.utils.cpp_extension.load`, `torch.utils.cpp_extension.load_inline` and (new) `class SyclExtension` APIs. Files having `.sycl` extension are considered to have sycl kernels and are compiled with `icpx` (dpc++ sycl compiler from Intel). Files with other extensions, `.cpp`, `.cu`, are handled as before. API supports building sycl along with other file types into single extension.

Note that `.sycl` file extension is a PyTorch convention for files containing sycl code which I propose to adopt. We did follow up with compiler team to introduce such file extension in the compiler, but they are opposed to this. At the same time discussion around sycl file extension and adding sycl language support into such tools as cmake is ongoing. Eventually cmake also considers to introduce some file extension convention for sycl. I hope we can further influence cmake and compiler communities to broader adopt `.sycl` file extension.

By default SYCL kernels are compiled for all Intel GPU devices for which pytorch native aten SYCL kernels are compiled. At the moment `pvc,xe-lpg`. This behavior can be overridden by setting `TORCH_XPU_ARCH_LIST` environment variables to the comma separated list of desired devices to compile for.

Fixes: #132944

CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132945
Approved by: https://github.com/albanD, https://github.com/guangyey
2025-02-16 10:16:09 +00:00
ed3b119c40 Skip unsupported types by MPS in test_torchinductor.py (#147211)
- Skip unsupported dtypes in `test_split_cumsum` (and manually skip int64 for MacOS13)
- Adapt `test_cat` to use `torch.half` instead of `torch.double` on MPS
- Skip `test_adaptive_avg_pool1d_argmax` is avgpool is not implemented for all sizes
-

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147211
Approved by: https://github.com/jansel, https://github.com/Skylion007, https://github.com/dcci
2025-02-16 10:15:53 +00:00
0fb5b224b7 [DCP] Cache save plans: planner helpers and interface updates (#147116)
Summary:
This PR updates the planner interface and introduces the class variables to cache the local and global plans.
Two new helpers are also introduced which will be used to compare if the plans have changed across save attempts and merge the delta plans.

Test Plan: UTs

Differential Revision: D69224488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147116
Approved by: https://github.com/MeetVadakkanchery, https://github.com/huydhn
2025-02-16 07:18:26 +00:00
4bacd13c92 [executorch hash update] update the pinned executorch hash (#147273)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147273
Approved by: https://github.com/pytorchbot
2025-02-16 05:11:33 +00:00
8f20026bcb [Intel GPU] Support SparseCsrXPU codegen (#144722)
Adding a new dispatch key - `SparseCsrXPU`  to enable Intel GPU support for SparseCsr Tensor.

Similar PR: https://github.com/pytorch/pytorch/pull/139267
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144722
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/albanD

Co-authored-by: Kanya-Mo <kanya.mo@intel.com>
2025-02-16 03:16:12 +00:00
1677a31019 [Inductor] Fix 3D tiling with permute (#147249)
This PR adds a test case and tiny fix for 3D tiling. Before this PR, tiling would crash because one of the candidates lacked a `"y"` dimension. Now, when we're calculating 3D tiling candidates, we assume the y size is 1 if it's missing.

The test case implements a 3D permute using block pointers.

```
@triton.jit
def triton_poi_fused_add_0(in_ptr0, out_ptr0, znumel, ynumel, xnumel, ZBLOCK : tl.constexpr, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    znumel = 51
    ynumel = 51
    xnumel = 51
    zoffset = tl.program_id(2) * ZBLOCK
    zindex = zoffset + tl.arange(0, ZBLOCK)[None, None, :]
    zmask = zindex < znumel
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None, None]
    xmask = xindex < xnumel
    x2 = xindex
    y1 = yindex
    z0 = zindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2])
    tmp1 = tl.load(tl.make_block_ptr(in_ptr0, shape=[51, 51, 51], strides=[51, 1, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), boundary_check=[0, 1, 2])
    tmp2 = tmp0 + tmp1
    tmp3 = tmp0 + tmp0
    tmp4 = tmp2 + tmp3
    tl.store(tl.make_block_ptr(out_ptr0, shape=[51, 51, 51], strides=[1, 51, 2601], block_shape=[XBLOCK, YBLOCK, ZBLOCK], order=[2, 1, 0], offsets=[xoffset, yoffset, zoffset]), tl.broadcast_to(tmp4, [XBLOCK, YBLOCK, ZBLOCK]).to(tl.float32), boundary_check=[0, 1, 2])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147249
Approved by: https://github.com/jansel
2025-02-15 23:28:36 +00:00
44ee9ca593 [inductor] Add type annotations to _inductor/utils.py (#144108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144108
Approved by: https://github.com/eellison
2025-02-15 23:13:41 +00:00
4ab967c44d all reduce non strict (#147133)
Summary:
Some distributed collectives like `all_reduce` have special handling in Dynamo, where they are mapped to functional collectives. Non-strict was previously blind to such mappings, which means using them would fail to trace. Here we show how intercepting them in non-strict's torch function mode can mimic this remapping logic. More ops to follow.

Side note: a recently added distributed test was in the wrong place, making the expected failures for non-strict not fire because we weren't actually generating those tests to begin with! Now fixed.

Test Plan: moved and updated test

Differential Revision: D69607140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147133
Approved by: https://github.com/tugsbayasgalan
2025-02-15 19:37:08 +00:00
75a4b73816 utils: Update md5 call to be fips compliant (#147252)
Updates md5 call to be fips compliant according to this issue:
* https://github.com/pytorch/pytorch/issues/147236

Not going to add a conditional here because minimum the python version
that we support is already 3.9

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147252
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/malfet
2025-02-15 15:19:08 +00:00
6ca5c22e31 Revert "Enable fp16 linear layers in PyTorch via ACL (#144992)"
This reverts commit 5b37249259ad50d9b4b32a78a5b5178a1eb3d110.

Reverted https://github.com/pytorch/pytorch/pull/144992 on behalf of https://github.com/nikhil-arm due to Accuracy Test failures ([comment](https://github.com/pytorch/pytorch/pull/144992#issuecomment-2660902238))
2025-02-15 12:40:59 +00:00
86be5d4421 remove unnecessary xpu availability check when retrieving aot flags (#146966)
As title

Retrieving xpu aot flags that the pytorch binary was compiled against is not the same as running the binary itself. Thus it doesn't seem to necessarily check if there is an xpu environment available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146966
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/albanD
2025-02-15 09:15:49 +00:00
9e0b3e9b6c [Inductor] Fix Inplace Buffer inner name conflict (#147199)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/146975, when create `InplacedBuffer` inner name, we only count the number of unique `InplacedBuffer` or `RemovedArg`. The name may have conflict, for example reported in this issue

```
---- make inplace create, input_name is: buf22; output_name is: buf27; buf.inner_name is: in_out_ptr2
dict_values([
InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']),
InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf6', 'buf11']),
InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26']),
InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf24', 'buf26'])])

---- make inplace create, input_name is: buf0; output_name is: buf3; buf.inner_name is: in_out_ptr2
dict_values([
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']),
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33'])
<torch._inductor.codegen.common.RemovedArg object at 0x7fbf75516350>,
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33']),
InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf22', 'buf27', 'buf31', 'buf33'])
])
```

- The first time create `in_out_ptr2`, there are 2 unique `InplacedBuffer`

- The second time create `in_out_ptr2`, there is 1 `RemovedArg` and 1 unique `InplacedBuffer`

They are 2 different `InplacedBuffer`, but with same name `in_out_ptr2`. In this PR, we fix this regression by counting the number of `RemovedArg`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147199
Approved by: https://github.com/jansel
2025-02-15 08:31:06 +00:00
a30f145101 [inductor] Don't leak pointers to cpp_wrapper with lru_cache (#147233)
Putting lru_cache on methods will keep pointers to the `self` objects
alive forever and leak memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147233
Approved by: https://github.com/yanboliang
2025-02-15 08:25:41 +00:00
9dc702875d [dynamo][mappingproxy][inspect] Support existing types.MappingProxyType (#147217)
Fixes https://github.com/pytorch/pytorch/issues/147162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147217
Approved by: https://github.com/williamwen42
2025-02-15 07:59:33 +00:00
cyy
8daa742e8b Remove code for Python < 3.9 (#147181)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147181
Approved by: https://github.com/albanD
2025-02-15 06:43:26 +00:00
9919375cf1 [executorch hash update] update the pinned executorch hash (#147241)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147241
Approved by: https://github.com/pytorchbot
2025-02-15 05:02:22 +00:00
cyy
8f291e8c00 Fix clang-tidy warnings in torch/jit (#146963)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146963
Approved by: https://github.com/davidberard98
2025-02-15 03:36:59 +00:00
4233a77960 update kineto submodule to include fix for windows build (#147195)
Fixes an issue causing windows builds to fail
https://github.com/pytorch/kineto/pull/1039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147195
Approved by: https://github.com/cyyever, https://github.com/davidberard98, https://github.com/sraikund16
2025-02-15 01:53:16 +00:00
c1fcba3648 [Inductor] Fix the lowering of squeeze when input is not contiguous (#146746)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/143498. The issue happens when we lowering `select = torch.ops.aten.select.int(cat, 1, 0)`.

For example, when `cat` is contiguous with size[2, 2] stride[2,1]

- for eager, it returns a view of size[2,] stride[2,]
- for Inductor lowering, it returns wrong stride 1 instead of 2
```
TensorBox(
  ReinterpretView(
    StorageBox(
      ConcatKernel(name='buf10', layout=FixedLayout('cpu', torch.int64, size=[u0, 2], stride=[2, 1]), inputs=[ComputedBuffer(name='buf8', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b856449d0>, ranges=[u0, 1])), ComputedBuffer(name='buf9', layout=NonOwningLayout('cpu', torch.int64, size=[u0, 1], stride=[2, 1]), data=Pointwise(device=device(type='cpu'), dtype=torch.int64, inner_fn=<function ReinterpretView.make_loader.<locals>.loader at 0x7f6b85644790>, ranges=[u0, 1]))])
    ),
    FixedLayout('cpu', torch.int64, size=[u0], stride=[**1**]),
    origins=OrderedSet([select])
  )
)
```

To fix this issue, we give the right stride when lowering of `squeeze`.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_unbacked_symints.py -k test_issue_143498
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146746
Approved by: https://github.com/jgong5, https://github.com/sanchitintel, https://github.com/eellison
2025-02-15 01:33:04 +00:00
bf0c89a72f [dynamo] fix error message when logging graph that contains hops (#147227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147227
Approved by: https://github.com/zou3519
2025-02-15 00:53:44 +00:00
933f921b36 [PT][FSDP] support custom all reduce hook across FSDP units (#147114)
This change adds an API `set_all_reduce_hook` to the `FSDPModule` to
support customized all reduce either in native HSDP (2d mesh) setup or custom HSDP (1d FSDP + custom AR across replicas)
* For native HSDP, the original AR would still run as is and this hook allows for additional gradient modification post all reduce.
* For custom HSDP, the original AR will be skipped and all the logic is instead expected to be executed in the hook.

The custom hook is expected to perform operations in place (no return value).

Example basic usage:
```
model = ...
fully_shard(model, mesh=...)
model.set_all_reduce_hook(my_hook)
```

By default, the hook will run in the default all reduce stream post reduce scatter.
When native HSDP is NOT enabled, the custom hook can be specified to run in a custom stream. This custom stream will also be synchronized post reduce scatter similarly. See tests for examples.

Test Plan: CI

Differential Revision: D68255583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147114
Approved by: https://github.com/awgu
2025-02-15 00:38:00 +00:00
a9ae3340ca Fix triton masked loading for non-block tl.loads (#144782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782
Approved by: https://github.com/eellison
2025-02-15 00:07:33 +00:00
49727bbc9d Turn on prologue fusion (#147008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147008
Approved by: https://github.com/masnesral
2025-02-14 23:36:21 +00:00
76f57e184a [dynamo] Make SliceVariable a subclass of VariableTracker (#147046)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147046
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146819, #146995
2025-02-14 23:22:27 +00:00
a5c0dab900 [AOTInductor] Guard RAII_cpuMalloc with macro (#147150)
Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function]

Test Plan: Existing tests

Differential Revision: D69623481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150
Approved by: https://github.com/henrylhtsang
2025-02-14 23:21:35 +00:00
1224765286 [cond] make cond call fake kernel in dynamo (#147045)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147045
Approved by: https://github.com/zou3519
ghstack dependencies: #146954
2025-02-14 23:13:15 +00:00
85a82c5bc8 [cond] make cond re-dispatch in proxy mode (#146954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954
Approved by: https://github.com/zou3519
2025-02-14 23:13:14 +00:00
eecee5863e Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 21:23:19 +00:00
d38db94689 [inductor][refactor] Move _compile_file to cpp_builder (#147202)
Summary: To further conslidate cpp build logic into cpp_builder

Test Plan: CI

Differential Revision: D69595327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147202
Approved by: https://github.com/yushangdi
2025-02-14 21:02:30 +00:00
dd86491b35 [cutlass backend][BE] refactor tests to remove duplicate logic (#146743)
Doing many things here:
* remove duplicate hip checking logic
* check for CUDA in setup
* remove CUTLASS_DIR setting. That is not needed when building from source and fbcode anymore
* fix some typing errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146743
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78
2025-02-14 20:50:27 +00:00
6f035d8462 [torch] Make amdsmi cdll hook private (#147207)
Summary: https://github.com/pytorch/pytorch/actions/runs/13314282597/job/37186177974 yelled at me for landing a seemingly public API that's not exported. It's a private API, so lets prepend `_` to make that clear

Test Plan: CI

Differential Revision: D69665234

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147207
Approved by: https://github.com/PaulZhang12
2025-02-14 20:30:48 +00:00
272ead7b5e Make fx.node.map_arg() and .map_aggregate() generic (#146248)
## What's the problem?

The popular `fx.node.map_arg()` and `fx.node.map_aggregate()` apply operations recursively on `dict`s, `tuples`, `list`s, etc, and return a new collection of the same type.

Unfortunately, their base input type is `Argument`, which is [very unspecific indeed](5d55a6585d/torch/fx/node.py (L48-L58)): most type information is just thrown away at the call site of either of these functions, as far as the type checker goes.

As `torch` moves to a more typed code base, this would force innocent, unsuspecting developers to add logically unnecessary casts or `# type: ignore` statements.

## What's the solution?

Making these two `node.map_*` functions generic on the first argument and return type means that type information is preserved for the type checker. (The signature of the other parameter, the function that visits the nodes and subnodes, has not changed, nor should it.)

## Won't it break everything?

It doesn't break the type checker - one place needed an extra hint.

There have been code breakages, resolved one, at least one new one... we'll see!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146248
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2025-02-14 19:25:32 +00:00
58f654b5ad [ONNX] Consolidate constants to a single location (#147166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147166
Approved by: https://github.com/titaiwangms
ghstack dependencies: #147164, #147165
2025-02-14 19:08:19 +00:00
765bc30ab9 [ONNX] Set warning stacklevel so it appears at the torch.onnx call site (#147165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147165
Approved by: https://github.com/Skylion007
ghstack dependencies: #147164
2025-02-14 19:04:43 +00:00
9a1eac6704 [ONNX] Handle number of outputs in builder (#147164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147164
Approved by: https://github.com/titaiwangms
2025-02-14 19:03:51 +00:00
5517eb4398 Revert "[cutlass backend] Do not change dtype of GEMM template (#146877)"
This reverts commit 260b21b8bca6edd3e0b89b800d6efa8243f0d122.

Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to let me resubmit  ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2660053270))
2025-02-14 18:58:18 +00:00
aac5d1a289 Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit f0bdc27f74f8b1d4ab6789156691ee0fd5cbb30f.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like internal ideep version is too old to support this ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2660008996))
2025-02-14 18:31:54 +00:00
20a9938069 try print stacktrace for error (#147061)
Differential Revision: D69573525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147061
Approved by: https://github.com/Skylion007
2025-02-14 18:28:03 +00:00
8b5ee275fb [MPS] Fix cholesky_ex for empty inputs (#147159)
By making sure that `info` is actually initialized  if input is empty(but no need to do anything about `out`, is it's guaranteed to be an empty tensor)

Also move output resizing logic before `input.numel()` check

Fixes https://github.com/pytorch/pytorch/issues/147128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147159
Approved by: https://github.com/albanD
2025-02-14 17:44:08 +00:00
0d16188c06 [CI] Use job name to index into test times json (#147154)
When the test times are generated, it doesn't know what the build environment is because it's an environment variable.  But when we index into the test times, we (previously) didn't know what the job name is.  These are usually the same but sometimes they're different and when they're different it ends up using default, which can have unbalanced sharding

I think job name was added at some point to most of the CI environments but I didn't realize, so we can now update this code to use the job name instead so the generation and the indexing match

also upload stats workflow for mps

Checked that inductor_amx doesn't use default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147154
Approved by: https://github.com/huydhn
2025-02-14 17:06:56 +00:00
e8fbc86de0 Make torch.cuda.gds APIs public (#147120)
Follow up to https://github.com/pytorch/pytorch/pull/145748 that turned USE_CUFILE on for CUDA 12.6 and 12.8 binaries

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147120
Approved by: https://github.com/albanD
2025-02-14 17:06:50 +00:00
c3853d924f Introduce new template heuristic for triton autotune configs (#144985)
Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in.

This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs:
- CPUConfigHeuristic()
- CUDAConfigHeuristic()
- ROCmConfigHeuristic()
- XPUConfigHeuristic()

These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs.

The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985
Approved by: https://github.com/jansel
2025-02-14 17:01:06 +00:00
e06ee4aa9f Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)"
This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6.

Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))
2025-02-14 16:44:46 +00:00
059dfe2081 Revert "update kineto submodule (#147015)"
This reverts commit d1997b610f5b974af7ebad6b9903d2d8f751d927.

Reverted https://github.com/pytorch/pytorch/pull/147015 on behalf of https://github.com/atalman due to broke windows builds ([comment](https://github.com/pytorch/pytorch/pull/147015#issuecomment-2659730304))
2025-02-14 16:11:08 +00:00
06f4a5c0e5 Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073)
Should resolve: https://github.com/pytorch/pytorch/issues/144768
We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1``
For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1``
We use pinned version of NCCL rather then submodule.
Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj
2025-02-14 15:29:59 +00:00
cefd9805de Add RAISE_VARARGS 0 (#146493)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146493
Approved by: https://github.com/zou3519
ghstack dependencies: #146498, #146492
2025-02-14 13:37:23 +00:00
134723ee1c Add WITH_EXCEPT_START opcode (#146492)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146492
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #146498
2025-02-14 13:37:23 +00:00
dbb86b78ad Add sys.exc_info and sys.exception (#146498)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146498
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-02-14 13:37:14 +00:00
ea188ac0c7 [export] Add meta for aten.bincount (#147129)
Fixes https://github.com/pytorch/pytorch/issues/147094
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147129
Approved by: https://github.com/pianpwk
2025-02-14 10:33:54 +00:00
de26ddfbdc Update torch-xpu-ops commit pin (#146671)
Update the torch-xpu-ops commit to [80c375570e2b6b2989a8610da1871f8a50dfddc7](80c375570e), includes:

- Aten operator coverage improvement
- SYCL kernel optimization
- Nested Tensor OPs support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146671
Approved by: https://github.com/EikanWang
2025-02-14 09:30:36 +00:00
bd019c0bb4 [Inductor][CPP] Fix node name for wgt delete (#147056)
**Summary**
This is a regression issue caused by a change in the FX node name. In commit 71010bf0972834e35a155e6a187e5c6649a5a36b, both the node name and target for the `get_attr` node in `V.graph.graph.nodes` were `_frozen_param2`. However, in the latest main, the node name has changed to `_reorder_linear_weight`. This PR fixes the regression by using the node's target instead of its name.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_weight_prune
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147056
Approved by: https://github.com/jgong5
2025-02-14 06:27:41 +00:00
10bc8f25b2 [MPS][BE] Migrate polar to use functor (#147184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147184
Approved by: https://github.com/dcci
ghstack dependencies: #147182, #147183
2025-02-14 06:25:36 +00:00
278ffd84fc [MPS][BE] Add copysign integral flavors as functor (#147183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147183
Approved by: https://github.com/dcci
ghstack dependencies: #147182
2025-02-14 06:25:36 +00:00
2ef51cfb9d [BE][MPS] Infer results of functor (#147182)
Do not assume that functor will return the same results as its arguments, but rather dynamically infer it using `decltype` and `:🤘:declval`
This is a no-op that prepares for migration of `copysign` of integral arguments, that would return a float
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147182
Approved by: https://github.com/dcci
2025-02-14 06:25:27 +00:00
331d5cf560 [inductor] [cpp] Support vectorization for score and mask in FlexAttention CPU (#143638)
## Description
We generate vectorized kernel for score and mask in FlexAttention with this PR.

## Modification
The main change include:
- For the input and output buffer to the mask and score function, instead of passing scalars, we pass tensors to it.
- For the mask function, the original function which works on a scalar only includes the logic of calculating the mask value. The PR added the logic of applying the mark to the qk_data tensor into the graph and then leverage the CPP backend to generate vectorized kernels.
  The original mask graph:
  ```python
  def mask_fn(b, h, q_idx, kv_idx):
      mask = q_idx >= kv_idx
      return mask
  ```
  The converted_mask_graph should be:
  ```python
  def converted_mask_fn(qk_data, b, h, q_idx, kv_idx):
      mask = q_idx >= kv_idx
      qk_data = torch.where(mask, qk_data, torch.full_like(qk_data, -float("inf")))
      return qk_data
  ```

## Benchmark
For q, k, v of shape: `[1, 32, 1024, 128]`, using 40 CPU cores, we observe over 20x speedup compared with the non vectorized version for both `is_causal` = `False` and `True`.

## Test plan
The existing FlexAttention UTs (`test/inductor/test_flex_attention.py`, `test/inductor/test_flex_decoding.py`) can cover the change in this PR.

## Output code

**Code before this PR is in scalar version:**

```cpp
// apply score mod function
for (int64_t row = 0; row < cur_qSplitSize; ++row) {
    for (int64_t col = 0; col < cur_kvSplitSize; col++) {
    std::vector<int64_t> b_idx = {i};
    std::vector<int64_t> h_idx = {j};
    std::vector<int64_t> q_idx = {m+row};
    int64_t phisical_kv_idx = n+col;
    if (use_kv_indice) {
        phisical_kv_idx= *kv_logical_data * kvBlockSize + col;
    }
    std::vector<int64_t> kv_idx = {phisical_kv_idx};
    accum_t* in_ptr0 = qk_data + row * cur_kvSplitSize + col;
    auto in_ptr1 = b_idx.data();
    auto in_ptr2 = h_idx.data();
    auto in_ptr3 = q_idx.data();
    auto in_ptr4 = kv_idx.data();

    accum_t* out_ptr0 = in_ptr0;
    {
        {
            {
                auto tmp0 = in_ptr0[static_cast<int64_t>(0L)];
                out_ptr0[static_cast<int64_t>(0L)] = tmp0;
            }
        }
    }

    }
}
// Apply block mask, fill unused with -inf
for (int64_t row = 0; row < cur_qSplitSize; ++row) {
    for (int64_t col = 0; col < cur_kvSplitSize; col++) {
    std::vector<int64_t> b_idx = {i};
    std::vector<int64_t> h_idx = {j};
    std::vector<int64_t> q_idx = {m+row};
    int64_t phisical_kv_idx = n+col;
    if (use_kv_indice) {
        phisical_kv_idx= *kv_logical_data * kvBlockSize + col;
    }
    std::vector<int64_t> kv_idx = {phisical_kv_idx};
    accum_t* qk_block = qk_data + row * cur_kvSplitSize + col;
    auto in_ptr1 = b_idx.data();
    auto in_ptr2 = h_idx.data();
    auto in_ptr3 = q_idx.data();
    auto in_ptr4 = kv_idx.data();

    std::vector<int64_t> temp = {0};
    int64_t* out_ptr1 = temp.data();
    {
        {
            {
                auto tmp0 = static_cast<bool>(true);
                out_ptr1[static_cast<int64_t>(0L)] = tmp0;
            }
        }
    }

    *qk_block = *out_ptr1 != 0
                    ? *qk_block
                    : -std::numeric_limits<accum_t>::infinity();
    }
}
```

**Code after this PR will be vectorized:**
```cpp
accum_t* in_ptr0 = qk_data;

auto in_ptr1 = b_idx.data();
auto in_ptr2 = h_idx.data();
auto in_ptr3 = q_idx.data();
auto in_ptr4 = kv_idx.data();

// apply score mod function
{

    accum_t* out_ptr0 = in_ptr0;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(16));
                        tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0));
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize)))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))));
                        tmp0.store(out_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(cur_kvSplitSize + ((-16L)*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))));
                    }
                }
            }
        }
    }

}

// Apply block mask, fill unused with -inf
{

    accum_t* out_ptr1 = in_ptr0;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(cur_qSplitSize); x0+=static_cast<int64_t>(1L))
        {
            for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(cur_kvSplitSize); x1+=static_cast<int64_t>(16L))
            {
                {
                    if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))))))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0), static_cast<int64_t>(16));
                        auto tmp1 = static_cast<bool>(true);
                        auto tmp2 = -std::numeric_limits<float>::infinity();
                        auto tmp3 = at::vec::VecMask<float,1>::from(tmp1);
                        auto tmp4 = at::vec::Vectorized<float>(tmp2);
                        auto tmp5 = decltype(tmp0)::blendv(tmp4, tmp0, tmp3.template cast<float,1>());
                        tmp5.store(out_ptr1 + static_cast<int64_t>(x1 + cur_kvSplitSize*x0));
                    }
                    if(C10_UNLIKELY(x1 >= static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L)))) && x1 < static_cast<int64_t>(cur_kvSplitSize)))
                    {
                        for (int64_t x1_tail = static_cast<int64_t>(16L*(c10::div_floor_integer(static_cast<int64_t>(cur_kvSplitSize), static_cast<int64_t>(16L))));x1_tail < static_cast<int64_t>(cur_kvSplitSize); x1_tail++)
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x1_tail + cur_kvSplitSize*x0)];
                            auto tmp1 = static_cast<bool>(true);
                            auto tmp2 = -std::numeric_limits<float>::infinity();
                            auto tmp3 = tmp1 ? tmp0 : tmp2;
                            out_ptr1[static_cast<int64_t>(x1_tail + cur_kvSplitSize*x0)] = tmp3;
                        }
                    }
                }
            }
        }
    }

}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143638
Approved by: https://github.com/jgong5, https://github.com/drisspg, https://github.com/leslie-fang-intel
2025-02-14 05:26:18 +00:00
ce38bfd299 [executorch hash update] update the pinned executorch hash (#147157)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147157
Approved by: https://github.com/pytorchbot
2025-02-14 05:04:17 +00:00
92f669e39c [BE] Use c10::multiply_integers in cholesky_impl (#147163)
That replaces explicit for loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147163
Approved by: https://github.com/huydhn
2025-02-14 03:59:17 +00:00
2d089a5697 [dynamo] Remove unintended lru_cache (#147147)
I forgot to remove it while add frozenset __contains__ method in this PR
- https://github.com/pytorch/pytorch/pull/146062?fbclid=IwZXh0bgNhZW0CMTEAAR3S_qq8bYxO7pDuHqpr2X-vqkXQrY0KtT14z46bfuRDYikjJBet3uKF2dE_aem_o1c7I4eawKyaEsfiWhnTmw

This is causing memory leak

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147147
Approved by: https://github.com/williamwen42
2025-02-14 03:55:39 +00:00
6344ca1dd4 [BE][Ez]: Apply FURB188: use str remove(pre|suf)fix (#146997)
Since we are on 3.9, we can use this nice str builtin which is more readable and more efficient.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146997
Approved by: https://github.com/XuehaiPan, https://github.com/cyyever, https://github.com/jansel
2025-02-14 03:38:07 +00:00
cyy
d473c212fd Remove code for Python < 3.9 (#147097)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147097
Approved by: https://github.com/albanD
2025-02-14 03:22:49 +00:00
880e176544 [inductor] Fix for pattern file contains 'getitem' fails during impor… (#144980)
…t of the pattern module

  For example any pattern module that has the following pattern generated, fails to import because
  the name getitem undefined.

  native_dropout_default = CallFunction(aten.native_dropout.default, div_Tensor_1, KeywordArg('dropout_p'), True, _users=2)
  getitem = CallFunction(getitem, native_dropout_default, 0)

  this fix will resolve the error.

Fixes #144674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144980
Approved by: https://github.com/eellison
2025-02-14 02:30:24 +00:00
0b84311842 [export] Generate printers/parsers for serialization enum values. (#147126)
Summary:
Generate two helper functions for enum classes in generated_serialization_types.h

printEnum: will convert enum values into strings.
parseEnum: will convert strings into enum values.

Test Plan: CI

Differential Revision: D69604850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147126
Approved by: https://github.com/yiming0416
2025-02-14 02:14:35 +00:00
05001f0459 Add Structured Tracing for Traced Graph Edge Details for AC Debugging (#146634)
Summary:
Updating the structured trace infrastructure so that we are able to output to Zoomer and have an E2E solution.

Context Doc: https://docs.google.com/document/d/1T6omIBEWVhbOiwDLSLffgQwjxiT2rQv8QvvQwXkw4fY/edit?usp=sharing

Test Plan:
### Testing Structured Log + tlparse locally

Command:
```
TORCH_TRACE=/data/users/basilwong/fbsource/fbcode/log_torch_trace buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -- mode=local_fb_fm_v4 launcher.num_workers=2
```

Torch Trace Logs (local then sent to paste): P1686419449
```
cat log_torch_trace/dedicated_log_torch_trace_rank_0_2lg012xo.log | pastry
P1686419449
```

tlparse output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

tlparse graph edge details output: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpyiv5wj/rank_1/9_0_0/joint_graph_information_397.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

Differential Revision: D61557220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146634
Approved by: https://github.com/jansel, https://github.com/Yuzhen11
2025-02-14 02:04:26 +00:00
486fc12d7e torch: Log a unified waitcounter for torch.compile and triton.autotune (#146723)
Summary: Add a second more generic waitcounter to torch.compile. We'll keep expanding this as new generic pytorch compilation sites show up.

Test Plan: Waitcounter only change, relying on existing tests.

Differential Revision: D69215401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146723
Approved by: https://github.com/davidberard98
2025-02-14 02:04:13 +00:00
f0bdc27f74 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-14 02:03:53 +00:00
c5a9e4a6a0 [Inductor][CPP] Fix a CPP GEMM Template output data type issue (#146958)
**Summary**
Issue found when fixing https://github.com/pytorch/ao/issues/1662. A FP32 GEMM with an epilogue node `to_fp16` resulted in [generated code](https://gist.github.com/leslie-fang-intel/464fb112abdb105818ae09b057350e84), which failed to compile. The root cause is that we used the slice of global buffer `Y` as the output of micro GEMM instead of a `local buffer`. However, due to the `to_fp16` epilogue node, the global buffer `Y` has a float16 data type, leading to the failure. This fix will ensure the use of a local buffer in such cases.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_linear_to_lowp_fp
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146958
Approved by: https://github.com/jgong5
2025-02-14 01:40:08 +00:00
d3524ecdd6 [Break XPU] Align meta calculation for fft_r2c with _fft_r2c_mkl (#146763)
Fix #146761
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146763
Approved by: https://github.com/jansel
ghstack dependencies: #146762, #145248, #146880
2025-02-14 01:39:18 +00:00
ade5af9430 [XPU] Align XPU convolution_backward output layout between fake tensor and real output tensor. (#146880)
Fix #146879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146880
Approved by: https://github.com/EikanWang, https://github.com/jansel
ghstack dependencies: #146762, #145248
2025-02-14 01:39:18 +00:00
9befdf565a [Break XPU][Inductor UT] Set input tensors to corresponding device for test case in test_aot_indutor.py (#145248)
Fix #145247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145248
Approved by: https://github.com/desertfire, https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #146762
2025-02-14 01:39:11 +00:00
972e927134 [Break XPU][Inductor UT] Fix XPU Inductor UT failures introduced from community. (#146762)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146762
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/jansel
2025-02-14 01:38:50 +00:00
6419076db9 [torch][amdsmi] Look for amdsmi in ROCM_HOME/ROCM_PATH before using rpath (#147117)
Summary: ROCm uses ROCM_HOME/ROCM_PATH to specify which version of rocm the user wants to use. This is especially important in multi-version setups. Let's respect that behavior when loading amdsmi.

Test Plan:
CI
```
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL MSCCL_ALGO_DIR=~/2fbsource/third-party/rccl/develop/tools/msccl-algorithms RCCL_MSCCLPP_THRESHOLD=(math '128*1024*1024')  RCCL_MSCCLPP_ENABLE=1 ENABLE_MSCCLPP=1 buck2 run fbcode//mode/opt-amd-gpu -m rocm621 fbcode//accelerators/workloads/microbench:bench_comm -- --shape moe_17b --comm_algo nccl_allreduce
```

Differential Revision: D69597647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147117
Approved by: https://github.com/malfet
2025-02-14 01:11:59 +00:00
20a369aa3a [Intel GPU] Avoid copy when the input of Matmul is broadcasted (#143784)
Avoid copy when the input of Matmul is 3D and broadcasted on batch dim.  oneDNN support implicit broadcast semantics i.e., src can be broadcasted into weight if the corresponding dimension in src is 1 (and vice versa). On Max 1100, timm resmlp_12_224 amp_fp16 inference with bs=128 can improve from 42ms to 13.7 ms on torch.compile and 57.5ms to 32ms on eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143784
Approved by: https://github.com/EikanWang
2025-02-14 00:48:07 +00:00
057bcd3a45 [ca] eliminate duplicate getitem graph nodes for shape inputs (#146875)
should reuse existing proxies instead of creating new ones

before: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpL7hmHe/0_-_-_0/compiled_autograd_graph_3.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
```python
class CompiledAutograd0(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem = inputs[0]
        getitem_1 = inputs[1]
        getitem_2 = inputs[2];  inputs = None
        getitem_3 = sizes[0];  getitem_3 = None
        getitem_4 = sizes[1];  getitem_4 = None
        getitem_5 = sizes[2];  getitem_5 = None
        getitem_6 = sizes[3];  getitem_6 = None
        getitem_7 = sizes[4];  getitem_7 = None
        getitem_8 = sizes[5];  getitem_8 = None
        getitem_9 = sizes[6];  getitem_9 = None
        getitem_10 = sizes[7];  getitem_10 = None
        getitem_11 = sizes[8];  getitem_11 = None
        getitem_12 = sizes[9];  getitem_12 = None
        getitem_13 = sizes[10];  getitem_13 = None
        getitem_14 = sizes[11];  getitem_14 = None
        getitem_15 = sizes[12];  getitem_15 = None
        getitem_16 = sizes[13];  getitem_16 = None
        getitem_17 = sizes[14];  getitem_17 = None
        getitem_18 = sizes[15];  getitem_18 = None
        getitem_19 = sizes[0]
        getitem_20 = sizes[1]
        getitem_21 = sizes[2]
        getitem_22 = sizes[3]
        getitem_23 = sizes[4]
        getitem_24 = sizes[5]
        getitem_25 = sizes[6]
        getitem_26 = sizes[7]
        getitem_27 = sizes[8]
        getitem_28 = sizes[9]
        getitem_29 = sizes[10]
        getitem_30 = sizes[11]
        getitem_31 = sizes[12]
        getitem_32 = sizes[13]
        getitem_33 = sizes[14]
        getitem_34 = sizes[15];  sizes = None
```

after: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpCo5T6B/0_-_-_0/compiled_autograd_graph_1.txt?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100
```python
class CompiledAutograd0(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem = inputs[0]
        getitem_1 = inputs[1]
        getitem_2 = inputs[2];  inputs = None
        getitem_3 = sizes[0]
        getitem_4 = sizes[1]
        getitem_5 = sizes[2]
        getitem_6 = sizes[3]
        getitem_7 = sizes[4]
        getitem_8 = sizes[5]
        getitem_9 = sizes[6]
        getitem_10 = sizes[7]
        getitem_11 = sizes[8]
        getitem_12 = sizes[9]
        getitem_13 = sizes[10]
        getitem_14 = sizes[11]
        getitem_15 = sizes[12]
        getitem_16 = sizes[13]
        getitem_17 = sizes[14]
        getitem_18 = sizes[15];  sizes = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146875
Approved by: https://github.com/jansel
ghstack dependencies: #146720, #146735
2025-02-13 21:41:33 +00:00
76dacd5fc7 [ca] log graph before reodering passes (#146735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146735
Approved by: https://github.com/jansel
ghstack dependencies: #146720
2025-02-13 21:41:33 +00:00
cdbf677cdd Remove outdated comment in ATen/mkl/Sparse.h about lack of Windows support (#147125)
Fixes #147124.

* #102604 added support for Intel oneMKL Sparse BLAS APIs so there was an outdated comment left around in the codebase that can now be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147125
Approved by: https://github.com/janeyx99
2025-02-13 21:34:05 +00:00
1f41ceb713 [BE][Ez]: Enable ruff rule banning print in assert (#146615)
Enables a few ruff rules
* Ban print statements within asserts (likely bugs)
* ~Use string for Decimal literal to prevent loss of precision~
* ~Do not use default args for __post__init__ in dataclasses, they likely were meant to go into the factory method, the __init__, or somewhere else. The default values are useless here.~

Wait until ruff upgrade for the last 2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146615
Approved by: https://github.com/jansel
2025-02-13 21:14:00 +00:00
5469e5c556 [export] Minor fix to locals (#146955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146955
Approved by: https://github.com/bobrenjc93
2025-02-13 20:29:15 +00:00
7b4efb492b [inductor][refactor] Make _compile_file only used for fbcode (#147106)
Summary: _compile_file in codecache.py only handles specific cpp compilation in fbcode. The next step is to consolidate it with cpp_builder.

Test Plan: CI

Differential Revision: D69592025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147106
Approved by: https://github.com/yushangdi
2025-02-13 20:22:31 +00:00
2d3db4509a fix pt2e block wise quantization test (#147035)
Differential Revision: D69559217

https://github.com/pytorch/pytorch/pull/145941 breaks the unit test added for prepare pt2e + block wise quantization. Fixing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147035
Approved by: https://github.com/andrewor14
2025-02-13 19:44:56 +00:00
b0553cee6b [Utilization] post-test-process workflow (#145310)
# Overview
Add reusable workflow to trigger the post-test right after each test job is complete.

Cousion with pr to setup the runner permissions:
Add m fleet instances: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595/files
add to lix fleet:https://github.com/pytorch/ci-infra/pull/322/files

Currently I turn on the debug flag for testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145310
Approved by: https://github.com/huydhn
2025-02-13 18:51:19 +00:00
260b21b8bc [cutlass backend] Do not change dtype of GEMM template (#146877)
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69085556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877
Approved by: https://github.com/ColinPeppler
2025-02-13 18:36:16 +00:00
92d448ff62 Add self to CODEOWNERS for fx/proxy.py; warn against adding new node arg types (#147031)
Not sure if there's a better way

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147031
Approved by: https://github.com/StrongerXi
ghstack dependencies: #147016, #147012, #147013
2025-02-13 18:21:21 +00:00
9a883007a2 Revert "Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)"
This reverts commit c7515da7b00de40942c83dc5856b6daec727e280.

Reverted https://github.com/pytorch/pytorch/pull/140979 on behalf of https://github.com/huydhn due to This change has been reported to break internal code ([comment](https://github.com/pytorch/pytorch/pull/140979#issuecomment-2657361940))
2025-02-13 18:04:26 +00:00
65e8862b9a Revert "[cond] make cond re-dispatch in proxy mode (#146954)"
This reverts commit 2ce6de2415fb6592dd4447ebea334fd12b8c31ea.

Reverted https://github.com/pytorch/pytorch/pull/146954 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I need to revert it to cleanly revert 140979 ([comment](https://github.com/pytorch/pytorch/pull/146954#issuecomment-2657357742))
2025-02-13 18:02:33 +00:00
1f8ff6812d [Fix]: Disable KleidiAI if unsupported gcc/clang compiler is detected (#146836)
Fixes: https://github.com/pytorch/pytorch/issues/146740

Description:
1. KleidiAI officially supports GCC>=11 and Clang>=11. Certain hardware features are tied to compiler version and KleidiAI compilation will fail in such cases.

Change-Id: Ib43d6b5bf66ef5ea48c481a2774801c573ec205c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146836
Approved by: https://github.com/malfet
2025-02-13 17:49:26 +00:00
447a142de2 support input mutations on tangents in compile (#141131)
Fixes https://github.com/pytorch/pytorch/issues/141111. We previously supported mutations on saved activations that happened in the backward. This PR extends the support to tangents

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141131
Approved by: https://github.com/zou3519
2025-02-13 17:48:56 +00:00
7077d0ac8c [DCP] Introduce modules metadata in the storage_meta (#146654)
Summary: Introduce the list of modules in the storage_meta which is shared between the planner and the storage writer. We will use it to let the storage writer know about the modules in the state dict and create module directories in the checkpoint.

Test Plan: UTs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146654
Approved by: https://github.com/MeetVadakkanchery
2025-02-13 17:44:30 +00:00
938209fb6f Revert "Use 2022 as default VC_YEAR for windows builds (#147053)"
This reverts commit 858bc0cea50614d1e190e6991d974ddb0f53fc88.

Reverted https://github.com/pytorch/pytorch/pull/147053 on behalf of https://github.com/atalman due to Broke windows tests ([comment](https://github.com/pytorch/pytorch/pull/147053#issuecomment-2657239501))
2025-02-13 17:09:37 +00:00
683178fabc [cuda] fix printing of num_gpus (#146838)
Previously on machines with less than 8 gpus, the device==7 case would
trigger the assert inside getDeviceProperties, and print `num_gpus=BEL`
which is ascii for 7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146838
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-02-13 15:23:35 +00:00
020232ec9f [Submodule]: Update KleidiAI submodule to v1.3.0 (#146480)
Change-Id: I687255982c72ee7daca438a15b718f07298963cc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146480
Approved by: https://github.com/digantdesai, https://github.com/malfet
2025-02-13 15:23:04 +00:00
df776d64f7 chore: fix typos in error messages in FSDP (#146805)
Fixes two small typos in FSDP error messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146805
Approved by: https://github.com/awgu, https://github.com/Skylion007
2025-02-13 15:22:13 +00:00
345f556628 Fix DispatchStub.cpp compilation for gcc 14 (#146512)
Otherwise I get the following error:

```bash

.../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: error: no matching function for call to ‘find(std::array<c10::DeviceType, 7>::const_iterator, std::array<c10::DeviceType, 7>::const_iterator, const c10::DeviceType&)’
  152 |     if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) {
      |         ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/include/c++/14/bits/locale_facets.h:48,
                 from /usr/include/c++/14/bits/basic_ios.h:37,
                 from /usr/include/c++/14/ios:46,
                 from /usr/include/c++/14/ostream:40,
                 from .../intel-xpu-backend-for-triton/pytorch/c10/core/DeviceType.h:13,
                 from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.h:3,
                 from .../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:2:
/usr/include/c++/14/bits/streambuf_iterator.h:435:5: note: candidate: ‘template<class _CharT2> typename __gnu_cxx::__enable_if<std::__is_char<_CharT2>::__value, std::istreambuf_iterator<_CharT, std::char_traits<_CharT> > >::__type std::find(istreambuf_iterator<_CharT, char_traits<_CharT> >, istreambuf_iterator<_CharT, char_traits<_CharT> >, const _CharT2&)’
  435 |     find(istreambuf_iterator<_CharT> __first,
      |     ^~~~
/usr/include/c++/14/bits/streambuf_iterator.h:435:5: note:   template argument deduction/substitution failed:
.../intel-xpu-backend-for-triton/pytorch/aten/src/ATen/native/DispatchStub.cpp:152:18: note:   mismatched types ‘std::istreambuf_iterator<_CharT, std::char_traits<_CharT> >’ and ‘const std::array<c10::DeviceType, 7>::value_type*’ {aka ‘const c10::DeviceType*’}
  152 |     if (std::find(supported_devices.begin(), supported_devices.end(), device_type) == supported_devices.end()) {
      |         ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146512
Approved by: https://github.com/Skylion007
2025-02-13 15:21:59 +00:00
7c3b2a29ec [subclass] testing WrapperSubclass respect outer_size, outer_stride (#146897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146897
Approved by: https://github.com/bdhirsh
2025-02-13 15:21:19 +00:00
e2479d7809 Update slow tests (#146822)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146822
Approved by: https://github.com/pytorchbot
2025-02-13 15:20:58 +00:00
aeabbffe15 Disable test with dynamo for schema gen (#146865)
Fixes https://github.com/pytorch/pytorch/issues/141202.

1. So we skip the schema gen tests under dynamo. https://github.com/pytorch/pytorch/issues/141202 fails in a weird way: where it's claiming node is an integer, but we tested isinstance tests [here](https://github.com/pytorch/pytorch/blob/main/torch/_library/utils.py#L234-L241). This is probably dynamo messing up with the stacks. and checking fx.Node isn't really what dynamo is designed for.
2. We move some of legit cond testes out of schema gen and put it back to control flow tests. Also rename _test_export to a lengthy names.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146865
Approved by: https://github.com/zou3519
2025-02-13 15:20:52 +00:00
67c4c39b4f [docs] Minor fixes to export and aoti docs (#144513)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144513
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-02-13 15:19:35 +00:00
d1997b610f update kineto submodule (#147015)
Fix https://github.com/pytorch/kineto/issues/1032
See https://github.com/pytorch/kineto/pull/1035 for testplan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147015
Approved by: https://github.com/sraikund16, https://github.com/Skylion007
2025-02-13 15:13:18 +00:00
8d94eb1e3b [BE]: Make OrderedSet reversible (#146904)
It's rather trivial to make OrderedSet reversible, so let's do it and unlock that additional functionality for downstream users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146904
Approved by: https://github.com/eellison
2025-02-13 15:11:48 +00:00
858bc0cea5 Use 2022 as default VC_YEAR for windows builds (#147053)
New Windows AMI does not have Visual Studio 2019. Hence use 2022 as default.
See: https://github.com/pytorch/test-infra/pull/6226
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147053
Approved by: https://github.com/huydhn
2025-02-13 14:37:55 +00:00
f95bdf5e6c Make GetCPUAllocatorMaybePinned to be Device-Agnostic (#146687)
----

- Keep cuda first to perserve BC
- Remove cuda first if it is possible to have only one accelerator at a time in the future
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146687
Approved by: https://github.com/ngimel
2025-02-13 13:09:48 +00:00
e21181642f [AOTInductor] Align behavior between CPU and GPU (#145459)
Summary:
(1) Make sure CPU and GPU doesn't have different implementation and behavior when calling from the same path and API. Only difference between CPU and GPU after this PR should ONLY be the running hardware.
(2) This PR fixes the issue of memory access when it==constants_map.end()
(3) This PR resolves T179437596

Test Plan: buck2 run mode/dev sigmoid/inference/test:e2e_test_cpu

Differential Revision: D68540744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145459
Approved by: https://github.com/desertfire, https://github.com/hl475
2025-02-13 09:50:18 +00:00
ca3aabc8e6 [Inductor][CPU] Add a lowering pass for _weight_int4pack_mm_for_cpu (#145250)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds a lowering pass for `torch.ops.aten_weight_int4pack_mm_for_cpu`. This op is used for WoQ int4 in Torchao. The lowering pass is a prerequisite for max-autotune, which is planed to be enabled for this op in subsequent PRs.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int4
python test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145250
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #145245
2025-02-13 08:40:12 +00:00
17d3a69c32 [Intel GPU] fix memory leak in deconv backward (#144385)
Fixes #143807

We need manage onednn scratchpad in pytorch, otherwise onednn will always allocate scratchpad memory during primitive execution and causes memory leak.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144385
Approved by: https://github.com/liangan1, https://github.com/EikanWang
2025-02-13 07:41:34 +00:00
43496e9b90 [NJT] fix flop counter for SDPA & test (#147032)
Fixes 3 issues:
1. The test wasn't actually testing SDPA: both were checking cuda, and the inputs to SDPA were not transposed.
2. FlopCounterMode has been renamed _FlopCounterMode (and a wrapper named FlopCounterMode has been added)
3. offsets_to_list also needs to ignore the actual offset values if offsets is a meta tensor.

Differential Revision: [D69558785](https://our.internmc.facebook.com/intern/diff/D69558785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147032
Approved by: https://github.com/jbschlosser
2025-02-13 07:14:58 +00:00
tim
b9a22b3f37 bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps (#146623)
This pr addresses the issue in the MPS backend for `_scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape.

The issue was found in https://github.com/hiyouga/LLaMA-Factory/issues/6835, in [transformers qwen2_vl](1590c66430/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py (L373C14-L373C93)), 3d q/k/v were passed into sdpa function, which lead to an error.

Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms.

---
reproduce code:
```
import torch
import torch.nn.functional as F

head_num, seq_len, embed_dim = 16, 16, 80
bsz = 1

q = torch.randn(head_num, seq_len, embed_dim)
k = torch.randn(head_num, seq_len, embed_dim)
v = torch.randn(head_num, seq_len, embed_dim)
attention_mask = torch.ones(1, seq_len, seq_len)

oo_cpu = F.scaled_dot_product_attention(
    q.to("cpu"),
    k.to("cpu"),
    v.to("cpu"),
    attention_mask.to("cpu"),
    dropout_p=0.0
)

if torch.backends.mps.is_available():
    oo_mps = F.scaled_dot_product_attention(
        q.to("mps"),
        k.to("mps"),
        v.to("mps"),
        attention_mask.to("mps"),
        dropout_p=0.0
    )
    assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5)
```

error outputs:
```
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module>
    oo_mps = F.scaled_dot_product_attention(
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)
```

hardware and envs:
```
torch               2.6.0
apple m3 max
```

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146623
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-13 07:00:51 +00:00
17a808557c [MPS] cholesky ex version (#146799)
PR #145701 didn't have experimental version of cholesky. This PR adds that version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146799
Approved by: https://github.com/malfet
2025-02-13 07:00:21 +00:00
4879f8f919 [TP] Add warning when module is distributed twice (#147006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147006
Approved by: https://github.com/XilunWu
2025-02-13 06:49:17 +00:00
3e4172d985 [BE][Ez]: Update fmtlib submodule to 11.1.3 (#146985)
This submodule just fixes a bunch of miscellaneous bugfix issues with ABI compatibility, compiler warning, workarounds for older compilers, performance, and edge cases in formatting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146985
Approved by: https://github.com/drisspg
2025-02-13 06:47:11 +00:00
aa20b4b6cf Friendly handle mem_get_info's runtime error message (#146899)
# Motivation
Friendly handle the runtime error message if the device doesn't support querying the available free memory. See https://github.com/intel/torch-xpu-ops/issues/1352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146899
Approved by: https://github.com/EikanWang
2025-02-13 06:26:19 +00:00
66fb10fc53 [BE][OpInfo] Introduce generic dtypesIf (#146905)
Use `__setattr__` and `__getattribute__` to wrap existing `dtypesIfXYZ` into it, which will allow for subsequent incremental elimination of those

Also, type annotation for OpInfo is a sham: it claims that `dtypes` and `dtypesIfXYZ` must be of type `_dispatch_dtypes`, but in reality it's converted to set in post init.

Test Plan:
 - Check that `op_db[0].dtypesIfCUDA` and others shows the same values as before, by running the following script
 ```python
from torch.testing._internal.common_methods_invocations import op_db
print({name: getattr(op_db[0], f'dtypesIf{name}') for name in ['CUDA', 'ROCM', 'XPU', 'Hpu']})
```
 - CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146905
Approved by: https://github.com/janeyx99
2025-02-13 05:33:17 +00:00
43eb39d7c8 [executorch hash update] update the pinned executorch hash (#145128)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145128
Approved by: https://github.com/pytorchbot
2025-02-13 05:06:44 +00:00
88d0bb0fee [aoti_debug_printer][BE] explicitly dumping float32, bfloat16, float16 data type (#147020)
Summary:
per request, explicitly dumping the float dtypes for aten tensors in debug printing summary info.

can be useful in identifying issues such as "wrong AOTI Lowering precisions"

Test Plan:
```
 AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm
```

Differential Revision: D69547344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147020
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
2025-02-13 04:41:00 +00:00
2ff3fdfdae [audio hash update] update the pinned audio hash (#146738)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146738
Approved by: https://github.com/pytorchbot
2025-02-13 04:29:46 +00:00
936df4571b Update test_c10d_object_collectives.py with DistributedTestBase class (#145056)
# MOTIVATION
To generalize distributed test cases for non-CUDA devices, we are leveraging the DistributedTestBase class introduced in [PR #138216](https://github.com/pytorch/pytorch/pull/138216). This new class is derived from MultiProcessTestCase and abstracts the creation/deletion of process groups and other functionality for specific devices. In this PR, we extend the scope of these tests to support HPUs.

# CHANGES

Replaced MultiProcessTestCase with the DistributedTestBase class.
Extended test functionality to include support for HPUs.
Utilized instantiate_device_type_tests with targeted attributes to generate device-specific test instances.
Applied the skipIfHPU decorator to skip tests that are not yet compatible with HPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145056
Approved by: https://github.com/kwen2501, https://github.com/guangyey
2025-02-13 03:57:59 +00:00
a9598337b7 [Optimus] Include more corner cases in the select cat aten pass (#146662)
Summary: Thanks to Shuai for reporting the bug in the pattern. We found there's a typo in the pass, where we should make sure all the selects will go to the cat node.

Test Plan:
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad

Buck UI: https://www.internalfb.com/buck2/2cd0888e-d803-43a8-8530-d97e6bc281b3
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6192449699305108
Network: Up: 110KiB  Down: 35KiB  (reSessionID-687be0fa-031a-47a0-8780-5ab4cf4bbd94)
Executing actions. Remaining     0/4                                                                              6.6s exec time total
Command: test.     Finished 2 local
Time elapsed: 2:12.0s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D69278487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146662
Approved by: https://github.com/Microve
2025-02-13 03:40:26 +00:00
6ca497a8e5 Replace is_same with is_same_v for concise syntax (#145450)
Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450
Approved by: https://github.com/huydhn
2025-02-13 03:29:39 +00:00
c159723c39 Fix meta impl for topk (#147017)
Topk in this context is always size-like so we should use torch._check_is_size. Fixes some issue in https://github.com/pytorch/pytorch/issues/146990

Differential Revision: [D69545983](https://our.internmc.facebook.com/intern/diff/D69545983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147017
Approved by: https://github.com/ydwu4
2025-02-13 03:18:47 +00:00
821422018a [FlexAttention] Make zero_length sequence handiling better (#147010)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147010
Approved by: https://github.com/Chillee
2025-02-13 03:18:24 +00:00
54e28b2a71 [BE] Turn nextafter into functor (#147018)
This functor is a bit more involved as nextafter is missing for MacOS13
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147018
Approved by: https://github.com/dcci
ghstack dependencies: #146965, #146993, #147023
2025-02-13 02:10:29 +00:00
aaa46c0625 Add missing autoreleasepool around runUniqueGraph to prevent leaks (#145512)
References were held onto longer than needed. Added autoreleasepool around the runUniqueGraph to allow the memory to be freed.

Fixes #145151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145512
Approved by: https://github.com/malfet

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-13 01:58:18 +00:00
e0ca041ae3 [BE] Toward Metal Iterator (step 2) (#147023)
Add dense flavor of the binary ops, i.e. if iterator is contiguous, do not build indices but rather run different flavor, using the same functor, which results in almost 100% perf gain for binary operation with 1mln elements of `torch.fmax` as one can see from the table below collected on M4Pro Mini using following benchmarking script
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_binary(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()",
        setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024**2
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_binary(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```

 Dtype  | Time before | Time After |
| ------|------------ | ---------- |
| float32  | 0.84 msec  | 0.66 msec |
| float16  |  0.49 msec |  0.23 msec |
| bfloat16  | 0.48 msec  | 0.22 msec |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147023
Approved by: https://github.com/dcci
ghstack dependencies: #146965, #146993
2025-02-13 01:50:43 +00:00
80f146dedf Update addbmm, addmm, addmv and baddbmm description (#146689)
Fixes #146611, following #146482

## Test Result

![image](https://github.com/user-attachments/assets/5c1749be-1f10-4e80-a284-b1929ca340eb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146689
Approved by: https://github.com/mikaylagawarecki
2025-02-13 01:30:50 +00:00
5dab0aeef0 [SkipFiles] Some more cleanup (#147013)
This isn't a no-op but I think it's fine. It changes the case where a
function f1 in a module in MOD_SKIPFILES calls a function f2 in one of
the deleted modules. Previously f2 would have been skipped, now f2 gets
inlined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147013
Approved by: https://github.com/yanboliang
ghstack dependencies: #147016, #147012
2025-02-13 01:18:47 +00:00
fddaa2958b [SkipFiles] Some more cleanup (#147012)
I think these are all no-ops.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147012
Approved by: https://github.com/yanboliang
ghstack dependencies: #147016
2025-02-13 01:18:47 +00:00
87ebd77b34 Add some more docs to trace_rules.py (#147016)
After discussing with Yanbo we wanted to record the behavior down so we
don't need to rederive them in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147016
Approved by: https://github.com/yanboliang
2025-02-13 01:18:39 +00:00
b77a6eb184 [dynamo] Fix tensordict regression (#146995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146995
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146819
2025-02-13 00:59:59 +00:00
2ce6de2415 [cond] make cond re-dispatch in proxy mode (#146954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146954
Approved by: https://github.com/zou3519
2025-02-13 00:50:33 +00:00
67cbbb29e0 [export] Dedup expression_created logs (#146859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146859
Approved by: https://github.com/pianpwk
ghstack dependencies: #146532, #146533, #146534, #146858
2025-02-13 00:21:34 +00:00
59bc5d0d71 [tlparse] Add stacktrace filter utility (#146858)
Added a utility function for capturing the user stack and framework stacktrace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146858
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #146532, #146533, #146534
2025-02-13 00:21:34 +00:00
43f5566c92 [export] Add additional tlparse logging (#146534)
Added some additional logging so we can also run tlparse on generic export errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146534
Approved by: https://github.com/pianpwk
ghstack dependencies: #146532, #146533
2025-02-13 00:21:34 +00:00
b4bdbce1ac [export] Use custom stream logger in draft-export (#146533)
Using a custom logger so that we can store our own buffer to dedup logs that look the same. The schema for deduping is as follows:

```python
        if key == "missing_fake_kernel":
            return hash((key, data["op"]))  # Same ops get deduped
        elif key == "mismatched_fake_kernel":
            return hash((key, data["op"], data["reason"]))  # Same op and reason for errors get deduped
        elif key == "propagate_real_tensors":
            return hash((key, json.dumps(data["stack"])))  # Guards appearing on the same stacktrace get deduped
        elif key == "create_unbacked_symbol":
            return hash((key, json.dumps(data["stack"])))  # Unbacked symbols appearing on the same stacktrace get deduped
```

Notably, guards appearing on the same stacktrace get deduped. This is because there are some cases in PT2I models where a piece of code which creates a new unbacked symint + runs into a DDE gets called 800 times, causing 800 new symints to be created, and 800 propagate_real_tensor errors that are all the same expression. This is hard to look at, so we should just deduplicate this.

The con of this is that if there exists multiple DDE on the same stacktrace, we will only show the first issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146533
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #146532
2025-02-13 00:21:34 +00:00
be387f57b1 [symbolic shapes] Log SymNode id for provenance (#146532)
We can use the SymNode id to point us back to how previous expressions were created, and construct this nice tree in tlparse:
<img width="761" alt="image" src="https://github.com/user-attachments/assets/531b03e8-4398-4d0a-bd11-16078256041c" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146532
Approved by: https://github.com/bobrenjc93
2025-02-13 00:21:34 +00:00
21c2565f35 Document dynamo (#146736)
Many files in dynamo are currently lacking file/module-level documentation, which makes it hard to know what they do at a glance and without digging into the code. This fixes that.

Note: documentation was AI-generated and could be incorrect, please review carefully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146736
Approved by: https://github.com/jansel, https://github.com/StrongerXi, https://github.com/anijain2305, https://github.com/zou3519
2025-02-13 00:02:21 +00:00
0344bf8a5a [cuDNN] cuDNN to 9.7.1.26 for CUDA 12.8 (#146957)
rebasing for https://github.com/pytorch/pytorch/pull/146717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146957
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/nWEIdia, https://github.com/atalman

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-12 23:43:34 +00:00
d5a2e4c754 [oncall] Change error message to be more readable (#146934)
Summary:
During oncall, got a debug, where the error message is a bit ambiguous, due to multiple colons, and full line cutoff
```
AssertionError: Expected order: 1 for the component: remote_request_only to be >= 2, the max order for all its
```

Update the error message to something like
```
AssertionError: Component remote_request_only order must be >= max order of its upstream components, got component order=1 and max=2
```

Test Plan: CI

Differential Revision: D69482789

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146934
Approved by: https://github.com/ColinPeppler
2025-02-12 23:33:09 +00:00
ad4e5bf705 cpp_wrapper: handle mixed-device C-shim fallbacks (#146449)
Fixes an error from test_torch, where a CUDA cpp_wrapper run called a CUDA native C-shim kernel with two CPU tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146449
Approved by: https://github.com/desertfire
2025-02-12 23:21:04 +00:00
076215944a Turn on autograd local caches in fbcode (#146996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146996
Approved by: https://github.com/jamesjwu
2025-02-12 23:04:39 +00:00
c60f587c04 Fix shape_inference for V-schedules (#147000)
I was hitting a hang in shape_inference when testing v-shaped schedules with >2 ranks in titan.

`self.next_rank` and `self.prev_rank` are used in shape inference but are not accurate for v-shaped schedules:
bfcce6984b/torch/distributed/pipelining/stage.py (L1325-L1326)

Will clean up / delete the use of next_rank / prev rank in follow up PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147000
Approved by: https://github.com/wconstab
2025-02-12 22:56:46 +00:00
f954aac6be Add make_dynamo_test (#146491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146491
Approved by: https://github.com/zou3519, https://github.com/anijain2305, https://github.com/malfet
2025-02-12 22:54:29 +00:00
fd21126007 [ONNX] Deprecation message follow up (#147005)
Follow up on https://github.com/pytorch/pytorch/pull/146923 to address comments.

This pull request includes updates to the `torch/onnx` module, focusing on deprecations and documentation improvements. The most important changes involve moving version change notes within the `export` function, updating deprecation messages, and removing example code in the `dynamo_export` function.

Documentation and Deprecation Updates:

* [`torch/onnx/__init__.py`](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184): Moved version change notes to the correct location within the `export` function's docstring. Updated the deprecation note for the `dynamo_export` function to version 2.7 and removed example code from its docstring. [[1]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L172-L184) [[2]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553R349-R357) [[3]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L434-R430) [[4]](diffhunk://#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553L445-L475)

* [`torch/onnx/utils.py`](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114): Enhanced deprecation messages for several functions (`select_model_mode_for_export`, `disable_apex_o2_state_dict_hook`, `setup_onnx_logging`, `unconvertible_ops`) to provide clearer guidance on their removal and suggest copying logic if needed. [[1]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL111-R114) [[2]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL148-R151) [[3]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL166-R173) [[4]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1180-R1189) [[5]](diffhunk://#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL1190-R1199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147005
Approved by: https://github.com/titaiwangms
2025-02-12 22:48:56 +00:00
f655f840b8 [ONNX][dort] Remove reference to onnxscript rewriter (#147003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147003
Approved by: https://github.com/titaiwangms, https://github.com/gramalingam, https://github.com/shubhambhokare1
2025-02-12 22:02:07 +00:00
995f607c74 fix doc string (#146968)
Fixes a wrong function name in doc string

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146968
Approved by: https://github.com/zackycao, https://github.com/H-Huang
2025-02-12 21:43:16 +00:00
06a07f6018 [BE] Towards MetalTensorIterator (#146993)
Further refactor binary kernels to replace individual implementation with a binary_indexing_kernel template that takes functors that implement the logic.

According to godbolt such refactoring should have no impact on the performance as compiler thru dead code elimination should just replaces the functor with direct underlying function call as one can see for clang CPU compiler here: https://godbolt.org/z/8dxv5jvz7 but to be on the safe side, run following benchmark
```python
import torch

from timeit import default_timer
from itertools import product
from torch.utils.benchmark import Measurement, Timer

def bench_binary(
    n,
    binary_func,
    dtype=torch.float32,
) -> Measurement:
    t = Timer(
        stmt=f"f(x, y);f(x, y); f(x, y); torch.mps.synchronize()",
        setup=f"x, y=torch.rand((2, {n}), dtype={dtype}, device='mps').unbind(0)",
        globals = {'f': binary_func},
        language="python", timer=default_timer
    )
    return t.blocked_autorange()

if __name__ == "__main__":
    n = 1024**2
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        eager_t = bench_binary(n, torch.fmax, dtype)
        use_msec = eager_t.mean > 1e-4
        multiplier = 1e3 if use_msec else 1e6
        uname = "msec" if use_msec else "usec"
        print(f"torch.fmax()x3 {str(dtype):>14} {eager_t.mean*multiplier:>7.2f} {uname}")
```

That reports roughly identical before and after times (1 msec for float32 and .5 msec for float16)

Another interesting quirk, that functors can not be in anonymous namespace, otherwise they'll not be visible from the library, as one can see by running following swift sample (filed FB16490467 to clarify if this is supported)
```swift
let shader_source = """
struct add_functor {
  template <typename T>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(a + b);
  }
};

namespace {
struct sub_functor {
  template <typename T>
  inline T operator()(const T a, const T b) {
    return static_cast<T>(a - b);
  }
};
} // anonymous namespace

template <typename T, typename F>
kernel void binary_executor(
    constant T* input [[buffer(0)]],
    constant T* other [[buffer(1)]],
    device T* out [[buffer(2)]],
    uint tid [[thread_position_in_grid]]) {
  F f;
  out[tid] = f(input[tid], other[tid]);
}

template
[[host_name("add_float")]] kernel void binary_executor<float, add_functor>(constant float*, constant float *, device float*, uint);

template
[[host_name("sub_float")]] kernel void binary_executor<float, sub_functor>(constant float*, constant float *, device float*, uint);
"""

import Metal
guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
let library = try! device.makeLibrary(source:shader_source, options:MTLCompileOptions())

// Expect two kernels to be printed, but see only one, with functor in global namespace
for kernel_name in library.functionNames {
  print(kernel_name)
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146993
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #146965
2025-02-12 21:40:40 +00:00
de964b9f8b dont specialize symints when testing truthiness (#146731)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146731
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #146642, #146729
2025-02-12 20:57:10 +00:00
5cda021cac support meta_tensor.to(device='cpu') under fake_mode (#146729)
Fixing this is actually a bit annoying:

(1) FakeTensorMode sees a function where all of its inputs are real tensors, so it tries to run the real compute before converting the output to a FakeTensor

(2) we don't actually want this, because the "real compute" is support to error normally, when you do `meta_tensor.to(device='cpu')`. Instead, we want FakeTensor to actually skip constant prop and run the normal FakeTensor implementation, which will not error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146729
Approved by: https://github.com/zou3519, https://github.com/SherlockNoMad, https://github.com/albanD
ghstack dependencies: #146642
2025-02-12 20:57:10 +00:00
ec0b318ddb [poc] force UntypedStorage.from_buffer(buf) to return meta storage under FakeTensorMode (#146642)
context here: https://fb.workplace.com/groups/326136610199609/permalink/495389539940981/

This PR is an attempt to make it such that if you create a tensor from an external buffer (using `UntypedStorage.from_buffer(buf)`, we can generate a proper fake tensor for you out of the box.

The annoying bit is that there are not any dispatcher ops to interpose on and change behavior. So instead, I took the manual C binding and tweaked the storage device to be "meta' if we see an active fake mode.

Put "poc" in the title since I... think this is hopefully reasonable, but I can be convinced that it's not :)

```
from torch._subclasses.fake_tensor import FakeTensorMode
import pickle
import io
import torch
from contextlib import nullcontext

use_fake_tensor = True
with FakeTensorMode() if use_fake_tensor else nullcontext():
    obj = [1, 2]
    f = io.BytesIO()
    pickle.Pickler(f).dump(obj)
    byte_storage = torch.ByteStorage._from_buffer(f.getvalue())  # type: ignore[attr-defined]

    t = torch.ByteTensor(byte_storage)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146642
Approved by: https://github.com/zou3519
2025-02-12 20:57:10 +00:00
8a975cb247 Revert "[cutlass backend] Do not change dtype of GEMM template (#146877)"
This reverts commit 5f2714d5e7cded0eb553d5915002e03c22e01e34.

Reverted https://github.com/pytorch/pytorch/pull/146877 on behalf of https://github.com/henrylhtsang due to mistake on logging ([comment](https://github.com/pytorch/pytorch/pull/146877#issuecomment-2654648949))
2025-02-12 19:26:45 +00:00
0de27ee7e0 Let _create_cpu_state_dict and _copy_state_dict support DTensor (#146852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146852
Approved by: https://github.com/d4l3k
2025-02-12 18:43:52 +00:00
352484cc83 [BE] Unify kernel templates instantiation (#146965)
By defining `REGISTER_BINARY_OP` template that could be used to register fmix, fmax, etc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146965
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-12 18:40:45 +00:00
7f62616a58 [ONNX][reland2] Create deprecation warning on dynamo_export (#146923)
Reland two PRs
- https://github.com/pytorch/pytorch/pull/146425
- https://github.com/pytorch/pytorch/pull/146639

Fixed by removing the deprecation warning on a base class `ExportOptions`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146923
Approved by: https://github.com/titaiwangms
2025-02-12 18:28:37 +00:00
5f2714d5e7 [cutlass backend] Do not change dtype of GEMM template (#146877)
I think this is a change in the right direction.

Right now, when we try to find a cutlass gemm, we generate bunch of gemm templates, and filter out those that don't fix. For example, if we are doing bf16 x bf16 matmul, the gemm template for fp32 x fp32 is generated and filtered out.

However, for the dtype of bias, we would attempt to modify the dtype of the gemm template. I think this is a bad idea, since (1) the usable template is also being generated, and (2) this messes with the configuration name of the template.

I tested this offline. There isn't much difference in performance. However, with instantiation level 2222, I noticed way less "C++ compile error". This is probably due to using the right template?

Follow-ups are needed:
1. benchmark and dashboard
2. check our logic for setting alignment

with my change
https://www.internalfb.com/intern/paste/P1729604119/

without my change
https://www.internalfb.com/intern/paste/P1729624806/

Differential Revision: D69085556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146877
Approved by: https://github.com/ColinPeppler
2025-02-12 18:16:49 +00:00
bfcce6984b [ROCm][TunableOp] Close offline tuning results file when offline tuning is disabled. (#146574)
This PR is to fix UT breakage that has been reported internally and is considered high priority. When `tunable.record_untuned_enable(False)` is invoked, we flush the results of the untuned gemm file.

Offline tuning I/O currently doesn't have a set untuned results filename member function or untuned results write to file member function. When performing back-to-back unit tests, the same ofstream ends up getting reused between UTs. Due to the way the UT are executed, this can lead to unexpected failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146574
Approved by: https://github.com/jeffdaily
2025-02-12 18:03:06 +00:00
04011304e5 Update dynamo expected 20250210 (#146856)
Update all the ci accuracy expect values to make trunk green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146856
Approved by: https://github.com/yanboliang
2025-02-12 18:01:20 +00:00
d6513f3246 [dynamo] Support list subclasses and fix dict subclasses mutation bugs (#146819)
This PR adds support for list subclasses. Among other things are

1) Tracking the mutations on internal vts like `_dict_vt` and `_list_vt` using sources. This helps identify if there was a mutation in the underlying data structures, and we need to reconstruct it.
2) `UserDefinedObjectVariable` now has a new method - `is_modified` which `side_effect` infra relies upon to check mutations in the underlying vts (like `_dict_vt`).
3) `reconstruction` logic ensures that we use `dict.__getitem__` and `list.__getitem__` methods. This is super important because we don't want to call the overridden `__getitem__` methods.

If this PR is hard to review, please let me know. I can break it into several small PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146819
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-02-12 17:46:02 +00:00
6c81435f16 [ONNX] Update CI transformers cache (#146926)
The cached models are outdated because the related tests are all deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146926
Approved by: https://github.com/justinchuby
2025-02-12 17:02:43 +00:00
b894c2824b [ONNX] Support custom axis name through dynamic_shapes (#146321)
Fixes #143443

This PR aims to support custom dynamic axis naming through dynamic_shapes. Currently, _Dim and _DimHint do not support dynamic axis naming (#144273).

1. **the original dynamic shapes guarantee**
The axis renaming is only applied when dynamic shapes include string instead of all _Dim and _DimHint. Thus, there will not be any inconsistent behavior to dynamic_shapes with torch.export.export if the given dynamic shapes follow torch.export.export format.
2. _DimHint.AUTO is applied to the axes that are specified with custom names to avoid exporter crash. (_DimHint.DYNAMIC crashes when the export fails.)
3.  There's no need to handle cases where kwargs are out of order with the model signature,
    as torch.export.export supports dynamism only when kwargs and dynamic_shapes are provided in order.
    49082f9dba/torch/export/_trace.py (L2034)
4. If `torch.onnx.ExportedProgram` finds the axes share the same constraints, they will have the same name (e.g. s0, s1, ...). Therefore, even if the ONNX users specify them with different custom names, they won't be respected.

Example model:
```python
        class NestedModel(torch.nn.Module):
            def forward(
                self,
                x: torch.Tensor,
                ys: list[torch.Tensor],
                zs: dict[str, torch.Tensor],
                c: torch.Tensor,
            ):
                y = ys[0] + ys[1] + zs["a"] + zs["b"]
                w = 5
                if x.shape[0] < 3 and c.shape[0] != 4:
                    return x + w, x + y, c
                else:
                    return x - w, x - y, c

        input = (
            torch.ones(5),
            [torch.zeros(5), torch.ones(5)],
            {"a": torch.zeros(5), "b": torch.ones(5)},
            torch.ones(6),
        )

        dynamic_shapes = (
            {0: torch.export.Dim("dim_x", min=3)},  # _Dim
            [("custom_name_axis_ys_0",), (torch.export.Dim.AUTO,)],  # custom name
            {
                "a": {0: torch.export.Dim.AUTO},
                "b": ("custom_name_axis_zs_b_0",),
            },  # _DimHint
            {0: "custom_name_axis_c_0"},  # custom name
        )

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146321
Approved by: https://github.com/justinchuby
2025-02-12 17:00:03 +00:00
9abaaad6a8 [pytree][Easy] preserve dict keys in insertion order in CXX pytree (#130140)
`optree` and JAX pytree traversal the `dict` in sorted key ordering (see [Key Ordering for Dictionaries](https://github.com/metaopt/optree#key-ordering-for-dictionaries)). While in PyTorch Python pytree, we traversal the `dict` in insertion order. See also:

- #114392

This aligns the behavior of CXX pytree with Python pytree.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130140
Approved by: https://github.com/zou3519
2025-02-12 16:41:49 +00:00
1f8ff94d4f PEP585: Add noqa to necessary tests (#146391)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146391
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-02-12 15:29:50 +00:00
b61032fcf7 [BE][Ez]: Remove unnecessary type ignores from orderedset (#146902)
After #145783, we can remove some type ignores from the ordered set class
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146902
Approved by: https://github.com/eellison
2025-02-12 15:00:13 +00:00
ce80865f13 Revert "Replace is_same with is_same_v for concise syntax (#145450)"
This reverts commit 5205158c1b0bc5c390b2a9d83fe3b2ec5edbe3f2.

Reverted https://github.com/pytorch/pytorch/pull/145450 on behalf of https://github.com/jeanschmidt due to testing to see if reverting would fix timeout in inductor jobs ([comment](https://github.com/pytorch/pytorch/pull/145450#issuecomment-2653645466))
2025-02-12 13:01:32 +00:00
b0042286d4 [Dynamo] Allow dynamo to handle str.xxx() (#146587)
Fixes #146350

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146587
Approved by: https://github.com/zou3519
2025-02-12 08:54:10 +00:00
98e16012ec [Quant][CPU] add a wrapper op for _weight_int4pack_mm_for_cpu with tensor args (#145245)
**Summary**
It's part of the task to enable max-autotune with GEMM template for WoQ INT4 GEMM on CPU.

This PR adds a wrapper op in `quantized` namespace for `torch.ops.aten_weight_int4pack_mm_for_cpu`, whose arguments are all tensors. It will be used in Inductor lowering with max-autotune where scalar arguments are difficult to handle.
The new op is not registered to
- `aten` because it will require changing `native_functions.yaml`, which is not recommended.
- `quantized_decomposed` because it will only have a Python implementation, which cannot be used for cpp wrapper in Inductor.

**Test plan**
```
python test/test_linalg.py -k test__int4_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145245
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2025-02-12 08:46:38 +00:00
ac0f206f3c [dtensor] fix side-effect on dtype for _like ops (#146869)
fixes https://github.com/pytorch/pytorch/issues/146749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146869
Approved by: https://github.com/yifuwang, https://github.com/janeyx99, https://github.com/ngimel
2025-02-12 08:42:14 +00:00
d774a6333d [StaticRuntime] Support a new pattern for ClipRangesToGatherToOffsets (#146931)
Summary:
Support the following new pattern for ClipRangesToGatherToOffsets:

Before optimization:
```
%18267 : Tensor, %18268 : Tensor = fb::clip_ranges_gather(%int_77.1, %getitem_2484.1, %493)
%getattr_368.1 : int = prim::dtype(%18267)
%to_443.1 : Tensor = aten::to(%18268, %getattr_368.1, %self._maybe_compute_kjt_to_jt_dict.is_weighted, %self._maybe_compute_kjt_to_jt_dict.is_weighted)
%lengths_to_offsets_490.1 : Tensor = fb::lengths_to_offsets(%to_443.1, %8)
```

After optimization:
```
%18297 : int = prim::dtype(%int_77.1)
%18298 : Tensor, %18299 : Tensor = fb::clip_ranges_gather_to_offsets(%int_77.1, %getitem_2484.1, %493, %8, %18297)
```

Reviewed By: garroud

Differential Revision: D69373835

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146931
Approved by: https://github.com/hanyilou123
2025-02-12 08:19:41 +00:00
ae5cc19ba7 [pytorch][cuda] Improve softmax backward pass native CUDA implementation (#145866)
This PR is similar to https://github.com/pytorch/pytorch/pull/122970, but works on the softmax backward pass.

Specifically, it uses shared memory to cache the gradOutput when it can fit in shared memory. Before this PR we were reading gradOutput twice.

On my H100 this seems to improve the softmax backward pass performance by about 5% for problem sizes that fit within shared memory. (Note that this is not the only kernel that runs when you call softmax backward pass -- there is an elementwise kernel that runs before this; optimizing that can be a separate PR).

**Important Note**: Currently the softmax backward pass consists of an [element-wise multiply operator](7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1216)), followed by [this function](7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1062)) which calls the `cunn_SoftMaxBackward` kernel. With my change the kernel time reduces by about 12% (see screenshot below), while the total time (including the elementwise) reduces by about 5%.

```
Baseline						This PR
N	size	FP32 bandwidth	FP16 bandwidth		N	size	FP32 bandwidth	FP16 bandwidth		fp32 diff	fp16 diff
0	256	134.340966	70.042039		0	256	133.70146	70.342753		-0.48%	0.43%
1	512	233.501185	129.945803		1	512	234.057145	132.933066		0.24%	2.30%
2	1024	340.667966	229.280464		2	1024	338.833265	226.441699		-0.54%	-1.24%
3	2048	379.643726	337.452058		3	2048	399.559017	338.432284		5.25%	0.29%
4	4096	416.597537	383.625364		4	4096	428.252403	396.137506		2.80%	3.26%
5	6000	431.198241	384.384384		5	6000	457.744577	406.06275		6.16%	5.64%
6	8192	462.811252	427.292573		6	8192	474.791032	428.281563		2.59%	0.23%
7	10000	464.258731	429.050294		7	10000	483.7643	446.849381		4.20%	4.15%
8	10013	465.199701	429.824179		8	10013	464.904407	428.72184		-0.06%	-0.26%
9	10240	477.07359	428.853737		9	10240	485.317024	444.902586		1.73%	3.74%
10	11000	473.038785	430.778663		10	11000	488.161438	453.462162		3.20%	5.27%
11	12000	474.342475	432.594814		11	12000	490.532418	458.427653		3.41%	5.97%
12	16384	487.468854	473.611576		12	16384	488.154406	476.264631		0.14%	0.56%
13	20000	482.029793	465.666186		13	20000	482.147092	483.886193		0.02%	3.91%
14	24000	478.368093	474.159464		14	24000	478.364948	491.447921		0.00%	3.65%
15	32000	476.523796	473.18868		15	32000	476.523796	474.398962		0.00%	0.26%
16	32768	476.104723	477.493634		16	32768	476.704463	477.330606		0.13%	-0.03%
17	36864	477.900663	475.472787		17	36864	477.973279	475.728454		0.02%	0.05%
18	40960	477.707561	475.559064		18	40960	478.445017	476.088067		0.15%	0.11%
19	45056	479.169812	475.865134		19	45056	479.143266	475.878202		-0.01%	0.00%
20	49152	477.804907	475.382982		20	49152	477.868404	475.976377		0.01%	0.12%
21	65536	481.274125	478.171806		21	65536	481.537733	478.703926		0.05%	0.11%
22	66000	481.64652	480.095457		22	66000	481.856013	480.466388		0.04%	0.08%
23	68608	481.745774	479.034704		23	68608	481.917596	478.856209		0.04%	-0.04%
24	80000	483.409361	480.356529		24	80000	483.330481	480.375277		-0.02%	0.00%
25	98304	480.736301	481.396882		25	98304	480.789858	481.320143		0.01%	-0.02%
```

NCU profiler shows lower DRAM fetches with the new kernel:

![image](https://github.com/user-attachments/assets/f3606725-d8fc-4ea5-ae6d-9c188bf32d72)

NCU reports about 12% elapsed time reduction in this kernel alone compared to baseline (and because of other kernels that are run, the overall backward pass time as seen by the user gets reduced by 5%).

I compared the binary size increase by running `python setup.py develop` before and after and diffing the .so files:

![image](https://github.com/user-attachments/assets/8e6cee2e-3c7a-4fa4-8836-954047ce8ffc)

libtorch_cuda.so goes from 274,752,224 bytes to 274,787,072 bytes. The increase in size is 34kB which is about 0.01%.

I measured the compilation time for incremental development:

```
touch ./aten/src/ATen/native/cuda/SoftMax.cu
time python setup.py develop
real    0m10.083s
user    0m8.197s
sys     0m3.149s
```

Note that this uses `ccache` and does a bunch of copies and is not just measuring the `nvcc` time. I measured the `nvcc` time separately by capturing the `nvcc` command shown in [1] below and running it on the baseline and modified kernels:

```
# baseline nvcc time for SoftMax.cu
real    0m35.341s
user    0m33.801s
sys     0m1.289s

# this PR's nvcc time for SoftMax.cu
real    0m36.513s
user    0m34.722s
sys     0m1.408s
```

So the `nvcc` time increases by about 1 second, or ~3% of the baseline.

[1] `nvcc` command is here:
```
# This is the nvcc command
/usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/torch/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/torch/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145866
Approved by: https://github.com/ngimel
2025-02-12 07:54:41 +00:00
8c80c13b34 [CD] Add python 3.13t build for xpu (#146614)
Fixes #146451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146614
Approved by: https://github.com/atalman
2025-02-12 07:01:36 +00:00
b30bad710d Update octokit/request-action to 2.4.0 (#146940)
The current version 2.1.0 has disappeared since yesterday:

* https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml
* https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml

The latest version is 2.4.0 https://github.com/octokit/request-action
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940
Approved by: https://github.com/izaitsevfb
2025-02-12 05:36:27 +00:00
6105b6f15f Revert "Update octokit/request-action to 2.4.0 (#146940)"
This reverts commit 7aa629f1268f6944eee6e49e43071b4342bf1669.

Reverted https://github.com/pytorch/pytorch/pull/146940 on behalf of https://github.com/huydhn due to This does not work ([comment](https://github.com/pytorch/pytorch/pull/146940#issuecomment-2652691614))
2025-02-12 05:21:43 +00:00
5a1c7c424d Fix standalone runner for CUTLASS auto-tuning backend (#146764)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146764
Approved by: https://github.com/henrylhtsang
ghstack dependencies: #146755
2025-02-12 04:42:08 +00:00
eb655a2d5f Fix CUTLASS 2.x kernels for auto-tuning (#146755)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146755
Approved by: https://github.com/henrylhtsang
2025-02-12 04:42:07 +00:00
683bb1242c [export][ez] Update tag_ for union setters. (#146912)
Summary: ez fix to set tag for union type fields.

Test Plan: CI

Differential Revision: D69467715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146912
Approved by: https://github.com/yiming0416
2025-02-12 03:52:36 +00:00
06f8f9a017 Update instructions about faster linker (#146750)
This PR adds instructions to specify linker via cmake env `CMAKE_LINKER_TYPE` and also adds `mold` as a linker alternative.

Since 3.29, cmake introduced [`CMAKE_LINKER_TYPE`](https://cmake.org/cmake/help/latest/variable/CMAKE_LINKER_TYPE.html) that can specify linker without overwriting `ld` file or changing build script.

`mold` is already stable and **the fastest** (afaict) linker out there, and also easier to install compared with `lld`. So I added it here. After switching to `mold`, the time of linking `libtorch_cuda.so` has been reduced from ~7s to ~0.6s locally.

Also note `gold` has been marked deprecated recently[1].

[1] https://lwn.net/Articles/1007541/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146750
Approved by: https://github.com/albanD
2025-02-12 03:14:08 +00:00
28a2ab6b84 Clear CompiledTritonKernel cache after each inductor compile (#146925)
Fix a bug introduced by D69123174: because triton kernels now are returned directly by the worker, each future created by the triton kernel should only be used once per compile. Otherwise, a long running process that does something like in :

```
compiled_1 = torch.compile("max-autotune", fullgraph=True)(fn)
# run compiled_1
out_compiled = compiled_1
compiled_2 = torch.compile("max-autotune", fullgraph=True)(fn2)
```

Where fn1 and fn2 are very similar (i.e. would generate the same triton kernel source code) would result in us using the launcher for the first autotuning run, and setting the launcher to None after running, and then using the same future/kernel again without regenerating the launcher.

Found this bug testing internal inference models.

This does not remove the caching support for @eellison's caching for prologue benchmarking, because that happens under the same compile: https://github.com/pytorch/pytorch/pull/143408

Differential Revision: [D69476856](https://our.internmc.facebook.com/intern/diff/D69476856/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D69476856/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146925
Approved by: https://github.com/laithsakka, https://github.com/jansel
ghstack dependencies: #146417
2025-02-12 02:38:42 +00:00
0acbf8039a [BE] Unskip some tensor creation tests on Mac (#146952)
Followup after https://github.com/pytorch/pytorch/pull/145367

One should never use skip, but rather xfail otherwise one never knows when test is finally fixed.

`test_float_to_int_conversion_finite` were fixed on MacOS a while back (guess since the time Intel builds were disbaled), while `test_float_to_int_conversion_nonfinite` is fixed by https://github.com/pytorch/pytorch/pull/145367 that selects architecture-appropriate reference values for Arm ISA

Note, that results of floating to integral types cast are undefined if floating point value is outside of integral dynamic range

"Fixes" https://github.com/pytorch/pytorch/issues/38752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146952
Approved by: https://github.com/atalman, https://github.com/seemethere
2025-02-12 01:59:15 +00:00
78ebd3c502 Revert commit that removed windows testing in VS2019-> update (#146920)
This reverts commit b57b38b52ede2af27d4eb1bf6ba63868a3ee7553.

This commit removed windows testing for the VS build and needs to be added back in with the updated VS2022 build

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146920
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet
2025-02-12 01:12:05 +00:00
df5e232563 [BE] Delete NCCL slimming (#146943)
It was added by https://github.com/pytorch/pytorch/pull/35843 and served its purpose when everything was linked statically in libtorch_cuda.so, but for all our releases it's no longer relevant as nccl is now a dynamic dependency of libtorch_cuda.so

Besides,  It does not work with CXX11 ABI anyway, and creates problems with newer version of NCCL, when two `collectvies.o` are package into library archive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146943
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-02-12 00:35:55 +00:00
a58f421f4b [CUDA][CUDNN][SDPA] Pass dropout seed and offset to cuDNN in int64 (#146734)
Workaround for limitation in cuDNN that does not accept dropout seed/offset in `int32` for SM 10.0 kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146734
Approved by: https://github.com/Skylion007
2025-02-12 00:24:38 +00:00
281249ba54 [torch][amdsmi] Avoid ODR violation when loading amdsmi (#146324)
Summary:
amdsmi bundles its own copy of `libamd_smi.so`. When you're interacting with `amdsmi` from *only* python that's fine, but when you try to interact with `libamd_smi.so` from native code too this poses a problem, because from native code you'll be linking against the copy of `libamd_smi.so` from the SDK.

This means you'll end up with 2 copies of `libamd_smi.so` in your process, and potentially (Murphey's law says you will, as does our CI) violate ODR.

In order to avoid this issue from the PT side of the world we can hook the `dlopen("path/to/bundled/libamd_smi.so")` and try to use the already loaded/SDK version of `libamd_smi.so` first, before proceeding to use the `path/to/bundled/libamd_smi.so`.

Test Plan: CI, inspect process using libamd_smi.so from native + python and observe only a single copy loaded

Differential Revision: D69064038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146324
Approved by: https://github.com/malfet
2025-02-12 00:01:02 +00:00
7aa629f126 Update octokit/request-action to 2.4.0 (#146940)
The current version 2.1.0 has disappeared since yesterday:

* https://github.com/pytorch/pytorch/actions/workflows/upload-torch-dynamo-perf-stats.yml
* https://github.com/pytorch/pytorch/actions/workflows/upload-test-stats.yml

The latest version is 2.4.0 https://github.com/octokit/request-action
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146940
Approved by: https://github.com/izaitsevfb
2025-02-11 23:50:24 +00:00
f50d359ce2 [ c10d ] modify API to get device string from device with torch.device (#146290)
Modify the ```get_default_backend_for_device()``` API to extract the device string using ```torch.device()```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146290
Approved by: https://github.com/guangyey, https://github.com/H-Huang
2025-02-11 23:30:57 +00:00
3a29992ee6 [associative_scan] Lifted arguments (#140043)
This PR implements lifted arguments for associative_scan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140043
Approved by: https://github.com/ydwu4
2025-02-11 23:25:55 +00:00
f59a56e56f [ARM] Fix test_float_to_int_conversion_nonfinite (#145367)
We have broken tests on Aarch64 which are not enabled upstream, this PR will fix and enable those tests.

```
AssertionError: Tensor-likes are not equal!

Mismatched elements: 2 / 3 (66.7%)
Greatest absolute difference: 1 at index (1,)
Greatest relative difference: 1.0842021724855044e-19 at index (1,)

To execute this test, run the following from the base repo dir:
    python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_float_to_int_conversion_nonfinite_cpu_int64

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145367
Approved by: https://github.com/malfet
2025-02-11 22:22:10 +00:00
a20055288f [DTensor][Test] Create a simple unit test for tensordot (#146514)
Fixes #ISSUE_NUMBER

The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
2025-02-11 21:57:56 +00:00
443437648a Revert "Introduce new template heuristic for triton autotune configs (#144985)"
This reverts commit 69301fb10eb3f7fd49af5c681a2e386af115baba.

Reverted https://github.com/pytorch/pytorch/pull/144985 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I think it needs a small tweak to avoid breaking some internal code ([comment](https://github.com/pytorch/pytorch/pull/144985#issuecomment-2652021045))
2025-02-11 20:42:41 +00:00
b1ff90ae8a remove Windows XPU build workaround. (#144644)
From the RFC: https://github.com/pytorch/pytorch/issues/141946
Fixes https://github.com/pytorch/pytorch/issues/134989

After we land these fixing PRs:
1. https://github.com/pytorch/pytorch/pull/142245
2. https://github.com/pytorch/pytorch/pull/141943

We can remove the Windows XPU workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144644
Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/gujinghui, https://github.com/atalman
2025-02-11 20:39:51 +00:00
664550ecbf [export] Serialize special values of float into strings for json. (#146490)
Summary: Currently inf is serialized as Infinity in JSON which is not standard compliant. Instead we will tweak all special floating points into strings and handle them at json layer.

Test Plan:
see D69060784
CI

Differential Revision: D69186425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146490
Approved by: https://github.com/yiming0416
2025-02-11 20:01:27 +00:00
110638f702 [inductor] skip _test_insignificant_strides on rocm (#146849)
Check https://github.com/pytorch/pytorch/issues/146848 , the rocm kernel for _scaled_dot_product_attention does not match the meta kernel regarding output shape. cuda kernel is fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146849
Approved by: https://github.com/eellison, https://github.com/atalman, https://github.com/jansel
ghstack dependencies: #145904
2025-02-11 19:55:43 +00:00
b18e3c01aa [Inductor] Unifiy Low Precision FP Legalization for to_dtype_bitcast & constant (#144646)
The upcast in `to_dtype_bitcast()` breaks following operations that only works with the target type (I uses `bitwise_and` in the updated UT).
![image](https://github.com/user-attachments/assets/77a6f3b6-b5e7-4ed8-ab65-09d76f077376)

This PR fixes this problem. Let's check the CI results to make sure it doesn't bring accuracy problems.

- Unified the type promotion of low-precision FP operations in the legalize func, grouping ops into sources (whose results may be promoted) and sinks (whose input may be cast back). (The term of _sink_ and _source_ are from [graph theory](https://en.wikipedia.org/wiki/Directed_graph#Indegree_and_outdegree).)

## Test
```bash
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float16_to_int16_cpu
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_bfloat16_to_int16_cpu
pytest -vs test/inductor/test_torchinductor.py::CpuTests::test_float32_to_int32_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144646
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-11 19:45:04 +00:00
af349047c3 [FlexAttention] Bug fix broken flag (#146872)
# Summary

I somehow broke this... I think claude was trippin

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146872
Approved by: https://github.com/BoyuanFeng
2025-02-11 19:42:37 +00:00
ebd992724f Implement serializable getattr support for tensor subclasses (#145772)
builtins.getattr is not serializable, so we replace it with a custom op that has more refined schema.

Differential Revision: [D68899421](https://our.internmc.facebook.com/intern/diff/D68899421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145772
Approved by: https://github.com/bdhirsh
2025-02-11 19:05:14 +00:00
d5d3bdb55a Fix var CUDA_PATH_V128 in cuda128.bat file (#146906)
Followup after: https://github.com/pytorch/pytorch/pull/146653
This should fix upcoming CUDA 12.8 windows builds.
Issue found during pytorch-canary Windows AMI test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146906
Approved by: https://github.com/malfet, https://github.com/tinglvv
2025-02-11 18:43:55 +00:00
c7515da7b0 Implement cuda graphs implementation of torch.cond and torch.while_loop (#140979)
This is a new PR for #130386 , which got stale and was closed. Since I force-pushed to that branch in order to rebase it on top of main, the PR can no longer be reopened, according to https://github.com/isaacs/github/issues/361

I fixed the possibly-not-warmed-up problem described here: https://github.com/pytorch/pytorch/pull/130386/files#r1690856534

Since starting this, torch.cond and torch.while_loop now apparently have support for backward passes. I will look into what it might take to support that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140979
Approved by: https://github.com/eqy, https://github.com/eellison
2025-02-11 18:16:15 +00:00
e3839bd603 [BE] Strip #pragma once when embedding the headers (#146871)
This eliminates compiler warning, for example when compiling Metal shader with embedded headers
```
 with program_source:6:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:81:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:588:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:719:9: warning: #pragma once in main file [-Wpragma-once-outside-header]
#pragma once
        ^
program_source:829:29: error: use of undeclared identifier 'r0_2'
        auto tmp8 = in_ptr2[r0_2 + 768*x0];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146871
Approved by: https://github.com/dcci
2025-02-11 16:49:00 +00:00
861bf892fb Set USE_CUFILE=1 by default and add pypi package to binary build matrix (#145748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145748
Approved by: https://github.com/atalman
2025-02-11 15:49:01 +00:00
5235a18cd6 [SkipFiles] remove some more stuff from MOD_SKIPLIST (#146876)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146876
Approved by: https://github.com/anijain2305
ghstack dependencies: #146854
2025-02-11 15:00:56 +00:00
fc5913b6bf [StaticRuntime] Fix a bug that memory planner ignores subblocks (#146728) (#146855)
Summary:

When Static Runtime graph node has sub-blocks, the memory planner does not consider sub-blocks' inputs as a node's input in memory planner. As the result, such nodes' inputs' lifetime is incorrect and corresponding tensor memory is released earlier than required and causes errors.

Differential Revision: D69195886

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146855
Approved by: https://github.com/swolchok
2025-02-11 13:59:54 +00:00
cyy
15635b14ce [4/N] Remove unnecessary once flag usage (#146783)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146783
Approved by: https://github.com/albanD
2025-02-11 13:55:06 +00:00
69301fb10e Introduce new template heuristic for triton autotune configs (#144985)
Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in https://github.com/pytorch/pytorch/pull/143286 we require large conditionalisation to get ROCm specific optimisations in.

This PR introduces a new file `torch/_inductor/template_heuristics.py` which implements device specific subclasses for autotune configs:
- CPUConfigHeuristic()
- CUDAConfigHeuristic()
- ROCmConfigHeuristic()
- XPUConfigHeuristic()

These subclasses are integrated as part of the `InductorChoices` class, which will be the interface for the kernel files to access the configs.

The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144985
Approved by: https://github.com/jansel
2025-02-11 10:48:09 +00:00
229fb0bc83 [Dynamo][autograd.Function] Relax backward speculation strict mode: support .requires_grad (#146742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146742
Approved by: https://github.com/zou3519
ghstack dependencies: #146571, #146741
2025-02-11 05:39:07 +00:00
f2da810516 [Dynamo][autograd.Function] Relax backward speculation strict mode: support .data (#146741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146741
Approved by: https://github.com/zou3519
ghstack dependencies: #146571
2025-02-11 05:39:07 +00:00
29523aa113 [Dynamo][autograd.Function] Relax backward speculation strict mode a bit (#146571)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146571
Approved by: https://github.com/zou3519
2025-02-11 05:39:00 +00:00
a7fe384d0e Remove torch._higher_order_ops from MOD_SKIPLIST (#146853)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146853
Approved by: https://github.com/williamwen42
2025-02-11 04:38:26 +00:00
001ebbf734 [MTIA] (4/n) Implement PyTorch APIs to query/reset device peak memory usage (#146751)
Summary: Public summary (shared with Github): This diff updates the unit test for the PyTorch API "reset_peak_memory_stats".

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_reset_peak_memory_stats
```

https://www.internalfb.com/intern/testinfra/testrun/9007199321947161

Reviewed By: yuhc

Differential Revision: D68989900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146751
Approved by: https://github.com/nautsimon
2025-02-11 03:51:48 +00:00
23524699d5 Only call triton in worker process, kick off worker processes earlier, during inductor codegen (#146417)
### Big idea
This PR extends https://github.com/pytorch/pytorch/pull/144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.
### Implementation Overview
In total, the diff does the following:
- Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
- Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
- Extend @eellison's future cache to a class, mostly as a refactor
- Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent
async_compile.triton call that occurs after codegen to cache hit on cold start.
In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.
Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.

Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146417
Approved by: https://github.com/jansel
2025-02-11 03:46:16 +00:00
fe94ece375 Revert "Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791)"
This reverts commit 3d604b17d91b928c850ded83b2ec25ea066bb3f6.

Reverted https://github.com/pytorch/pytorch/pull/141791 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/141791#issuecomment-2649717140))
2025-02-11 03:17:59 +00:00
30cbf13544 [PGNCCL] Associate tensor allocation support with NCCL version (#146842)
This is a forward fix to #146589.
For NCCL version lower than 2.19, previous PR would see `RuntimeError: NCCL mem allocator is not supported in this NCCL version`.
This PR gates the support by checking link-time NCCL version via `ncclGetVersion`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146842
Approved by: https://github.com/XilunWu, https://github.com/wconstab, https://github.com/fduwjj
ghstack dependencies: #146589
2025-02-11 02:52:52 +00:00
1d81ecfc54 Rename PrimHOPBase to BaseHOP + minor changes (#146727)
This PR:
- renames PrimHOPBase to BaseHOP
- changes the backward pass to always return a tuple (to match the
  forward pass).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146727
Approved by: https://github.com/ydwu4
2025-02-11 02:43:37 +00:00
275c034b16 [SkipFiles] remove some stuff from MOD_SKIPLIST (#146854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146854
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2025-02-11 01:34:46 +00:00
5205158c1b Replace is_same with is_same_v for concise syntax (#145450)
Replace `std::is_same<T, U>::value` with `std::is_same_v` for concise and consistent syntax with other code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145450
Approved by: https://github.com/Skylion007
2025-02-11 01:34:15 +00:00
f38f1dcd82 Revert "move and fix logic to update unbacked bindings (#146115)"
This reverts commit 103c8b44bcb6fbf30b5411c5af19d312427525e7.

Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/huydhn due to This change has been reverted internally D69129334 but the OSS revert failed https://github.com/pytorch/pytorch/pull/146437 ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2649610877))
2025-02-11 01:26:36 +00:00
0c9fdd6cfb [Docs] Fix description of input in torch.addbmm() (#146664)
Fixes #146613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146664
Approved by: https://github.com/mikaylagawarecki
2025-02-11 01:22:09 +00:00
2fafcd37c3 Revert "cpp_wrapper: Precompile device-specific header files (#144002)"
This reverts commit de6efa1feb0e8c9073640a77afdec1a53a477aed.

Reverted https://github.com/pytorch/pytorch/pull/144002 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this breaks some inductor tests running internally ([comment](https://github.com/pytorch/pytorch/pull/144002#issuecomment-2649569562))
2025-02-11 00:42:22 +00:00
d763093b49 [MPS] fix lu factor for large tensors with bs>1 (#146753)
Try this:
```python
import torch

batch_size = 2
A = torch.eye(256, device="mps")[None, :, :].expand(batch_size, -1, -1) + 0.1 * torch.randn((batch_size, 256, 256), device="mps")
A_cpu = A.cpu()
LU_cpu, pivots_cpu = torch.linalg.lu_factor(A_cpu)
LU, pivots = torch.linalg.lu_factor(A)
torch.testing.assert_close(LU.cpu(), LU_cpu)
```
You'll get huge difference in LU tensors
<img width="706" alt="Screenshot 2025-02-08 at 12 14 39" src="https://github.com/user-attachments/assets/b45f2b3c-e0a5-49c8-aa07-42792150b781" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146753
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-11 00:37:07 +00:00
937b41e3b5 Refactoring pipeline parallelism test cases to be device agnostic [1/n] (#146472)
In this series of PR we intend to refactor pipeline parallelism test cases to enable to be completely device agnostic.

These changes will include the following approaches to do the same :

- Allowing for multiple device types using instantiate_device_type_test
- Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies

This should result in improvement in usability for all devices

For this PR we have shown support for the following devices:

- CPU (wherever applicable)
- CUDA
- HPU
- XPU

To add other device new users can simply append their device to the device list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146472
Approved by: https://github.com/H-Huang
2025-02-11 00:13:23 +00:00
b6273d7f4b [ROCm] Update periodic.yml to use 2GPU runners (#146839)
Temporary fix for rocm workflow.
The 4-GPU runners are all taken offline due to (network timeout issue), and so we aren't able to run any periodic jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146839
Approved by: https://github.com/jeffdaily
2025-02-10 23:41:11 +00:00
aa1622c0b6 Support ignoring parameters in FSDP2 (#146631)
Differential Revision: D69153051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146631
Approved by: https://github.com/awgu
2025-02-10 23:20:28 +00:00
c2bf3be011 [inductor] Remove _get_grid_fn_str (#146800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146800
Approved by: https://github.com/yanboliang
2025-02-10 23:14:30 +00:00
0d5fb0941f [cutlass backend] check against arch >= 100 (#145812)
Summary:
Want to add a guard against silent fallback to SM90.

GenerateSM100 was just added 3 days ago. https://github.com/NVIDIA/cutlass/blame/main/python/cutlass_library/generator.py#L8896

It should show up in CUTLASS 3.8 (not pinned yet).

Test Plan: ci

Differential Revision: D68748705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145812
Approved by: https://github.com/chenyang78, https://github.com/ColinPeppler, https://github.com/Aidyn-A
2025-02-10 22:41:08 +00:00
bab35eb26a fix intermediate debug information with cpp_wrapper (#145527)
Summary: before fix, code like:
```cpp
    aoti_torch_print_tensor_handle(buf0, "after_launch - triton_poi_fused_randn_0 - buf0");
    aoti_torch_print_tensor_handle(buf1, "after_launch - triton_poi_fused_randn_0 - buf1");
    printf("[  after_launch - triton_poi_fused_randn_0 - 0: %ld  ]", 0); printf("
");
    printf("[  after_launch - triton_poi_fused_randn_0 - 1228800L: %ld  ]", 1228800L); printf("
");
```
was generated, which is a syntax error.

Test Plan:
New unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145527
Approved by: https://github.com/desertfire
2025-02-10 22:24:26 +00:00
681894546b Fix bazel job after #144489 (#146840)
This is currently failing in trunk with the following error https://github.com/pytorch/pytorch/actions/runs/13246034191/job/36972742610

### Testing

Bazel job passing https://github.com/pytorch/pytorch/actions/runs/13247495161/job/36977571965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146840
Approved by: https://github.com/atalman
2025-02-10 22:17:36 +00:00
652880e840 Fix logging and test files which misspell "precision" (#146113)
Noticed this while working on something, decided to submit a quick fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146113
Approved by: https://github.com/drisspg
2025-02-10 21:54:16 +00:00
e65b89e4cd [Feat]: Improve KleidiAI 4 bit kernel performance (#146476)
Description:
1. New thread blocking accelerates GEMVs
2. We increase throughput of the lhs quant pack + matmul pipeline by decoupling two operations.
3. The new blocking strategy blocks ```out_feature``` to accelerate GEMVs

Perf improvements:
12% speedup in LLM prefill phase and upto 16% speedup in autoregressive phase

Perf Benchmarking : https://github.com/pytorch/pytorch/issues/143289#issuecomment-2545773370

Change-Id: Ie574ff8459fdb75701ae366158b4e118c70694e4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146476
Approved by: https://github.com/malfet
2025-02-10 21:30:57 +00:00
4d626c261b Fix workarea compute in lapackSyevd (#146456)
work-query APIs return floating point values, that could loose precision when converted back to int. Solve this by using `nextafter` and `ceil`
Add regression test

Fixes #145801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146456
Approved by: https://github.com/malfet
2025-02-10 21:29:48 +00:00
8f073065d5 [while_loop][inductor] support sym expression as cond_fn output (#146222)
As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn.

aoti generated output code looks like:
```
V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]     bool buf7_cond_result;
....
(while_loop_cond_graph_0_arg2_1_handle);
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         buf7_cond_result = u0 + u1 < 10L;
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         if (!buf7_cond_result) break;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222
Approved by: https://github.com/desertfire
2025-02-10 21:25:40 +00:00
97d4753bd3 [hop][inductor] don't promote arg type for cond and while_loop (#146660)
Hop subgraph codegen assumes arguments's type are not promoted. Otherwise, we might generate wrong kernel.

Differential Revision: [D69279031](https://our.internmc.facebook.com/intern/diff/D69279031)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146660
Approved by: https://github.com/zou3519, https://github.com/eellison
2025-02-10 21:24:52 +00:00
da216baaa2 Optimize inductor Self typing (#146669)
Replace method return type with `Self` typing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146669
Approved by: https://github.com/jansel
2025-02-10 20:39:56 +00:00
86b52f4209 Fix lint (#146846)
[Fixes #ISSUE_NUMBER
](https://github.com/pytorch/pytorch/actions/runs/13248382636/job/36980294598)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146846
Approved by: https://github.com/huydhn, https://github.com/clee2000
2025-02-10 20:00:29 +00:00
3d604b17d9 Exclude upsample_bilinear2d.vec from default core ATen decomposition table (#141791)
As upsample_bilinear2d.vec is a core ATen op, it should not be decomposed by default in the export path. Because the operator has CompositeImplicitAutograd dispatch, its decomposition is registered by default. This change adds an override list for CIA decompositions being registered in the default decomp table.
In the long-term, we likely will want to exclude decompositions for all core-tagged CIA ops, but this will require all consumers to be ready to handle the remaining three ops: upsample_nearest2d.vec, avg_pool1d, and adaptive_avg_pool1d. Until they are ready, I believe an explicit override list is the safest option.

Additionally, I've also removed the ExecuTorch XNNPACK delegate ConvertToUpsampleBilinear2d pass, as the pass breaks (and is not needed), given that the op is not decomposed. The purpose of this pass was originally to pattern match the decomposition and un-decomposite it, but this is no longer necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141791
Approved by: https://github.com/tugsbayasgalan, https://github.com/digantdesai
2025-02-10 19:30:19 +00:00
97f6480cf5 Fix an issue where functional collectives don't force fx stride on inputs when compiled (#146467)
Fixes https://github.com/pytorch/pytorch/issues/146416

Also added contiguity checks in the C++ functional collective ops to prevent striding issues introduced during compilation manifest as silent correctness issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146467
Approved by: https://github.com/Chillee, https://github.com/lw, https://github.com/shunting314
2025-02-10 19:15:49 +00:00
3822a88d21 [symbolic shapes] Log symnode id (#146583)
We want to log the symnode id which will help us with provenance tracking between expressions created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146583
Approved by: https://github.com/bobrenjc93
2025-02-10 19:13:06 +00:00
b45e6fa707 Cleanup VS 2019 refs in pytorch (#145863)
Related to: https://github.com/pytorch/pytorch/issues/128835
Follow up on PR: https://github.com/pytorch/pytorch/pull/145319
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145863
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn, https://github.com/atalman
2025-02-10 19:05:35 +00:00
c02a1ecc1d [export][ez] Allow math.trunc for serialization. (#146715)
Summary: as title.

Test Plan: CI

Differential Revision: D69317084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146715
Approved by: https://github.com/angelayi
2025-02-10 19:05:07 +00:00
9b7d050600 Move capture_provenance to make_node_impl (#146625)
Previously we were only logging `make_user_impl` implementations, which only gets triggered for operations done on python SymInts, not cpp SymInts. Instead `make_node_impl` will get triggered for both python and cpp SymInt operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146625
Approved by: https://github.com/bobrenjc93
2025-02-10 19:00:51 +00:00
0486a996d2 [sigmoid] Implement a OSS only model runner. (#146440)
Summary: Implement an oss version of modelrunner with clean dependencies. The new oss model runner only removes thrift and only use json header to load the model.

Test Plan: Test will be added in the next diff separately. (D69060784)

Differential Revision: D68846877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146440
Approved by: https://github.com/SherlockNoMad
2025-02-10 18:54:05 +00:00
519f547d05 windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-02-10 18:34:59 +00:00
ad847da0cf [cutlass backend] fix bug for accuminator dtype (#146356)
Will add unit tests for accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356
Approved by: https://github.com/Chillee
2025-02-10 18:20:58 +00:00
ddcc97bb8c Make sure cutlass kernel .cu file has configuration name and nvcc compile command (#146668)
I think its good to have everything in the .cu file. Especially the nvcc compile command.

Technically, the configuration name can be found in the template already. So let me know if you think its not needed.

Differential Revision: D69281295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146668
Approved by: https://github.com/chenyang78
2025-02-10 18:16:44 +00:00
6b3f51f870 use None to slice when list has one element only (#146638)
When autotune_num_choices_displayed is None and the list of choices has length 1, slicing with `[:-1]` means getting all elements except the last one, which resulted in an empty list.

Slicing with `[:None]` works.

Differential Revision: D69265168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146638
Approved by: https://github.com/drisspg
2025-02-10 18:15:45 +00:00
374b762bbf [ez][BE] get rid of the extra printf('\n') (#146726)
Summary: as title

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100a @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_cuda
```

Differential Revision: D69328701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146726
Approved by: https://github.com/ColinPeppler
2025-02-10 17:45:55 +00:00
5fd15a04b7 [ROCm] Enable inductor-periodic testing for MI300 (#144594)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144594
Approved by: https://github.com/malfet, https://github.com/huydhn

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-02-10 17:42:09 +00:00
b8261358ca Revert "windows Magma build for cu128 (#146653)"
This reverts commit d0e70c4fd33d9accca2c66203c19372733a83ea1.

Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/jeanschmidt due to Seems to have broken some windows tests, reverting to see if it gets green ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2648769150))
2025-02-10 17:36:32 +00:00
cbbb11d967 [dynamo][user-defined] Unify standard and non-standard __new__ codebase (#146737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146737
Approved by: https://github.com/jansel
ghstack dependencies: #146677
2025-02-10 17:31:13 +00:00
ee8a06f1f6 [dynamo][user-defined] User class.__new__ instead of special casing (#146677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146677
Approved by: https://github.com/jansel
2025-02-10 17:31:13 +00:00
de6efa1feb cpp_wrapper: Precompile device-specific header files (#144002)
This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone.

Differential Revision: [D69185685](https://our.internmc.facebook.com/intern/diff/D69185685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144002
Approved by: https://github.com/desertfire
2025-02-10 17:13:09 +00:00
3cadce7af2 [NJT] Fix inference mode for composite implicit ops without nested-specific kernel (#146633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146633
Approved by: https://github.com/jbschlosser
2025-02-10 16:59:48 +00:00
dfe3b64282 [mps] Implement eager support for spherical_bessel_j0 (#146818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146818
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-10 16:58:05 +00:00
5f621c5879 [MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#146710)
Summary: Public summary (shared with Github): This diff implements a C++-Python binding to enable `reset_peak_memory_stats`.

Test Plan: The test is implemented in the following diff.

Reviewed By: yuhc

Differential Revision: D68988673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146710
Approved by: https://github.com/nautsimon
2025-02-10 16:57:09 +00:00
68c9e22ef7 FSDP: avoid resetting version counter of all_gather_output in inference_mode (#146709)
Summary:
FSDP needs to hide VC bumps on its allgather buffer, but it does not need to do this is the allgather buffer was generated under inference mode.

more details here: https://www.internalfb.com/diff/D69115649?dst_version_fbid=1316814572779281&transaction_fbid=849120230625711

Test Plan: CI

Differential Revision: D69311496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146709
Approved by: https://github.com/awgu
2025-02-10 16:56:40 +00:00
6aa924af68 Revert "[ONNX] Create deprecation warning on dynamo_export (#146425)"
This reverts commit 41e6d189a39a40b237ab9b9ab195cec1194b331b.

Reverted https://github.com/pytorch/pytorch/pull/146425 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/146425#issuecomment-2648472579))
2025-02-10 15:54:34 +00:00
1557b7bf9a Revert "[ONNX] Adjust and add deprecation messages (#146639)"
This reverts commit 63c2909ae3e293dee96bca5af88bc51d8ca0ce10.

Reverted https://github.com/pytorch/pytorch/pull/146639 on behalf of https://github.com/atalman due to Sorry Need to revert https://github.com/pytorch/pytorch/pull/146425 ([comment](https://github.com/pytorch/pytorch/pull/146639#issuecomment-2648465047))
2025-02-10 15:51:52 +00:00
a36c22f2ed futher scheduler changes for invoke_quant: prologue low prec, (slightly) more aggressive fusion (#145104)
Respect invoke_quant low precision options, also, be more aggressive in attepmting fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145104
Approved by: https://github.com/shunting314, https://github.com/jansel
ghstack dependencies: #139102
2025-02-10 15:50:19 +00:00
899066eedf Fix round(...) with constants (#146495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146495
Approved by: https://github.com/anijain2305
2025-02-10 15:08:09 +00:00
611ca163fd [MPS] Add bilineard2d_aa implementation (#145526)
Interesting quirk of the algorithm, that is not very well documented, is that value of align_corners is ignored in antialias mode, see arguments of
e8304f08fe/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L747-L751)

Error out on  uint8 implementation(as it relies on a very fragile integer integer arithmetic), as it's not implemented on any other Accelerator devices at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145526
Approved by: https://github.com/dcci
2025-02-10 15:03:14 +00:00
d0e70c4fd3 windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2025-02-10 13:48:55 +00:00
6f15a609d3 Test typing of arithmetic operators on Tensor (see #145838) (#146426)
See #145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146426
Approved by: https://github.com/Skylion007
2025-02-10 12:19:56 +00:00
c24038025d [ROCm] Unskip std:bad_alloc failures (#146407)
Flakey MI300 issue related to memory usage should now be resolved after https://github.com/pytorch/pytorch/actions/runs/13007160888?pr=145829.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146407
Approved by: https://github.com/jeffdaily
2025-02-10 11:01:56 +00:00
c88ae00692 fix: replace stderr with stdout for download messages in hub.py (#146475)
This PR addresses an issue where download logs in `hub.py` are sent to `stderr` instead of `stdout`. Hence, when running models with workers, these messages are incorrectly categorized as errors, leading to confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146475
Approved by: https://github.com/mikaylagawarecki
2025-02-10 10:46:10 +00:00
6667e5d786 [dim order] solve broken doc (#146641)
Differential Revision: [D69265340](https://our.internmc.facebook.com/intern/diff/D69265340/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146641
Approved by: https://github.com/svekars, https://github.com/Jack-Khuu
2025-02-10 07:51:26 +00:00
c4d835fbab [DTensor][conv] add DTensor convolution_backward op support for case where the input Tensor has requires_grad=False (#142278)
Fixes #142058

## Summary
DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model.

ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above.

However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent.

## Fix
allow the `grad_input` to be `None` for `convolution_backward` op.

## Test
`pytest test/distributed/tensor/test_convolution_ops.py`

## Follow-up
The current implementation of DTensor conv op also ignores `output_mask` and this may need further care.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278
Approved by: https://github.com/bdhirsh
2025-02-10 07:06:40 +00:00
effc545274 [DDP] Use NCCL allocated memory for gradient bucket (#146589)
So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications.
Less SM usage, less memory contention between NCCL kernel and compute kernels.

Added env `DDP_DISABLE_COMM_MEM` as a back-out option:
```
An environment variable to disable comm-optimized memory pool.
Default is 0, which means comm-optimized memory pool is enabled.
Users can set it to 1 in case of seeing regression or OOM (because this
comm MemPool may not share space with regular compute MemPool).
```

Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589
Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj
2025-02-10 05:23:11 +00:00
387c993c3b [ca] remove private API: _compiled_autograd_should_lift (#146720)
Since the functional autograd + compiled autograd migration, we don't trace into nodes anymore, and everything is lifted. We can't support this flag which tries to inline make_fx style in CA initial pass. There's no more usage internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146720
Approved by: https://github.com/zou3519
2025-02-10 04:29:57 +00:00
e8304f08fe Fix torch.take_along_dim param type and default description (#146474)
## Changes

- Change type description to `LongTensor`, consistent with [`torch.take`](https://pytorch.org/docs/stable/generated/torch.take.html)
- Add `dim` param default value description

## Test Result

**Before**
![image](https://github.com/user-attachments/assets/720ce158-2bc1-48b5-a188-56fcc7188d96)

**After**
![image](https://github.com/user-attachments/assets/05fe20bd-9476-4b97-ac2b-9b161d6532a1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146474
Approved by: https://github.com/mikaylagawarecki
2025-02-10 01:19:30 +00:00
298226f358 [dynamo] check for incompatible configs (#146513)
internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/

Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time.

Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513
Approved by: https://github.com/williamwen42

Co-authored-by: Gabriel Ferns <gabeferns@meta.com>
2025-02-10 00:44:23 +00:00
2a55311773 [cuda] Simplify the sinc function a bit. (#146774)
`else` after `return` can be removed & the indentation can be reduced, for readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146774
Approved by: https://github.com/malfet
2025-02-09 20:09:34 +00:00
b133907d0a Update strided test to float32 (#146748)
Fixes #146377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146748
Approved by: https://github.com/BoyuanFeng, https://github.com/leijurv
2025-02-09 17:41:35 +00:00
91c4bf39d3 [mps] Add a shader for spherical_bessel_j0. (#146771)
In preparation for adding the operation to inductor/eager.
Adapted from the CUDA version of the shader.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146771
Approved by: https://github.com/malfet
2025-02-09 05:11:17 +00:00
0e83e7d56e [EZ] Add logic to build Metal shader with debug info (#146768)
By appending `-frecord-sources -gline-tables-only` to the compilation command

Helpful when debugging shaders compiled into libtorch

Test plan: Run
`python ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal ../aten/src/ATen/native/mps/operations/UpSample.mm`
And then run following to capture shader and check that it contains debug info
```python
import torch
import os
os.environ["MTL_CAPTURE_ENABLED"]="1"
inp = torch.rand(size=(6, 3, 10, 20), device="mps", dtype=torch.float32)
with torch.mps.profiler.metal_capture("bilinear2d"):
    out = torch.nn.functional.interpolate(x, scale_factor=(1.7,0.9), mode="bilinear")
```
<img width="769" alt="image" src="https://github.com/user-attachments/assets/e0316c1c-07a4-4da5-97b9-886c56857c1d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146768
Approved by: https://github.com/dcci
2025-02-08 23:40:23 +00:00
6a9a02acbe Set enable_faithful_generator_behavior flag to True (#142513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142513
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420, #145223
2025-02-08 22:42:12 +00:00
580a305681 Raise MutationError if there are side effects when returning generator (#145223)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145223
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420
2025-02-08 22:42:12 +00:00
68cfd36c11 Add CLEANUP_THROW bytecode (#144420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144420
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423, #144424
2025-02-08 22:42:12 +00:00
53ab82d8f5 Implement generator.throw(exception) (#144424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144424
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422, #144423
2025-02-08 22:42:12 +00:00
8ee095f7c1 Implement generator.close() (#144423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144423
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421, #144422
2025-02-08 22:42:12 +00:00
ca9b16e070 Implement generator.send(..) (#144422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144422
Approved by: https://github.com/zou3519
ghstack dependencies: #141055, #144421
2025-02-08 22:42:12 +00:00
d798831167 Implement generator.__iter__() (#144421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144421
Approved by: https://github.com/zou3519
ghstack dependencies: #141055
2025-02-08 22:42:12 +00:00
8603a1c870 Suport generators (#141055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141055
Approved by: https://github.com/zou3519
2025-02-08 22:42:12 +00:00
ade8fee512 Use c10 version of half/bfloat16 in executorch (#144111)
Summary:
X-link: https://github.com/pytorch/executorch/pull/7040

Accomplished by importing relevant files from c10 into
executorch/runtime/core/portable_type/c10, and then using `using` in
the top-level ExecuTorch headers. This approach should keep the
ExecuTorch build hermetic for embedded use cases. In the future, we
should add a CI job to ensure the c10 files stay identical to the
PyTorch ones.
ghstack-source-id: 260047850
exported-using-ghexport

Test Plan: builds

Differential Revision: D66106969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111
Approved by: https://github.com/malfet
2025-02-08 22:40:14 +00:00
92b7e610ab [Inductor changes] Invoke Quant (#139102)
Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0).

The primary motivations are

- Unifying scattered reasoning for quant operators throughout the code base

- Easy of pattern matching - see this very large pattern match expression [here](949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426). Compared to the pattern I have in the tests:

```
        @register_graph_pattern(
            CallFunction(
                torch.ops.aten.mm,
                CallFunction(
                    torch.ops.higher_order.invoke_quant,
                    Ignored(),
                    Ignored(),
                    Ignored(),
                    scheme="nf4",
                ),
                Arg(),
            ),
            pass_dict=test_pass,
        )
```

- Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul.

Example graph:

``` Python
 ===== AFTER POST GRAD =====
 /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
         # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self)  # type: ignore[call-arg]
        repeated_subgraph0 = self.repeated_subgraph0
        invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4');  repeated_subgraph0 = arg0_1 = arg1_1 = None
        return (invoke_quant,)

    class repeated_subgraph0(torch.nn.Module):
        def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"):
             # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(*args, **kwargs, quant_options=self)  # type: ignore[call-arg]
            mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1);  arg0_1 = None
            add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1);  mul = arg1_1 = None
            return add
```

The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, *args, scheme=None)` where the scheme will not always be present.

I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing.

```
invoke_quant = InvokeQuant(codegen_low_precision=True)
invoke_quant(gn, (x, y), scheme="nf4")
```
Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162.

Feedback welcome.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102
Approved by: https://github.com/Chillee
2025-02-08 19:30:19 +00:00
a1bfb39a31 [Inductor] Expand Identity ops prior to block pattern matching (#146000)
# Feature

Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s.

This PR adds a few features to achieve this invariance.
 - Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression.
 -  Preprocess the expression with this expansion prior to pattern matching.
 - Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`.

# Test plan
This PR adds a few new unit tests:
 - Added a unit test specifically for `expr.expand(identity=True)`.
 - Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops.

I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000
Approved by: https://github.com/eellison, https://github.com/jansel
2025-02-08 18:11:53 +00:00
eee5622b98 [inductor] Pre-populate cache for simplify_with_ranges return value (#146373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257, #146282, #146297
2025-02-08 18:00:49 +00:00
c098385cb3 [inductor] Refactor CaptureIndexing into global scope (#146297)
And inline SimplifyIndexing into it CaptureIndexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257, #146282
2025-02-08 18:00:49 +00:00
d35f6b2339 [inductor] Minor compile time optimizations in DefaultHandler (#146282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255, #146257
2025-02-08 18:00:40 +00:00
06604c4ec1 [inductor] Refactor op handlers part 5 (#146257)
This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254, #146255
2025-02-08 18:00:30 +00:00
403db2faee [inductor] Refactor op handlers part 4 (#146255)
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.

Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146252, #146254
2025-02-08 18:00:17 +00:00
0e31e5932b [inductor] Refactor op handlers part 3 (#146254)
Fixes type errors that arise from typing `V.ops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254
Approved by: https://github.com/shunting314
ghstack dependencies: #146252
2025-02-08 18:00:08 +00:00
71498aeae3 [inductor] Refactor op handlers part 2 (#146252)
This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252
Approved by: https://github.com/yanboliang
2025-02-08 18:00:00 +00:00
46e83bb637 Fix linter F821 error (#146665)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-08 07:19:37 +00:00
a3ca5c7f4e remove incorrect warnings from min/max documentation (#146725)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146725
Approved by: https://github.com/wdvr, https://github.com/malfet
2025-02-08 05:10:08 +00:00
63c2909ae3 [ONNX] Adjust and add deprecation messages (#146639)
Adjust and add deprecation messages to torch.onnx utilities and verification methods because they are only related to torch script and are obsolete.

Removed unused `_exporter_states.py` and removed the internal deprecation module in favor of the typing_extensions deprecated decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146639
Approved by: https://github.com/titaiwangms
2025-02-08 05:09:16 +00:00
2328dcccb9 [MPSInductor] Implement Welford reduction (#146703)
Still work in progress, though fallback works as expected, but custom shader is not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703
Approved by: https://github.com/jansel, https://github.com/dcci
2025-02-08 05:00:00 +00:00
69feef5a94 Fix broken meta function for flex-attention backwards (#146563)
# Summary

Fixes https://github.com/pytorch/pytorch/issues/146377

So what was the original problem: we were codegening a really weird epilogue:

```Python
        # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM]
        # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM]
        xindex = index_k + 64*index_n + 64*off_hkv*ks2 + 128*off_zq*ks2
        tl.store(out_ptr0 + (tl.broadcast_to(index_k + 64*index_n + off_hkv*ks1, dk.shape)), dk, mask)
        x5 = (xindex % ks3)
        tmp2 = tl.load(out_ptr0 + (x5 + ks1*off_hkv), mask, eviction_policy='evict_last')
        tl.store(out_ptr1 + (tl.broadcast_to(xindex, dk.shape)), tmp2, mask)
 ```

 This epilogue was writing and then reading from overlapping regions of memory causing a race condition.

 ### Why were we generating this epilgoue

 During the lowering we created a buffer w/ a different size/stride from the expected return strides. I :think this added an implicit node (for doing the permutation of this wrongly strided output to the the expected one from the meta func. The scheduler for some reason thought it was okay to fuse this into the epilogue, tbh I dont know why.

 This fixes the broken meta func and the original repro. I will add a test but it is hard to pop, better than nothing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146563
Approved by: https://github.com/Chillee
2025-02-08 04:13:52 +00:00
9c78fb920d Fix assertion failure in gemm template lowering (#146353)
Summary:
This commit fixes a crash in the gemm template lowering caused by hitting an [assert](fd515e4f59/torch/_inductor/codegen/common.py (L1181)) that a buffer was previously removed.

The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set.

The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings.

Differential Revision: D68814625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146353
Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475
2025-02-08 01:52:20 +00:00
cyy
6cb2f737ee Enable Windows tests (#146666)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146666
Approved by: https://github.com/albanD
2025-02-08 00:55:20 +00:00
0ab67299c3 [MPS] lu unpack (#146681)
Implements lu unpack function on MPS. Haven't added new tests because they are covered by removing the lu_unpack from UNIMPLEMENTED_XFAILLIST in test_mps with `test_output_match` function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146681
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-08 00:16:17 +00:00
803661526e Update ET pin to 41e7ffa (#145831)
ExecuTorch pin is failing to update due to a change in the executorch install scripts. The previous install_requirements.sh now only installs dependencies and does not build ET. There is a new script - install_executorch.sh, which both installs dependencies and builds the framework.

This PR updates the relevant CI logic to use install_executorch.sh and bumps the pin forward. This should fix the stuck ET pin.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145831
Approved by: https://github.com/metascroy
2025-02-07 23:52:20 +00:00
dcac3c3e06 [MTIA] (2/n) Implement PyTorch APIs to query/reset device peak memory usage (#146659)
Summary:
Public summary (shared with Github): This diff implements the correct version of the PyTorch API "max_memory_allocated".

Nit: The file previously contained two unit tests with the same name (due to wrong revert); I deleted a deprecated one to revamp the correct version.

Test Plan:
```
buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated
```

https://www.internalfb.com/intern/testinfra/testrun/12103424065182810

Reviewed By: yuhc

Differential Revision: D68988435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146659
Approved by: https://github.com/nautsimon
2025-02-07 23:06:35 +00:00
fa34128435 revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453)
Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass.

Test Plan:
With the change, build of APS model using rcclexp can now pass:
`sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0`

Reviewed By: c-p-i-o

Differential Revision: D69149588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453
Approved by: https://github.com/c-p-i-o
2025-02-07 22:43:52 +00:00
103c8b44bc move and fix logic to update unbacked bindings (#146115)
Summary:
Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here).

This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde.

Test Plan: added test

 D68880766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115
Approved by: https://github.com/pianpwk
2025-02-07 22:41:19 +00:00
45d35f5f5a Clean up op BC check list (#146577)
Summary: Remove the expired ones

Test Plan: ci

Differential Revision: D69226556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146577
Approved by: https://github.com/hl475
2025-02-07 22:40:49 +00:00
908133f682 [TreeSpec] Add custom comparision function (#146442)
Summary:
https://github.com/pytorch/pytorch/pull/145815 used caching to for treespec_loads calculation to speed up AOTI module call.

However, this made tests flaky due when comparing TreeSpec for objects in local scope. ie. 'test_export.TestExport.test_pytree_register_nested_data_class.<locals>.Inner'

Type comparison will yield False when local scopes are different due to lru_cache.

Since this comparison is only used for testing purpose, we will only test if str(type) are equal.

Test Plan:
```
PYTORCH_TEST_WITH_ROCM=1 python test/export/test_retraceability.py
```

Differential Revision: D69137706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146442
Approved by: https://github.com/angelayi
2025-02-07 22:39:21 +00:00
91dfa82981 [FlexAttention] Fix dynamic shapes in max-autotune (#146657)
# Fixes
https://github.com/pytorch/pytorch/issues/146624

### Updated

From offline discussion going w/ sizehint

However this does incur guards. I couldn't really think of a fancy way to do this. I was going to do `V.graph.sizevars.size_hint` w/ some default for num blocks, but we ultimately need some information about the input.

I am also not sure if size_hint is ALWAYS guaranteed to return the runtime value. I think it would be okay to not supported unbacked symints (maybe).

 For instance, in the repro, we quickly hit the recompile limit.
 ```Shell
 torch._dynamo hit config.recompile_limit (8)
   function: 'flex_attention' (/home/drisspg/meta/pytorch/torch/nn/attention/flex_attention.py:1161)
   last reason: 0/0: tensor 'L['key']' size mismatch at index 2. expected 1, actual 546
To log all recompilation reasons, use TORCH_LOGS="recompiles".
To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146657
Approved by: https://github.com/Chillee, https://github.com/yanboliang
2025-02-07 22:34:28 +00:00
579b9f2ed9 [inductor] Better exception error messages for cache_on_self (#146652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146652
Approved by: https://github.com/yanboliang
2025-02-07 21:22:21 +00:00
04ce02182b [inductor] Use index_dtype (int32/int64 depending on size) for argmax accumulators (#146651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146651
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-02-07 21:21:21 +00:00
80a1696679 Revert "[cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)"
This reverts commit 5f0901e57341eb9865102c1caa3d986a0c4ae3bd.

Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))
2025-02-07 21:04:23 +00:00
206ad9f4ad [cutlass backend] Set no fallback to aten, disabled a few broken tests, default to test on H100 (#146554)
This PR does a few things:
* set fall back to aten to False for most tests. Without this, a lot of tests would fail silently since they just use aten
* Disable two subprocess related broken tests. They would crash in subprocess. More investigation needed.
* remove/disable the tests on A100. Let me elaborate a bit more.

There are two types of A100 tests.
* normal tests that also test A100. e.g., mm, addmm, bmm. However, since the shift to cutlass 3x, they don't work anymore. GenerateSM80 would generate ops that use cutlass 2x, but they get filtered out since they are of GemmKind.Universal but only GemmKind.Universal3x are supported in the 3x template.
* tests for A100 only. The mixed mm and sparse semi structure tests are failing due to "TypeError: can't multiply sequence by non-int of type 'str'" for a while. Disabled them for now. Do let us know if you are about them @alexsamardzic

Differential Revision: D69209929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146554
Approved by: https://github.com/chenyang78
2025-02-07 19:59:28 +00:00
f17109bd96 Revert "windows Magma build for cu128 (#146653)"
This reverts commit 9e27d36e2b2a4f037a7e448c2f87a9ebb0d6e628.

Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2643882976))
2025-02-07 19:37:16 +00:00
bc0191802f [inductor] add size-asserts for fallback ops (#145904)
Fix https://github.com/pytorch/pytorch/issues/144717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904
Approved by: https://github.com/jansel
2025-02-07 18:44:32 +00:00
b60f630de8 fuzzer: disable "fail_on_recompile_limit_hit" and "suppress_errors" (#146650)
Summary:
needed for https://github.com/pytorch/pytorch/pull/146513
Test Plan:
the existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146650
Approved by: https://github.com/xmfan
2025-02-07 18:25:00 +00:00
9e27d36e2b windows Magma build for cu128 (#146653)
https://github.com/pytorch/pytorch/issues/145570

removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653
Approved by: https://github.com/atalman
2025-02-07 18:09:30 +00:00
23af9dde4d distributed/serialization: add experimental streaming torch.save/load methods (#146555)
Summary:

This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor.

This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility.

Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata.

Adapted from:

https://github.com/pytorch/torchft/pull/101

https://github.com/pytorch/torchft/pull/54

Test Plan:

pytest test/distributed/test_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555
Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki

Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>
2025-02-07 18:08:11 +00:00
68631f6e87 PyWork: preserve Python reference counting when used in functional collectives (#146376)
@fegin  found an issue where torchft is not compatible with functional collectives.

Found in https://github.com/pytorch/torchtitan/pull/806

The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug.

PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists

To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects.

To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup.

Test plan:

```
cd pytorch
pytest test/distributed/test_c10d_functional_native.py
```

```
cd torchft
pytest torchft/process_group_test.py -k functional -v -x -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376
Approved by: https://github.com/yifuwang
2025-02-07 18:07:53 +00:00
76c8a2dc48 Fix get_top() to return the base level event of the stack, not the most recently started event (#146649)
`get_top()` is really confusing when talking about a stack, because it can mean the most recently started event on the stack or the toplevel event in perfetto(which displays the stack upside down). Rename to `get_outermost` and fix the bug associated with it,  so that it returns the correct value out of the stack.

Running nanogpt now puts `guard_latency_us` correctly in the `dynamo` event:
```
tlp python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only nanogpt --amp --cold-start-latency --print-compilation-time --training --performance 2>&1 --dynamic-shapes | tee out.log
```
<img width="1281" alt="image" src="https://github.com/user-attachments/assets/4eeb371a-4d81-415a-acc4-7d303a4b2a93" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146649
Approved by: https://github.com/masnesral, https://github.com/anijain2305
2025-02-07 18:04:50 +00:00
f138b18d18 [inductor/profiler] add kernel kwargs instrumentation (#145573)
## About

As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc.

## Test program

Note, install triton before proceeding (pip install triton)

triton_test.py>>>
```
import torch
from torch.profiler import profile, ProfilerActivity

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b

def main():
    x = torch.randn(10, 10).cuda()
    y = torch.randn(10, 10).cuda()
    opt_foo = torch.compile(foo)
    z = opt_foo(x, y)

    # Profile the kernel function on the GPU
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True
    ) as prof:
        z = opt_foo(x, y)

    # Export the trace to a file
    prof.export_chrome_trace("my_kernel_trace.json")

if __name__ == "__main__":
    main()
```

Run it and we should get a trace file my_kernel_trace.json

Output has triton event with the kernel_kwargs attribute.
```
  {
    "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815,
    "ts": 2045246693014.959, "dur": 75.662,
    "args": {
      ...
      "kernel_backend": "triton",
      "num_warps": 4,
      "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)",
      "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py",
      "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor"
    }
  },
```

## Unit Test
Updated unit test:
```
pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573
Approved by: https://github.com/davidberard98, https://github.com/jansel
2025-02-07 17:44:30 +00:00
ee45ea599d [dynamo] Actionable message on recompilations for fullgraph=True (#146550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146550
Approved by: https://github.com/zou3519, https://github.com/StrongerXi
ghstack dependencies: #146553
2025-02-07 17:28:43 +00:00
fa0956951c [dynamo] Remove the suggestion to use suppress_errors on compiler error (#146553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146553
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-02-07 17:28:43 +00:00
cyy
25aa7ca62d Cleanup CallOnce.h (#146700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700
Approved by: https://github.com/albanD
2025-02-07 16:44:45 +00:00
076717785c Revert "[while_loop][inductor] support sym expression as cond_fn output (#146222)"
This reverts commit 5ecdc428b230ab5ba44a90678f1c905e314f6ccb.

Reverted https://github.com/pytorch/pytorch/pull/146222 on behalf of https://github.com/atalman due to Internal failure, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146222#issuecomment-2643379933))
2025-02-07 16:19:41 +00:00
eqy
5d7532140f [CUDA][CUDA Graphs] Fix debug mode warning message (#145996)
The real method is `enable_debug_mode()`, `_cuda_enable_graphs_debug_mode` does not exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145996
Approved by: https://github.com/ptrblck, https://github.com/eellison
2025-02-07 08:04:49 +00:00
002accfb8d Check meta strides for expanded dims in effn_attn_bias (#146054)
With the `_scaled_dot_product_efficient_attention.default`, we have lowering logic to realize the bias to specific alignment constraints. Some of the dims can be expanded, and we need to keep the stride of that dim to 0 to avoid materializing a larger tensor than we need. Previously, we had checked stride of tensor, but if it is not realized, that will not work. so we should check the strides of the meta as well.

Note: getting the exact of realizing/slicing/requiring_exact_strides was a little tricky. I commented to @exclamaforte on an example unable-to-fuse message you get if you do it incorrectly.

Fix for https://github.com/pytorch/pytorch/issues/145760

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146054
Approved by: https://github.com/shunting314
2025-02-07 06:35:57 +00:00
71e8a2bda4 Expand inductor codegen dtype asserts, fix scan (#146067)
We were codegening intermediary dtype asserts in some places but not all. expands assertions, fixes newly failing assertion in

`TORCHINDUCTOR_COMPILE_THREADS=1 TORCH_LOGS="output_code" PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_logcumsumexp_cuda_float16` for scan.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146067
Approved by: https://github.com/shunting314, https://github.com/jansel
2025-02-07 06:35:47 +00:00
cyy
f6bd20e8a2 Enable TemporaryFileName tests on Windows (#146311)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146311
Approved by: https://github.com/albanD
2025-02-07 06:06:18 +00:00
1c872803cb [export][dynamic shapes] log provenance for locals & symbols for non-strict (#143378)
Adds `dtrace_structured` logging so when a guard or real-tensor propagation assert is added, the relevant user code with local symbolic values & free symbols are logged, e.g. from the draft export CLI report (soon to be added to tlparse):
1. Guard added:
```
1. Constraint violation error.
    The specified input dynamic_shapes spec was found to be incorrect during tracing.
    Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}.
    This occured at the following stacktrace:
        File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 267, in forward:
            assert a.shape[0] == 3

        Locals:
            a: Tensor(shape: torch.Size([s0, 3]), stride: (3, 1), storage_offset: 0)

        Symbols:
           s0: L['args'][0][0].size()[0]
...
```

2. Real tensor propagation:
```
1. Data dependent error.
    When exporting, we were unable to evaluate the value of `u2 < 0`.
    This was encountered 8 times.
    This occurred at the following stacktrace:
        File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 217, in forward:
            return res[:c_item]

        Locals:
            res: Tensor(shape: torch.Size([u0, u1]), stride: (Max(1, u1), 1), storage_offset: 0)
            c_item: u2
...
```

Currently the values are extracted from the traceback, and are only valid for non-strict; strict seems to require storing & fakifying locals in the frames reporting by `TracingContext`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143378
Approved by: https://github.com/avikchaudhuri, https://github.com/bobrenjc93
2025-02-07 05:46:05 +00:00
bc40ccf6aa [BE]: Inline special functions for MPS (#146627)
These header functions should be inlined for consistency and to avoid translation unit / symbol issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146627
Approved by: https://github.com/dcci
2025-02-07 05:15:15 +00:00
ecf44d1002 Fixed a typo in dataset.py (#146600)
Changed word 'Mult' to 'Multi'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146600
Approved by: https://github.com/Skylion007
2025-02-07 05:09:51 +00:00
41e6d189a3 [ONNX] Create deprecation warning on dynamo_export (#146425)
Reland #146003

Deprecation of `torch.onnx.dynamo_export`:

* [`torch/onnx/_internal/_exporter_legacy.py`]: Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146425
Approved by: https://github.com/titaiwangms, https://github.com/atalman
2025-02-07 04:20:46 +00:00
cyy
fa0592b568 Remove some NOLINT (#146610)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146610
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-07 01:50:06 +00:00
624d94bdb8 [MPS] Extend torch.special.sinc to complex (#146648)
And to integral data types as well

Was too lazy to deduce the formula myself(or write a sympy script), but ChatGPT did a decent job of doing it, though it forgot that input must be multiplied by $$\pi$$:
```math
\text{Re}\left(\text{sinc}(x + i y)\right) = \frac{\sin(x)\cosh(y) x - \cos(x)\sinh(y) y}{x^2 + y^2}
```
```math
\text{Im}\left(\text{sinc}(x + i y)\right) = \frac{\cos(x)\sinh(y) x + \sin(x)\cosh(y) y}{x^2 + y^2}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146648
Approved by: https://github.com/dcci
2025-02-07 01:12:37 +00:00
9ea1823f96 [ROCm][Windows] Remove external linkage from an anonymous namespace (#146607)
Fixes a clang-cl compiler error related to attempt to export a symbol that doesn't have any external linkage, since its declared within a local anonymous namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146607
Approved by: https://github.com/jeffdaily
2025-02-06 23:48:20 +00:00
3379c65de6 [ROCm][Windows] Fix unrecognized _BitScanReverse intrinsic (#146606)
Since PyTorch with ROCm on Windows is built with clang-cl and not MSVC, the intrinsics used are different and hence an attempt to compile with `_BitScanReverse` fails. However, a call to `__builtin_clz` which follows in the subsequent preprocessor branch is correctly recognized by the clang-cl compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146606
Approved by: https://github.com/jeffdaily
2025-02-06 23:47:18 +00:00
0d8fc00e0a [ROCm][Windows] Fix isnan integer overload errors on MS STL (#146605)
Microsoft's STL has a problem with integer overloads of std::fpclassify used by std::isnan and std::isinf. These functions need a cast to double to function correctly. Otherwise, the call fails with "ambiguous call to overloaded function" error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146605
Approved by: https://github.com/jeffdaily
2025-02-06 23:44:11 +00:00
3f5ed05688 [Windows][ROCm] Fix c10 hip tests (#146599)
- Solves a problem related to .hip source files being ignored by the build system when HIP language is not enabled in CMake.
- Also ensures that the test executables link to an appropriate CRT Runtime Library and hence have access to all the necessary symbols. Previously, there were many problems related to linkage errors.
- Moves part of Linux-related hipBLASLt changes in `LoadHIP.cmake` under the UNIX conditional branch, as these aren't supported on Windows yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146599
Approved by: https://github.com/jeffdaily
2025-02-06 23:41:25 +00:00
e13a544b54 fix tf32 issue in test_inductor_freezing.py unit tests (#146444)
Test is hitting numerical mismatches in NVIDIA internal CI. Add tf32_on_and_off decorater, update check to assertEqual

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146444
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/eqy
2025-02-06 23:34:28 +00:00
eqy
7bd7f735d4 [CUDA][SDPA] Compute reference in test_triton_scaled_dot_product_attention_block_size_16_cuda_float32 in float64 (#146461)
Seems to currently fail with mismatches in the 1e-4 range presumably due to sdpa calling into the `MATH` backend here which is less fused than a triton kernel. Doing the ref computation in `float64` appears to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146461
Approved by: https://github.com/drisspg
2025-02-06 23:28:56 +00:00
2834fe5e93 [inductor] Fix test error test_force_cutlass_backend_aoti_cexpr_codegen (#146564)
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend -- --exact 'caffe2/test/inductor:cutlass_backend - test_force_cutlass_backend_aoti_cexpr_codegen (caffe2.test.inductor.test_cutlass_backend.TestCutlassBackend)'
```

Differential Revision: D69219873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146564
Approved by: https://github.com/yanboliang
2025-02-06 23:02:41 +00:00
0c81b398ab [BE][Ez]: Enable some additional pylint ruff warnings (#146609)
Some additional code hardening with some pylint warnings in ruff that usually indicate bugs.  All code currently conforms nicely to them, but this will ensure these errors can be detected statically before running / creating tests.

The follow rules:
* Ban walrus operators where they would have no effect over regular assignment; making intention more clear.
* Statically check for the common error of forgetting to put parens after the `super` call, which will cause an attribute error
* Ban bad string literal args to builtins `open`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146609
Approved by: https://github.com/aorenste
2025-02-06 21:58:08 +00:00
99dd846672 [torch] fix builds for older pybind (#146630)
Summary:
some versions of pybind we build with don't have `py::set_error`.

So just use the underlying python C API.

Test Plan: unit tests

Differential Revision: D69254629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146630
Approved by: https://github.com/colin2328, https://github.com/ngimel
2025-02-06 21:22:00 +00:00
3008368b12 Honor Dr.CI classification results on auto commit hash update (#146337)
Disable `ignore_flaky_failures` was a safer choice, but it seems that this option doesn't work with the current state of the CI.  For example, https://github.com/pytorch/pytorch/pull/125806 hasn't been merged since May because there would always be a failure in one type or another.  This effectively disables the automate mechanism.

My proposal here is to relax this rule and allows the bot to merge auto commit has update with `@pytorchbot merge` like a regular PR.  Then we will at least have something working.  If this causes issue, we can revert it back and try to longer route of improving CI reliability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146337
Approved by: https://github.com/clee2000
2025-02-06 20:33:38 +00:00
44b69b80c2 [ROCm][TunableOp] Future proof TunableOp unit test. (#146548)
TunableOp UT will fail because the regular expression in the test will not work for future versions of ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146548
Approved by: https://github.com/jeffdaily
2025-02-06 20:26:02 +00:00
5cc1b54a91 [2/N][cp][example] flex attention in context parallel (backward pass) (#146397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146397
Approved by: https://github.com/fegin
ghstack dependencies: #145896
2025-02-06 19:50:02 +00:00
6220c64aea [1/N][cp][example] flex attention in context parallel (forward pass) (#145896)
**Description**
This is an example of how FlexAttention can be used in a context parallel fashion. Right now it's only a flex_attention call with collectives added and has no load balancer, but we're about to add the missing parts step by step:
1. backward pass
2. static load balancing for causal masking
3. dynamic load balancing for other general maskings
4. automatic collective insertion solution
5. non-intrusive context parallel APIs

**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/tensor/examples/flex_attention_cp.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145896
Approved by: https://github.com/fegin, https://github.com/Skylion007
2025-02-06 19:50:02 +00:00
5ecdc428b2 [while_loop][inductor] support sym expression as cond_fn output (#146222)
As titled. Previously, we only support tensor output of cond_fn, this PR changes to also allow a shape expr to be returned in cond_fn.

aoti generated output code looks like:
```
V0203 11:28:05.750000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]     bool buf7_cond_result;
....
(while_loop_cond_graph_0_arg2_1_handle);
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         buf7_cond_result = u0 + u1 < 10L;
V0203 11:27:59.336000 2611693 torch/_inductor/compile_fx.py:1091] [1/0] [__output_code]         if (!buf7_cond_result) break;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146222
Approved by: https://github.com/desertfire
ghstack dependencies: #146194, #146195
2025-02-06 19:39:55 +00:00
1b879fd0ea [Inductor] Add a JIT Inductor unit test following #146293 (#146529)
Summary: To follow up https://github.com/pytorch/pytorch/pull/146293, add a JIT Inductor unit test. Other Triton template may need similar fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146529
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-02-06 19:21:15 +00:00
992388c100 [inductor] use ftz variant of exp (#146216)
Inductor generated exp op is compiled as the following ptx snippet by Triton.

```
        mul.f32         %f74, %f83, 0f3FB8AA3B;
        ex2.approx.f32 %f73, %f74;
```

But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as
```
	mul.ftz.f32 	%f2, %f1, 0f3FB8AA3B;
	ex2.approx.ftz.f32 	%f3, %f2;
```
which uses the FTZ variant.

Let Inductor able to generate the FTZ variant if use_fast_math config is true.

I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216
Approved by: https://github.com/jansel, https://github.com/eellison
2025-02-06 19:12:35 +00:00
9ee506bd93 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-02-06 19:04:50 +00:00
eqy
07b214402a [CUDA][B200] Update the number of threads in avg_pool2d backward for SM 10.0 (#145669)
Fixes register count issue when launching on SM 10.0, originally authored by @bilal2vec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145669
Approved by: https://github.com/nWEIdia, https://github.com/ngimel
2025-02-06 18:57:33 +00:00
99ddbb4802 [dynamo][fullgraph] Do not skip frame with fullgraph=True (#146527)
Earlier if there were no ops in the graph, fullgraph=True will also fallback to eager. This hides issues in testing, where we silently fallback to eager, and do not test optimized bytecode. As can be seen in the PR, I had to fix several tests when I forced to use the optimized bytecode in the absence of graph. A few failing tests will be fixed in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146527
Approved by: https://github.com/zou3519, https://github.com/StrongerXi
2025-02-06 18:56:07 +00:00
15b1ac3e86 Add torch.func.debug_unwrap (#146528)
Use it to unwrap any functorch-wrapped tensor. I don't recommend using
the output in a program since it breaks the semantics of the transforms,
but it seems useful for debugging.

I will note that some people have wanted to get intermediate values out
of an e.g. grad transform, so this might be a way to do that...

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146528
Approved by: https://github.com/Chillee
2025-02-06 18:48:09 +00:00
49082f9dba parallelize sort (#142391)
- use __gnu_parallel::sort for gcc compilations
- add a parallelized version of std::sort and std::stable_sort for non gcc compilations

Using __gnu_parallel::sort:
provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64

The performance is measured using the following script:
```python
import torch
import torch.autograd.profiler as profiler

torch.manual_seed(0)

N = 50000
x = torch.randn(N, dtype=torch.float)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):
        _, _ = torch.sort(x)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10))

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142391
Approved by: https://github.com/malfet
2025-02-06 18:06:40 +00:00
7725d0ba12 [METAL] inline bfloat min/max (#146588)
After a recent commit 36c6e09528a7e071edecde083254da70cba26c95 , building from source with `python setup.py develop` leads to an error due to multiple symbols for min/max:
```
FAILED: caffe2/aten/src/ATen/kernels_bfloat.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_bfloat.metallib
cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_bfloat.metallib BinaryKernel_31.air Bucketization_31.air CrossKernel_31.air FusedOptimizerOps_31.air Gamma_31.air HistogramKernel_31.air Im2Col_31.air Indexing_31.air LinearAlgebra_31.air Quantized_31.air RMSNorm_31.air RenormKernel_31.air Repeat_31.air SpecialOps_31.air TriangularOps_31.air UnaryKernel_31.air UnfoldBackward_31.air UpSample_31.air
LLVM ERROR: multiple symbols ('_ZN3c105metal3minIDF16bEEN5metal9enable_ifIXgssr5metalE19is_floating_point_vIT_EES4_E4typeES4_S4_')!
```

This PR fixes that.

@malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146588
Approved by: https://github.com/FFFrog, https://github.com/Skylion007, https://github.com/malfet
2025-02-06 17:57:31 +00:00
e2e265e27b [dynamo] Use polyfill to implement comparison operators (#144485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485
Approved by: https://github.com/jansel
2025-02-06 17:27:07 +00:00
1090e58687 [mps] Remove a stale comment. (#146619)
The implementation of the function was moved to a shader, but the comment was left there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146619
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-06 17:25:29 +00:00
46390e9a37 [mps] Implement support for sinc() operator (inductor and eager). (#146539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146539
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 16:37:27 +00:00
a14c780c4c [dynamo] fix dynamo_compile logging on RecompileLimitExceeded (#146544)
Logging branches based on RecompileLimitExceeded or not. If we exceed the limit, we fallback to eager before even trying to analyze the frame. We handle RecompileLimitExceeded outside of the try/catch/finally that edits the metrics context:
72405b0c0f/torch/_dynamo/convert_frame.py (L908-L935).

dynamo_config and recompile_reason are both known before we raise the RecompileLimitExceeded, so we can add them with the rest of the "common" metrics. which are logged on metric_context decorator exit and is always called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146544
Approved by: https://github.com/masnesral
2025-02-06 16:20:42 +00:00
6ff3383157 Enable CUPTI on Windows (#141454)
Fixes:
- https://github.com/pytorch/pytorch/issues/93855

The PR enables CUPTI on Windows and enables unit tests to check CUDA profiling events.
Additionally, the changes can be verified using the following script:

```
import torch
from torch.profiler import profile, ProfilerActivity

def check_cupti_enabled():
    # Check if CUDA is available
    if not torch.cuda.is_available():
        print("CUDA is not available on this system.")
        return False

    # Create a simple CUDA tensor
    x = torch.randn(1000, 1000, device="cuda")
    y = torch.randn(1000, 1000, device="cuda")

    try:
        # Use PyTorch profiler to perform a basic check
        with profile(activities=[ProfilerActivity.CUDA]) as prof:
            z = x @ y  # Simple CUDA operation

        # Print profiling results
        print("CUPTI is enabled and profiling works.")
        print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
        return True
    except RuntimeError as e:
        # If profiling fails, CUPTI is likely not set up correctly
        print("Error: CUPTI might not be enabled or accessible.")
        print(f"Details: {e}")
        return False

if __name__ == "__main__":
    if check_cupti_enabled():
        print("CUPTI is properly configured in PyTorch.")
    else:
        print("CUPTI is not configured correctly. Check your CUDA installation.")
```

Sample output:
```
CUPTI is enabled and profiling works.
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
     sgemm_128x128x8_NN_vec         0.00%       0.000us         0.00%       0.000us       0.000us       2.086ms       100.00%       2.086ms       2.086ms             1
                   cudaFree         9.67%       9.816ms         9.67%       9.816ms       9.816ms       0.000us         0.00%       0.000us       0.000us             1
     cudaDeviceGetAttribute         0.01%      10.000us         0.01%      10.000us       0.476us       0.000us         0.00%       0.000us       0.000us            21
    cudaGetDriverEntryPoint         0.00%       1.700us         0.00%       1.700us       0.850us       0.000us         0.00%       0.000us       0.000us             2
       cudaGetSymbolAddress        85.15%      86.438ms        85.15%      86.438ms      86.438ms       0.000us         0.00%       0.000us       0.000us             1
                 cudaMalloc         0.43%     433.300us         0.43%     433.300us     144.433us       0.000us         0.00%       0.000us       0.000us             3
           cudaLaunchKernel         2.61%       2.648ms         2.61%       2.648ms       2.648ms       0.000us         0.00%       0.000us       0.000us             1
      cudaDeviceSynchronize         2.13%       2.163ms         2.13%       2.163ms       2.163ms       0.000us         0.00%       0.000us       0.000us             1
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 101.511ms
Self CUDA time total: 2.086ms

CUPTI is properly configured in PyTorch.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141454
Approved by: https://github.com/malfet
2025-02-06 15:58:20 +00:00
FEI
8a4dd763b8 [CCA] remove TODO for hardware_destructive_interference_size (#145591)
@zyan0 @albanD  @houseroad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145591
Approved by: https://github.com/albanD
2025-02-06 14:41:25 +00:00
ed309b9156 Re-add stft option to align window for center = false (#146379)
Skips advancing the fc window on https://github.com/pytorch/pytorch/pull/145437, since I just found that there were non-trivial efforts to do so a while ago that eventually was reverted: https://github.com/pytorch/pytorch/pull/73434

Works around the issue by keeping the stft sans center overload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146379
Approved by: https://github.com/justinchuby, https://github.com/iseeyuan
2025-02-06 14:07:13 +00:00
1b79d47635 Revert "[dynamo] check for incompatible configs (#146513)"
This reverts commit aab7925418be561a8af6adfcb8cf009a8786c31b.

Reverted https://github.com/pytorch/pytorch/pull/146513 on behalf of https://github.com/atalman due to inductor/test_fuzzer.py::TestConfigFuzzer::test_config_fuzzer_dynamo_bisect [GH job link](https://github.com/pytorch/pytorch/actions/runs/13174131431/job/36772837627) [HUD commit link](4a545eb85d) ([comment](https://github.com/pytorch/pytorch/pull/146513#issuecomment-2639860568))
2025-02-06 13:42:25 +00:00
340cfe4f28 [dynamo][fbcode] Turn on inline_inbuilt_nn_modules (#145407)
As title.

Some internal testing at https://fb.workplace.com/groups/241460628989036/permalink/411650015303429/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145407
Approved by: https://github.com/ezyang, https://github.com/jansel
2025-02-06 13:18:35 +00:00
bd7d4fb2b5 Revert "[DTensor][Test] Create a simple unit test for tensordot (#146514)"
This reverts commit 1f8baf09ea598c97f30731ddb8328b6aa8d31fe9.

Reverted https://github.com/pytorch/pytorch/pull/146514 on behalf of https://github.com/albanD due to The lint failures that you ignored are real right? ([comment](https://github.com/pytorch/pytorch/pull/146514#issuecomment-2639554636))
2025-02-06 11:26:43 +00:00
4a545eb85d Fix torch.nn.functional.one_hot param num_classes optional description (#146470)
`torch.nn.functional.one_hot` [document](https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html) describe param `num_classes` not optional, but user can call method without pass it.

![image](https://github.com/user-attachments/assets/4e6d4feb-691f-451f-95b5-4ac11bac7bc2)

```python
>>> import torch
>>> a = torch.arange(0, 5) % 3  # [0,1,2,0,1]
>>> torch.nn.functional.one_hot(a)
tensor([[1, 0, 0],
        [0, 1, 0],
        [0, 0, 1],
        [1, 0, 0],
        [0, 1, 0]])

```

`num_classes` has default value -1

93d98aca31/aten/src/ATen/native/native_functions.yaml (L6154-L6157)

## Test Result

![image](https://github.com/user-attachments/assets/2c7203b7-6226-4ebc-84c8-cbf912fc48e2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146470
Approved by: https://github.com/albanD
2025-02-06 07:48:05 +00:00
aab7925418 [dynamo] check for incompatible configs (#146513)
internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/

Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time.

Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513
Approved by: https://github.com/williamwen42
2025-02-06 07:39:52 +00:00
eqy
5f0901e573 [cuBLAS][cuBLASLt] Unify cuBLASLt workspaces with cuBLAS workspaces (#145130)
As `cuBLAS` workspaces are already per-stream, there shouldn't be kernel execution overlap with `cuBLASLt` kernels.

This PR reuses `cuBLAS` workspaces for `cuBLASLt` for the following benefits:

+ caching (`cuBLAS` workspaces were already cached, so now we get that for `cuBLASLt`)
+ "free" workspace size bump for `cuBLASLt` `cuBLASLt` workspace sizes were previously smaller than those for `cuBLAS` by default which potentially hurts performance, and we encountered difficulty in increasing the size due to downstream OOMs , see also #120925
+ fixes behavior broken behavior with the memtracker; https://github.com/pytorch/pytorch/pull/139442 attempted to handle peaky allocation behavior that broke memtracker equivalence tests but it didn't seem to fully work, here the cached/reused `cuBLAS` workspace seems to fix it
+ one environment variable to rule them all: `CUBLAS_WORKSPACE_CONFIG` applies directly to `cuBLASLt` without a confusing `CUBLASLT_WORKSPACE_SIZE` that users would also need to consider

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145130
Approved by: https://github.com/ngimel
2025-02-06 05:57:33 +00:00
36c6e09528 [MPSInductor] Fix min/max for bfloat16 (#146552)
By introducing a full specialization that upcasts everything to float, as bfloat does not have a native min/max

Test by runing `test_min_max_reduction`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146552
Approved by: https://github.com/dcci
2025-02-06 05:15:00 +00:00
1f8baf09ea [DTensor][Test] Create a simple unit test for tensordot (#146514)
Fixes #ISSUE_NUMBER

The dims and shape of the tensors are from a specific Shampoo use case. We want to create a unit test for it to make sure there are no regressions for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146514
Approved by: https://github.com/tianyu-l
2025-02-06 05:09:34 +00:00
e01a5e9e1e Small improvements to NJT matrix multiplies (#146405)
Fixes #146404

Adds changes to the matmul and matmul_backward operation for nested jagged tensors, to support back propagation when the output is a regular strided tensor.
This required adding support for the nested matmul operation to work when the nested tensor wasn't 'self', i.e
`A@B` where `A` isn't nested but `B` is.

The operation schemas had to be updated to reflect that either input can be a strided tensor instead (and the gradient), so an extra assertion is added in an edge case where neither input is nested.

Unit tests are also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146405
Approved by: https://github.com/soulitzer, https://github.com/jbschlosser
2025-02-06 04:51:12 +00:00
389c5c0842 print out partial fx graph for all data-dependent errors (#146363)
The previous implementation didn't catch the following type of errors

```
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not extract specialized integer from data-dependent expression u2 (unhinted: u2).  (Size-like symbols: none)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146363
Approved by: https://github.com/angelayi, https://github.com/bdhirsh
ghstack dependencies: #146298, #146296
2025-02-06 04:21:34 +00:00
425804db2b [torch] fix exception types in custom class magic setattr/getattr (#146516)
Summary:
`c10::AttributeError` is not automatically converted to Python AttributeError, it needs some special macros (e.g. `HANDLE_TH_ERRORS`).

Some Python functions like `hasattr` rely on the type of the throw exception to be correct.

We don't need the fully generality of those macros, so just do a targeted error type conversion here.

Test Plan: added unit test

Differential Revision: D69197217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146516
Approved by: https://github.com/zdevito
2025-02-06 02:14:11 +00:00
3a6a203b98 [dynamic shapes][real tensor tracing] propagate unbacked hint when creating mod replacement (#146381)
Fixes data-dependent errors for 2 PT2I models in draft export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146381
Approved by: https://github.com/angelayi
2025-02-06 01:48:40 +00:00
c5062cca98 [export] make stack_trace optional in insert_custom_op_guards (#146438)
Summary: Fixes 1 PT2I exportability error

Test Plan: -

Differential Revision: D69132186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146438
Approved by: https://github.com/yiming0416, https://github.com/angelayi
2025-02-06 01:48:26 +00:00
6a985d8b2e Make inductor_utils.requires_gpu accept MPS (#145156)
Not yet ready to setp HAS_GPU to true, but can unskip tests that require GPU
(Noticed while running test_mps_basics.py that `test_scalar_cpu_tensor_arg` is getting skipped)

- Replace `GPU_TYPE` with `self.device` in `test_custom_op_fixed_layout_sequential`, `test_inductor_layout_optimization_input_mutations`, `test_mutable_custom_op_fixed_layout2`  otherwise they GPU tests are just running for _cpu suffixes.
- Tweak `test_tmp_not_defined_issue3` to work correctly on CPU, by defining `test_device` and `test_device_0`
- UnXFail `test_mutable_custom_op_fixed_layout2_dynamic_shapes` as it should just work on CPU
- Add `skip_if_no_triton` decorator and decorate `test_reduction_config_limit` with it, as it does not need CPU nor GPU, but rather a triton backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145156
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/jansel
2025-02-06 01:14:36 +00:00
0dc03134d9 [MPS] linalg solve implementation (#146531)
Fixes #98222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146531
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-06 00:57:49 +00:00
495049860b [BE][Metal] Fix signed unsigned comparison warning (#146549)
I wish I knew how to extract Metal warnings during JIT compilation but https://developer.apple.com/documentation/metal/mtldevice/makelibrary(source:options:)?changes=_7&language=objc is a lie as `error:` stays `nil` unless shader compilation fails. But when it does following warnings are thrown
```
program_source:666:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:677:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:688:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:699:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:710:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {
                     ~~~ ^ ~~~~
program_source:723:26: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
  for (auto idx = 1; idx < size; ++idx) {

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146549
Approved by: https://github.com/dcci
2025-02-06 00:40:17 +00:00
e0cf519ade Revert "[inductor] Refactor op handlers part 2 (#146252)"
This reverts commit 13f0436abdff0386f33c7a8c25caa66e9af16dbd.

Reverted https://github.com/pytorch/pytorch/pull/146252 on behalf of https://github.com/atalman due to Sorry need to revert, failing internally ([comment](https://github.com/pytorch/pytorch/pull/146252#issuecomment-2638305417))
2025-02-06 00:04:04 +00:00
c7087d6b14 [BE][EZ][Metal] Do not pass tensor length as arg (#146522)
As all devices capable of running Metal-2 support nonuniform threadgroup sizes, see https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf for more detail
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146522
Approved by: https://github.com/dcci
ghstack dependencies: #146521
2025-02-06 00:03:41 +00:00
54ef029532 [BE][EZ][Metal] Mark constant inputs as constant (#146521)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146521
Approved by: https://github.com/dcci
2025-02-06 00:03:41 +00:00
2001066c61 Revert "[inductor] Refactor op handlers part 3 (#146254)"
This reverts commit 8e9bda8d895e80da0fe480d02e100bae8332ed57.

Reverted https://github.com/pytorch/pytorch/pull/146254 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146254#issuecomment-2638300857))
2025-02-05 23:59:50 +00:00
72405b0c0f [ca] refactor compile reasons and log to tlparse (#146386)
This PR accumulates comple reasons inside each CacheNode, and logs them to tlparse on each CA compile. This defines a compile as an autograd structure change, and a recompile as a dynamic shape change.

sample tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpdbo7gt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

for compiles:
```python
[
  "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]"
]
```

for recompiles:
```python
[
  "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]",
  "!1: Cache miss due to 7 changed tensor shapes (total of 7): sizes[0], sizes[1], sizes[2], sizes[3], sizes[4], sizes[5], sizes[6]"
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146386
Approved by: https://github.com/jansel
ghstack dependencies: #146229
2025-02-05 23:33:21 +00:00
68304dba7a Revert "[inductor] Refactor op handlers part 4 (#146255)"
This reverts commit 7aced455c542f629ffcd4f79c6af259bb966add8.

Reverted https://github.com/pytorch/pytorch/pull/146255 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146255#issuecomment-2638258089))
2025-02-05 23:24:20 +00:00
49effa0deb Revert "[inductor] Refactor op handlers part 5 (#146257)"
This reverts commit d3dd3eeb7f599a2816ba1a067a8fa5a1bb1c84c3.

Reverted https://github.com/pytorch/pytorch/pull/146257 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146257#issuecomment-2638251994))
2025-02-05 23:20:38 +00:00
93e1e6e07c Revert "[inductor] Minor compile time optimizations in DefaultHandler (#146282)"
This reverts commit b8a529cca18ae4d21b1681c5ea3a40635aba5a83.

Reverted https://github.com/pytorch/pytorch/pull/146282 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146282#issuecomment-2638239575))
2025-02-05 23:13:08 +00:00
7dc5cfe2ad Revert "[inductor] Refactor CaptureIndexing into global scope (#146297)"
This reverts commit 7288950bcd4c5851e003dded6ce87da643b93e49.

Reverted https://github.com/pytorch/pytorch/pull/146297 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146297#issuecomment-2638234829))
2025-02-05 23:10:08 +00:00
9555bfce88 Revert "[inductor] Pre-populate cache for simplify_with_ranges return value (#146373)"
This reverts commit 84ba9c6e7844a0b457bc64ca70a9c8cf3655d03d.

Reverted https://github.com/pytorch/pytorch/pull/146373 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/146252 ([comment](https://github.com/pytorch/pytorch/pull/146373#issuecomment-2638232033))
2025-02-05 23:07:08 +00:00
8af31e30d7 [Codemod][AddExplicitStrictExportArg] caffe2/torch (#146439)
Differential Revision: D69068432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146439
Approved by: https://github.com/avikchaudhuri
2025-02-05 22:56:54 +00:00
97b64f2e5c Fix workflow for closing nonexistent disable issues (#146447)
The workflow could not update issues because it didn't have permissions, and it looked green because it didn't check return codes.

Tested by running the workflow and seeing that issues did get closed
Fixes https://github.com/pytorch/pytorch/issues/145382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146447
Approved by: https://github.com/huydhn
2025-02-05 22:29:05 +00:00
9b6d680131 Remove stage_index_to_group_rank from schedule (#146217)
This PR allows schedules loaded via CSV to automatically set their `stage_index_to_group_rank ` and removes the `stage_index_to_group_rank ` argument from the `PipelineScheduleMulti` constructor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146217
Approved by: https://github.com/wconstab
ghstack dependencies: #146193
2025-02-05 21:26:45 +00:00
4ee7d0de86 Add generate_stage_to_rank_mapping utility (#146193)
We use `stage_index_to_group_rank` in the stage to determine what send/recv ops and in the schedule for IR generation. However, we don't need to expose this as an argument in our schedule class, so this stack of PRs is to remove it.

This PR creates a `stage_index_to_group_rank` utility function and removes the arg for the ZBVschedule. In a following PR I will add code to infer the `stage_index_to_group_rank` for the CSV schedule path and we will be able to remove this argument from our classes entirely.

Related comment from @wconstab https://github.com/pytorch/torchtitan/issues/774#issuecomment-2619793741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146193
Approved by: https://github.com/wconstab
2025-02-05 21:26:45 +00:00
98b5d455fd [opcheck] Improve error reporting; allow atol/rtol overrides (#146488)
This PR improves opcheck to:
1. directly use torch.testing.assert_close (without a msg override).
   This allows it to print the absolute and relative differences and the
   number of mismatched elements.
2. take in an atol/rtol tolerance (for if someone just wants to use
   opcheck in their testing).

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146488
Approved by: https://github.com/williamwen42
2025-02-05 21:25:06 +00:00
1f6b566d74 [ONNX] Bump onnx and onnxscript versions in CI (#146097)
Bump onnx onnxscript==0.1 in CI; Skipped onnxruntime 1.19 because it has regression on avgpool.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146097
Approved by: https://github.com/malfet
2025-02-05 21:00:25 +00:00
9da376daa6 Add retain-output argument (#145921)
This PR add retain-output argument which enables appending to the already existing output file if it exists instead of deleting it and creating a new one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145921
Approved by: https://github.com/jansel
2025-02-05 19:45:09 +00:00
dd349207c5 Add check that envvar configs are boolean (#145454)
So we don't get unexpected behavior when higher typed values are passed in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145454
Approved by: https://github.com/c00w, https://github.com/jamesjwu
2025-02-05 19:40:10 +00:00
9091096d6c Refactoring Distributed test cases to be device agnostic [1/n] (#145222)
In this series of PR we intend to refactoring distributed test cases to enable to be completely device agnostic.

These changes will include the following approaches to do the same :

- Allowing for multiple device types using instantiate_device_type_test
- Replacing calls to cuda stream with torch.get_device_module(device) wherever it applies
- Skipping set up steps required while using MultiProcessTestCase with DistributedTestBase (#138216) wherever applicable
- Replacing explicit calls to distributed backend (NCCL,HCCL,etc) with get_default_backend_for_device (#140536).

This should result in significant improvement in usability for all devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145222
Approved by: https://github.com/kwen2501
2025-02-05 18:47:09 +00:00
eqy
6f7fda3f49 Bump nn.functional.conv3d tolerances for test_comprehensive (#135719)
`float16` tolerance was previously set to `1e-5` which seemed very low
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135719
Approved by: https://github.com/Chillee, https://github.com/albanD
2025-02-05 18:34:12 +00:00
d2a2b9f8a7 Fix constants with non-functional operators (#145593)
Previously, in non-strict path, we always error when trying to inplace update a constant tensor because those constant tensors are not actually wrapped by functional tensors. This is correct behaviour in torch.compile, because dynamo makes all constant tensors into buffers and AOTDispatcher just lifts them and wraps them in functional tensors. However, in non-strict, there is no such step that registers constants as buffers so AOTDispatcher panics when it sees these dangling constant tensors when functioanalizing.

Due to recent change in the IR, this is no longer an issue in non-strict path because we don't call AOTDispatcher at training IR level, but now it is a problem for both strict and non-strict when we lower to inference. (lowering to inference is very similar to non-strict tracing) As a result, we have at least one external (https://github.com/pytorch/pytorch/issues/141336) and internal issues reported due to this difference.

To fix this, there are two ways:
1. Make functionalization be aware of constant tensors and map them to functional tensors on the fly. This makes functionalization invariant uglier and could potentially open up a gate for more nasty bugs.
2. Special handle this in export. This seems more aligned with what dynamo does today so i think we should do it this way. I think the current state could benefit from more refactors to make the run_deocmpositions to be more similar to strict export (because both of them now handle this constant registerinig logic) but it is bit complicated to do it now because strict export version of this logic is also not complete because it doesn't take into account of export graph renaming pass etc). I will follow up with more refactors after this PR (T213466691) to unblock users faster.

For future reference:

Why are we not doing "turning constants into non-persistent buffers and never de-register"? The reason is because in some internal models, they rely on module.to to reliably work to move params/buffers to correct device. As a result, buffers are moved while constants are not. In composibility meeting, we agreed that export won't do device agnostic tracing going forward (it will provide a way to specify FakeTensor in CPU that can be configured to be run on GPU), so after that is done, we can always turn constants into non-persistent buffers which will simplify export's constant handling.

Differential Revision: [D68610739](https://our.internmc.facebook.com/intern/diff/D68610739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145593
Approved by: https://github.com/avikchaudhuri
2025-02-05 17:44:19 +00:00
44248c44eb [ROCm] miopen benchmark behavior now better aligns with cudnn (#145294)
The default benchmark setting is now false. The new miopen behavior means when benchmarking is disabled, for any shape that doesn't have a find hit, then it will do a quick search (same behavior as the prior default), and use that result. Now when benchmark is enabled, it will perform an exhaustive search and update any DBs. miopen immediate mode is still available and is used when deterministic is true and benchmark is false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145294
Approved by: https://github.com/BrianHarrisonAMD, https://github.com/malfet
2025-02-05 17:19:53 +00:00
f27220e32a Revert "Move get accelerator to use build time flags when possible (#146098)"
This reverts commit 157d81c201715f84ead21d0ee420669ab7f58c04.

Reverted https://github.com/pytorch/pytorch/pull/146098 on behalf of https://github.com/atalman due to Failing internally, sorry need to revert ([comment](https://github.com/pytorch/pytorch/pull/146098#issuecomment-2637443675))
2025-02-05 16:39:37 +00:00
f55c0af37f [inductor] Support non-power-of-2 cooperative RSPLIT (#145689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145689
Approved by: https://github.com/eellison
2025-02-05 16:36:53 +00:00
db22e9d5a2 Implement blend operation for float, double, int in VEC ATen backend for SVE (#146479)
- Added support for SVE vectorized blend operation for float, double, int8_t, int16_t, int32_t and int64_t data types.
- Utilizes SVE ACLE intrinsic (svcntb, svcntw, svcmpne, svsel) to handle different vector lengths (VL) dynamically.
-  Ensured compatibility with SVE128, SVE256, and SVE512 hardware configurations.
-  Enabled back blend SVE vec tests

**Testing:**
**a) Float DType:**
./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/0.Blend    [Test Passed] on Graviton 3 machine (SVE256)
./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/0.Blend    [Test Passed] on Graviton 4 machine (SVE128)

**b) Double DType:**
./vec_test_all_types_SVE256 --gtest_filter=BitwiseFloatsAdditional2/1.Blend    [Test Passed] on Graviton 3 machine (SVE256)
./vec_test_all_types_SVE128 --gtest_filter=BitwiseFloatsAdditional2/1.Blend    [Test Passed] on Graviton 4 machine (SVE128)

**c)Int DType:**
python3 test/inductor/test_cpu_repro.py CPUReproTests.test_vec_remainder
[Test Passed] on Graviton 3 machine (SVE256) and on Graviton 4 machine (SVE128)
<img width="661" alt="grv4_test_case_passed" src="https://github.com/user-attachments/assets/5572fcc0-a861-4bd6-bf9e-356219ffe656" />

Fixes https://github.com/pytorch/pytorch/issues/146309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146479
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-05 16:29:13 +00:00
cd6c0707a8 [aoti] Assign proxy call args by name, and support default values. (#146263)
Fixing the following issue when compiling the following program:
```
                window = torch.hann_window(N_FFT).to(x.device)
                stft = torch.stft(
                    x, N_FFT, HOP_LENGTH, window=window, return_complex=True
                )
                magnitudes = stft[..., :-1].abs() ** 2
                return magnitudes
```
```
Traceback (most recent call last):
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
    self.check_model(model, example_inputs)
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
    actual = AOTIRunnerUtil.run(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
    optimized = AOTIRunnerUtil.load(device, so_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
    return torch._export.aot_load(so_path, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
2025-02-05 15:43:05 +00:00
1bb977a2a4 [auto_functionalized] Support Tensor(a!)[]? (#145400)
Summary:
This is just updating some of the checks to allow the Tensor(a!)[]? type
through.

Fixes #144072

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145400
Approved by: https://github.com/laithsakka
2025-02-05 14:52:39 +00:00
282d185ec1 Revert "[inductor] use ftz variant of exp (#146216)"
This reverts commit b0b3fe8bcf00f30513e9bb3e197ea4cbcc2beef0.

Reverted https://github.com/pytorch/pytorch/pull/146216 on behalf of https://github.com/atalman due to inductor/test_op_completeness.py::TestOpCompleteness::test_triton_overrides [GH job link](https://github.com/pytorch/pytorch/actions/runs/13152430750/job/36702812599) [HUD commit link](b0b3fe8bcf) ([comment](https://github.com/pytorch/pytorch/pull/146216#issuecomment-2636961317))
2025-02-05 14:13:45 +00:00
8a2000fd42 [MPS] Implement support for zeta (both eager and inductor). (#146465)
A test was failing in inductor (`test_pointwise_zeta`) -- and I realized the operation was missing also from eager.
Implemented for both, leveraging the kernel. Happy to split in two (one PR for eager, one for inductor) if folks prefer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146465
Approved by: https://github.com/malfet
2025-02-05 13:55:50 +00:00
fd0cd6a08f [ROCm][TunableOp] Improve identification of fastest solution (#144942)
This PR addresses some stability issues with identifying the fastest solution on AMD GPUs, particularly the MI300.

Changes include:
- An improved timer, StreamTimerNoSync
- More aggressive skipping of slow solutions
- Additional statistics that can be used for diagnostics PYTORCH_TUNABLEOP_VERBOSE=3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144942
Approved by: https://github.com/jeffdaily
2025-02-05 11:16:49 +00:00
e20b0c82d1 [ca] no longer require is_traceable annotations for c++ autograd functions (#146229)
This PR removes the CA compile-time error for C++ autograd functions, and supports them by having dynamo graph break on them (instead of allow_in_graph). The CppNode's collects are kept as is for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146229
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-02-05 08:49:17 +00:00
cyy
6293d1446b [2/N] Remove NOLINT suppressions (#146402)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146402
Approved by: https://github.com/soulitzer
2025-02-05 08:38:52 +00:00
e5ea7e9cdc add support for capturing provenance of unary operations (#146413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-05 08:31:38 +00:00
b0b3fe8bcf [inductor] use ftz variant of exp (#146216)
Inductor generated exp op is compiled as the following ptx snippet by Triton.

```
        mul.f32         %f74, %f83, 0f3FB8AA3B;
        ex2.approx.f32 %f73, %f74;
```

But if we enable --use_fast_math in nvcc, exp in CUDA is compiled as
```
	mul.ftz.f32 	%f2, %f1, 0f3FB8AA3B;
	ex2.approx.ftz.f32 	%f3, %f2;
```
which uses the FTZ variant.

Let Inductor able to generate the FTZ variant if use_fast_math config is true.

I see 4% speedup for the two pass prepare_softmax kernel, online softmax should be affected more since it does more computation per seconds (>10% in my testing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146216
Approved by: https://github.com/jansel
2025-02-05 07:35:43 +00:00
clr
93d98aca31 inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)
If a nn.module getattr call throws, we should make sure that we don't crash with an internal error

Note that I couldn't figure out how to test this, so advice would be awesome.  I have my best case attempt at  https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122
Approved by: https://github.com/jansel
2025-02-05 05:49:32 +00:00
eb832b7bcc [export] Fix draft-export logging (#146106)
Summary: Fix issue where the lazyTraceHandler does not exist

Test Plan: CI

Differential Revision: D68928070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146106
Approved by: https://github.com/yiming0416
2025-02-05 05:49:22 +00:00
f242da41c7 Revert "move and fix logic to update unbacked bindings (#146115)"
This reverts commit 0144613e6ff6e018ca41085d1509dcceb80987f7.

Reverted https://github.com/pytorch/pytorch/pull/146115 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/146115#issuecomment-2635695958))
2025-02-05 04:51:39 +00:00
cyy
c6ea4425e5 Enable some tests on Windows (#146243)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146243
Approved by: https://github.com/albanD
2025-02-05 03:54:28 +00:00
f35e60b21c Revert "[cutlass backend] fix bug for accuminator dtype (#146356)"
This reverts commit 7c8ec84dab7dc10d4ef90afc93a49b97bbd04503.

Reverted https://github.com/pytorch/pytorch/pull/146356 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some slow cutlass tests are failing ([comment](https://github.com/pytorch/pytorch/pull/146356#issuecomment-2635594712))
2025-02-05 03:01:50 +00:00
3c0d2bc262 Revert "[Testing] Reduce test_exp flakiness (#146436)"
This reverts commit 4c5a9a5f949ef3019fc3ef095034ccfc973ff13d.

Reverted https://github.com/pytorch/pytorch/pull/146436 on behalf of https://github.com/huydhn due to Some test_exp2 starts failing in trunk I think ([comment](https://github.com/pytorch/pytorch/pull/146436#issuecomment-2635591878))
2025-02-05 02:58:53 +00:00
aafaf4016f [MPS] Add error checking when dispatching kernel (#146458)
That thread-group size should not exceed maximum thread group size
Add regression test to validate that
Make failures like https://github.com/pytorch/pytorch/issues/146430 much easier to detect
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146458
Approved by: https://github.com/dcci
2025-02-05 02:56:40 +00:00
9e45bc82e9 [aarch64] CUDA 12.8 aarch64 builds to nightly binaries (#146378)
https://github.com/pytorch/pytorch/issues/145570

Adding Cuda 12.8 and keeping 12.6 for the sbsa build, supported CUDA_ARCH: 9.0, 10.0, 12.0

Refactor the binaries matrix for cuda sbsa build. Previously cuda-aarch64 was hardcoded to cuda 12.6. Now reads 12.6 and 12.8, new build naming example [manywheel-py3_9-cuda-aarch64-12_8-build](https://github.com/pytorch/pytorch/actions/runs/13132625006/job/36640885079?pr=146378#logs)

TODO: once 12.8 is stable, remove 12.6 in sbsa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146378
Approved by: https://github.com/atalman
2025-02-05 02:55:21 +00:00
001ad5bef5 [MPSInductor] Scope-down test_prod running in MPS (#146460)
As mutli-stage reductions are yet not a thing, but original `test_prod` just returned 0 for large reductions, so failures were reported as flaky ones, but if one to run the same test with `MTL_DEBUG_LAYER=1` than failure was obvious
```
2025-02-04 11:51:30.034 Python[16594:289093] Metal API Validation Enabled
test_prod (__main__.MPSBasicTests.test_prod) ... -[MTLDebugComputeCommandEncoder _validateThreadsPerThreadgroup:]:1266: failed assertion `(threadsPerThreadgroup.width(1) * threadsPerThreadgroup.height(2050) * threadsPerThreadgroup.depth(1))(2050) must be <= 1024. (device threadgroup size limit)'
```

Fixes https://github.com/pytorch/pytorch/issues/146430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146460
Approved by: https://github.com/dcci
2025-02-05 01:47:01 +00:00
52aaadf379 [BE][Ez]: Enable ruff rule E731. use def instead of anonymous lambda (#146410)
Not sure why this isn't enabled, only 1 fix is needed and it supports autofixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146410
Approved by: https://github.com/aorenste, https://github.com/albanD
2025-02-05 01:44:41 +00:00
0e060342b6 [triton] Update pin to tip of 3.2 release (#145867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867
Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte, https://github.com/jansel
2025-02-05 01:42:33 +00:00
616ac94175 [Dynamo] Fix spammy optimizer warning (#146374)
Fixes https://discuss.pytorch.org/t/torch-compile-optimizer-step-generates-excessive-warning-messages/216067/7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146374
Approved by: https://github.com/anijain2305
2025-02-05 01:03:49 +00:00
8177fc4d33 Make regex error catching compatible with Python 3.12+. (#145945)
In Python 3.12, the error message has changed from "Can't pickle local object" to "Can't get local object".
The old regex would no longer catch the error.

This PR make it compatible with Python 3.12 and backward compatible as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145945
Approved by: https://github.com/H-Huang
2025-02-05 00:57:36 +00:00
9d5bf38dec [cpp_builder] refactor to reduce libcudart_static logs (#146394)
Want to reduce logs from `log_msg = f'"libcudart_static.a" not found under {path}'`, which was added in https://github.com/pytorch/pytorch/pull/142175

Differential Revision: D69096354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146394
Approved by: https://github.com/benjaminglass1, https://github.com/chenyang78
2025-02-05 00:41:30 +00:00
658e22d495 Revert "add support for capturing provenance of unary operations (#146413)"
This reverts commit bc33d993acdff2637bc6aee5e604fb969b11fc13.

Reverted https://github.com/pytorch/pytorch/pull/146413 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but some export tests are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/146413#issuecomment-2635440261))
2025-02-05 00:32:40 +00:00
6e03f4f90e [export] Include metadata in FlatArgsAdapter (#146107)
Summary:
With https://github.com/pytorch/pytorch/pull/145956, which introduces
storing a list of namedtuple field names when serializing, we now want to
expose this list to the args adapater so that APS can utilize this information
and remove extraneous inputs.

Test Plan: No-op

Differential Revision: D68928416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146107
Approved by: https://github.com/pianpwk
2025-02-05 00:29:58 +00:00
84ba9c6e78 [inductor] Pre-populate cache for simplify_with_ranges return value (#146373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282, #146297
2025-02-04 23:36:44 +00:00
7288950bcd [inductor] Refactor CaptureIndexing into global scope (#146297)
And inline SimplifyIndexing into it CaptureIndexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257, #146282
2025-02-04 23:36:44 +00:00
b8a529cca1 [inductor] Minor compile time optimizations in DefaultHandler (#146282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255, #146257
2025-02-04 23:36:34 +00:00
d3dd3eeb7f [inductor] Refactor op handlers part 5 (#146257)
This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254, #146255
2025-02-04 23:36:25 +00:00
7aced455c5 [inductor] Refactor op handlers part 4 (#146255)
This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2.

Some compile time wins from this as well:
```
2025-02-02T19:46:32.2033010Z
2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2037575Z
2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones
2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50%
2025-02-02T19:46:32.2040131Z
2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results.
2025-02-02T19:46:32.2042188Z
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252, #146254
2025-02-04 23:36:17 +00:00
8e9bda8d89 [inductor] Refactor op handlers part 3 (#146254)
Fixes type errors that arise from typing `V.ops`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226, #146235, #146252
2025-02-04 23:36:09 +00:00
13f0436abd [inductor] Refactor op handlers part 2 (#146252)
This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252
Approved by: https://github.com/yanboliang
ghstack dependencies: #146225, #146226, #146235
2025-02-04 23:36:01 +00:00
67be5953fe [inductor] Refactor op handlers part 1 (#146235)
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.

Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%

WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%

WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
2025-02-04 23:35:53 +00:00
ed03f9ca10 [inductor] Refactor CSEProxy into global scope (#146226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226
Approved by: https://github.com/shunting314
ghstack dependencies: #146225
2025-02-04 23:35:43 +00:00
5cac550ddf [inductor] Finish typing common.py (#146225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225
Approved by: https://github.com/Skylion007
2025-02-04 23:35:33 +00:00
7c8ec84dab [cutlass backend] fix bug for accuminator dtype (#146356)
Will add unit tests for accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146356
Approved by: https://github.com/Chillee
2025-02-04 22:10:17 +00:00
13e17aa106 Make the CUTLASS swizzle options configurable and default to 2. (#146088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146088
Approved by: https://github.com/henrylhtsang, https://github.com/mlazos
2025-02-04 22:07:26 +00:00
aac0577796 [TEST][Sparse] Force CUTLASS backend in TestSparseSemiStructuredCUTLASS (#146398)
We have noticed some discrepancy between the ways the `test_sparse_semi_structured.py` was called. And in some ways, the test falsely fails, because it was attempting to run on a wrong backend. All because `SparseSemiStructuredTensor._FORCE_CUTLASS = True` was never set in the setup of `TestSparseSemiStructuredCUTLASS` as it was in its `TestSparseSemiStructuredCUSPARSELT` counterpart 8444fe019a/test/test_sparse_semi_structured.py (L1039-L1046)

When I run tests via pytest, just by shear luck it calls `test_values_backend_cutlass_cuda` which sets the backend to CUTLASS bb4bd5f00b/test/test_sparse_semi_structured.py (L475) before `test_conversions_all_patterns_cuda_*`:
```
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_values_backend_cutlass_cuda PASSED [0.0071s]                                                                                          [ 72%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_bfloat16 PASSED [0.0484s]                                                                        [ 73%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_float16 PASSED [0.0041s]                                                                         [ 73%]
test/test_sparse_semi_structured.py::TestSparseSemiStructuredCUTLASSCUDA::test_conversions_all_patterns_cuda_int8 PASSED [0.0079s]                                                                            [ 73%]
```
In this scenario everything is good.

But in `python test/test_sparse_semi_structured.py -v -k cuda` way, the order of the tests is not the same, and it sets cuSparseLt backend just before running `test_conversions_all_patterns_cuda_*` which causes failures:
```
test_cusparselt_backend_cuda (__main__.TestSparseSemiStructuredCUSPARSELTCUDA.test_cusparselt_backend_cuda) ... ok
...
test_conversions_all_patterns_cuda_bfloat16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_bfloat16) ... FAIL
test_conversions_all_patterns_cuda_float16 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_float16) ... FAIL
test_conversions_all_patterns_cuda_int8 (__main__.TestSparseSemiStructuredCUTLASSCUDA.test_conversions_all_patterns_cuda_int8) ... ERROR
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146398
Approved by: https://github.com/Skylion007, https://github.com/jcaip, https://github.com/eqy
2025-02-04 22:07:12 +00:00
317dae95fa cpp_wrapper: fix CPU cpp_wrapper and max-autotune tests (#145683)
Both of these tests mostly failed due to incorrect assumptions about the generated code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145683
Approved by: https://github.com/desertfire
ghstack dependencies: #145095, #145654, #145655
2025-02-04 22:05:59 +00:00
e2a029054d cpp_wrapper: enable all CPU repro tests (#145655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145655
Approved by: https://github.com/desertfire
ghstack dependencies: #145095, #145654
2025-02-04 22:05:59 +00:00
9873319a42 cpp_wrapper: fix set_.source_Tensor lowering (#145654)
Adds a C-shim fallback for `set_.source_Tensor`, which is effectively required by `ir.SetSourceTensorKernel`. As a necessary prerequisite to use that IR node, updates `CppWrapperCpu` to handle in-place returns in C-shim ops (the arguments for those returns are silently dropped by `torchgen`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145654
Approved by: https://github.com/desertfire
ghstack dependencies: #145095
2025-02-04 22:05:59 +00:00
7c0fe7a045 cpp_wrapper/aot_inductor: handle conjugation and negation dispatch keys (#145095)
Handles conjugation and negation in the same way that runtime dispatch does: by on-the-fly cloning a tensor with either key applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145095
Approved by: https://github.com/desertfire
2025-02-04 22:05:58 +00:00
09b0dfdc90 [metal] Add a missing cast to make the call to copysign unambiguous. (#146422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146422
Approved by: https://github.com/Skylion007, https://github.com/Samkm0084
2025-02-04 22:04:25 +00:00
clr
4e194bbfd6 dynamo: fsdp throw unimplemented vs attribute error (#146188)
Rather than throw a full exception for fsdp, instead just return unimplemented,
and respect the user options (i.e. fullgraph, vs graph break).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146188
Approved by: https://github.com/jansel
2025-02-04 21:45:55 +00:00
4c5a9a5f94 [Testing] Reduce test_exp flakiness (#146436)
By setting `reference_in_float` to false,  as `exp(a + b)` could yield significantly different results than `exp(a.half()+b.half())` as one can see in the following example (which is accidentally the random values generated by MacOS RNG for this test)

```
>>> import torch
>>> x=torch.tensor(2.5599, dtype=torch.half)
>>> y=torch.tensor(0.6970, dtype=torch.half)
>>> (x + y).exp()
tensor(26., dtype=torch.float16)
>>> (x.float() + y.float()).exp()
tensor(25.9799)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146436
Approved by: https://github.com/dcci
2025-02-04 21:24:08 +00:00
bc33d993ac add support for capturing provenance of unary operations (#146413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146413
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-04 21:16:15 +00:00
07b9fe0690 [Trace PyDispatcher] Add CustomFunctionHigherOrderOperatorVariable (#146272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146272
Approved by: https://github.com/zou3519
ghstack dependencies: #146270, #146271
2025-02-04 20:55:51 +00:00
d23e4f8109 use DTRACE_ENV_VAR as the trace logs directory of set (#146412)
```
(/home/bobren/local/a/pytorch-env) [7:47] devgpu035:/home/bobren/local/a/pytorch TORCH_DTRACE=/tmp/bb python r1.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146412
Approved by: https://github.com/angelayi
ghstack dependencies: #145848
2025-02-04 20:54:28 +00:00
7f65a20884 [BE]: Enable ruff SLOT checks (#146276)
This enables a check that which a class which only inherits from immutable classes like str, tuple, and NamedTuple, also defined `__slots__` so they don't allocate memory unnecessarily. This also ensure contributors think about how they define their classes with subclass NamedTuples and str, of which we have many in our codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146276
Approved by: https://github.com/aorenste
2025-02-04 19:18:23 +00:00
3525b834f0 [MPSInductor] Implement argmax/argmin (#146429)
TODOs:
 - Find test with NaN
 - Report internal compiler error when running `test_argmax_argmin1` (which is actually not enough shared memory)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146429
Approved by: https://github.com/dcci
ghstack dependencies: #146423, #146428
2025-02-04 19:16:06 +00:00
c591ad0c03 dump partial fx graph to stderr when dynamo tracing fails with guard on data-dependent (#146296)
As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging.

The following code produces a data dependent error

```
import torch
from torch import nn

# UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)).  (Size-like symbols: u0)
class Repro(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, cache, update, pos):
        _, _, max_seq_len, _ = cache.shape
        _, _, seqlen, _ = update.shape

        pos_item = pos[0].item() # u0
        torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        torch._check(pos_item >= 0)
        before = cache.narrow(2, 0, pos_item)

        # FAIL
        # Laith: why can't we make unbacked expressions size-like?
        after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))

        # PASS
        end = torch.tensor(max_seq_len - pos_item - seqlen).item()
        after = cache.narrow(2, (pos_item + seqlen), end)

        return torch.cat([before, update, after], dim=2)

repro = Repro()

bsz = 1
n_heads = 4
max_seq_len = 512
head_dim = 64
seqlen = 5
pos_item = 1

cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim)
update = torch.ones(bsz, n_heads, seqlen, head_dim)
pos = torch.tensor([pos_item])
example_inputs = (cache, update, pos)

torch.export.export(repro, example_inputs)
```

This is what it now prints out

```
class GraphModule(torch.nn.Module):
    def forward(self, L_cache_: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", L_update_: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", L_pos_: "i64[1][1]cpu"):
        l_cache_ = L_cache_
        l_update_ = L_update_
        l_pos_ = L_pos_

         # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0
        getitem: "i64[][]cpu" = l_pos_[0];  l_pos_ = None
        item: "Sym(u0)" = getitem.item();  getitem = None

         # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        add: "Sym(u0 + 5)" = item + 5
        le: "Sym(u0 + 5 <= 512)" = add <= 512;  add = None
        _check = torch._check(le);  le = _check = None

         # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0)
        ge: "Sym(u0 >= 0)" = item >= 0
        _check_1 = torch._check(ge);  ge = _check_1 = None

         # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item)
        before: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = l_cache_.narrow(2, 0, item);  before = None

         # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
        add_1: "Sym(u0 + 5)" = item + 5
        sub: "Sym(512 - u0)" = 512 - item;  item = None
        sub_1: "Sym(507 - u0)" = sub - 5;  sub = None
        narrow_1 = l_cache_.narrow(2, add_1, sub_1);  l_cache_ = add_1 = sub_1 = narrow_1 = None

Traceback (most recent call last):
  File "/data/users/bobren/a/pytorch/torch/_dynamo/utils.py", line 3075, in run_node
    return getattr(args[0], node.target)(*args[1:], **kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl
    decomposition_table[func](*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward
    return self.as_strided(sizes, strides, storage_offset)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl
    entry = self._make_cache_entry(state, key, func, args, kwargs, output)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry
    output_info = self._get_output_info_for_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry
    synth_output = self._output_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry
    return self._get_output_tensor_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry
    empty.set_(storage, storage_offset, shape, stride)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious
    r = self.shape_env.evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr
    return self._evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)).  (Size-like symbols: u0)

Caused by: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))  # r1.py:21 in forward (utils/_stats.py:27 in wrapper)
For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146296
Approved by: https://github.com/zou3519
ghstack dependencies: #146298
2025-02-04 19:12:39 +00:00
8f861a8dfb [experimental] filter logs by subgraph (#146047)
```
TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[1/0]" python r4.py
```

```
TORCH_LOGS="dynamo" TORCH_LOGS_TRACE_ID_FILTER="[0/0],[1/0_1]" python r4.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146047
Approved by: https://github.com/laithsakka
2025-02-04 19:11:44 +00:00
7d60235aa6 [Metal] Small speedup for sum/prod (#146428)
As they can not really be invoked over empty arrays
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146428
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #146423
2025-02-04 19:10:33 +00:00
b1663b31e1 [Metal][BE] Add #pragma once to all headers (#146423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146423
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-02-04 19:10:33 +00:00
292af3cc89 [BE][Ez]: ISC001 Auto concatenate implicit one line strings (#146408)
Apply ruff rule about implicit string concatenation, this autofixes strings that are all the same type and on the same line. These lines are broken up likely as the result of autoformatters in the past. All fixes are automated using the autofixes in ISC001.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146408
Approved by: https://github.com/justinchuby, https://github.com/janeyx99
2025-02-04 19:07:04 +00:00
f38a2ea0d4 [Dynamo] Better unsupported message for Fake Tensor Exception (#146357)
I cannot repro this. But this line shows up in internal logs, and I want
to know what the exception is and the context inside it. All of the
exceptions_allowed_to_be_fallback are dataclasses, so they should print
nicely.

Test Plan:
- code reading

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146357
Approved by: https://github.com/williamwen42
2025-02-04 18:52:11 +00:00
b0fe975521 [hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.

This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
2025-02-04 18:47:34 +00:00
157d81c201 Move get accelerator to use build time flags when possible (#146098)
This PR does two main things (they are in a single PR to show how the newly added APIs are used).

- Add isBuilt and isAvailable APIs to the AcceleratorHook interface. See inline doc for their exact semantic
- Use the newly added isBuilt for accelerator check to ensure it does not poison fork

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146098
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/EikanWang

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-02-04 18:23:24 +00:00
23fffb54d5 Use OrderedSet in _functorch/partitioners (#146102)
In an attempt to make partitioning more deterministic, change all sets in partitioners.py to OrderedSets. Note that this change does not fix the non-determinism we're seeing in the internal model. But let's at least eliminate this potential source of non-determinism before investigating any changes to the mincut approach?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146102
Approved by: https://github.com/oulgen
2025-02-04 17:43:07 +00:00
53759ccca8 [AOTI] Fix an unaligned memory access issue in mm_template (#146293)
Summary: Fixes a corner case in the Triton MM template, where the dimension M (dynamic size) can be smaller than BLOCK_M (similarly for the N dimenstion) can trigger unaligned memory access error.

Differential Revision: D69034578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146293
Approved by: https://github.com/chenyang78, https://github.com/jansel
2025-02-04 17:12:04 +00:00
87a63a9886 Add @nikitaved to torch.linalg CODEOWNERS/persons_of_interest (#141803)
As per title. I hope there is no objection :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141803
Approved by: https://github.com/albanD
2025-02-04 16:11:31 +00:00
e9f6e273e7 [inductor] Add typing to common.CSE (#145993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993
Approved by: https://github.com/yanboliang
ghstack dependencies: #145916
2025-02-04 16:05:39 +00:00
7a5239afd7 [inductor] Add typing to common.KernelArgs (#145916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916
Approved by: https://github.com/yanboliang
2025-02-04 16:05:39 +00:00
5d81bc3696 [MPSInductor] Implement prod reduction (#146396)
Mostly reusing `sum` reduction logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146396
Approved by: https://github.com/dcci
ghstack dependencies: #146369, #146370, #146380, #146389
2025-02-04 14:08:04 +00:00
bbe95341d9 [MPSInductor] Implement min and max reductions (#146389)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146389
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370, #146380
2025-02-04 14:04:10 +00:00
106acf0eec Revert "[aoti] Assign proxy call args by name, and support default values. (#146263)"
This reverts commit 11f69808c64a65c68a4452250ba7719dcff27c78.

Reverted https://github.com/pytorch/pytorch/pull/146263 on behalf of https://github.com/atalman due to multiple build failures, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146263#issuecomment-2633828689))
2025-02-04 12:57:55 +00:00
e0f22e54e8 [ROCm][TunableOp] Support leading dimensions in TunableOp signature. (#146358)
This is a feature enhancement that:
- May improve performance by distinguishing GEMMs with different leading dimensions.
- Fix correctness issues reported by users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146358
Approved by: https://github.com/jeffdaily
2025-02-04 10:27:43 +00:00
cyy
3f63f2bced Use std::string_view in tests (#146120)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146120
Approved by: https://github.com/albanD
2025-02-04 09:51:36 +00:00
8444fe019a [export] Fix requires_grad deserialization (#146351)
Test Plan: CI

Differential Revision: D69072095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146351
Approved by: https://github.com/zhxchen17
2025-02-04 08:02:38 +00:00
bb4bd5f00b [Metal][BE] Fix the arguments of polygamma (#146382)
In the public API, order comes before input, while here they're
 reversed. Match for consistency (and make this less error prone).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146382
Approved by: https://github.com/jansel, https://github.com/malfet
2025-02-04 06:40:34 +00:00
54ceb7c565 [MPSInductor] Add support for sum reduction (#146380)
- Add `threadgroup_sum` template to `c10/metal/reduction_utils.h` that so far uses barrier to compute the reductions

TODOs:
 - Implement efficient reduction using cooperative functions such as `simd_shuffle_down`
 - Figure out how to merge several sum reduction together
 - Implement `reduction_store` that will only write results from the first thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146380
Approved by: https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #146369, #146370
2025-02-04 06:23:44 +00:00
cyy
1c16cf70c3 Apply ruff fixes to tests (#146140)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146140
Approved by: https://github.com/albanD
2025-02-04 05:41:01 +00:00
cyy
71e3575525 Remove unactivated test (#146233)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146233
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-04 05:26:04 +00:00
e68f5087d8 update _unsafe_set_version_counter to accept lists of tensors (#137921)
See the comment [here](https://github.com/pytorch/pytorch/issues/132014#issuecomment-2379547400) (cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @XilunWu @rec) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list.

I left the binding in pybind, and used a `std::vector`. if we **really** need to optimize overhead even further, we could write a manual cpython binding.

I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137921
Approved by: https://github.com/awgu, https://github.com/albanD
2025-02-04 04:51:11 +00:00
425aca40a4 Fix random crash in PyPer (#146327)
Summary: PyPer saw random crashes when writing into ET file. This DIFF is to check if the output file is in condition before writing into it, and catch the exception if something bad happens, instead of crashing.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D69065509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146327
Approved by: https://github.com/sraikund16
2025-02-04 04:50:40 +00:00
0c37c332da [export] Additionally save pytree namedtuple field names (#145956)
If a user passes in a namedtuple as an input, currently the input TreeSpec looks like: `TreeSpec(type=namedtuple, context=”class_fqn”, children_spec=[*, *])`

The user then saves the program containing this input TreeSpec. But what happens if they load it in a new environment where `class_fqn` now contains an additional field?

This means that the exported program is now expected to take in another input. But since those fields were not used in the original program, users should be able just drop those additional fields and the program will run successfully. This is needed/used in APS where they use unflattener's adapter to adapt the inputs based on the previously saved treespecs.

There are a couple of [solutions](https://docs.google.com/document/d/1V4ZSdy-8PUISWc8RqvGu3DU01BVegJhHHPWqa1Io7Eg/edit?tab=t.0) for how we can address this, but eventually we settled on saving a side table mapping namedtuple types to their list of field names, which can then be accessed by the adapter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145956
Approved by: https://github.com/zhxchen17
2025-02-04 04:42:30 +00:00
487400f47f [dynamo] Support functools.partial variables through inspect.signature (#146339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146339
Approved by: https://github.com/jansel
ghstack dependencies: #146322, #146116
2025-02-04 04:39:39 +00:00
9756c7d788 [benchmark] Remove ONNX (#146325)
ONNX exporter experiments in benchmark is obsolete and unmaintained. This PR removes it to unblock https://github.com/pytorch/pytorch/pull/146003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146325
Approved by: https://github.com/titaiwangms
2025-02-04 04:02:47 +00:00
a79d8f8ba4 [ROCm] Tune 3d tensor sums when not using fastest dimension (#146170)
Tune 3d tensor sums when not using fastest dimension.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146170
Approved by: https://github.com/jeffdaily
2025-02-04 04:02:16 +00:00
7997ecf809 [BE] reduce log spew from test_triton_kernels.py (#145895)
One of the tests in this file was setting `self._logging.set_logs(output_code=True)` - which would cause logs to be printed for the rest of the tests in this file.

This PR puts the log-setting in a context manager so that the old behavior is restored afterwards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145895
Approved by: https://github.com/nmacchioni
2025-02-04 03:44:23 +00:00
5f53889850 [dynamo][builtin-skipfiles-cleanup] Remove inspect (#146116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146116
Approved by: https://github.com/williamwen42, https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #146322
2025-02-04 03:36:07 +00:00
762a05b3b3 [DCP] Remove all-gather of state dict keys (#145998)
The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU.

Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang.

In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check.

Resolves #145965 (as a workaround).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998
Approved by: https://github.com/mhorowitz, https://github.com/fegin
2025-02-04 03:16:13 +00:00
7f796eb8b7 Revert "[inductor] Add typing to common.KernelArgs (#145916)"
This reverts commit 68cf36d5ab6165372160f65eb84e13d0f8dbc5dc.

Reverted https://github.com/pytorch/pytorch/pull/145916 on behalf of https://github.com/atalman due to Failing internally, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/145916#issuecomment-2632715678))
2025-02-04 03:07:12 +00:00
d3c7e4bb9c Revert "[inductor] Add typing to common.CSE (#145993)"
This reverts commit 8c657ae4be55c6133307ad278c1740af5db133a7.

Reverted https://github.com/pytorch/pytorch/pull/145993 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/145993#issuecomment-2632712384))
2025-02-04 03:04:01 +00:00
ecbc725fad Revert "[inductor] Finish typing common.py (#146225)"
This reverts commit 3a67c0e48d29578aeeaa872275e730020bb5cbc2.

Reverted https://github.com/pytorch/pytorch/pull/146225 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146225#issuecomment-2632709707))
2025-02-04 03:01:36 +00:00
0061eb5b70 Revert "[inductor] Refactor CSEProxy into global scope (#146226)"
This reverts commit 18380ab877711f2e651c69c78675f0d0b31d2ceb.

Reverted https://github.com/pytorch/pytorch/pull/146226 on behalf of https://github.com/atalman due to Sorry need to revert https://github.com/pytorch/pytorch/pull/145916 ([comment](https://github.com/pytorch/pytorch/pull/146226#issuecomment-2632707618))
2025-02-04 02:58:50 +00:00
cyy
f397c72697 Remove NOLINTNEXTLINE (#146238)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146238
Approved by: https://github.com/albanD
2025-02-04 02:45:32 +00:00
5451c9b7c9 [MPSInductor] Add support for any reduction (#146370)
- Add `_new_accvar` function that creates a threadgroup variable
- As threadgroup variables can not be initialized in place, add explicit initialization for reduction var

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146370
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #146369
2025-02-04 02:45:03 +00:00
71179772cd [MPSInductor] Prep change for reduction support (#146369)
Add `group_pos` parameter as well as set `group_size` when invoking reduction kernels
Separates loads and stores and insert threadgroup barrier if reduction is in place

Should be a no-op right now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146369
Approved by: https://github.com/dcci, https://github.com/jansel
2025-02-04 02:38:07 +00:00
3dcbd04d1d [cutlass backend] Add instantiation level for generating configs (#146230)
Passing through instantiation level to generate more configs.

I do see some C++ compilation error. But running is fine. Using 2222 generates 1k+ configs.

Differential Revision: D68989194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146230
Approved by: https://github.com/Chillee, https://github.com/mlazos
2025-02-04 02:36:04 +00:00
0e49f35e3d Integrate sympy expression provenance logging with structured logs (#145848)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145848
Approved by: https://github.com/angelayi
2025-02-04 01:21:37 +00:00
4168982dad PEP585: .github release triggers (#145708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145708
Approved by: https://github.com/malfet
2025-02-04 01:02:46 +00:00
cf6c5b8fa8 [mps/inductor] Adjust more tests that expect float64 as input. (#146366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146366
Approved by: https://github.com/malfet
2025-02-04 00:48:02 +00:00
2f40f789da Revert "[inductor] Refactor op handlers part 1 (#146235)"
This reverts commit 204be4e0a2e4509bd2457bfb295c429dd92c241f.

Reverted https://github.com/pytorch/pytorch/pull/146235 on behalf of https://github.com/atalman due to Breaks lint, sorry: Definition of polygamma in base class MetalOverrides is incompatible with definition in base class OpsHandler. Please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/146235#issuecomment-2632444514))
2025-02-04 00:00:08 +00:00
3aeccf2a28 DeepSpeed github repo move sync (#146320)
DeepSpeed has moved to a new repo on github https://github.com/deepspeedai/DeepSpeed

This PR updates this repo to use the new URL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146320
Approved by: https://github.com/awgu
2025-02-03 23:20:49 +00:00
204be4e0a2 [inductor] Refactor op handlers part 1 (#146235)
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.

Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%

WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%

WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.

please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
2025-02-03 23:15:13 +00:00
18380ab877 [inductor] Refactor CSEProxy into global scope (#146226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146226
Approved by: https://github.com/shunting314
ghstack dependencies: #146225
2025-02-03 23:15:13 +00:00
0bc036a9e9 use copy2d in h2d/d2h copy when possible (#146256)
A rewrite of #138964
In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964:
1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d
2) copy2d should record even for the host pinned memory, like the regular copy does
3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking

In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call.

Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added.
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256
Approved by: https://github.com/eqy, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 23:07:54 +00:00
35af193408 [easy] Add type annotation for autotune_num_choices_displayed (#146323)
Test Plan: ci

Differential Revision: D69064447

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146323
Approved by: https://github.com/ColinPeppler
2025-02-03 23:04:21 +00:00
0463cb6ca5 [mps/inductor] Add support for digamma(). (#146292)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146292
Approved by: https://github.com/malfet, https://github.com/jansel
2025-02-03 22:48:13 +00:00
178531c95e [ONNX] torch.onnx.export(dynamo=True) changes optimization to default (#146187)
Fixes #145897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146187
Approved by: https://github.com/justinchuby
2025-02-03 22:44:54 +00:00
d69c181d77 log out partial fx graph when guard on data dependent during non stirct tracing (#146298)
As discussed with @avikchaudhuri and @bdhirsh last week, this can be quite useful when debugging.

The following code produces a data dependent error

```
import torch
from torch import nn

# UserError: Could not guard on data-dependent expression Eq(507 - u0, 0) (unhinted: Eq(507 - u0, 0)).  (Size-like symbols: u0)
class Repro(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, cache, update, pos):
        _, _, max_seq_len, _ = cache.shape
        _, _, seqlen, _ = update.shape

        pos_item = pos[0].item() # u0
        torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        torch._check(pos_item >= 0)
        before = cache.narrow(2, 0, pos_item)

        # FAIL
        # Laith: why can't we make unbacked expressions size-like?
        after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))

        # PASS
        end = torch.tensor(max_seq_len - pos_item - seqlen).item()
        after = cache.narrow(2, (pos_item + seqlen), end)

        return torch.cat([before, update, after], dim=2)

repro = Repro()

bsz = 1
n_heads = 4
max_seq_len = 512
head_dim = 64
seqlen = 5
pos_item = 1

cache = torch.zeros(bsz, n_heads, max_seq_len, head_dim)
update = torch.ones(bsz, n_heads, seqlen, head_dim)
pos = torch.tensor([pos_item])
example_inputs = (cache, update, pos)

torch.export.export(repro, example_inputs, strict=False)
```

This is what it now prints out

```
class GraphModule(torch.nn.Module):
    def forward(self, arg0_1: "f32[1, 4, 512, 64][131072, 32768, 64, 1]cpu", arg1_1: "f32[1, 4, 5, 64][1280, 320, 64, 1]cpu", arg2_1: "i64[1][1]cpu"):
         # File: /data/users/bobren/a/pytorch/r1.py:14 in forward, code: pos_item = pos[0].item() # u0
        select: "i64[][]cpu" = torch.ops.aten.select.int(arg2_1, 0, 0);  arg2_1 = None
        item: "Sym(u0)" = torch.ops.aten.item.default(select);  select = None

         # File: /data/users/bobren/a/pytorch/r1.py:15 in forward, code: torch._check(pos_item + seqlen <= max_seq_len) # u0 + 502 <= 507
        add: "Sym(u0 + 5)" = item + 5
        le: "Sym(u0 + 5 <= 512)" = add <= 512;  add = le = None

         # File: /data/users/bobren/a/pytorch/r1.py:16 in forward, code: torch._check(pos_item >= 0)
        ge: "Sym(u0 >= 0)" = item >= 0;  ge = None

         # File: /data/users/bobren/a/pytorch/r1.py:17 in forward, code: before = cache.narrow(2, 0, pos_item)
        narrow: "f32[1, 4, u0, 64][131072, 32768, 64, 1]cpu" = torch.ops.aten.narrow.default(arg0_1, 2, 0, item);  narrow = None

         # File: /data/users/bobren/a/pytorch/r1.py:21 in forward, code: after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
        add_1: "Sym(u0 + 5)" = item + 5
        sub: "Sym(512 - u0)" = 512 - item;  item = None
        sub_1: "Sym(507 - u0)" = sub - 5;  sub = None
        narrow_1 = torch.ops.aten.narrow.default(arg0_1, 2, add_1, sub_1);  arg0_1 = add_1 = sub_1 = narrow_1 = None

Traceback (most recent call last):
  File "/data/users/bobren/a/pytorch/r1.py", line 45, in <module>
    torch.export.export(repro, example_inputs, strict=False)
  File "/data/users/bobren/a/pytorch/torch/export/__init__.py", line 368, in export
    return _export(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper
    raise e
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper
    ep = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 2079, in _export
    return _export_for_training(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1044, in wrapper
    raise e
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1017, in wrapper
    ep = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/exported_program.py", line 117, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1944, in _export_for_training
    export_artifact = export_func(  # type: ignore[operator]
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1879, in _non_strict_export
    aten_export_artifact = _to_aten_func(  # type: ignore[operator]
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1665, in _export_to_aten_ir_make_fx
    gm, graph_signature = transform(_make_fx_helper)(
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1809, in _aot_export_non_strict
    gm, sig = aot_export(wrapped_mod, args, kwargs=kwargs, **flags)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1585, in _make_fx_helper
    gm = make_fx(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2194, in wrapped
    return make_fx_tracer.trace(f, *args)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2132, in trace
    return self._trace_inner(f, *args)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 2103, in _trace_inner
    t = dispatch_trace(
  File "/data/users/bobren/a/pytorch/torch/_compile.py", line 51, in inner
    return disable_fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_dynamo/eval_frame.py", line 749, in _fn
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1136, in dispatch_trace
    graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1692, in trace
    res = super().trace(root, concrete_args)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 834, in trace
    (self.create_arg(fn(*args)),),
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1191, in wrapped
    out = f(*tensors)  # type:ignore[call-arg]
  File "<string>", line 1, in <lambda>
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1488, in wrapped_fn
    return tuple(flat_fn(*args))
  File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/utils.py", line 184, in flat_fn
    tree_out = fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_functorch/_aot_autograd/traced_function_transforms.py", line 879, in functional_call
    out = mod(*args[params_len:], **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module
    return Tracer.call_module(self, m, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module
    ret_val = forward(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/export/_trace.py", line 1793, in forward
    tree_out = mod(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 811, in module_call_wrapper
    return self.call_module(mod, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1762, in call_module
    return Tracer.call_module(self, m, forward, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 529, in call_module
    ret_val = forward(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/_symbolic_trace.py", line 804, in forward
    return _orig_module_call(mod, *args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1749, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/nn/modules/module.py", line 1760, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/r1.py", line 21, in forward
    after = cache.narrow(2, (pos_item + seqlen), (max_seq_len - pos_item - seqlen))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1239, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1286, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_export/non_strict_utils.py", line 654, in __torch_function__
    return func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_ops.py", line 866, in handler
    return torch._library.utils.handle_dispatch_mode(
  File "/data/users/bobren/a/pytorch/torch/_library/utils.py", line 296, in handle_dispatch_mode
    return curr_mode.__torch_dispatch__(op_overload, overload_types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 1341, in __torch_dispatch__
    return proxy_call(self, func, self.pre_dispatch, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/proxy_tensor.py", line 910, in proxy_call
    out = func(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_ops.py", line 749, in __call__
    return self._op(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1369, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 2282, in _dispatch_impl
    decomposition_table[func](*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_decomp/decompositions.py", line 759, in slice_forward
    return self.as_strided(sizes, strides, storage_offset)
  File "/data/users/bobren/a/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1267, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1808, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1370, in _cached_dispatch_impl
    entry = self._make_cache_entry(state, key, func, args, kwargs, output)
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1640, in _make_cache_entry
    output_info = self._get_output_info_for_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1583, in _get_output_info_for_cache_entry
    synth_output = self._output_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1738, in _output_from_cache_entry
    return self._get_output_tensor_from_cache_entry(
  File "/data/users/bobren/a/pytorch/torch/_subclasses/fake_tensor.py", line 1709, in _get_output_tensor_from_cache_entry
    empty.set_(storage, storage_offset, shape, stride)
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/sym_node.py", line 564, in guard_size_oblivious
    r = self.shape_env.evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/recording.py", line 263, in wrapper
    return retlog(fn(*args, **kwargs))
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6468, in evaluate_expr
    return self._evaluate_expr(
  File "/data/users/bobren/a/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6658, in _evaluate_expr
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Ne(507 - u0, 1) (unhinted: Ne(507 - u0, 1)).  (Size-like symbols: u0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146298
Approved by: https://github.com/bdhirsh
2025-02-03 22:16:03 +00:00
0da07a6d1d [dynamo][skip-function] Add missing unimplemented line (#146322)
This is a missing line from the merged PR in the stack below. Lets try to get this in quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146322
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos
2025-02-03 22:11:55 +00:00
00dc5b10f6 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 2fd1b6b3610eb84cd615360a8fd23756a7f2e743.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/atalman due to Breaks executorch tests ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2632202864))
2025-02-03 22:04:28 +00:00
15e12d5ec3 [Trace PyDispatcher] Support temporarily_pop_interpreter_stack ctx manager (#146271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146271
Approved by: https://github.com/zou3519
ghstack dependencies: #146270
2025-02-03 21:47:54 +00:00
bd8d7b1b74 [Dynamo][Trace PyDispatcher] Remove disable from HigherOrderOperator.__call__ (#146270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146270
Approved by: https://github.com/zou3519
2025-02-03 21:47:54 +00:00
fd73ae2068 [Utilization] Convert timestamp to str for datetime64 (#145985)
Convert all timestamp(float) to int  timestamp during data pipeline for db type datetime64.
float does not work when try to insert into clickhouse using jsonExtract.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145985
Approved by: https://github.com/huydhn
2025-02-03 21:05:18 +00:00
1d4adf4e1f [dynamo] log recompile reason to dynamo_compile (#146117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146117
Approved by: https://github.com/bobrenjc93
2025-02-03 21:04:04 +00:00
11f69808c6 [aoti] Assign proxy call args by name, and support default values. (#146263)
Fixing the following issue when compiling the following program:
```
                window = torch.hann_window(N_FFT).to(x.device)
                stft = torch.stft(
                    x, N_FFT, HOP_LENGTH, window=window, return_complex=True
                )
                magnitudes = stft[..., :-1].abs() ** 2
                return magnitudes
```
```
Traceback (most recent call last):
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
    self.check_model(model, example_inputs)
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
    actual = AOTIRunnerUtil.run(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
    optimized = AOTIRunnerUtil.load(device, so_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
    return torch._export.aot_load(so_path, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
2025-02-03 20:15:59 +00:00
e67ce67498 [cutlass backend] update try_import_cutlass to accomodate for pip install (#145891)
The goal of this PR is to provide 3 ways for people to try out CUTLASS backend:
1. fbcode / internal
2. pip install torch (nightly) and pip install nvidia-cutlass
3. build from source

I will go into more detailed combos between building from source and downloading via pip for torch and cutlass.

repro:
```
import torch
import torch.nn as nn

import torch._inductor.config as config

config.force_disable_caches = True
config.max_autotune = True
config.max_autotune_gemm_backends = "CUTLASS"
# the following is only needed if you use a custom cutlass library
# config.cuda.cutlass_dir = "/data/users/henrylhtsang/cutlass"

class TestModule(nn.Module):
    def forward(self, A, B):
        return A @ B

model = TestModule().cuda()
M, K, N = 2048, 2048, 2048
A = torch.randn(M, K).cuda().half()
B = torch.randn(K, N).cuda().half()

C = torch.compile(model, fullgraph=True)(A, B)
```

## pre-requisite
Assuming you have the right cuda toolkit. Recommend 12.4. Make sure PATH, LD_LIBRARY_PATH and CUDA_NVCC_EXECUTABLE are good.

## combo 1: pip install torch + pip install nvidia-cutlass
Check https://pytorch.org/get-started/locally/ for **nightly** install command.
```
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
pip install nvidia-cutlass
```
Then try running the script above. It should work.

## combo 2: build torch from source + pip install nvidia-cutlass
This is going to be be pretty straightforward. Just keep in mind that even though pytorch/third_party/cutlass exists, the one that will be used is the pip package, so mindful of version differences.

## combo 3: build torch from source + use pytorch/third_party/cutlass
This is how most pytorch devs would do it. Just make sure you don't have a cutlass pip package installed, i.e., make sure `import cutlass_library` would fail on its own.

## combo 4: any torch version + cutlass library from somewhere else
This is probably the only case you need to pass in cutlass_dir. Just set cutlass_dir to the cutlass repo library. The expectations is that cutlass_dir is the directory that contains include, tool, and python/cutlass_library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145891
Approved by: https://github.com/Chillee, https://github.com/ColinPeppler
2025-02-03 20:05:41 +00:00
f237172768 Fix not inlining functions used in metal files (#146316)
Fixes issue when building PyTorch with Xcode installed after https://github.com/pytorch/pytorch/pull/146231
```
FAILED: caffe2/aten/src/ATen/kernels_basic.metallib /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen/kernels_basic.metallib
cd /Users/Irakli_Salia/Desktop/pytorch/build/caffe2/aten/src/ATen && xcrun metallib -o kernels_basic.metallib BinaryKernel_30.air Bucketization_30.air CrossKernel_30.air FusedOptimizerOps_30.air Gamma_30.air HistogramKernel_30.air Im2Col_30.air Indexing_30.air LinearAlgebra_30.air Quantized_30.air RMSNorm_30.air RenormKernel_30.air Repeat_30.air SpecialOps_30.air TriangularOps_30.air UnaryKernel_30.air UnfoldBackward_30.air UpSample_30.air
LLVM ERROR: multiple symbols ('_ZN3c105metal4zetaEff')!
[3835/5420] Building CXX object c10/test/CMakeFiles/c10_small_vector_test.dir/util/small_vector_test.cpp.o
ninja: build stopped: subcommand failed.
```

AI to @malfet: Add linter that ensures that `c10/metal/` headers do not have any functions there, only templates
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146316
Approved by: https://github.com/malfet, https://github.com/atalman
2025-02-03 19:33:52 +00:00
674e0b668a Add non-strict export while_loop test back (#146195)
This is fixed by https://github.com/pytorch/pytorch/pull/145762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146195
Approved by: https://github.com/zou3519
ghstack dependencies: #146194
2025-02-03 19:28:22 +00:00
1138d0c4f6 [hop] enable while_loop return torch.ones with unbacked symbol expression. (#146194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146194
Approved by: https://github.com/zou3519
2025-02-03 19:28:22 +00:00
57b1fc35f6 [dynamo] Disable compiling on elementwise_type_promotion_wrapper (#146219)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146219
Approved by: https://github.com/zou3519
ghstack dependencies: #146075, #146283
2025-02-03 18:02:48 +00:00
64fc9ff09c Revert "[ONNX] Create deprecation warning on dynamo_export (#146003)"
This reverts commit e6c39d37e90242692cf25ea849abd47d11932cd7.

Reverted https://github.com/pytorch/pytorch/pull/146003 on behalf of https://github.com/atalman due to Broke internally ([comment](https://github.com/pytorch/pytorch/pull/146003#issuecomment-2631599314))
2025-02-03 17:17:14 +00:00
041e08f9dc Add buffers to parameterizaiton rule (#145991)
Differential Revision: [D68959513](https://our.internmc.facebook.com/intern/diff/D68959513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145991
Approved by: https://github.com/bdhirsh
2025-02-03 16:49:03 +00:00
c0979d72b5 Revert "[hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)"
This reverts commit 68a363548409a3ff17965770304ee5e12fe718d9.

Reverted https://github.com/pytorch/pytorch/pull/143456 on behalf of https://github.com/atalman due to New tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/143456#issuecomment-2631475900))
2025-02-03 16:25:58 +00:00
01554c7b5a fix incorrect literal strings / accidental tuples (#146037)
* `expr,` is short for `(expr,)`
* literal strings over multiple lines need to escape the newline `\` or use `(...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146037
Approved by: https://github.com/Skylion007
2025-02-03 15:08:11 +00:00
550441a87b Update slow tests (#146301)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146301
Approved by: https://github.com/pytorchbot
2025-02-03 11:37:16 +00:00
08b14936ae Disable has_relational_guards check for dict_tag optimization for now (#146232)
has_relational_guards evaluates to true almost always, and leads to a
slowdown in guards runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146232
Approved by: https://github.com/anijain2305
2025-02-03 07:56:06 +00:00
e3643e1e0e [MPS] Add linalg det and fix lu factor for non contiguous tensors (#146279)
Requested in #77764

This PR adds support for linalg.det on MPS and fixes lu factor for non contiguous tensors, current implementation crashed on any kind of non-contiguous tensor with an error:
```
-[AGXG13XFamilyCommandBuffer blitCommandEncoderCommon:]:833: failed assertion `A command encoder is already encoding to this command buffer'
zsh: abort      python det.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146279
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-02-03 06:06:43 +00:00
1580f47bf4 [export][ez] Fix generated header file. (#146208)
Summary: as title.

Test Plan: CI

Differential Revision: D68978788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146208
Approved by: https://github.com/yiming0416
2025-02-03 06:01:05 +00:00
cyy
7b512095ef Enable some tests on MacOS (#146268)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146268
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-03 05:04:24 +00:00
fa48757180 [dynamo] misc fixes for inspect (#146283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146283
Approved by: https://github.com/jansel
ghstack dependencies: #146075
2025-02-03 04:26:10 +00:00
cyy
6ac8bc0cd2 Remove unused import in tests (#146266)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146266
Approved by: https://github.com/Skylion007
2025-02-03 03:40:18 +00:00
d80eef7c6d [inductor] Guard a member variable with a define. (#146278)
It's unused otherwise, and when running MPS tests, I get a bunch of warnings of this kind:

/Users/davidino/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:412:10: warning: private field 'blob_size_' is not used [-Wunused-private-field]
  412 |   size_t blob_size_;
      |          ^
1 warning generated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146278
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-02-03 02:20:08 +00:00
c0ec2e0a0d [dynamo][functions] Improve getattr on functions (#146075)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146075
Approved by: https://github.com/jansel
2025-02-03 02:01:57 +00:00
d28fe3ed47 [metal] Move digamma to special_math.h (#146284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146284
Approved by: https://github.com/jansel
2025-02-03 01:29:14 +00:00
1f21f699ba [metal] Refactor digamma in preparation for moving it. (#146281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146281
Approved by: https://github.com/jansel
2025-02-02 23:54:45 +00:00
511d0dd558 [Dynamo][Trace PyDispatcher] Support calling id function over class (#146269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146269
Approved by: https://github.com/anijain2305
2025-02-02 22:29:30 +00:00
02fd4868d6 Fix unreachable code (#146262)
Fixes #146261

Removed unreachable code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146262
Approved by: https://github.com/Skylion007
2025-02-02 21:35:26 +00:00
5d55a6585d [MPS] lu factor ex implementation (#144651)
Implements `torch.linalg.lu_factor_ex`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144651
Approved by: https://github.com/malfet
2025-02-02 15:09:49 +00:00
0144613e6f move and fix logic to update unbacked bindings (#146115)
Summary:
Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here).

This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde.

Test Plan: added test

Differential Revision: D68880766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115
Approved by: https://github.com/pianpwk
2025-02-02 10:43:55 +00:00
a44a8a7d3a [audio hash update] update the pinned audio hash (#145988)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145988
Approved by: https://github.com/pytorchbot
2025-02-02 04:19:29 +00:00
cyy
8543d8395b [2/N] Enable ruff F841 on distributed tests (#146132)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146132
Approved by: https://github.com/Skylion007, https://github.com/rec
2025-02-02 03:44:48 +00:00
cef856faa9 [dynamo][enum] Trace through enum.py for enum construction (#146070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146070
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198, #146258, #146214
2025-02-02 03:12:36 +00:00
31fb691782 [dynamo] Graph break on tensor.retain_grad (#146214)
Fixes https://github.com/pytorch/pytorch/issues/146212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146214
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198, #146258
2025-02-02 03:12:36 +00:00
529eb8d558 [dynamo] Add return to python_type (#146258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146258
Approved by: https://github.com/jansel
ghstack dependencies: #146062, #146198
2025-02-02 03:12:36 +00:00
7854299b27 [mps/inductor] Implement support for polygamma(). (#146259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146259
Approved by: https://github.com/jansel
2025-02-02 01:54:23 +00:00
d89c7ea401 add WaitCounter type interface and get rid of type errors (#146175)
Summary: as titled

Differential Revision: D68960123

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146175
Approved by: https://github.com/andriigrynenko, https://github.com/Skylion007
2025-02-01 23:24:52 +00:00
3a67c0e48d [inductor] Finish typing common.py (#146225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146225
Approved by: https://github.com/Skylion007
2025-02-01 22:53:35 +00:00
dca5cc0255 [mps] Move polygamma to special_math.h. (#146253)
In preparation to implement it in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146253
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 21:45:23 +00:00
07dbd539b4 [BE][Ez]: Make c10/special arrays constexpr (#146246)
No reason to have array creation overhead for these constexpr arrays. This is better because it guarantees the array is not duplicated across templates or translation units unless necessary and allows the compiler to do static compile time bounds checking (even in loop based accesses)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146246
Approved by: https://github.com/dcci, https://github.com/malfet
2025-02-01 21:03:18 +00:00
d4ad7b91ad [mps] Move zeta() to special_math.h. (#146231)
In preparation for implementing digamma/polygamma

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146231
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-01 19:22:59 +00:00
f97307f463 [Docs] Add clarification for target types in CrossEntropyLoss doc (#145444)
CrossEntropyLoss function requires that target for class indices are provided as a long and class probabilities are provided as a float datatype.

The CrossEntropyLoss function distinguish the two scenarios (indices and probabilities) by comparing the shapes. When input and target shapes are the same it’s a case for probabilities otherwise it will be used as a class index as already covered in the doc. The related code is here,
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624

I think the current documentation is great but seems like it can confuse users about types as reported in the issues so this PR adds a bit more clarification.

Fixes #137188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145444
Approved by: https://github.com/mikaylagawarecki
2025-02-01 18:55:58 +00:00
5ed5793016 Temp disable MKL in DistributionKernels.cpp (#146174)
Until https://github.com/pytorch/pytorch/issues/132395 is addressed

Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential )
```python
import torch

high_bits_for_seed = 16000000000000000000           # to use "good quality" seed
_ = torch.manual_seed (high_bits_for_seed + 2024)

prob = torch.ones (26)
dups_mult = 0
perm_counts_mult = {}
for _ in range (1_000_000):
    p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist())
    if  p in perm_counts_mult:
        dups_mult += 1
        perm_counts_mult[p] += 1
    else:
        perm_counts_mult[p] = 1

print ('duplicate multinomial perms: ', dups_mult)
print ('multiple multinomial perms:  ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item())
print ('max of perm_counts_mult:     ', torch.tensor (list (perm_counts_mult.values())).max().item())
print ('len (perm_counts_mult):      ', len (perm_counts_mult))
```

This is a reland of https://github.com/pytorch/pytorch/pull/132532 but excluding internal builds that already has some hardcoded values

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146174
Approved by: https://github.com/ngimel
2025-02-01 18:53:11 +00:00
e56dcf2772 [CPUInductor] Fix SVE256 detection (#146207)
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes https://github.com/pytorch/pytorch/issues/145441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207
Approved by: https://github.com/angelayi
2025-02-01 18:51:34 +00:00
8c657ae4be [inductor] Add typing to common.CSE (#145993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145993
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914, #145915, #145916
2025-02-01 16:34:18 +00:00
68cf36d5ab [inductor] Add typing to common.KernelArgs (#145916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145916
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914, #145915
2025-02-01 16:34:18 +00:00
8e56d713c9 [inductor] Add typing to common.OpDecompositions (#145915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145915
Approved by: https://github.com/yanboliang
ghstack dependencies: #145913, #145914
2025-02-01 16:34:11 +00:00
79f9f62e3a [inductor] Combine regexp checks in OpOverrides.paren (#145914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145914
Approved by: https://github.com/Skylion007
ghstack dependencies: #145913
2025-02-01 16:34:03 +00:00
4c004caa76 [inductor] Add types to DeviceOpOverrides (#145913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145913
Approved by: https://github.com/Skylion007
2025-02-01 16:33:49 +00:00
0f768c7866 Barebones flat_apply HOP (#146060)
This PR:
- adds pytree.register_constant for registering a class to be treated as
  a constant by torch.compile/torch.fx
- adds a very barebones flat_apply HOP. This should be sufficient to get
  mark_traceable working. A lot more work is necessary to get the custom
  operator case working (when make_fx sees a custom operator with PyTree
  arg types, it needs to emit a call to the flat_apply HOP).
- I expect the flat_apply HOP to change a lot, I want to ship this in
  the current state to unblock the mark_traceable and custom ops
  work.

Test Plan:
- It's kind of difficult to test the barebones flat_apply HOP "works" so
  I added a really simple test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146060
Approved by: https://github.com/StrongerXi, https://github.com/yanboliang
ghstack dependencies: #146059
2025-02-01 16:17:48 +00:00
373606928b Add torch.utils._pytree.register_dataclass (#146059)
This is an API that registers a dataclass as a pytree node.
It directly calls torch.export.register_dataclass, but we should
eventually inline that implementation here. I want to use this API for
something in compile and feel weird calling
torch.export.register_dataclass.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146059
Approved by: https://github.com/StrongerXi, https://github.com/angelayi, https://github.com/yanboliang
2025-02-01 16:17:48 +00:00
cyy
2fd1b6b361 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-02-01 12:33:41 +00:00
2b00d211f0 Build RowwiseScaledMM.cu for SM89 (#145676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676
Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy
2025-02-01 11:44:58 +00:00
f40e013787 Fix aten.to when input is a tensor constant (#146220)
Summary:
Fix aten.to when input is a tensor constant.

In this case, `args_unwrapped` could just be a constant, so not a functional tensor.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export  -- -r  tensor_constant_aten_to
```

Differential Revision: D68984244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146220
Approved by: https://github.com/JacobSzwejbka
2025-02-01 11:07:33 +00:00
30f091da44 add speculation log divergence test (#145659)
Followup from a SEV. Confirmed that this breaks when stacked on top of https://github.com/pytorch/pytorch/pull/145660 (offending PR that caused the SEV)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145659
Approved by: https://github.com/laithsakka
2025-02-01 09:39:22 +00:00
a4e4368157 add node mapping processing (#146103)
Summary:
Add `node_mapping = create_node_mapping(pre_grad_graph_id, inductor_post_to_pre_grad_nodes, debug_info)`, to produce a `inductor_provenance_tracking_node_mappings.json` file. This file will be used by the provenance tracking highlighter tool to create provenance visualization.

`inductor_triton_kernel_to_post_grad_nodes.json` and `inductor_provenance_tracking_node_mappings.json` files are not dumped if they are both empty. So it's removed from some of the `test_structured_trace` tests.

Test Plan:
CI
```
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r graph_provenance

buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing

python test/dynamo/test_structured_trace.py
```

Differential Revision: D68190173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146103
Approved by: https://github.com/chenyang78
2025-02-01 08:29:29 +00:00
f38d5b4a74 Update TorchBench commit to main (#145455)
I'm adding sam2 to TorchBench https://github.com/pytorch/benchmark/issues/2566, so, as part of that, I'm updating PyTorch CI to use latest TorchBench commit.

The corresponding change from TorchBench is https://github.com/pytorch/benchmark/pull/2584

The main thing to call out that the newer transformers added by https://github.com/pytorch/benchmark/pull/2488 is regressing several models. This needs to be investigated further, and I pin the version to unblock this change.

* `hf_Roberta_base` a new model added by https://github.com/pytorch/benchmark/pull/2279, not sure why it fails accuracy on A10G, but it works fine on A100
* `speech_transformer` failures are pre-existing trunk failures, i.e. https://github.com/pytorch/pytorch/actions/runs/13040114684/job/36380989702#step:22:2408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145455
Approved by: https://github.com/kit1980
2025-02-01 06:44:26 +00:00
a97a906dd9 Add "//caffe2:libtorch" to minifier TARGET file (#146203)
Summary: as title. To avoid errors like "undefined symbol: aoti_torch_device_type_cpu" when compiling minifier_launcher.py

Test Plan: CI

Differential Revision: D68978430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146203
Approved by: https://github.com/desertfire
2025-02-01 05:37:23 +00:00
bcd0ba0f69 Adding the best autotuner config (#146121)
Summary: Adding logs to log the best config for autotune configs

Test Plan:
Testing in Mast : aps-omnifmv1-5_32_test_with_best_config-c5e9ceccf8

 {F1974838864}

Reviewed By: oulgen

Differential Revision: D68931164

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146121
Approved by: https://github.com/oulgen
2025-02-01 03:43:33 +00:00
549e230c33 [draft_export] Clear pending unbacked symbols when overriding mismatched fake kernels (#146089)
Summary:
When encountering a mismatched fake kernel that also creates unbacked symbols, draft export will fail with `PendingUnbackedSymbolNotFound` error.

Clearing `shape_env.pending_fresh_unbacked_symbols` fixes this issue.

Test Plan:
```
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_override_mismatched_fake_kernel_with_unbacked_symbols
```

Differential Revision: D68920990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146089
Approved by: https://github.com/pianpwk
2025-02-01 03:32:50 +00:00
cyy
4d2056efb5 Enable ruff F841 on numpy tests (#146126)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146126
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-01 03:07:28 +00:00
cyy
985a78e9df Enable ruff F841 on distributed tests (#146131)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146131
Approved by: https://github.com/rec, https://github.com/albanD
2025-02-01 03:06:16 +00:00
1de41e6918 [dynamo][exceptions][3.10] Clean symbolic stack on exception handling (#146198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146198
Approved by: https://github.com/williamwen42
ghstack dependencies: #146062
2025-02-01 02:51:44 +00:00
6023684311 [export] Fix symfloat serialization (#146112)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146112
Approved by: https://github.com/pianpwk
2025-02-01 02:28:44 +00:00
8326d27093 [inductor][5/N] triton support post-#5512, fix 1 and None handling (#145515)
This fixes handling for "1" and "None" args with new Triton versions. TL;DR: triton_meta["constants"] (which is passed to ASTSource) should be a map of {"kwarg_name": constant_value} for values which are tl.constexpr, or have a value of 1 or None (i.e. "specialized" constants). For constant args, triton_meta["signature"][arg_name] should be "constexpr" (even for specialized constants).

Note: This adds support for Triton versions after 5512; but not for versions in between 5220 and 5512 (i.e. `TritonAttrsDescriptorVersion.V3_BACKENDS_TUPLE`). There's a completely different format for constants/signature in the commit range in between.

To test: I ran `test_torchinductor.py` and `test_triton_kernels.py` with the main branch of triton (~jan 27). The only failing tests are aoti-related tests (which need to be fixed as a follow-up), and test_mutable_custom_op_fixed_layout2_cuda (which is failing with or without the new triton version on my machine); additionally, the split-scan/split-reduction kernels rely on https://github.com/triton-lang/triton/pull/5723.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145515
Approved by: https://github.com/SamGinzburg
2025-02-01 02:11:48 +00:00
6e734bab93 execution trace export supports gzip format (#146179)
As above, allows Chakra Execution Trace observer to support compressing files.
Usage is straightforward, just add ".gz" suffix to the output file name
```
et = ExecutionTraceObserver()
et.register_callback("my_trace.json.gz")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146179
Approved by: https://github.com/shengfukevin, https://github.com/davidberard98, https://github.com/sraikund16
2025-02-01 01:25:25 +00:00
57c45340e7 include entire GraphModule instead of current node when erroring inside of fx interpreter (#146197)
This seems like it would make it easier to diagnose PT2 issues where the user cannot easily repro, and we need more info in the backtrace, e.g. in https://github.com/pytorch/pytorch/issues/134182#issuecomment-2628076114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146197
Approved by: https://github.com/jamesjwu
2025-02-01 01:09:27 +00:00
73d90d66a4 Cap size of thread pool in select_algorithm to cpu count (#146071)
Summary: With changes from https://github.com/pytorch/pytorch/pull/144829, we can see more autotune configs and the size of the pool can get outta hand when using the cutlass backend.

See internal discussion at: https://fburl.com/workplace/7g4vz0zy

Test Plan: `python test/inductor/test_cutlass_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146071
Approved by: https://github.com/Chillee
2025-02-01 00:41:36 +00:00
cde5ddfd14 fix internal error with reorder submodules (#146181)
Test Plan: hard to isolate as small repro

Differential Revision: D68963033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146181
Approved by: https://github.com/angelayi
2025-02-01 00:30:42 +00:00
35f113e2a0 torch/nn/utils/rnn.py: docs: improvements (#138628)
Fix constants highlighting in generated documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138628
Approved by: https://github.com/mikaylagawarecki
2025-02-01 00:10:30 +00:00
a78c796f0b [AOTI] Support composed dynamic shape constraint (#146044)
Summary: Fixes https://github.com/pytorch/pytorch/issues/145500. When export takes a dynamic shape constraint as an expression containing a symbol, we should be able to solve the symbol at run time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146044
Approved by: https://github.com/angelayi
ghstack dependencies: #146043
2025-02-01 00:02:12 +00:00
43372e70c2 ehnace logging statically known by adding size_oblivious(..) (#145354)
after the diff
```
[0/0_1] eval size_oblivious(Eq(s1, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(u0, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(s0, 1)) == False [statically known]
[0/0_1] eval size_oblivious(Eq(s0*s1*u0, 0)) == False [statically known]
```
before
```
[0/0_1] eval (Eq(s1, 1)) == False [statically known]
[0/0_1] eval (Eq(u0, 1)) == False [statically known]
[0/0_1] eval (Eq(s0, 1)) == False [statically known]
[0/0_1] eval (Eq(s0*s1*u0, 0)) == False [statically known]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145354
Approved by: https://github.com/ezyang
2025-01-31 23:26:37 +00:00
f25f1163dc [dynamo] Support frozenset({..}).__contains__ (#146062)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146062
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-01-31 23:22:58 +00:00
eb029fba13 [c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789) (#144794)
Try to land https://github.com/pytorch/pytorch/pull/136789/files on our end and fix any remaining issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144794
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman
2025-01-31 22:39:56 +00:00
af2a39849d [AOTI] Refactor codegen_input_symbol_assignment (#146043)
Summary: Extract the common logic for size and stride symbol generation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146043
Approved by: https://github.com/angelayi
2025-01-31 21:55:18 +00:00
c39c679813 Revert "Tensor .cuda() very slow with specific array sizes (#138964)"
This reverts commit 98f87edd233ea69cee5f3e73e9eb4b5ab77aa744.

Reverted https://github.com/pytorch/pytorch/pull/138964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but some slow test start failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/138964#issuecomment-2628455198))
2025-01-31 21:48:51 +00:00
a7cc6d3e84 Manylinux 2.28 migration - remove pre-cxx11 abi libtorch builds (#146200)
Related to: https://github.com/pytorch/pytorch/issues/123649
Removing pre-cxx11 abi builds.
As per announcement : https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146200
Approved by: https://github.com/kit1980, https://github.com/huydhn
2025-01-31 21:43:12 +00:00
8203894eff Resolve affine quantization namespace collision with torchao (#145941)
Summary:
https://github.com/pytorch/pytorch/pull/141421
duplicated affine quantization custom ops from torchao into
the PT2E quantization flow, but these ops are registered under
the same namespace with the same name, causing "Duplicate
registration" errors for the new ops for use cases that import
from both repos. This commit fixes this by moving the PT2E
versions of the ops to a new namespace. In the long term,
we expect to migrate PT2E into torchao so users can migrate
back to the old namespace if they wish to.

Test Plan: python test/test_quantization.py -k test_channel_group_quantization

Differential Revision: D68838437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145941
Approved by: https://github.com/cccclai
2025-01-31 21:29:47 +00:00
781aceee9c [dynamo] Revert abc change due to internal failures (#146177)
xref - https://www.internalfb.com/tasks/?t=191383874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146177
Approved by: https://github.com/StrongerXi
ghstack dependencies: #146141
2025-01-31 21:28:06 +00:00
a0d1393b1a [MTIA][FSDP2] Enable MTIA device in FSDP2 library code (#145842)
Differential Revision: D68560256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145842
Approved by: https://github.com/chaos5958, https://github.com/nautsimon
2025-01-31 21:21:00 +00:00
06850e624a [ca][hop] test CA on all HOPs (#145429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145429
Approved by: https://github.com/zou3519
ghstack dependencies: #145422
2025-01-31 20:45:22 +00:00
2e197c8a2d [dynamo][hop] test torch.compiling all HOPs (#145422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145422
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-01-31 20:45:22 +00:00
5b1abdbf5d [dynamo] remove always-failing eval_frame.c debug check (#145982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145982
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145981
2025-01-31 20:40:59 +00:00
49df8de8be [dynamo] disable eval_frame callback in _TorchDynamoContext __enter__/__exit__ (#145981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145981
Approved by: https://github.com/jansel
2025-01-31 20:40:59 +00:00
3a4e7a589b [CI][Distributed] Fix edge case: One rank case (Rank 0) should get [False, False] (#146099)
To match the expected tensor (i.e. 2nd element in the array). Making rank0 receive [False, False]

Fixes one of the issues reported in #146094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146099
Approved by: https://github.com/eqy
2025-01-31 20:31:13 +00:00
8b8c596503 Remove trivial dispatch_key_allowlist_check function (#146169)
Hmmm...this _is_ removing a public function from a public C++ file. But the GH counts for this function total 83, seemingly all copying pytorch: https://github.com/search?q=dispatch_key_allowlist_check&type=code&p=1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146169
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-01-31 19:59:40 +00:00
ec2522e200 [MPS] optimize cholesky (#145722)
Followup to #145701

Optimizes the syrk and trsm kernels of cholesky decomposition on mps. For SYRK kernel it does matmuls with apple's simdgroup matrices instead of a tiled implementation and for trsm kernel we do vectorized loads. Also this PR puts command encoder inside of the stream queue dispatch (as discussed on last PR).

Script to collect perf
```
mport torch
import numpy as np
import time
import csv

matrix_sizes = [512, 1024, 2048, 4096]
batch_sizes = [1, 2, 4, 8, 16]
num_runs = 10
warmup_runs = 3

def create_spd_matrix(n, batch_size):
    torch.manual_seed(42)
    A = torch.randn(batch_size, n, n, dtype=torch.float32)
    return A @ A.transpose(-2, -1) + n * torch.eye(n).expand(batch_size, -1, -1)

def run_cholesky_mps(A):
    torch.mps.synchronize()
    start = time.perf_counter()
    b = torch.linalg.cholesky(A, upper=False)
    torch.mps.synchronize()
    end = time.perf_counter()
    return b, end - start

results = {
    'N': [],
    'batch_size': [],
    'mean_time': [],
    'std_time': []
}

for n in matrix_sizes:
    for batch_size in batch_sizes:
        print(f"\nBenchmarking N={n}, batch_size={batch_size}")

        try:
            A_cpu = create_spd_matrix(n, batch_size)
            A_mps = A_cpu.to("mps")

            for _ in range(warmup_runs):
                _, _ = run_cholesky_mps(A_mps)

            times = []
            for _ in range(num_runs):
                _, t = run_cholesky_mps(A_mps)
                times.append(t)

            mean_time = np.mean(times)
            std_time = np.std(times)

            results['N'].append(n)
            results['batch_size'].append(batch_size)
            results['mean_time'].append(mean_time)
            results['std_time'].append(std_time)

            print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

        except RuntimeError as e:
            print(f"Error for N={n}, batch_size={batch_size}: {e}")
            continue

with open('cholesky_benchmark_times.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['N', 'batch_size', 'mean_time', 'std_time'])
    for i in range(len(results['N'])):
        writer.writerow([
            results['N'][i],
            results['batch_size'][i],
            results['mean_time'][i],
            results['std_time'][i]
        ])
```

Observed speedups on M1 Pro
![cholesky_speedup](https://github.com/user-attachments/assets/be3edb1a-8b4a-4039-9d7f-9b9a10f1c83a)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145722
Approved by: https://github.com/malfet
2025-01-31 19:52:31 +00:00
6a0138fcc1 Torch device backend autoload fix (#145611)
This causes an import failure if an external backend imports a module that uses `torch._as_tensor_fullprec` when it is being loaded.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145611
Approved by: https://github.com/albanD
2025-01-31 19:27:42 +00:00
cyy
18380836eb Remove outdated test skipif conditions for Python3.9 (#146144)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146144
Approved by: https://github.com/albanD
2025-01-31 19:01:04 +00:00
68a3635484 [hop][inductor] track the dependency on unbacked symbols correctly with constant_args for hops (#143456)
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.

This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
2025-01-31 18:29:27 +00:00
aad9f44b2e [export] Sync model container types to schema.py (#145959)
Summary: Synced from D68840230

Test Plan: No behavior changes to existing API. Will be tested internally.

Differential Revision: D68846532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145959
Approved by: https://github.com/yiming0416
2025-01-31 18:17:56 +00:00
16f44fee25 Revert "[inductor/profiler] add kernel kwargs instrumentation (#145573)"
This reverts commit 720b8d0d8dac98f89499bc6b251d1f34dbf68dfe.

Reverted https://github.com/pytorch/pytorch/pull/145573 on behalf of https://github.com/ZainRizvi due to Sorry, but this is failing internally. It's a bit weird since this PR doesn't really appear related at first glance, but despite retries it fails pretty consistently. Please see D68930742 for details ([comment](https://github.com/pytorch/pytorch/pull/145573#issuecomment-2628013872))
2025-01-31 18:13:23 +00:00
67ed47d886 Binary upload checksum (#144887)
Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887
Approved by: https://github.com/atalman
2025-01-31 17:51:27 +00:00
d0748566b4 s390x ci: ensure CI starts correctly if token pipe is not removed (#145840)
Mark stop actions as "may fail".
Container is expected to stop on it's own in normal case.

Remove "may fail" mark from token generation steps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145840
Approved by: https://github.com/huydhn
2025-01-31 17:46:09 +00:00
44ecbcbd5a s390x: disable test_model_exports_to_core_aten.py test (#145835)
It often gets killed by OOM.
Disable it while investigating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145835
Approved by: https://github.com/huydhn
2025-01-31 17:45:10 +00:00
667b94d1c2 [hotfix][dynamo] Skip linecache due to a flaky issue (#146141)
A large number of jit + dynamo wrapped tests fail in linecache tracing.
We need further debugging. Skipping for now to stem the bleeding.

https://github.com/pytorch/pytorch/issues/146076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146141
Approved by: https://github.com/StrongerXi
2025-01-31 17:45:06 +00:00
c3f71eb61b Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit e2917245fb0c0b6aab216e7a0a254b80e7a9e78f.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error.  @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))
2025-01-31 17:43:09 +00:00
f5a61ba0a3 Revert "inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)"
This reverts commit d100e9ae744322a74d9fd05d0851caaf36f19c24.

Reverted https://github.com/pytorch/pytorch/pull/145122 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. See D68924977 for details ([comment](https://github.com/pytorch/pytorch/pull/145122#issuecomment-2627880860))
2025-01-31 17:39:23 +00:00
eb5a0718c2 S390x nightly builds timeouts (#146041)
Sometimes build timeouts at the end.
This should be fixed by increased timeout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146041
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-01-31 17:29:11 +00:00
001e355a56 Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## How does this work

The format for the checkpoint is as such

```
archive_name/
|_ data.pkl
|_.format_version
|_byteorder
|_data/
  |_ 0
  |_ 1
  |_ 2
  |_ ...
|_
```

Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them.

For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage.

Note that we always use `miniz` writer  in the zip64 mode per [here](7796e308d0/caffe2/serialize/inline_container.cc (L701)) A zipfile record written by miniz looks as such

```
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
| 30 byte header | n byte filename | zip64_extra_data | m byte padding | storage | 16 or 24 byte local dir footer  |
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
```

- The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290)
- filename will be `"{archive_name}/{filepath}"`

- `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524))
- `m` is determined by [`getPadding`](7796e308d0/caffe2/serialize/inline_container.cc (L254)), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes
- The local dir footer size is determined based on [this snippet ](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16.

When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following
- We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts
- for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage
- for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage
- Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case
- After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-31 17:09:20 +00:00
98f87edd23 Tensor .cuda() very slow with specific array sizes (#138964)
### **Pull Request: Optimized Non-Contiguous Tensor Copy for CPU to GPU in PyTorch**

#### **Summary**
This PR addresses the performance issue identified in [#111570](https://github.com/pytorch/pytorch/issues/111570), where non-contiguous tensors took significantly longer to transfer from CPU to GPU. Through detailed tracing of the call flow, we identified that PyTorch was creating temporary contiguous buffers for non-contiguous tensor transfers, which introduced unnecessary overhead.

#### **Tracing the Issue**
To pinpoint the cause of the slowdown, we followed the call flow from Python’s `tensor.cuda()` method through PyTorch’s backend, ultimately identifying `copy_kernel_cuda` as the key function responsible for CPU-to-GPU tensor transfers. Here’s a summary of the tracing process:

1. **Python Call: `tensor.cuda()`**
   - Starting from Python, the `cuda()` method initiates the tensor transfer to the GPU.

2. **`TensorBody.h: cuda()`**
   - The `cuda()` method calls `to()`, specifying the target device as CUDA.

3. **`Tensor.cpp: TensorBase::to()`**
   - The `to()` function prepares device and data type options before invoking `_ops::to_dtype_layout::call()`.

4. **Operator Call: `_ops::to_dtype_layout::call()`**
   - This operator dispatches the request to the backend-specific function responsible for managing the transfer.

5. **`Copy.cpp: copy_()`**
   - The `copy_()` function performs preliminary checks (e.g., zero-tensor immutability) and proceeds to call `copy_impl()`.

6. **`Copy.cpp: copy_impl()`**
   - This function sets up a tensor iterator and dispatches the copy operation to the appropriate backend through `copy_stub`.

7. **Dispatch to CUDA: `copy_stub`**
   - The dispatch mechanism routes the call to the CUDA-specific function, `copy_kernel_cuda`.

8. **`Copy.cu: copy_kernel_cuda()`**
   - Here, we identified that PyTorch was creating temporary contiguous buffers for 1D and 2D non-contiguous tensors, which slowed down the copy process. This behavior is managed by the `copy_requires_temporaries()` function.

#### **Solution**
To address this, we modified `copy_kernel_cuda` to handle non-contiguous 1D and 2D tensors directly by using `cudaMemcpy2DAsync`, which allows efficient, stride-aware memory transfers without temporary buffers. Here’s why this approach improves performance:

- **Efficiency of `cudaMemcpy2DAsync`**: This CUDA function is optimized for pitched (stride-based) memory transfers, allowing it to handle non-contiguous data layouts effectively by specifying memory strides for source and destination tensors.
- **Reduction of Overhead**: By directly copying non-contiguous tensors without intermediate buffers, we eliminate extra memory allocation and achieve faster CPU-to-GPU transfers.
- **Asynchronous Execution**: `cudaMemcpy2DAsync` enables asynchronous transfer on the CUDA stream, further improving performance by taking advantage of CUDA's optimized memory handling for non-contiguous layouts.

#### **Performance Results**

In my testing, I created tensors of size `327680 x 2000` and used slices for transfer performance measurements. The tests show that the average time for transferring a non-contiguous slice (e.g., rows 10,000 to 50,000) from CPU to GPU now closely matches the contiguous case. This improvement indicates that the updated implementation effectively addresses the performance discrepancy. Below are the measured times and validation checks:

```plaintext
Average time for contiguous slice (rows 10,000-50,000): 66 ms
Average time for non-contiguous slice (rows 10,000-50,000): 66 ms

Validation of contiguous and non-contiguous tensor copies:
 PASS: Tensor shapes match.
 PASS: Tensor contiguity matches.
 PASS: Tensor contents match.
 PASS: Tensor data types match.

 Success: Both contiguous and non-contiguous tensors were copied correctly to the GPU.
```

#### **Conclusion**
This PR resolves the identified performance issue by eliminating the need for temporary buffers in non-contiguous 1D and 2D tensor transfers, ensuring faster and more efficient copies from CPU to GPU. Future optimizations could further enhance performance for higher-dimensional non-contiguous tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138964
Approved by: https://github.com/jeffdaily

Co-authored-by: Natalia Gimelshein <ngimel@gmail.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-31 17:05:02 +00:00
2d6f6637d3 Remove lexicographical sorting of storage keys in torch.save (#143879)
Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted)

This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads

* __->__ #143879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879
Approved by: https://github.com/albanD
2025-01-31 17:00:23 +00:00
9232355bb0 Add CUDA 12.8 manywheel x86 Builds to Binaries Matrix (#145792)
https://github.com/pytorch/pytorch/issues/145570

Adding cuda 12.8.0 x86 builds first
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145792
Approved by: https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman
2025-01-31 16:12:02 +00:00
a7c2d85c18 Add overloads to diagonal docs (#144214)
Fixes #126827. Refactored doc to demonstrate when none of the optional values are passed in. Added another example so that all overloads of the function are covered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144214
Approved by: https://github.com/albanD
2025-01-31 15:53:59 +00:00
2af876707b [AOTI] Fix a memory leak in package boxed_run (#146100)
Summary: AOTIModelPackageLoaderPybind::boxed_run missed a decref when constructing the returned py::list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146100
Approved by: https://github.com/cpuhrsch
2025-01-31 13:32:28 +00:00
7b07415aaa [export] nested terms in nn_module_stack deserialization (#145901)
Summary: accounting for terms like "getattr(getattr(a[0], b), c)".

Test Plan: test_serialize

Differential Revision: D68784736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145901
Approved by: https://github.com/angelayi
2025-01-31 10:00:13 +00:00
1f1a9965d5 fix a small typo in comments (#145323)
A minor typo fix.
The description was confusing with the typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145323
Approved by: https://github.com/Skylion007
2025-01-31 06:45:44 +00:00
c55af2b567 [CMake] Delete Caffe2 inspect_gpu binary (#146105)
As it's unbuildable right now, as headers it depends on are gone

Fixes https://github.com/pytorch/pytorch/issues/146042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146105
Approved by: https://github.com/atalman, https://github.com/seemethere
2025-01-31 06:42:52 +00:00
e84bf88dde [ATen][CUDA] Implement 128 bit vectorization v2 (#145746)
This is a re-base PR to my previous one #141959.

Description from the original PR:

This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100.

<details>

<summary>The benchmark code used </summary>

```Python

import time
import torch
from torch.profiler import profile, ProfilerActivity

def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False):
    device = torch.device("cuda")

    shapes = []
    for p in range(24, 30):
        shape = 1<<p
        shapes.append(shape)

    for shape in shapes:
        for _ in range(6):
            x = torch.randn(shape, device=device, dtype=dtype)
            y = function(x)

        if print_profile:
            x = torch.randn(shape, device=device, dtype=dtype)
            with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
                y = function(x)
            print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

        x = torch.randn(shape, device=device, dtype=dtype)
        torch.cuda.synchronize()
        t1 = time.perf_counter()
        for _ in range(6):
            y = function(x)
        torch.cuda.synchronize()
        t2 = time.perf_counter()
        perf_time = (t2 - t1) / 6

        print(f"{function.__name__}, {dtype}, {shape}, {perf_time}")
        if check_numerics:
            x_cpu = x.cpu()
            y_cpu = function(x_cpu).cuda()
            try:
                torch.testing.assert_allclose(y_cpu, y)
            except AssertionError as error:
                print("An exception occurred:", error)

def main():
    ops = [
            torch.relu,
            torch.sigmoid,
            torch.tanh,
            torch.nn.functional.gelu,
            torch.sin,
            torch.exp,
    ]

    dtypes = [
            torch.float16,
            torch.bfloat16,
            torch.float32,
    ]

    for op in ops:
        for dtype in dtypes:
            benchmark(op, dtype=dtype)
            torch.cuda.empty_cache()

if __name__ == "__main__":
    main()
```

</details>

<details>

<summary> Results </summary>

| op | dtype | size | time after | time before | % improvement |
| ---- | ---- | ---- | ---- | ---- | ---- |
| relu | torch.float16 | 33554432 | 4.84E-05 | 5.06E-05 | 4.66296539127052 |
| relu | torch.float16 | 67108864 | 9.22E-05 | 9.64E-05 | 4.56491432752297 |
| relu | torch.float16 | 134217728 | 0.000180343495837102 | 0.000187981834945579 | 4.23543919508829 |
| relu | torch.float16 | 268435456 | 0.000355071155354381 | 0.000370856161074092 | 4.44558942107169 |
| relu | torch.float16 | 536870912 | 0.000704489842367669 | 0.000736006341564159 | 4.47366268483987 |
| relu | torch.bfloat16 | 16777216 | 3.03E-05 | 3.04E-05 | 0.166504085842689 |
| relu | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.45848238875716 |
| relu | torch.bfloat16 | 67108864 | 9.32E-05 | 9.65E-05 | 3.56122651631445 |
| relu | torch.bfloat16 | 134217728 | 0.000180805509444326 | 0.000187998676362137 | 3.97840029317567 |
| relu | torch.bfloat16 | 268435456 | 0.000356242332297067 | 0.000371279485989362 | 4.22104627356745 |
| relu | torch.bfloat16 | 536870912 | 0.000708114336399982 | 0.000736773828975856 | 4.04729732229083 |
| relu | torch.float32 | 16777216 | 5.61E-05 | 5.61E-05 | 0.0442587268354941 |
| relu | torch.float32 | 33554432 | 9.33E-05 | 9.30E-05 | -0.259070913799022 |
| relu | torch.float32 | 67108864 | 0.000181321326332788 | 0.000181289506144822 | -0.0175490597877115 |
| relu | torch.float32 | 134217728 | 0.000356896334172537 | 0.000356570177245885 | -0.0913870206618981 |
| relu | torch.float32 | 268435456 | 0.000709421835684528 | 0.000707465515006334 | -0.275762681635911 |
| relu | torch.float32 | 536870912 | 0.00141372415237129 | 0.00141036518228551 | -0.237597276678471 |
| sigmoid | torch.float16 | 16777216 | 3.10E-05 | 3.16E-05 | 2.10012593866895 |
| sigmoid | torch.float16 | 33554432 | 4.91E-05 | 5.23E-05 | 6.37710600666122 |
| sigmoid | torch.float16 | 67108864 | 9.30E-05 | 0.000100057009452333 | 7.61866144555331 |
| sigmoid | torch.float16 | 134217728 | 0.000180928347011407 | 0.000194982004662355 | 7.76752669390248 |
| sigmoid | torch.float16 | 268435456 | 0.000355658994521946 | 0.00038468533117945 | 8.16128288742412 |
| sigmoid | torch.float16 | 536870912 | 0.000705982849467546 | 0.000764021339515845 | 8.22094900634937 |
| sigmoid | torch.bfloat16 | 16777216 | 3.08E-05 | 3.17E-05 | 2.90965915673149 |
| sigmoid | torch.bfloat16 | 33554432 | 4.87E-05 | 5.24E-05 | 7.63503884668234 |
| sigmoid | torch.bfloat16 | 67108864 | 9.33E-05 | 0.000100019678939134 | 7.21238137428013 |
| sigmoid | torch.bfloat16 | 134217728 | 0.000180786165098349 | 0.000194868014659733 | 7.78922964250206 |
| sigmoid | torch.bfloat16 | 268435456 | 0.000355564659306159 | 0.000384909333661199 | 8.25297835063321 |
| sigmoid | torch.bfloat16 | 536870912 | 0.000705831005082776 | 0.000764102345177283 | 8.2557070566308 |
| sigmoid | torch.float32 | 16777216 | 4.93E-05 | 5.65E-05 | 14.5314136197766 |
| sigmoid | torch.float32 | 33554432 | 9.32E-05 | 9.31E-05 | -0.120169865610833 |
| sigmoid | torch.float32 | 67108864 | 0.000181328505277634 | 0.000180455681402236 | -0.481349512069855 |
| sigmoid | torch.float32 | 134217728 | 0.000357362829769651 | 0.000356093340087682 | -0.35523831137877 |
| sigmoid | torch.float32 | 268435456 | 0.000708921831877281 | 0.000707052337626616 | -0.263709504574663 |
| sigmoid | torch.float32 | 536870912 | 0.00141358317341656 | 0.0014090768333214 | -0.318788464654745 |
| tanh | torch.float16 | 16777216 | 3.03E-05 | 3.03E-05 | -0.0912564658661808 |
| tanh | torch.float16 | 33554432 | 4.90E-05 | 5.07E-05 | 3.46644442974484 |
| tanh | torch.float16 | 67108864 | 9.30E-05 | 9.68E-05 | 3.99871369815531 |
| tanh | torch.float16 | 134217728 | 0.00018052199933057 | 0.000188717152923346 | 4.53969799978138 |
| tanh | torch.float16 | 268435456 | 0.000355684508879979 | 0.000373026006855071 | 4.8755280430115 |
| tanh | torch.float16 | 536870912 | 0.000706660988119741 | 0.000740105014604827 | 4.73268328765002 |
| tanh | torch.bfloat16 | 16777216 | 2.99E-05 | 3.03E-05 | 1.21049563135981 |
| tanh | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.48836101041744 |
| tanh | torch.bfloat16 | 67108864 | 9.28E-05 | 9.69E-05 | 4.39944918036626 |
| tanh | torch.bfloat16 | 134217728 | 0.000180710999605556 | 0.000189167990659674 | 4.67984299382829 |
| tanh | torch.bfloat16 | 268435456 | 0.000356062994493792 | 0.000372666652159144 | 4.66312363882606 |
| tanh | torch.bfloat16 | 536870912 | 0.000707100164921333 | 0.000740134331863374 | 4.67178040408393 |
| tanh | torch.float32 | 16777216 | 5.61E-05 | 5.64E-05 | 0.439595755746353 |
| tanh | torch.float32 | 33554432 | 9.31E-05 | 9.31E-05 | 0.00287633090228212 |
| tanh | torch.float32 | 67108864 | 0.000181465332085888 | 0.000180895323865116 | -0.31411411437098 |
| tanh | torch.float32 | 134217728 | 0.000356963835656643 | 0.000356073161431899 | -0.249513854283251 |
| tanh | torch.float32 | 268435456 | 0.000709201170442005 | 0.00070707315656667 | -0.300057862849997 |
| tanh | torch.float32 | 536870912 | 0.00141367283261692 | 0.00141030051357423 | -0.238550176877922 |
| gelu | torch.float16 | 16777216 | 2.73E-05 | 3.17E-05 | 15.921079070745 |
| gelu | torch.float16 | 33554432 | 5.06E-05 | 5.55E-05 | 9.76345374333098 |
| gelu | torch.float16 | 67108864 | 9.65E-05 | 0.000106600326641152 | 10.4308039074712 |
| gelu | torch.float16 | 134217728 | 0.000187776672343413 | 0.000208565829476962 | 11.0712139447915 |
| gelu | torch.float16 | 268435456 | 0.000370216167842348 | 0.000412251994324227 | 11.3544005187205 |
| gelu | torch.float16 | 536870912 | 0.000737301345604161 | 0.000819394170927505 | 11.1342296895002 |
| gelu | torch.bfloat16 | 16777216 | 3.02E-05 | 3.08E-05 | 1.78405479367653 |
| gelu | torch.bfloat16 | 33554432 | 5.13E-05 | 5.69E-05 | 10.9929393318302 |
| gelu | torch.bfloat16 | 67108864 | 9.76E-05 | 0.00010968199543034 | 12.3420807512356 |
| gelu | torch.bfloat16 | 134217728 | 0.000189661824454864 | 0.000214487663470209 | 13.0895287371091 |
| gelu | torch.bfloat16 | 268435456 | 0.000374197009174774 | 0.000423670164309442 | 13.2211519391275 |
| gelu | torch.bfloat16 | 536870912 | 0.000743675006863972 | 0.000842577001700799 | 13.299088166737 |
| gelu | torch.float32 | 16777216 | 5.06E-05 | 5.04E-05 | -0.413385894716413 |
| gelu | torch.float32 | 33554432 | 9.31E-05 | 9.32E-05 | 0.134157041722546 |
| gelu | torch.float32 | 67108864 | 0.000181480175039421 | 0.000180836669945469 | -0.354586992112075 |
| gelu | torch.float32 | 134217728 | 0.000356874331676712 | 0.000356305002545317 | -0.159532104402047 |
| gelu | torch.float32 | 268435456 | 0.000708909006789327 | 0.000706991491218408 | -0.270488250615287 |
| gelu | torch.float32 | 536870912 | 0.00141321367118508 | 0.00140937082081412 | -0.271922813181618 |
| sin | torch.float16 | 16777216 | 3.04E-05 | 3.11E-05 | 2.21834939018859 |
| sin | torch.float16 | 33554432 | 4.85E-05 | 5.23E-05 | 7.72165512511596 |
| sin | torch.float16 | 67108864 | 9.31E-05 | 9.98E-05 | 7.24947099480072 |
| sin | torch.float16 | 134217728 | 0.000180371008658161 | 0.000194791161144773 | 7.99471744039613 |
| sin | torch.float16 | 268435456 | 0.000355454161763191 | 0.000384903668115536 | 8.28503630574026 |
| sin | torch.float16 | 536870912 | 0.000705183832906187 | 0.000764360166310022 | 8.39161799270973 |
| sin | torch.bfloat16 | 16777216 | 3.11E-05 | 3.10E-05 | -0.257677954940036 |
| sin | torch.bfloat16 | 33554432 | 4.89E-05 | 5.24E-05 | 7.34808420323539 |
| sin | torch.bfloat16 | 67108864 | 9.26E-05 | 0.000100248667877167 | 8.22347488801205 |
| sin | torch.bfloat16 | 134217728 | 0.000180674154156198 | 0.00019567032965521 | 8.30012215584937 |
| sin | torch.bfloat16 | 268435456 | 0.000355360486234228 | 0.000386023331278314 | 8.62865913118873 |
| sin | torch.bfloat16 | 536870912 | 0.00070483615854755 | 0.000766805159704139 | 8.79197248964745 |
| sin | torch.float32 | 16777216 | 5.67E-05 | 5.64E-05 | -0.441348534920039 |
| sin | torch.float32 | 33554432 | 9.34E-05 | 9.30E-05 | -0.496458540364117 |
| sin | torch.float32 | 67108864 | 0.000181706990891447 | 0.000180556671693921 | -0.633062708199702 |
| sin | torch.float32 | 134217728 | 0.000356894995396336 | 0.000356046327700218 | -0.237791985616354 |
| sin | torch.float32 | 268435456 | 0.000708777321657787 | 0.000707602652255446 | -0.165731798471427 |
| sin | torch.float32 | 536870912 | 0.00141263716310884 | 0.00140912582476934 | -0.248566187496451 |
| exp | torch.float16 | 16777216 | 3.00E-05 | 3.04E-05 | 1.40099098901014 |
| exp | torch.float16 | 33554432 | 4.86E-05 | 5.03E-05 | 3.44611943643906 |
| exp | torch.float16 | 67108864 | 9.37E-05 | 9.55E-05 | 1.96412400380129 |
| exp | torch.float16 | 134217728 | 0.000180913504057874 | 0.000187193179347863 | 3.47109262113439 |
| exp | torch.float16 | 268435456 | 0.00035607748820136 | 0.000369079003576189 | 3.65131630210701 |
| exp | torch.float16 | 536870912 | 0.000707551507124056 | 0.000732363162872692 | 3.50669251620789 |
| exp | torch.bfloat16 | 16777216 | 2.98E-05 | 3.04E-05 | 1.74345594341654 |
| exp | torch.bfloat16 | 33554432 | 4.88E-05 | 5.04E-05 | 3.40217856534821 |
| exp | torch.bfloat16 | 67108864 | 9.32E-05 | 9.62E-05 | 3.29219958210226 |
| exp | torch.bfloat16 | 134217728 | 0.000180999826019009 | 0.000187239318620414 | 3.44723679499521 |
| exp | torch.bfloat16 | 268435456 | 0.000355944503098726 | 0.000369370992605885 | 3.77207384585864 |
| exp | torch.bfloat16 | 536870912 | 0.000707135167128096 | 0.000733066000975668 | 3.66702648277075 |
| exp | torch.float32 | 16777216 | 4.89E-05 | 5.63E-05 | 15.1245314346532 |
| exp | torch.float32 | 33554432 | 9.34E-05 | 9.31E-05 | -0.259945454477446 |
| exp | torch.float32 | 67108864 | 0.000181152504713585 | 0.000180474346658836 | -0.374357536939058 |
| exp | torch.float32 | 134217728 | 0.000356771342922002 | 0.000355627329554409 | -0.3206573034212 |
| exp | torch.float32 | 268435456 | 0.000708404501589636 | 0.00070713268360123 | -0.179532736671163 |
| exp | torch.float32 | 536870912 | 0.00141283582585553 | 0.00140944866385932 | -0.23974208002295 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746
Approved by: https://github.com/eqy, https://github.com/ngimel

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-01-31 06:42:08 +00:00
eeb5e1bf20 [AOTI] Cache treespec_loads calculation (#145815)
Summary: Treespec can be reused instead of calculated from str every AOTI module call. Using cached result saves 0.2ms for each module call.

Test Plan:
Before:
{F1974751578}

After:
 {F1974751667}

Differential Revision: D68749539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145815
Approved by: https://github.com/henrylhtsang
2025-01-31 06:38:21 +00:00
57d8278ab9 pickler for GraphModule (#141659)
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.

Differential Revision: [D68921318](https://our.internmc.facebook.com/intern/diff/D68921318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
2025-01-31 05:34:28 +00:00
f9227e7c33 Expose ToIValueAllowNumbersAsTensors to TORCH_PYTHON_API so we can use it in monarch (#146087)
Summary: TSIA

Test Plan: Tested up the stack but existing unittests

Reviewed By: suo

Differential Revision: D68917233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146087
Approved by: https://github.com/suo
2025-01-31 05:08:11 +00:00
cf2de4e230 Introduce aoti_call_delegate HOP (#145630)
Summary:
Previously, aoti compile node is represented as a kernel-less custom op in the exported program. The node was not eager runnable, which is a common practice for numerical validation during lowering.

I introduce a new HOP to address this.

The schema is following
```
aoti_call_delegate(lower_moduel: AOTInductorEPModule, original_gm: fx.GraphModule, weights: List[Tensor], inputs: List[Tensor])
```

There are a few problems exposed by HOP
- AOTI expects a FX graph with weights as getattr nodes, aka stateful graph. HOP expect graph_module arguments to be stateless. Export serializer also expect a stateless graph. Currently, to make AOTI happy, I am making `original_gm` stateful, and bypassing the serialization for `original_gm`.
- As a result, the HOP is not re-traceable, as functionalization on stateful graph module argument will fail.

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//deeplearning/aot_inductor/cpu/test:cpu_lowering_utils_test

Reviewed By: zhxchen17

Differential Revision: D68359391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145630
Approved by: https://github.com/zou3519
2025-01-31 04:57:36 +00:00
f358d4d004 [ONNX] Migrate test_torch_export_with_onnxruntime.py to test_small_models_e2e.py (#146095)
With [the deprecation of torch.onnx.dynamo_export](https://github.com/pytorch/pytorch/pull/146003), this PR turns the torch.export related tests toward torch.onn.export(..., dynamo=True), and places it in test_small_models_e2e.py

NOTE: test_exported_program_as_input_from_file and test_onnx_program_supports_retraced_graph are not kept, because they are more of testing whether exported program stays the same after save/load and retrace. However, in torch.onnx.export(..., dynamo=True), we focus more on the export of from nn.Module to ONNX proto.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146095
Approved by: https://github.com/justinchuby
2025-01-31 03:40:26 +00:00
27e35de6c2 [export] Add distributed test (#146050)
Reland https://github.com/pytorch/pytorch/pull/145886
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146050
Approved by: https://github.com/avikchaudhuri
2025-01-31 02:56:42 +00:00
ffb424eab6 [dynamo/export] call local_scalar_dense when full() value is scalar tensor (#144999)
Fixes https://github.com/pytorch/pytorch/issues/144907
```
        class Foo(torch.nn.Module):
            def forward(self, val):
                return torch.full((80, 2), val, dtype=torch.float32)

        export(Foo(), args=(torch.tensor(1),))
```

When we have a `torch.full` call like above, where the fill value is a scalar Tensor and not a scalar value, the FX graph from `_dynamo.export()` contains a single node: the full op. We run into a `PendingUnbackedSymbolNotFound` error, because the `item()` call is implicit; the UnbackedSymInt is extracted but goes directly into the data of the output tensor value, and we're then unable to locate it when we try to compute unbacked bindings.

On the other hand, non-strict export doesn't face this, because an explicit `item()`, or `local_scalar_dense` node is inserted, and the unbacked binding is directly the example value of that node.

This adds a dynamo handler to imitate what happens in non-strict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144999
Approved by: https://github.com/angelayi
2025-01-31 02:45:43 +00:00
e01c898e51 [Customized Optimus] Add select cat aten pass (#145918)
Summary: This is a follow up work of D68695717, where we can further reduce the number of cat kernels in the backward by designing new aten pass in the aten level.

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_select_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/6943087f-91be-4dbd-9693-df0a11a50b73
Test UI: https://www.internalfb.com/intern/testinfra/testrun/11821949087998233
Network: Up: 101KiB  Down: 132KiB  (reSessionID-60e898af-f366-4247-a9f7-d8d7cd129fe0)
Analyzing targets. Remaining      0/78148
Executing actions. Remaining      0/476147
Command: test.     Finished 2 local
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

### how to add the config

```
        post_grad_fusion_options: {
          "normalization_aten_pass": {},
          "split_cat_aten_pass": {},
          "select_cat_aten_pass": {},
        }
```

{F1974778773}

baseline:

aps-recgpt_ranking_1115_pt2_optimus-e52c1f277e

proposal

aps-recgpt_ranking_1115_pt2_optimus-1b0047ee0e

Differential Revision: D68803384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145918
Approved by: https://github.com/Yuzhen11
2025-01-31 02:35:10 +00:00
08d88127fe Use Magma-cuda 12.8 for libtorch (#146019)
https://github.com/pytorch/pytorch/issues/145570

Build failure for libtorch wheel
`CUDAContext.cpp:(.text+0x157): additional relocation overflows omitted from the output
/usr/bin/ld: failed to convert GOTPCREL relocation; relink with --no-relax
collect2: error: ld returned 1 exit status`

Unsure if this is related, fixing as a start
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146019
Approved by: https://github.com/eqy
2025-01-31 02:19:23 +00:00
2811f33d12 Fix code cache + freezing compile-time regression (#145868)
Summary: The current implementation introduces a compile-time regression due to overhead hashing large constants. To support freezing+caching, we consider only the tensor metadata of frozen params, but we neglect to do the same for any constants created as a result of folding frozen params. This PR Explicitly marks the constants created during freezing (and constant folding during freezing) and uses that info in the inductor cache to determine when to hash a tensor value+metadata vs. metadata only.

Test Plan: `python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only alexnet --bfloat16 --cold-start-latency --print-compilation-time --inference --performance --freezing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145868
Approved by: https://github.com/eellison
2025-01-31 02:04:15 +00:00
bf9d053fb8 [Break XPU] Fix Inductor cuda bias UT (#145934)
# Motivation
[Break XPU] inductor ut: `inductor/test_inplace_padding.py::InplacePaddingTest::test_pad_non_zero - RuntimeError: Expected to find "empty_strided_cuda((2048, 2048), (2048, 1), torch.float32).as_strided((2048, 2047), (2048, 1))" but did not find it`

With this PR, `test_pad_non_zero` will pass on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145934
Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/desertfire
2025-01-31 01:39:39 +00:00
ccd27e8129 Turn on fx graph cache and automatic dynamic pgo local caches in fbcode (#146065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146065
Approved by: https://github.com/jamesjwu
2025-01-31 01:11:48 +00:00
3fae5c8509 torchgen: support exception boundary for ExecuTorch functions (#144341)
Needed for ExecuTorch diff D67904052.

Differential Revision: [D67906411](https://our.internmc.facebook.com/intern/diff/D67906411/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144341
Approved by: https://github.com/Jack-Khuu
2025-01-31 01:05:21 +00:00
cyy
d94d816d96 Simplify handling of max jobs in CMake builds (#145820)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145820
Approved by: https://github.com/malfet
2025-01-31 00:55:39 +00:00
c70362fac8 [AsyncMM] re-enable and adapt to cutlass 3.6.0 (#144011)
[D68734067](https://our.internmc.facebook.com/intern/diff/D68734067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-31 00:48:51 +00:00
1e3d1738a4 [dynamo][polyfills]Support getrecursionlimit (#145989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145989
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145986, #145987, #145994
2025-01-31 00:47:31 +00:00
e7bb608d02 [dynamo][dicts] Support construction of types.MappingProxyType (#145994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145994
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145986, #145987
2025-01-31 00:47:31 +00:00
4665bc2cc0 [dynamo][functions] Support id on function (#145987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145987
Approved by: https://github.com/StrongerXi, https://github.com/jansel, https://github.com/mlazos
ghstack dependencies: #145986
2025-01-31 00:47:23 +00:00
56307dc370 [dynamo][dicts] Raise exception on pop (#145986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145986
Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/StrongerXi, https://github.com/jansel
2025-01-31 00:47:13 +00:00
e6704a2447 Allow replacing unbacked with very large upperbound by returning no-op for FloorToInt(int) (#146001)
* Let's say x is an integer beyond 2^53 where Python floats lose precision i.e. can't increment by 1.
* Therefore, float(x) will lose precision and won't retain the exact value of x even though it's an integer.
* That means `FloorToInt(very_large_number)` will lose precision if we cast it to float
```
>>> int(float(1000000007999999992))
1000000008000000000
```

This means when we try to do this in set_replacement():
32bb6f83d5/torch/fx/experimental/symbolic_shapes.py (L6011-L6019)

We run into this:
```
TORCH_LOGS="+torch.fx.experimental.symbolic_shapes" pytest -s test_export.py -k test_replace_unbacked_with_very_large_upperbound

  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6258, in _maybe_guard_rel
    self._set_replacement(rhs, self._find(lhs), "trivial_rhs")
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 6039, in _set_replacement
    assert tgt_bound.issubset(
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>(*(FakeTensor(..., size=(2*s0,)), FakeTensor(..., size=(u0,))), **{}):
tgt_bound=VR[4, 1000000008000000000] not a subset of src_bound=VR[4, 1000000007999999992]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146001
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145898
2025-01-31 00:25:20 +00:00
c72b536420 Add manual override flag for core ATen op detection during bc check (#146052)
Fixes https://github.com/pytorch/pytorch/issues/146049

Today the bc detection logic ignores allow_list for core ATen ops (A PR landed 4 months ago to enable this). The problem is that if I have a PR that removes an op, the script can no longer check whether that op is core ATen op (today we just error out).

With my fix: (1) conservatively assume core ATen op in such cases (2) allows the user to specify in their ALLOW_LIST entry that their op is not a core ATen op.)

Test plan:
- This is tested 2 PRs above

016bdafdcb/test/forward_backward_compatibility/check_forward_backward_compatibility.py (L129-L137)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146052
Approved by: https://github.com/albanD
2025-01-30 23:57:01 +00:00
720b8d0d8d [inductor/profiler] add kernel kwargs instrumentation (#145573)
## About

As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc.

## Test program

Note, install triton before proceeding (pip install triton)

triton_test.py>>>
```
import torch
from torch.profiler import profile, ProfilerActivity

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(y)
    return a + b

def main():
    x = torch.randn(10, 10).cuda()
    y = torch.randn(10, 10).cuda()
    opt_foo = torch.compile(foo)
    z = opt_foo(x, y)

    # Profile the kernel function on the GPU
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True
    ) as prof:
        z = opt_foo(x, y)

    # Export the trace to a file
    prof.export_chrome_trace("my_kernel_trace.json")

if __name__ == "__main__":
    main()
```

Run it and we should get a trace file my_kernel_trace.json

Output has triton event with the kernel_kwargs attribute.
```
  {
    "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815,
    "ts": 2045246693014.959, "dur": 75.662,
    "args": {
      ...
      "kernel_backend": "triton",
      "num_warps": 4,
      "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)",
      "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py",
      "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor"
    }
  },
```

## Unit Test
Updated unit test:
```
pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573
Approved by: https://github.com/davidberard98, https://github.com/jansel
2025-01-30 23:51:44 +00:00
8117656162 nonzero_static with symint size (#146006)
Summary: Previously `nonzero_static` would force specialization on the `size` argument. This PR enables it to be used with a dynamic `size` argument.

Test Plan: added test

Differential Revision: D68874784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146006
Approved by: https://github.com/angelayi
2025-01-30 23:42:42 +00:00
9fdc20809a [PGNCCL] Simplify support macro definition (#145964)
- Promotes usage of `NCCL_VERSION_CODE >= NCCL_VERSION(X, Y, Z)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145964
Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang
ghstack dependencies: #145893
2025-01-30 23:26:32 +00:00
4280232f21 Revert "Advance past fc window for stft center (#145437)"
This reverts commit 3ef1551f5a745c1d37ff421eb4678814ef4483e4.

Reverted https://github.com/pytorch/pytorch/pull/145437 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks some slow trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145437#issuecomment-2625840742))
2025-01-30 23:14:16 +00:00
f85e4c1360 Enable C++ API parity tests on AArch64 (#145370)
Re-enables C++ API parity tests on AArch64 which now pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145370
Approved by: https://github.com/albanD
2025-01-30 22:42:49 +00:00
2f60f12f8b [Torch] Extract arange_out resizing logic into a helper function that can be used by other devices (#145747)
Summary: We want to use the resizing implementation for arange_out in other devices (in this case MTIA), to make sure that the computations match and to avoid off-by-one-errors.

Test Plan: Existing CI tests pass.

Differential Revision: D68694489

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145747
Approved by: https://github.com/mortzur
2025-01-30 22:37:00 +00:00
99a0940991 [MPS] Fix regression in con-contig bitwise ops (#146085)
Caused by https://github.com/pytorch/pytorch/pull/128393 that change semantic of `needsGather`, which resulted in silent correctness errors on MacOS-15+ if output tensor is non-contiguous

Fixes https://github.com/pytorch/pytorch/issues/145203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146085
Approved by: https://github.com/dcci
2025-01-30 22:36:56 +00:00
e2917245fb [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-01-30 22:33:50 +00:00
7391cea857 Revert "[triton] Update pin to tip of 3.2 release (#145867)"
This reverts commit 5e5da9bd9afdbb51da3dcc39947347279ccd9130.

Reverted https://github.com/pytorch/pytorch/pull/145867 on behalf of https://github.com/ZainRizvi due to Sorry, this PR may have been written correctly, but something is clearly broken with the infra that's making CI very unhappy with this new triton version.  Since this has been blocking viable/strict upgrades for a couple days now, I'm reverting this PR.  I'll sync with @atalman on how we should fix this. ([comment](https://github.com/pytorch/pytorch/pull/145867#issuecomment-2625720817))
2025-01-30 22:24:09 +00:00
23695ea002 Fix dynamo use of list[int] in graph break (#145554)
This reintroduces the change backed out by #145393 and fixes the underlying problem.

Although using a BuiltinVariable was better than nothing when we saw a GenericAlias it had problems if there was a graph break and we had to reconstruct the original python code which BuiltinVariable did as a simple `list` instead of a `list[int]`.

This changes it to use a TypingVariable instead and then teaches TypingVariable how to reconstruct.

Original commit changeset: 77b9193acb23

python test/dynamo/test_repros.py ReproTests.test_graph_break_on_jit_isinstance

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145554
Approved by: https://github.com/anijain2305
ghstack dependencies: #145551, #145552, #145553
2025-01-30 22:21:40 +00:00
fbb076cc45 Fix call to create_load_global (#145553)
There is no version of create_load_global() that takes three parameters - any use of this function will fail. I think this is probably the correct fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145553
Approved by: https://github.com/anijain2305
ghstack dependencies: #145551, #145552
2025-01-30 22:21:40 +00:00
ccbbc88bbb Turn on mypy for _dynamo/variables/builtin.py (#145552)
The fact that mypy errors were ignored was hiding several bugs in builtin.py (for example the previous diff's incorrect override and use of `call_getattr`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145552
Approved by: https://github.com/anijain2305, https://github.com/Skylion007
ghstack dependencies: #145551
2025-01-30 22:21:32 +00:00
f3120f6d26 Remove incorrect BuiltinVariable.call_hasattr() (#145551)
BuiltinVariable.call_hasattr() overrides the base class - but actually behaves differently. The base is `obj.call_hasattr(tx, attr)` but BuiltinVariable's version is `<unused>.call_hasattr(tx, obj, attr)`.

The BuiltinVariable version is used as a pattern from `call_self_handler()` for `BuiltinVariable(hasattr)`. I think the other version is just used for internal `hasattr(obj, name)` so I renamed that one to `call_obj_hasattr`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145551
Approved by: https://github.com/anijain2305
2025-01-30 22:21:19 +00:00
clr
d100e9ae74 inductor: Don't throw an internal error when a nn.module is missing a attribute (#145122)
If a nn.module getattr call throws, we should make sure that we don't crash with an internal error

Note that I couldn't figure out how to test this, so advice would be awesome.  I have my best case attempt at  https://github.com/pytorch/pytorch/pull/145799, but it doesn't seem to reproduce the crash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145122
Approved by: https://github.com/jansel
2025-01-30 21:55:29 +00:00
08ff11e9d0 initialize device when pinning memory on this device, short circuit i… (#145752)
…s_pinned if device is not initialized
Do not land
RFC
potential fix for #144687

Now `.is_pinned(device="cuda")` does not initialize device and thus doesn't poison the fork (but it complains about `device` arg being deprecated). To not need `device=` arg we'd need to fix get_accelerator to not initialize device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145752
Approved by: https://github.com/albanD

Co-authored-by: albanD <albandes@fb.com>
2025-01-30 21:37:29 +00:00
1252c1933d Update to remind users to use torch.compile template (#145960)
Users have been submitting fuzzer issues without meeting the requirements outline in the torch.compile issue template. This updates the note to remind users to use the torch.compile template for torch.compile bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145960
Approved by: https://github.com/eellison
2025-01-30 21:34:40 +00:00
d14046b58d Update fuzzer guidance to include rng (#145962)
Add another condition to fuzzer issue guidance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145962
Approved by: https://github.com/eellison
2025-01-30 21:33:57 +00:00
7e7341bddd [hop] fix unbacked_bindings meta for while_loop (#143559)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143559
Approved by: https://github.com/zou3519
2025-01-30 21:33:09 +00:00
9f9904172d [scan] scan dim handling in user-facing scan() (#145179)
This PR introduces the capability that the scan dim is handled in the user facing scan() call. Internally, the scan dim is always shifted to dim 0 and then the scan is performed over that dim.

This is a follow-up PR from https://github.com/bohnstingl/pytorch/pull/3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145179
Approved by: https://github.com/ydwu4
2025-01-30 21:09:07 +00:00
70f6aaa786 [OSS] Add kwargs to fsspec reader/writer (#145845)
Summary: Add kwargs to fsspec reader/writer. This will be used when reading/writing from huggingface because it needs a token to access the repositories

Test Plan: https://fburl.com/anp/agkrlas1 ability to read write to hf with fsspec

Differential Revision: D68738777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145845
Approved by: https://github.com/mhorowitz
2025-01-30 21:00:58 +00:00
e6c39d37e9 [ONNX] Create deprecation warning on dynamo_export (#146003)
Deprecation of `torch.onnx.dynamo_export`:

* [`torch/onnx/_internal/_exporter_legacy.py`](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86): Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`. [[1]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR83-R86) [[2]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR231-R234) [[3]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR442-R445) [[4]](diffhunk://#diff-4d1eb96fe68ea904dcd1f8211318b9ff882dbfe4c3cb725ffc164b6c5a58b74cR700-R703)

This PR also removed the `**_` kwarg on onnx.export such that users get an error when they supply an unexpected augument.

Updated to emit deprecation warning because it is more appropriate: https://docs.python.org/3/library/exceptions.html#DeprecationWarning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146003
Approved by: https://github.com/titaiwangms
2025-01-30 20:13:32 +00:00
1fdb4d65c0 [MPS] Extend torch.mm/torch.bmm to integral types (#145809)
By using `naive_mm` kernel, but make sure that accumulation is done over int32 for smaller int types (and float for half and bfloat) as well as adding `navie_bmm` that follows the same pattern.
Remove stale restriction on `torch.dot` (which works fine on MacOS-14/15)
This also enables integer op flavors for:
- `addmv`
- `einsum`
- `inner`
- `linalg.multi_dot`
- `matmul`
- `mv`
- `tensordot`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145809
Approved by: https://github.com/dcci
2025-01-30 19:35:25 +00:00
3ef1551f5a Advance past fc window for stft center (#145437)
Long overdue follow-up on https://github.com/pytorch/pytorch/pull/73432/files#diff-5f3d4caa0693a716fc46fd7f6339312f1b5f0bf89e3a3ff58e9dc13a9486b17aR719

Onnx stft doesn't support centering, [and all of the existing tests are for center = False](https://github.com/pytorch/pytorch/blob/main/test/onnx/test_pytorch_onnx_onnxruntime.py#L8026). I will open a follow-up issue to address this, this is just a nice-to-have.

Pr chain:
- -> [Advance past fc window for stft center #145437](https://github.com/pytorch/pytorch/pull/145437)
- [Add stft option to align window for center = false #145324](https://github.com/pytorch/pytorch/pull/145324)
- [Add istft option to align window for center = false](https://github.com/pytorch/pytorch/pull/145510)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145437
Approved by: https://github.com/justinchuby, https://github.com/iseeyuan
2025-01-30 19:09:18 +00:00
a3698ebd5c [while_loop] specialize when cond_fn return constants (#144515)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144515
Approved by: https://github.com/zou3519
2025-01-30 19:02:34 +00:00
16420a78eb [AOTI] Remove AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 (#146039)
Summary: The AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 macro was used to solve a FC issue and it can be removed now.

Test Plan: CI

Differential Revision: D68871245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146039
Approved by: https://github.com/yushangdi, https://github.com/hl475
2025-01-30 19:01:19 +00:00
d1143c4b37 [export] fix non-strict pre_dispatch exporting while_loop (#145762)
fix https://github.com/pytorch/pytorch/issues/145737.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145762
Approved by: https://github.com/tugsbayasgalan, https://github.com/zou3519, https://github.com/avikchaudhuri
2025-01-30 18:58:34 +00:00
clr
f746bb6311 config: Don't spam warnings about reference type configs (#145800)
Summary:
https://github.com/pytorch/pytorch/issues/145755

The is_dynamic check for reference types was subtly broken, causing log spam
after it was accessed

Added an explicit type for is_default for reference types to make sure this
behaviour is correct
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145800
Approved by: https://github.com/eellison
2025-01-30 18:57:16 +00:00
5a527fa5ee Make sure not using cpp wrapper when setting nvtx training annotation (#145538)
Longer term would be good to add as a feature to cpp_wrapper, but this makes sure it doesn't fail on main.

Not sure if this needs a test because it's not meant to compose, but will add one if necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145538
Approved by: https://github.com/desertfire
2025-01-30 18:34:22 +00:00
3ee655e4d4 [async-TP] Fix scheduling in matmul+reduce-scatter for 2 ranks (#145846)
There's a sleep that is issued in order to "nudge" CUDA to do the right scheduling decision, but this is issued on iteration number 2. However, when the world size is 2, we never reach that iteration, which led to a suboptimal scheduling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145846
Approved by: https://github.com/yifuwang
2025-01-30 18:26:34 +00:00
51ee9b154e [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-30 18:19:00 +00:00
7796e308d0 Record inputs at time of tracing, constrain to them for triton fn (#145448)
Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context.

We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448
Approved by: https://github.com/zou3519
ghstack dependencies: #145953
2025-01-30 16:54:08 +00:00
967cf85f3a Revert "Update mi300 labels to account for multiple clusters. (#145923)"
This reverts commit 3e135993bd0fa08cbff565ae76bb15cb08e1d6d0.

Reverted https://github.com/pytorch/pytorch/pull/145923 on behalf of https://github.com/atalman due to reverting back to one cluster ([comment](https://github.com/pytorch/pytorch/pull/145923#issuecomment-2625022826))
2025-01-30 16:45:50 +00:00
1c3df9ca8c Fix signif_strides_equal for symints, dedupe (#145953)
Previous impl would take a size hint, which was failing internally with a
```
strides1 = [V.graph.sizevars.size_hint(strides1[i]) for i in non_1_indices]
  File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/torch/_inductor/sizevars.py", line 554, in size_hint
    return int(out)
  File "/dev/shm/uid-30083/6f57b5f9-seed-nspid4026541609_cgpid284393-ns-4026541967/sympy/core/expr.py", line 307, in __int__
    raise TypeError("Cannot convert symbols to int")
```

There are unbacked tests in test_triton which should exercise this, as well as other tests for these functions when they were added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145953
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-30 16:44:32 +00:00
aaddfc5a7f Add TORCHINDUCTOR_VEC_ISA_OK env var for vec_isa_ok (#134667)
Adds a `TORCHINDUCTOR_VEC_ISA_OK` for `vec_isa_ok` for A|B testing purposes. Similar setup to `fx_graph_remote_cache` to allow for default `None`.

No tests were present for any other config settings here, nor for `vec_isa_ok` so I didn't add any.

Motivation:
PyTorch uses filelock with a timeout to determine if the CPU supports particular intrinsics: pytorch/torch/_inductor/cpu_vec_isa.py
Therefore if 2 processes are running, each processes encounters the HAS_CPU test, if it cannot acquire the lock for checking vec_isa_ok the main thread will be put to sleep. Hence there is a bias towards non-sleeping processes in acquiring the lock i.e. new spawned processes.

To avoid this, use a env variable so that each process is aware of this without going through the check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134667
Approved by: https://github.com/eellison
2025-01-30 16:22:48 +00:00
5fa28bbe40 Revert "[c10d] Add NCCL memory allocator (#145675)"
This reverts commit 18a7a04c4adecda3be17dd364d48d484fd1dcdba.

Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally. See D68866823 for details ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2624900562))
2025-01-30 16:01:52 +00:00
50086ab537 [ONNX] Delete rename_dynamic_shapes_with_model_inputs (#146002)
Basically, this function brings more cons than pros.

It was nice to have an automation help users to convert top-level key of dynamic shapes to arg names. However, this function has a bug when the model input has the same amount as dynamic_shapes in coincidence:

```python
input_names
# 'input_ids', 'past_key_values.0.key', 'past_key_values.0.value', 'past_key_values.1.key', 'past_key_values.1.value', 'past_key_values.2.key', 'past_key_values.2.value', 'past_key_values.3.key', 'past_key_values.3.value', 'past_key_values.4.key', 'past_key_values.4.value', 'attention_mask', 'position_ids'

inspect.sig(model.forward).parameters
# mappingproxy(OrderedDict([('input_ids', <Parameter "input_ids: Optional[torch.LongTensor] = None">), ('past_key_values', <Parameter "past_key_values: Union[transformers.cache_utils.Cache, Tuple[Tuple[torch.Tensor]], NoneType] = None">), ('attention_mask', <Parameter "attention_mask: Optional[torch.FloatTensor] = None">), ('token_type_ids', <Parameter "token_type_ids: Optional[torch.LongTensor] = None">), ('position_ids', <Parameter "position_ids: Optional[torch.LongTensor] = None">), ('head_mask', <Parameter "head_mask: Optional[torch.FloatTensor] = None">), ('inputs_embeds', <Parameter "inputs_embeds: Optional[torch.FloatTensor] = None">), ('labels', <Parameter "labels: Optional[torch.LongTensor] = None">), ('use_cache', <Parameter "use_cache: Optional[bool] = None">), ('output_attentions', <Parameter "output_attentions: Optional[bool] = None">), ('output_hidden_states', <Parameter "output_hidden_states: Optional[bool] = None">), ('return_dict', <Parameter "return_dict: Optional[bool] = None">), ('cache_position', <Parameter "cache_position: Optional[torch.LongTensor] = None">)]))
```

In the above case, the given input_names is following onnx graph, while it has the same length as torch model forward call. This kind of case makes it difficult to detect, and automate for users.

On the other hand, the error message from torch.export.export is quite informative that I believe users will know how to go from there:

```python

import torch

class Model(torch.nn.Module):
    def forward(self, x=None, y=None):
        return x + y

dim = torch.export.Dim("x", min=1, max=6)
onnx_program = torch.export.export(
    Model(),
    (),
    kwargs={"x": torch.randn(2, 3), "y": torch.randn(2, 3)},
    dynamic_shapes={"custom_input_x": {0: dim}, "custom_input_y": {0: dim}},
)

# torch._dynamo.exc.UserError: When `dynamic_shapes` is specified as a dict, its top-level keys must be the arg names ['x', 'y'] of `inputs`, but here they are ['custom_input_x', 'custom_input_y']. Alternatively, you could also ignore arg names entirely and specify `dynamic_shapes` as a list/tuple matching `inputs`. For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146002
Approved by: https://github.com/justinchuby
2025-01-30 16:01:38 +00:00
894ef8c1e3 [torchbench] Inductor freezing bfloat16 conv folding needs high tolerance (#145623)
Issue:
https://github.com/pytorch/pytorch/issues/144888

Torchbench of timm lcnet_050 model fails on accuracy in case of `--frezing` `--inference` `--bfloat16`
`res_error==0.12`
If to turn off convolution inductor constant folding - `res_error==0.016`

`float16 error ~ 0.00669`
`float16 without conv folding ~ 0.0018`

convolution folding results in increase of error almost at one order of magnitude.

I think we should revisit and try to do something to improve the accuracy for conv folding.
E.g. For example doing conv folding at compilation time with float64?

At the moment I am adding counters to identify if convolution folding happened, and in case of bfloat16 and conv_folding - increase multiplier to the max level (10) to pass accuracy test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145623
Approved by: https://github.com/eellison
2025-01-30 12:46:35 +00:00
ffa628169d [ATen][Native][CUDA][SCALED_MM] limit f8f8bf16 rowwise scaled matmul to sm_90 (#145728)
The CUTLASS-based kernel for f8f8bf16 rowwise scaled matmul is specific to Hopper devices only. It is not re-usable on newer devices without modifications. This PR adds a guard for this matmul to be sm_90 specific. Once the kernel is there, the guard may be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145728
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-01-30 11:19:58 +00:00
6bd19e65b1 add inductor_triton_kernel_mapping_post_grad.json to tlparseadd changes (#145954)
Landing D67612181 here. The original exported PR somehow fails OSS CI, but this one doesn't (though the PR content is the same).

Add debug trace artifact to inductor_triton_kernel_mapping_post_grad.json (debug artifact for provenance tracking) to tlparse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145954
Approved by: https://github.com/YUNQIUGUO
2025-01-30 06:18:48 +00:00
8a6e9a88e9 Let PYTORCH_NO_CUDA_MEMORY_CACHING has effect only when value is 1 (#145905)
Fixes #145661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145905
Approved by: https://github.com/eqy, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-30 05:11:10 +00:00
58cc6693cb [BE] Type annotate wrapper_benchmark.py and cuda_combined_scheduling.py (#145542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145542
Approved by: https://github.com/eellison
2025-01-30 03:53:52 +00:00
8cc6f17334 [CD] Install OpenMP from homebrew (#145889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145889
Approved by: https://github.com/atalman
ghstack dependencies: #145871, #145870
2025-01-30 03:19:51 +00:00
0d5f0a81c5 [CMake] Find HomeBrew OpenMP on MacOS (#145870)
Either via `OMP_PREFIX` envvar or by searching in `/opt/homebrew/opt/libomp` folder

Modify libomp bundling logic in setup.py to change absolute path to libomp.dylib to a relative one if necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870
Approved by: https://github.com/Skylion007, https://github.com/atalman
ghstack dependencies: #145871
2025-01-30 03:19:51 +00:00
cyy
116af809eb Use std::string_view (#145906)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906
Approved by: https://github.com/albanD
2025-01-30 03:14:27 +00:00
933b6d9830 cpp_wrapper: enable in aarch64 and x86 nightly dashboard performance runs (#145791)
Adds `cpp_wrapper` mode to the nightly inductor benchmark runs, as well as optionally for manually triggered runs. This is justified by `aot_inductor` already being in those runs.

Additionally, re-enables `aot_inductor` in the nightly aarch64 runs. It was disabled 5 months ago to deal with a performance instability, which has likely gone away at this point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145791
Approved by: https://github.com/desertfire
2025-01-30 02:55:45 +00:00
32bb6f83d5 Make sure that benchmark_harness is set before running (#145532)
Running torch compile with these options causes an error, because the benchmark code isn't generated but is still called.
```
options={'profile_bandwidth_output': 'foo', 'benchmark_harness': False}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145532
Approved by: https://github.com/eellison
2025-01-30 01:25:53 +00:00
25ca05eebf [PGNCCL] Correct some ifdef's (#145893)
`create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893
Approved by: https://github.com/c-p-i-o
2025-01-30 01:05:21 +00:00
73dde451b7 [pytorch] Sprinkle in a few template keywords (#145877)
Summary:
These seem to be necessary to get compilation working on Windows with
CUDA 12.8. I'm not sure whether this means that all of the previous compilers
were broken, and the new one is better, or whether this is a regression in NVCC
12.8. Either way, as long as the CI passes for existing versions, this should
unblock us from CUDA 12.8 enablement on Windows.

See D68663662 for more details on the CUDA 12.8 enablement.

Test Plan: CI!

Reviewed By: akrieger

Differential Revision: D68787925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145877
Approved by: https://github.com/Skylion007
2025-01-30 00:57:40 +00:00
72699950b0 Copy model before benchmark warmup runs (#145858)
Fixes https://github.com/pytorch/pytorch/issues/144772

The eager warmup runs causes the model to change state so that later when we export it, the model is different than when we export it directly out of box. For some reason exporting the model with the changed state causes issues but exporting the inital model is ok. This is the reason why the accuracy checks pass but the performance check fails when exporting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145858
Approved by: https://github.com/desertfire
2025-01-30 00:36:33 +00:00
clr
6b41f310c2 config: Support str env variables (#145980)
Summary:
This allows us to use environment variables to set string values. We've added
tests for the specific functionality implemented here. Note that we already
accidentally started setting up configs to use this, so we're just adding the
feature.

Additionally, we're not fully validating the underlying type when we set the
value (and in general, it's more difficult than we would like to do this). Let
me know if people feel strongly, and we can add a PR to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145980
Approved by: https://github.com/yushangdi, https://github.com/oulgen
2025-01-30 00:13:02 +00:00
a9ed7bd78e [utilization] pipeline to create clean db records (#145327)
upload_utilization_script to generate db-ready-insert records to s3
- generate two files: metadata and timeseries in ossci-utilization buckets
- convert log record to db format ones
- add unit test job for tools/stats/

Related Prs:
setup composite action for data pipeline: https://github.com/pytorch/pytorch/pull/145310
add permission for composite action to access S3 bucket: https://github.com/pytorch-labs/pytorch-gha-infra/pull/595
add insert logic in s3 replicator: https://github.com/pytorch/test-infra/pull/6217
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145327
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-01-29 23:48:50 +00:00
18a7a04c4a [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-29 23:20:22 +00:00
b60120d0df Revert "[ATen][CUDA] Implement 128 bit vectorization v2 (#145746)"
This reverts commit 81685d81eb86595d169f55a564da26eaafb2ddf5.

Reverted https://github.com/pytorch/pytorch/pull/145746 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking in trunk. See functorch/test_ops.py::TestOperatorsCUDA::test_jvp_nn_functional_multi_head_attention_forward_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/13032483748/job/36358184032) [HUD commit link](81685d81eb) ([comment](https://github.com/pytorch/pytorch/pull/145746#issuecomment-2623108958))
2025-01-29 23:02:23 +00:00
521588519d re-use FloorDiv for RShift (#145898)
I encountered this C++ compilation error.
```
  579 |     int64_t var_6 = (static_cast<int64_t>(std::floor((1.0/2.0)*u0)) | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))) | std::floor((1.0/16.0)*(static_cast<int64_t>(std::floor((1.0/2.0)*u0)) | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))));
      |                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                     |                                                                                                         |
      |                                                                     int64_t {aka long int}                                                                                    double
```

Then, I figured out where this std::floor came from with the help of Bob's guard provenance tool. It comes from RShift which is used in `triton.next_power_of_2`.

---
Before, we used `std::floor`
```
int64_t var_6 = (
   static_cast<int64_t>(std::floor((1.0/2.0)*u0)) |
   static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0)))))
   | std::floor((1.0/16.0)*(static_cast<int64_t>(std::floor((1.0/2.0)*u0))             # no cast to int here.
   | static_cast<int64_t>(std::floor((1.0/4.0)*static_cast<int64_t>(std::floor((1.0/2.0)*u0))))));
```

Now, we use `c10::div_floor_integer` instead
```
int64_t var_6 = (
   (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L))) |
   (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))) |
   (c10::div_floor_integer(static_cast<int64_t>((c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(2L)))
   | (c10::div_floor_integer(static_cast<int64_t>(u0), static_cast<int64_t>(8L)))), static_cast<int64_t>(16L)));
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145898
Approved by: https://github.com/desertfire, https://github.com/bobrenjc93
ghstack dependencies: #145802
2025-01-29 22:50:22 +00:00
3df961d99b give emulate_precision_casts an envar (#145948)
this was requested internally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145948
Approved by: https://github.com/mlazos
2025-01-29 22:43:32 +00:00
2e5886dcc4 Add fake_impl for unique_consecutive (#145649)
Summary:
It's fairly similar to torch.unique and torch.unique_dim.

Test Plan:
New test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145649
Approved by: https://github.com/ezyang, https://github.com/eellison
2025-01-29 22:33:16 +00:00
1e57154af3 Require that all HOPs be imported at import torch time (#145939)
E.g. torch.ops.higher_order.cond does not exist until it is imported,
which is bad if it shows up in an FX graph or is used in some code
somewhere.

This PR also makes some more HOPs get imported at `import torch` time.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145939
Approved by: https://github.com/ydwu4
ghstack dependencies: #145938
2025-01-29 22:27:52 +00:00
2141c1aebe Better hop_db comment; move test to a non-export test file (#145938)
Goal is for people to better test their HOPs.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145938
Approved by: https://github.com/ydwu4
2025-01-29 22:27:52 +00:00
e02c038a23 [dynamo][benchmarks] Stop benchmarking compile time of dead code (#145590)
FIXES https://github.com/pytorch/pytorch/issues/144775 frfr

See details on the problem: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2611699385
We fixed some silent incorrectness, but it results in less nodes DCE'd. The benchmark iteration loop had some dead code which could contain side effect ops that aren't safe to DCE. The regression is expected.

This PR removes the compile time benchmarking of the dead code, which should reduce the noise of the benchmark and aligns with the benchmarking used by performance tests

New benchmark results:
```python
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,39.322364  # after https://github.com/pytorch/pytorch/pull/144319
cuda,BartForConditionalGeneration,1,pass,897,1,0,0,0,0,0,38.972257  # before https://github.com/pytorch/pytorch/pull/144319
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145590
Approved by: https://github.com/jansel
ghstack dependencies: #145447
2025-01-29 22:14:47 +00:00
793dfc27e0 [inductor] Add some typing to triton.py (#145688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145688
Approved by: https://github.com/Skylion007, https://github.com/eellison
ghstack dependencies: #145671, #145695
2025-01-29 21:56:40 +00:00
5db0ad92e3 [inductor] Remove mask_str from IndexingOptions (#145695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145695
Approved by: https://github.com/eellison
ghstack dependencies: #145671
2025-01-29 21:56:40 +00:00
23ff899164 [inductor] Fix handling of fixed XBLOCK larger than xnumel=1 (#145671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145671
Approved by: https://github.com/eellison
2025-01-29 21:56:32 +00:00
bb2fb554a9 [BE]: Update CUTLASS submodule to 3.7.0 (#145172)
* This has a couple of new features, but mostly has a lot of bugfixes for the prior releases
* This is the last Hopper-focused release of CUTLASS before blackwell drops, so let's upgrade to it.
* Most of the remaining diff noise is copyright year updates on the CUTLASS submodule
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145172
Approved by: https://github.com/eqy, https://github.com/henrylhtsang
2025-01-29 21:48:01 +00:00
d0aa1386b8 Disable AOTAutogradCache for triton version < 3.2 (#145937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145937
Approved by: https://github.com/bdhirsh
2025-01-29 21:32:16 +00:00
1185b81c51 Revert "[dynamo] Use polyfill to implement comparison operators (#144485)"
This reverts commit d1f82de2bf4ce4d4461791a9c9b2e759202db0bb.

Reverted https://github.com/pytorch/pytorch/pull/144485 on behalf of https://github.com/huydhn due to This seems to break dynamo tests in trunk after landing ([comment](https://github.com/pytorch/pytorch/pull/144485#issuecomment-2622893294))
2025-01-29 21:30:42 +00:00
953e80936e [linter] Grep linter batches long command (#145950)
If the command is too long, the linter fails with
```
Failed due to OSError:
[Errno 7] Argument list too long: 'grep'
```
Fix this by batching the command so it is shorter

Limit of 750k was chosen due to `getconf ARG_MAX` returns ~1M on my mac.  My guess is that most people shouldn't hit this unless they run --all-files and the directory length is long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145950
Approved by: https://github.com/wdvr
2025-01-29 21:23:27 +00:00
a6e3f294f1 Don't use mypy daemon in CI (#145961)
This is an attempt to fix flaky mypy errors in CI that look like:

```
dmypy status --verbose
connection_name         : /var/folders/rf/qrn1jkgj0b9_tcznwp8ck46w0000gn/T/tmpjoqsid7_/dmypy.sock
pid                     :      32233
error                   :  timed out
Daemon is stuck; consider /Users/zainr/pytorch/venv/bin/dmypy kill
```

"Fix" it by not using the daemon at all, since it doesn't actually provide any perf benefits in CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145961
Approved by: https://github.com/malfet
2025-01-29 21:15:29 +00:00
40ccb7a86d cpp_wrapper: Move #includes to per-device header files (#145932)
Summary:
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Reland https://github.com/pytorch/pytorch/pull/143909 after merge conflicts.

Co-authored-by: Benjamin Glass <[bglass@quansight.com](mailto:bglass@quansight.com)>

Differential Revision: D68656960

Pulled By: benjaminglass1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145932
Approved by: https://github.com/yushangdi, https://github.com/benjaminglass1

Co-authored-by: bglass@quansight.com <bglass@quansight.com>
2025-01-29 21:08:45 +00:00
8bd7bf3269 [Inductor-CPU] Add profiling support for codegened flex attention kernels (#145894)
### Summary

`RECORD_FUNCTION` wasn't present in codegened Inductor-CPU Flex Attention C++ kernels, so flex attention kernels weren't present in the PyTorch profiler profiling data.

Fixes #145825 by adding `RECORD_FUNCTION` calls in the codegened flex-attention kernels.

### Caveat

#### _Before_
No corresponding results in PyTorch profiler profiling data

#### _After_

| Inductor config settings |  What kernel name looks like in profiling data | Comments|
|-------------------|------------------------------------|--------------------|
| Env variable `TORCHINDUCTOR_CPP_WRAPPER=1` OR `inductor.config.cpp_wrapper=1` in python code | `graph_x_cpp_fused_y` | No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
|  `inductor.config.cpp.descriptive_names = "inductor_node"` but not CPP wrapper | `graph_x_kernel` | No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |
| Both `inductor_config.cpp.descriptive_names = "inductor_node"`  & Inductor CPP Wrapper | `graph_x_cpp_fused_flex_attention_y`| Easy to interpret data |
| Neither of the two configs  | `graph_x_kernel`| No way to tell from the profiling results if the kernel is a GEMM kernel or an attention kernel |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145894
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel
2025-01-29 20:54:46 +00:00
bb4964013f Add determinmistic kernel for reflection2d (#136241)
Adds feature for #98925

Tests pass for both existing reflectionpad2d and the new one I inserted.

**Summary of the work:**

Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor.

So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`.

I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241
Approved by: https://github.com/eqy, https://github.com/albanD
2025-01-29 20:34:03 +00:00
2b8c28099a [OSS] Add no dist as an argument to DCP top level apis (#145754)
Summary: No-dist, for a non-distributed checkpoint, was a top level param in the past, but was removed. This was requested back in https://github.com/pytorch/pytorch/issues/125777 and will be needed for our torchtune changes to use DCP

Test Plan: existing tests pass

Differential Revision: D68714246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145754
Approved by: https://github.com/daulet-askarov
2025-01-29 20:33:37 +00:00
2d5d022594 Fix a number of flexattention issues (cse, cudagraph, etc.) (#145059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145059
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-29 20:27:39 +00:00
6aed6c042e [CD] Install ninja and setuptools from PyPI (#145871)
As well as typing extensions, they are available from PyPI, no reason to install them from Anaconda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871
Approved by: https://github.com/Skylion007
2025-01-29 19:47:16 +00:00
b80482988f Revert "[CMake] Find HomeBrew OpenMP on MacOS (#145870)"
This reverts commit c26bb9ba5bd40d256a25436212279bc7e4b436ae.

Reverted https://github.com/pytorch/pytorch/pull/145870 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))
2025-01-29 19:34:27 +00:00
b52e8d521e Revert "[CD] Install ninja and setuptools from PyPI (#145871)"
This reverts commit eea7d395e5faa9a4be5b60f6668c0bdf5163e3a0.

Reverted https://github.com/pytorch/pytorch/pull/145871 on behalf of https://github.com/malfet due to Want to refine it a bit ([comment](https://github.com/pytorch/pytorch/pull/145870#issuecomment-2622659614))
2025-01-29 19:34:27 +00:00
082fab0fc7 [64-bit] Int64 casting for UpSampleNearest3D (#144865)
Fixes #144855

Follows approach in https://github.com/pytorch/pytorch/pull/141923 to use int64 types to increase INT_MAX limits
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144865
Approved by: https://github.com/eqy
2025-01-29 19:30:09 +00:00
1c9014a135 [export] Add tlparse to draft-export (#145810)
Dependent on https://github.com/ezyang/tlparse/pull/87/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145810
Approved by: https://github.com/pianpwk
2025-01-29 19:26:00 +00:00
6371c25b91 Revert "[c10d] Add NCCL memory allocator (#145675)"
This reverts commit 9fd6722fc9068eeaa176754acb315fc7e0f6416c.

Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))
2025-01-29 18:30:30 +00:00
e0525dbca9 Revert "inductor.config.descriptive_names = False is not actually supported (#145523)"
This reverts commit edf266e9bbbf6063f7c4a336ffb50234e11a0a82.

Reverted https://github.com/pytorch/pytorch/pull/145523 on behalf of https://github.com/ZainRizvi due to Hi, this breaks type checks internally. Can you please take a look? See D68801083 for details ([comment](https://github.com/pytorch/pytorch/pull/145523#issuecomment-2622510900))
2025-01-29 18:27:44 +00:00
284f217011 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 97b3b73f3e96bb8684064715b93c825ba0395475.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))
2025-01-29 18:24:29 +00:00
0d6343347f Revert "Record inputs at time of tracing, constrain to them for triton fn (#145448)"
This reverts commit a699034eeca8c096c44a690e405a60efa442d4ed.

Reverted https://github.com/pytorch/pytorch/pull/145448 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68779678 for details ([comment](https://github.com/pytorch/pytorch/pull/145448#issuecomment-2622470810))
2025-01-29 18:07:12 +00:00
1a613c3342 bump counters for unbacked binding names (#145882)
Instead of bumping symint counters when we process unbacked bindings during deserialization, it's better to bump them at the beginning based on what the symbols in the original shape env before serialization were. This allows symbols in unbacked bindings to have "gaps" that bumping alone would not be able to match.

Why is bumping counters important at all? It is because when the shape env coming out of deserialization is used later for propagating symints, say in run_decompositions, we don't want new names to clash with existing names (bad things happen).

Differential Revision: [D68798191](https://our.internmc.facebook.com/intern/diff/D68798191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145882
Approved by: https://github.com/pianpwk
2025-01-29 17:46:21 +00:00
4abff4b271 Introduce cache clearing APIs for the lazy graph executor (#144489)
This PR introduces two new methods to the LazyGraphExecutor class:

- ClearComputationCache(): Allows clearing the entire computation cache.
- RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash.

The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance:
- Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently.
- Selectively remove cache entries to analyze the impact on subsequent computations.
- Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations.

On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash.

Simultaneously, **another motivation**, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489
Approved by: https://github.com/wconstab
2025-01-29 17:38:01 +00:00
d1f82de2bf [dynamo] Use polyfill to implement comparison operators (#144485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144485
Approved by: https://github.com/jansel
2025-01-29 17:37:40 +00:00
3e135993bd Update mi300 labels to account for multiple clusters. (#145923)
We now have multiple Kubernetes clusters of mi300x resources, and this commit updates labels accordingly to target both clusters evenly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145923
Approved by: https://github.com/jeffdaily
2025-01-29 16:56:43 +00:00
4499d60d56 [dynamo][builin-skipfiles-cleanup] Remove types (#145909)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145909
Approved by: https://github.com/zou3519
ghstack dependencies: #145856, #145875, #145878, #145892
2025-01-29 16:47:02 +00:00
ed141d7d1a dont assign a size to _assert_scalar in partitioner (#143877)
Fixes https://github.com/pytorch/pytorch/issues/143876

Open to other suggestions - we have an invariant that all nodes in our ATen graphs should have a `meta['val']` field, but I don't think this is actually true in all cases, so I just hardcoded the invariant to ignore `_assert_scalar()` (which is a "special" op used in dynamic shapes for runtime asserts, and doesn't have a meta['val'] field)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143877
Approved by: https://github.com/zou3519
2025-01-29 16:21:37 +00:00
3b3aac0cde Filter out iGPU if dGPU is found on XPU (#144378)
# Motivation
for https://github.com/pytorch/pytorch/issues/143914
On Windows, there are two separate SYCL platforms for iGPU and dGPU. To simplify the logic, we will exclude iGPUs when a dGPU is present. This ensures that all XPU devices enumerated by PyTorch share the same SYCL context.

Now I generalize the logic as below:
1. We find the first L0 platform containing at least one dGPU and enumerate all dGPUs of that platform.
2. If no dGPU is found, we find the first L0 platform containing iGPU and enumerate all iGPUs of that platform.
3. No GPU is found (neither iGPU nor dGPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144378
Approved by: https://github.com/EikanWang, https://github.com/gujinghui
2025-01-29 15:53:16 +00:00
5e5da9bd9a [triton] Update pin to tip of 3.2 release (#145867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145867
Approved by: https://github.com/Skylion007, https://github.com/htyu, https://github.com/exclamaforte
2025-01-29 15:17:58 +00:00
81685d81eb [ATen][CUDA] Implement 128 bit vectorization v2 (#145746)
This is a re-base PR to my previous one #141959.

Description from the original PR:

This PR implements 128-bit vectorization. It improves the performance of contiguous elementwise ops by 4-10% on Hopper H100.

<details>

<summary>The benchmark code used </summary>

```Python

import time
import torch
from torch.profiler import profile, ProfilerActivity

def benchmark(function, dtype=torch.float32, check_numerics=True, print_profile=False):
    device = torch.device("cuda")

    shapes = []
    for p in range(24, 30):
        shape = 1<<p
        shapes.append(shape)

    for shape in shapes:
        for _ in range(6):
            x = torch.randn(shape, device=device, dtype=dtype)
            y = function(x)

        if print_profile:
            x = torch.randn(shape, device=device, dtype=dtype)
            with profile(activities=[ProfilerActivity.CUDA], record_shapes=True) as prof:
                y = function(x)
            print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

        x = torch.randn(shape, device=device, dtype=dtype)
        torch.cuda.synchronize()
        t1 = time.perf_counter()
        for _ in range(6):
            y = function(x)
        torch.cuda.synchronize()
        t2 = time.perf_counter()
        perf_time = (t2 - t1) / 6

        print(f"{function.__name__}, {dtype}, {shape}, {perf_time}")
        if check_numerics:
            x_cpu = x.cpu()
            y_cpu = function(x_cpu).cuda()
            try:
                torch.testing.assert_allclose(y_cpu, y)
            except AssertionError as error:
                print("An exception occurred:", error)

def main():
    ops = [
            torch.relu,
            torch.sigmoid,
            torch.tanh,
            torch.nn.functional.gelu,
            torch.sin,
            torch.exp,
    ]

    dtypes = [
            torch.float16,
            torch.bfloat16,
            torch.float32,
    ]

    for op in ops:
        for dtype in dtypes:
            benchmark(op, dtype=dtype)
            torch.cuda.empty_cache()

if __name__ == "__main__":
    main()
```

</details>

<details>

<summary> Results </summary>

| op | dtype | size | time after | time before | % improvement |
| ---- | ---- | ---- | ---- | ---- | ---- |
| relu | torch.float16 | 33554432 | 4.84E-05 | 5.06E-05 | 4.66296539127052 |
| relu | torch.float16 | 67108864 | 9.22E-05 | 9.64E-05 | 4.56491432752297 |
| relu | torch.float16 | 134217728 | 0.000180343495837102 | 0.000187981834945579 | 4.23543919508829 |
| relu | torch.float16 | 268435456 | 0.000355071155354381 | 0.000370856161074092 | 4.44558942107169 |
| relu | torch.float16 | 536870912 | 0.000704489842367669 | 0.000736006341564159 | 4.47366268483987 |
| relu | torch.bfloat16 | 16777216 | 3.03E-05 | 3.04E-05 | 0.166504085842689 |
| relu | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.45848238875716 |
| relu | torch.bfloat16 | 67108864 | 9.32E-05 | 9.65E-05 | 3.56122651631445 |
| relu | torch.bfloat16 | 134217728 | 0.000180805509444326 | 0.000187998676362137 | 3.97840029317567 |
| relu | torch.bfloat16 | 268435456 | 0.000356242332297067 | 0.000371279485989362 | 4.22104627356745 |
| relu | torch.bfloat16 | 536870912 | 0.000708114336399982 | 0.000736773828975856 | 4.04729732229083 |
| relu | torch.float32 | 16777216 | 5.61E-05 | 5.61E-05 | 0.0442587268354941 |
| relu | torch.float32 | 33554432 | 9.33E-05 | 9.30E-05 | -0.259070913799022 |
| relu | torch.float32 | 67108864 | 0.000181321326332788 | 0.000181289506144822 | -0.0175490597877115 |
| relu | torch.float32 | 134217728 | 0.000356896334172537 | 0.000356570177245885 | -0.0913870206618981 |
| relu | torch.float32 | 268435456 | 0.000709421835684528 | 0.000707465515006334 | -0.275762681635911 |
| relu | torch.float32 | 536870912 | 0.00141372415237129 | 0.00141036518228551 | -0.237597276678471 |
| sigmoid | torch.float16 | 16777216 | 3.10E-05 | 3.16E-05 | 2.10012593866895 |
| sigmoid | torch.float16 | 33554432 | 4.91E-05 | 5.23E-05 | 6.37710600666122 |
| sigmoid | torch.float16 | 67108864 | 9.30E-05 | 0.000100057009452333 | 7.61866144555331 |
| sigmoid | torch.float16 | 134217728 | 0.000180928347011407 | 0.000194982004662355 | 7.76752669390248 |
| sigmoid | torch.float16 | 268435456 | 0.000355658994521946 | 0.00038468533117945 | 8.16128288742412 |
| sigmoid | torch.float16 | 536870912 | 0.000705982849467546 | 0.000764021339515845 | 8.22094900634937 |
| sigmoid | torch.bfloat16 | 16777216 | 3.08E-05 | 3.17E-05 | 2.90965915673149 |
| sigmoid | torch.bfloat16 | 33554432 | 4.87E-05 | 5.24E-05 | 7.63503884668234 |
| sigmoid | torch.bfloat16 | 67108864 | 9.33E-05 | 0.000100019678939134 | 7.21238137428013 |
| sigmoid | torch.bfloat16 | 134217728 | 0.000180786165098349 | 0.000194868014659733 | 7.78922964250206 |
| sigmoid | torch.bfloat16 | 268435456 | 0.000355564659306159 | 0.000384909333661199 | 8.25297835063321 |
| sigmoid | torch.bfloat16 | 536870912 | 0.000705831005082776 | 0.000764102345177283 | 8.2557070566308 |
| sigmoid | torch.float32 | 16777216 | 4.93E-05 | 5.65E-05 | 14.5314136197766 |
| sigmoid | torch.float32 | 33554432 | 9.32E-05 | 9.31E-05 | -0.120169865610833 |
| sigmoid | torch.float32 | 67108864 | 0.000181328505277634 | 0.000180455681402236 | -0.481349512069855 |
| sigmoid | torch.float32 | 134217728 | 0.000357362829769651 | 0.000356093340087682 | -0.35523831137877 |
| sigmoid | torch.float32 | 268435456 | 0.000708921831877281 | 0.000707052337626616 | -0.263709504574663 |
| sigmoid | torch.float32 | 536870912 | 0.00141358317341656 | 0.0014090768333214 | -0.318788464654745 |
| tanh | torch.float16 | 16777216 | 3.03E-05 | 3.03E-05 | -0.0912564658661808 |
| tanh | torch.float16 | 33554432 | 4.90E-05 | 5.07E-05 | 3.46644442974484 |
| tanh | torch.float16 | 67108864 | 9.30E-05 | 9.68E-05 | 3.99871369815531 |
| tanh | torch.float16 | 134217728 | 0.00018052199933057 | 0.000188717152923346 | 4.53969799978138 |
| tanh | torch.float16 | 268435456 | 0.000355684508879979 | 0.000373026006855071 | 4.8755280430115 |
| tanh | torch.float16 | 536870912 | 0.000706660988119741 | 0.000740105014604827 | 4.73268328765002 |
| tanh | torch.bfloat16 | 16777216 | 2.99E-05 | 3.03E-05 | 1.21049563135981 |
| tanh | torch.bfloat16 | 33554432 | 4.89E-05 | 5.06E-05 | 3.48836101041744 |
| tanh | torch.bfloat16 | 67108864 | 9.28E-05 | 9.69E-05 | 4.39944918036626 |
| tanh | torch.bfloat16 | 134217728 | 0.000180710999605556 | 0.000189167990659674 | 4.67984299382829 |
| tanh | torch.bfloat16 | 268435456 | 0.000356062994493792 | 0.000372666652159144 | 4.66312363882606 |
| tanh | torch.bfloat16 | 536870912 | 0.000707100164921333 | 0.000740134331863374 | 4.67178040408393 |
| tanh | torch.float32 | 16777216 | 5.61E-05 | 5.64E-05 | 0.439595755746353 |
| tanh | torch.float32 | 33554432 | 9.31E-05 | 9.31E-05 | 0.00287633090228212 |
| tanh | torch.float32 | 67108864 | 0.000181465332085888 | 0.000180895323865116 | -0.31411411437098 |
| tanh | torch.float32 | 134217728 | 0.000356963835656643 | 0.000356073161431899 | -0.249513854283251 |
| tanh | torch.float32 | 268435456 | 0.000709201170442005 | 0.00070707315656667 | -0.300057862849997 |
| tanh | torch.float32 | 536870912 | 0.00141367283261692 | 0.00141030051357423 | -0.238550176877922 |
| gelu | torch.float16 | 16777216 | 2.73E-05 | 3.17E-05 | 15.921079070745 |
| gelu | torch.float16 | 33554432 | 5.06E-05 | 5.55E-05 | 9.76345374333098 |
| gelu | torch.float16 | 67108864 | 9.65E-05 | 0.000106600326641152 | 10.4308039074712 |
| gelu | torch.float16 | 134217728 | 0.000187776672343413 | 0.000208565829476962 | 11.0712139447915 |
| gelu | torch.float16 | 268435456 | 0.000370216167842348 | 0.000412251994324227 | 11.3544005187205 |
| gelu | torch.float16 | 536870912 | 0.000737301345604161 | 0.000819394170927505 | 11.1342296895002 |
| gelu | torch.bfloat16 | 16777216 | 3.02E-05 | 3.08E-05 | 1.78405479367653 |
| gelu | torch.bfloat16 | 33554432 | 5.13E-05 | 5.69E-05 | 10.9929393318302 |
| gelu | torch.bfloat16 | 67108864 | 9.76E-05 | 0.00010968199543034 | 12.3420807512356 |
| gelu | torch.bfloat16 | 134217728 | 0.000189661824454864 | 0.000214487663470209 | 13.0895287371091 |
| gelu | torch.bfloat16 | 268435456 | 0.000374197009174774 | 0.000423670164309442 | 13.2211519391275 |
| gelu | torch.bfloat16 | 536870912 | 0.000743675006863972 | 0.000842577001700799 | 13.299088166737 |
| gelu | torch.float32 | 16777216 | 5.06E-05 | 5.04E-05 | -0.413385894716413 |
| gelu | torch.float32 | 33554432 | 9.31E-05 | 9.32E-05 | 0.134157041722546 |
| gelu | torch.float32 | 67108864 | 0.000181480175039421 | 0.000180836669945469 | -0.354586992112075 |
| gelu | torch.float32 | 134217728 | 0.000356874331676712 | 0.000356305002545317 | -0.159532104402047 |
| gelu | torch.float32 | 268435456 | 0.000708909006789327 | 0.000706991491218408 | -0.270488250615287 |
| gelu | torch.float32 | 536870912 | 0.00141321367118508 | 0.00140937082081412 | -0.271922813181618 |
| sin | torch.float16 | 16777216 | 3.04E-05 | 3.11E-05 | 2.21834939018859 |
| sin | torch.float16 | 33554432 | 4.85E-05 | 5.23E-05 | 7.72165512511596 |
| sin | torch.float16 | 67108864 | 9.31E-05 | 9.98E-05 | 7.24947099480072 |
| sin | torch.float16 | 134217728 | 0.000180371008658161 | 0.000194791161144773 | 7.99471744039613 |
| sin | torch.float16 | 268435456 | 0.000355454161763191 | 0.000384903668115536 | 8.28503630574026 |
| sin | torch.float16 | 536870912 | 0.000705183832906187 | 0.000764360166310022 | 8.39161799270973 |
| sin | torch.bfloat16 | 16777216 | 3.11E-05 | 3.10E-05 | -0.257677954940036 |
| sin | torch.bfloat16 | 33554432 | 4.89E-05 | 5.24E-05 | 7.34808420323539 |
| sin | torch.bfloat16 | 67108864 | 9.26E-05 | 0.000100248667877167 | 8.22347488801205 |
| sin | torch.bfloat16 | 134217728 | 0.000180674154156198 | 0.00019567032965521 | 8.30012215584937 |
| sin | torch.bfloat16 | 268435456 | 0.000355360486234228 | 0.000386023331278314 | 8.62865913118873 |
| sin | torch.bfloat16 | 536870912 | 0.00070483615854755 | 0.000766805159704139 | 8.79197248964745 |
| sin | torch.float32 | 16777216 | 5.67E-05 | 5.64E-05 | -0.441348534920039 |
| sin | torch.float32 | 33554432 | 9.34E-05 | 9.30E-05 | -0.496458540364117 |
| sin | torch.float32 | 67108864 | 0.000181706990891447 | 0.000180556671693921 | -0.633062708199702 |
| sin | torch.float32 | 134217728 | 0.000356894995396336 | 0.000356046327700218 | -0.237791985616354 |
| sin | torch.float32 | 268435456 | 0.000708777321657787 | 0.000707602652255446 | -0.165731798471427 |
| sin | torch.float32 | 536870912 | 0.00141263716310884 | 0.00140912582476934 | -0.248566187496451 |
| exp | torch.float16 | 16777216 | 3.00E-05 | 3.04E-05 | 1.40099098901014 |
| exp | torch.float16 | 33554432 | 4.86E-05 | 5.03E-05 | 3.44611943643906 |
| exp | torch.float16 | 67108864 | 9.37E-05 | 9.55E-05 | 1.96412400380129 |
| exp | torch.float16 | 134217728 | 0.000180913504057874 | 0.000187193179347863 | 3.47109262113439 |
| exp | torch.float16 | 268435456 | 0.00035607748820136 | 0.000369079003576189 | 3.65131630210701 |
| exp | torch.float16 | 536870912 | 0.000707551507124056 | 0.000732363162872692 | 3.50669251620789 |
| exp | torch.bfloat16 | 16777216 | 2.98E-05 | 3.04E-05 | 1.74345594341654 |
| exp | torch.bfloat16 | 33554432 | 4.88E-05 | 5.04E-05 | 3.40217856534821 |
| exp | torch.bfloat16 | 67108864 | 9.32E-05 | 9.62E-05 | 3.29219958210226 |
| exp | torch.bfloat16 | 134217728 | 0.000180999826019009 | 0.000187239318620414 | 3.44723679499521 |
| exp | torch.bfloat16 | 268435456 | 0.000355944503098726 | 0.000369370992605885 | 3.77207384585864 |
| exp | torch.bfloat16 | 536870912 | 0.000707135167128096 | 0.000733066000975668 | 3.66702648277075 |
| exp | torch.float32 | 16777216 | 4.89E-05 | 5.63E-05 | 15.1245314346532 |
| exp | torch.float32 | 33554432 | 9.34E-05 | 9.31E-05 | -0.259945454477446 |
| exp | torch.float32 | 67108864 | 0.000181152504713585 | 0.000180474346658836 | -0.374357536939058 |
| exp | torch.float32 | 134217728 | 0.000356771342922002 | 0.000355627329554409 | -0.3206573034212 |
| exp | torch.float32 | 268435456 | 0.000708404501589636 | 0.00070713268360123 | -0.179532736671163 |
| exp | torch.float32 | 536870912 | 0.00141283582585553 | 0.00140944866385932 | -0.23974208002295 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145746
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-01-29 13:32:59 +00:00
354fe48db9 Add magma cuda build 12.8 (#145765)
https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145765
Approved by: https://github.com/malfet
2025-01-29 08:43:38 +00:00
501c5972f0 [pytorch] raise exception when calling dim order on sparse tensor (#145888)
This diff introduces a change to the PyTorch library that raises an exception when calling the `dim_order` method on a sparse tensor.

Differential Revision: [D68797044](https://our.internmc.facebook.com/intern/diff/D68797044/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145888
Approved by: https://github.com/Jack-Khuu
2025-01-29 06:15:44 +00:00
2e8c080ab1 [inductor][4/N] triton support post-#5512, fix constexpr signatures (#145583)
Prior to this PR, constexprs were appearing in signatures as `{.. "XBLOCK : tl.constexpr": "constexpr"}` when they really should appear as `{.. "XBLOCK": "constexpr"}`.

This PR represents the argument names as ArgName objects, which can optionally be marked as constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145583
Approved by: https://github.com/jansel
2025-01-29 05:46:05 +00:00
3f77002b96 [dynamo][builtin-skipfiles-cleanup] remove abc, enum, importlib (#145892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145892
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
ghstack dependencies: #145856, #145875, #145878
2025-01-29 05:30:06 +00:00
236793684d [dynamo][builtin-skipfiles-cleanup] Remove threading, _collections_abc, _weakrefset, threading (#145878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145878
Approved by: https://github.com/williamwen42, https://github.com/StrongerXi
ghstack dependencies: #145856, #145875
2025-01-29 05:30:06 +00:00
a479656cd2 [dynamo][builtin-skipfiles-removal] Remove logging (#145875)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145875
Approved by: https://github.com/williamwen42
ghstack dependencies: #145856
2025-01-29 05:29:58 +00:00
64ee57847b [dynamo][builtin-skipfiles-cleanup] Remove some builtins (#145856)
[dynamo][builtin-skipfiles-cleanup] Remove more builtins

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145856
Approved by: https://github.com/zou3519
2025-01-29 05:29:47 +00:00
7178b827d7 PEP585: Missed conversions (#145342)
Differential Revision: [D68785969](https://our.internmc.facebook.com/intern/diff/D68785969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145342
Approved by: https://github.com/bobrenjc93
2025-01-29 05:24:36 +00:00
8696e59ae2 add test for capture_dynamic_output_shape_ops=True changing expected output between eager and compiled versions (#145821)
Followup from https://github.com/pytorch/pytorch/issues/130290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145821
Approved by: https://github.com/eellison, https://github.com/ezyang
2025-01-29 04:36:32 +00:00
776bdb962c [ONNX] Support subgraphs with 1+ outputs (#145860)
Fixed a bug in _handle_output_node where additional output values were not added as graph outputs

Fixes #145734
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145860
Approved by: https://github.com/titaiwangms
2025-01-29 04:13:23 +00:00
cyy
fd515e4f59 Fix C++20 Wambiguous-reversed-operator warnings (#144126)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144126
Approved by: https://github.com/albanD
2025-01-29 03:13:57 +00:00
90a6db4a9c [be][pytorch] Fix backend in autocast (#145859)
Summary: fixing backend typo (BAKCNEDS -> BACKENDS)

Test Plan: ci

Differential Revision: D68573324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145859
Approved by: https://github.com/jvandebon
2025-01-29 03:13:08 +00:00
9be2e88d41 Fix lowering to inductor IR for triton CPU (#144389)
Example failing test:
 `pytest -s test_torchinductor_opinfo.py  -k test_comprehensive_special_polygamma_special_polygamma_n_0_cpu_float32` when using triton CPU.

Failure:
```shell
triton.compiler.errors.CompilationError: at 10:11:
def triton_poi_fused_polygamma_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 25
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = 1.0
    tl.static_assert(tmp1.dtype == tl.float32)
    tmp2 = ops.polygamma(tmp1, tmp0)
           ^
NameError('ops is not defined')
```
This occurs because the registered triton fallbacks are not used during the lowering to inductor IR.

Marked the problematic code in the excerpt below from 6bc17b0725/torch/_inductor/lowering.py (L572)

```python
def make_pointwise(
    fn,
    override_return_dtype=None,
    override_device=None,
    override_fn_when_input_bool=None,
    override_fn_when_gpu_float64=None,
    allow_alpha=False,
    triton_fallback=None,
):
    def inner(*inputs: TensorBox, alpha=None):
        if triton_fallback is not None and any(
            isinstance(inp, IRNode) and is_triton(inp) for inp in inputs <--- is_triton should return True when using triton CPU
        ):
            assert not allow_alpha  # not implemented
            return triton_fallback(*inputs)

        inputs = promote_constants(inputs, override_return_dtype)
        if allow_alpha:
            if alpha is not None and alpha != 1:
                inputs = list(inputs)

```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144389
Approved by: https://github.com/jansel
2025-01-29 03:10:53 +00:00
50f834f134 [export] allow bit shift builtin ops (#145802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145802
Approved by: https://github.com/pianpwk
2025-01-29 03:05:48 +00:00
f4ca98950e Add CUDA 12.8 libtorch image (#145789)
https://github.com/pytorch/pytorch/issues/145570

Builds 12.8 libtorch docker/deprecate 12.1 meanwhile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145789
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-01-29 02:59:37 +00:00
9330b6d098 Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)
Summary:

Test Plan:

Differential Revision: [D68751149](https://our.internmc.facebook.com/intern/diff/D68751149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829
Approved by: https://github.com/Chillee
2025-01-29 02:52:55 +00:00
9fd6722fc9 [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-29 02:48:56 +00:00
29521256e1 [Customized Optimus][Inductor] Add split cat pattern in aten level (#145721)
Summary:
Thanks Microve for discovering that recGPT has some repeated similar kernels that might be optimized through optimus. After investigation, I designed a pattern in the aten level to remove such excessive kernels.

trace: https://fburl.com/perfdoctor/82fauil7
tlparse: https://fburl.com/98q6tadx

Test Plan:
# unit test

```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_aten_passes -- test_split_cat_post_grad
```

Buck UI: https://www.internalfb.com/buck2/e8458d63-b8ca-498b-a731-77a83fb4d1cb
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16325548715106567
Network: Up: 341KiB  Down: 359KiB  (reSessionID-7d3de666-7fc1-4988-8d11-d75ba958016d)
Executing actions. Remaining     0/3
Command: test.     Finished 2 local
Time elapsed: 3:04.8s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# local run

```
buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=local_recgpt_ranking_30x_v0_unified_seq_1115
```

https://www.internalfb.com/mlhub/pipeline/1630903954173593

# E2E

```
buck2 run @//mode/opt aps_models/ads/recgpt_exp:recgpt_launcher -- mode=mast_recgpt_ranking_30x_v0_unified_seq_1115 launcher.oncall=ads_model_platform launcher.data_project=ai_large_scale launcher.fbl_entitlement=ads_global_tc_training_efficiency launcher.tags=[ads_ranking_taxonomy_mc_qps_optimization] launcher.hardware=SMC_T20 launcher.job_name=recgpt_ranking_1115_pt2_with_optimus data_loader.dataset.table_ds=[2024-12-13,2024-12-14,2024-12-15,2024-12-16,2024-12-17,2024-12-18]
```

### how to add the config
Add the following patterns to the dynamo config

```
        post_grad_fusion_options: {
          "normalization_aten_pass": {},
          "split_cat_aten_pass": {},
        }
```

{F1974700331}

baseline:
aps-recgpt_ranking_1115_pt2_5-8cb4905c7d

{F1974700216}

proposal:

Differential Revision: D68695717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145721
Approved by: https://github.com/Yuzhen11
2025-01-29 01:59:06 +00:00
331f49057d Removes threadfence from topk kernel to improve AMD performance (#145536)
Also marginally improves cuda perf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145536
Approved by: https://github.com/eqy
2025-01-29 01:29:15 +00:00
6f5c8fb128 [DTensor] Add pointwise ops strategy for aten.minimum (#145816)
Need it for Shampoo optimizer.
9c5700ad5e/matrix_functions.py (L240-L242)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145816
Approved by: https://github.com/XilunWu
2025-01-29 01:19:01 +00:00
15e37e4253 [export] don't always print GM in serdes logging (#145857)
Summary: Didn't realize print_readable() would also print and not just return string

Test Plan: .

Differential Revision: D68781525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145857
Approved by: https://github.com/angelayi, https://github.com/yiming0416
2025-01-29 01:03:02 +00:00
a24b25942a Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)
Fixes #140092

Here's what this PR does:

Case 1: no `eps`  is passed to python frontend:
Use `eps` associated with opmath_t instead of than `eps` associated with`scalar_t` for intermediate computation

Case 2: `eps` is passed to python frontend
Avoid downcasting `eps` to `scalar_t` and then upcasting it again implicitly in the `rqrst_input` computation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848
Approved by: https://github.com/albanD
2025-01-29 01:01:44 +00:00
ae0f305bf9 [inductor] Make triton kernel autotune config defaults backward-compatible (#145494)
If a model was torch.packaged using triton<=3.1, any user-defined
autotuned kernels will have reps/warmups burned in with the old defaults
(100/25).  If this model is loaded with triton>=3.2, inductor's checks for
unsupported non-default autotune args will fail, because triton.Autotuner's
defaults for these parameters has changed to `None`.  Let's explicitly support
those values for backward compatibility with these older models.

Differential Revision: [D68561014](https://our.internmc.facebook.com/intern/diff/D68561014/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145494
Approved by: https://github.com/aorenste
2025-01-29 00:31:39 +00:00
9036a22c83 [Inductor][Triton] Change propagated dtype for fp16/bf16 unwrapped 0d tensors (#145613)
Fixes TestInductorOpInfoCPU.test_comprehensive_max_binary_cpu_float16 and related tests for Triton CPU. TestInductorOpInfoCPU is currently not run in the CI. See https://github.com/pytorch/pytorch/pull/144389#issuecomment-2608050755 for some additional context.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145613
Approved by: https://github.com/davidberard98, https://github.com/eellison, https://github.com/jansel
2025-01-29 00:23:44 +00:00
2f24f2eb46 Make sure to evaluate annotation strings in the context of where the prototype was created (#145667)
This was incorrectly evaluating the annotation in the context of infer_schema - make sure to evaluate annotation strings in the context of where the prototype was created instead.

Fixes #145481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145667
Approved by: https://github.com/zou3519
2025-01-29 00:14:45 +00:00
82859f6185 [associative_scan] scan dim handling in user-facing associative_scan() (#139864)
This PR implements the user-facing dim change, i.e., that the scan dim provided by the user is always moved to dim 0 and then the associative_scan operation always operates on dim 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139864
Approved by: https://github.com/ydwu4
2025-01-28 23:58:10 +00:00
7ca156f0ee partitioner: avoid inserting duplicates into heap (#145082)
Fixes https://github.com/pytorch/pytorch/issues/145081

This looks like it was a source of quadratic compile times in the torchtitan CP graphs. There's some code in the partitioner that iteratively adds users of a node to a heap, and pops the earliest user. If you have long parallel chains of fusible ops that all eventually feed into some shared ops, then this can result in:
(1) a node getting added to the heap many times
(2) each time we pop that node, we add (duplicates of) each of that node users to the heap
(3) repeat with each user

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145082
Approved by: https://github.com/xmfan
2025-01-28 23:44:45 +00:00
02dd7a7803 Extend abi-stable nitpick message to all the c stable files (#145862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145862
Approved by: https://github.com/ezyang
2025-01-28 23:22:23 +00:00
049f042e52 Update build_wheel.sh 2025-01-28 15:14:41 -08:00
eea7d395e5 [CD] Install ninja and setuptools from PyPI (#145871)
Rather than Conda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145871
Approved by: https://github.com/Skylion007
ghstack dependencies: #145870
2025-01-28 23:09:38 +00:00
c26bb9ba5b [CMake] Find HomeBrew OpenMP on MacOS (#145870)
Either via `OMP_PREFIX` envvar or just searching in that folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145870
Approved by: https://github.com/Skylion007
2025-01-28 23:09:37 +00:00
f388ba5986 Update CUDNN frontend submodule to 1.10.0 (#145780)
Update to CUDNN 1.10. Most of this is release is about supporting some new APIs needed for Blackwell integration and new features in the corresponding CUDNN version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145780
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/malfet
2025-01-28 22:54:24 +00:00
af43b445a5 [ONNX] Set USE_EXPERIMENTAL_LOGIC to True (#137296)
This sets dynamo_export to use the new export logic. The legacy dynamo export logic will be removed as a follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137296
Approved by: https://github.com/titaiwangms
2025-01-28 22:35:11 +00:00
5aa5a5763e [inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684)
Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by:

1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format.
2. Using that function to explicitly disable TF32 generation when calling Triton, where needed.

To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684
Approved by: https://github.com/eqy
2025-01-28 22:01:08 +00:00
1ffed44b42 [aotinductor] update unbacked symint runtime assertion msg (#145569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145569
Approved by: https://github.com/chenyang78
2025-01-28 21:42:58 +00:00
a06a18b1bb [ATen] Implement exception handling for hipsolver APIs (#145839)
Summary: TSA

Test Plan: CI

Differential Revision: D68741194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145839
Approved by: https://github.com/Mellonta
2025-01-28 21:37:23 +00:00
9003d81144 change the test wheel to release wheel when release wheel available (#145252)
change the test wheel to release wheel when release wheel available

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145252
Approved by: https://github.com/seemethere, https://github.com/atalman

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-28 21:23:53 +00:00
4f949f282d [c10d][ez] Remove goto in PGNCCL and make linter happy for PGNCCL and NCCLUtils (#145855)
While working on PGNCCL I found that the code triggers some lint warnings so this PR is to address them or add lint suppressor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145855
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2025-01-28 21:19:49 +00:00
6bcb545d9c [CI][CUDA][cuSPARSELt] cusparselt 0.6.3 and cu121 related cleanups (#145793)
Make ci cusparselt installation be consistent with nightly binary
Remove cu121 related docker build jobs and inductor runs Update test failures relating to cu121

Retry of https://github.com/pytorch/pytorch/pull/145696
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145793
Approved by: https://github.com/eqy, https://github.com/tinglvv
2025-01-28 21:01:58 +00:00
ccc2878c97 Fix fractional_max_pool lowering in inductor (#144395)
Fixes https://github.com/pytorch/pytorch/issues/141538
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144395
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-28 21:00:18 +00:00
ef28df5c9e [Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593)
Reland of #137843 , after checking the code again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140593
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-28 20:51:49 +00:00
3481c2aec4 Revert "[dynamo] save/restore system random state more carefully (#145750)"
This reverts commit e3d3f2b22e4b75c64eaa2f940a2dd80c1e43435c.

Reverted https://github.com/pytorch/pytorch/pull/145750 on behalf of https://github.com/eellison due to bisected perf regression ([comment](https://github.com/pytorch/pytorch/pull/145750#issuecomment-2620028414))
2025-01-28 20:51:07 +00:00
28982ceb3b [aarch64] Rebuild everything with ArmPL (#145742)
Summary: Rebuild everything that used OpenBLAS with ArmPL

Test Plan: CI, prod test

Reviewed By: Nicoshev

Differential Revision: D68219559

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145742
Approved by: https://github.com/malfet
2025-01-28 20:48:42 +00:00
edf266e9bb inductor.config.descriptive_names = False is not actually supported (#145523)
Summary:
This config is not supported (it throws an error when set), and doesn't really make sense imo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145523
Approved by: https://github.com/eellison
2025-01-28 20:22:23 +00:00
515e55e692 Set -DPy_LIMITED_API flag for py_limited_api=True extensions (#145764)
This could be BC breaking, because there was a period of time when we use py_limited_api=True but don't enforce the flag, and now that we will start enforcing the flag, people's custom extensions may fail to build.

This is strictly still better behavior, as it is sketchy to claim CPython agnosticism without the flag, but calling this out as potential people yelling at us. Ways to mitigate this risk + reasons this may not be too big a deal:
- People haven't known about py_limited_api for extensions much due to lack of docs from python so usage is low right now
- My current tutorial is in store to make new users of py_limited_api pass this flag, so it'd be a noop for them.

Test plan:
* Locally i'm confident as I tried rebuilding ao with this change and it reliably failed (cuz importing torch/extension.h is a nono)
* Unit test wise, the normal python_agnostic one I added should work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145764
Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD
2025-01-28 20:11:05 +00:00
8d91bfd965 [BE] Include CheckFunctionExists in FindBLAS.cmake (#145849)
It's used in the script, so it must be included
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145849
Approved by: https://github.com/Skylion007
2025-01-28 19:47:05 +00:00
eaff13275e [dynamo] Properly branch on an unspecialized NN module (#145786)
User defined NN module might have their own `__len__` or `__bool__`
methods which Dynamo needs to trace through, so that side effects and/or
reads to buffered writes are properly handled.

This patch removes the special `UnspecializedNNModuleVariable` branch in
Dynamo's branch handling, and lets these cases fall into the
`UserDefinedObjectVariable` branch, which handles the aforementioned
cases correctly.

Fixes #145284.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145786
Approved by: https://github.com/williamwen42
2025-01-28 19:45:17 +00:00
d9ffa5da65 Log info for AOTAutogradCache bypasses instead of warning (#145768)
Fixes #145767

FxGraphCache also logs to info instead of warning so lets do that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145768
Approved by: https://github.com/eellison, https://github.com/bdhirsh
2025-01-28 19:25:36 +00:00
6c09954a9e Windows builds with VS2022 (#145319)
[Fixes #ISSUE_NUMBER
](https://github.com/pytorch/pytorch/issues/128835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145319
Approved by: https://github.com/huydhn
2025-01-28 19:07:24 +00:00
cbc4094298 [draft_export] add LOC for data-dep error logging (#145443)
Summary:
maybe this is too much info, but it's difficult to go through old draft export reports where the stack trace is out of sync with the current codebase. Data-dependent errors now look like:
```
2. Data dependent error.
    When exporting, we were unable to evaluate the value of `u306`.
    This occurred at the following stacktrace:
    File /data/users/pianpwk/fbsource/buck-out/v2/gen/fbcode/78204cab86e8a0fb/sigmoid/inference/ts_migration/__pt2i_readiness_main__/pt2i_readiness_main#link-tree/caffe2/torch/fb/training_toolkit/common/proxy_module_thrift/embedding_bag_proxy.py, lineno 109, in _forward_impl:
         `if offsets[-1] > len(input):`
    As a result, it was specialized to evaluate to `261`, and asserts were inserted into the graph.
    Please add `torch._check(...)` to the original code to assert this data-dependent assumption.
    Please refer to https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit#heading=h.boi2xurpqa0o for more details.
```

This would be even more helpful for reports on torch-packaged models, but that requires some more work on PT2I-specific stack trace processing

Test Plan: .

Differential Revision: D68534017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145443
Approved by: https://github.com/angelayi
2025-01-28 18:55:16 +00:00
c32bafeb0b [ROCm] Bump AOTriton to 0.8.2b (#145508)
We received reports AOTriton kernels mishandles the bias pointer and it causes NaN during fine-tuning llama3.2-11b vision model. This PR will fix the problem.

Note: this AOTriton 0.8.1b adds head dimension 512 support and thus the binary size increases,  but it is considered experimental and will not be enabled right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145508
Approved by: https://github.com/jeffdaily
2025-01-28 18:34:25 +00:00
621604ce46 Maintain multiple configs (#145103)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Previously, we would finalize the config of a triton template after its first fusion. this maintains multiple configs, in case we epilogue fuse, then prologue fuse, and prologue fusion has a new better config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145103
Approved by: https://github.com/jansel, https://github.com/shunting314
ghstack dependencies: #143408
2025-01-28 18:32:14 +00:00
eaec97ab1f [dynamo] Properly prune dead input cell object (#145781)
This patch models input cell object as "newly created" rather than
"pre-existing" python object (see added documentation for why this
actually captures the semantics more accurately).

This enables the `SideEffects.prune_dead_object_new` algorithm to prune
away writes to input cell objects which are no longer relevant; this
didn't happen prior to this patch because we modelled them as
pre-existing objects, which forces us to codegen their attribute
mutations.

Fixes #145564.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145781
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-28 18:28:13 +00:00
8e258e2ecd Parallelize epilogue/prologue benchmarking (#143408)
When we attempt prologue or epilogue fusion with a TritonTemplate, we benchmark it at compile time in order to determine profitability. This avoids slowdowns/register spilling, and allows us to pick fusion when a base triton template is slower than cublas but faster when considering an epilogue. However, that fused benchmarking does not do the same async compilation as we do for the base TritonTemplate. The Base TritonTemplate is async compiled during lowering, then later waited on and benchmarked.

This PR extends a similar process to benchmarking fused TritonTemplates in the scheduler. We keep a list of pending fusions which have async compilations. And we resolve any pending fusions a node is in prior to attempting to fuse it with any other node.

Initially, I saw some slowdowns with this because we kick off async compilations of identical fusions in parallel. To address this I added source code caching at the `async_compile` level (we also already cache benchmark runs, but that would not happen in parallel).

Compilation speedups:

<img width="717" alt="image" src="https://github.com/user-attachments/assets/8e8f7d6c-7824-4210-83f9-a2a0f6db5ac9" />

This also should let us be a bit more aggressive with either configs, or benchmarking other fusions which are hard to determine profitability of.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143408
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-01-28 18:18:24 +00:00
3fd4691908 [MPS] Add op_math_t (#145808)
Similar to `at::opmath_t` to be used for reduction (and int mms)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145808
Approved by: https://github.com/dcci
2025-01-28 18:03:52 +00:00
5382ab57d7 Move trunk windows builds to CUDA-12.4 (#145844)
Same as : https://github.com/pytorch/pytorch/pull/130446

That should catch build regressions that were previously only detectable during the nightly builds for 12.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145844
Approved by: https://github.com/janeyx99, https://github.com/malfet
2025-01-28 18:00:51 +00:00
56915b093a Fix environment deployment spam (#145823)
With https://github.com/pytorch-labs/pytorch-gha-infra/pull/598 in place, the environment can now be removed.

Fixes https://github.com/pytorch/pytorch/issues/145704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145823
Approved by: https://github.com/clee2000
2025-01-28 17:46:31 +00:00
cfbb27462e Revert "[inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373)"
This reverts commit b8087747f5ca7be0d37b1ac85dc0894f6a33e3a3.

Reverted https://github.com/pytorch/pytorch/pull/145373 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145373#issuecomment-2619674197))
2025-01-28 17:46:11 +00:00
dbef2a9bc9 Revert "Remove lexicographical sorting of storage keys in torch.save (#143879)"
This reverts commit 7db0afabaaff17dd37cf846cd786610ebf6aedd3.

Reverted https://github.com/pytorch/pytorch/pull/143879 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68746524 for details ([comment](https://github.com/pytorch/pytorch/pull/143879#issuecomment-2619661492))
2025-01-28 17:40:16 +00:00
097ccd9c39 Move ROCm MI300 jobs to unstable to make CI green (#145790)
This is a temporary change to reduce intermittent tests failures. Jobs can be moved back once those machines get better runner isolation.

This also sneaks in a small fix to all the rocm job's build step to be run on Linux Foundation runners (the get-label-type dependency).  The inductor-rocm-mi300 workflow already had it, but it was missing in the rocm-mi300 workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145790
Approved by: https://github.com/yangw-dev
2025-01-28 17:25:15 +00:00
7eb51e5464 Ensure GPU isolation for kubernetes pod MI300 runners. (#145829)
Fixes the reason behind moving the tests to unstable initially. (https://github.com/pytorch/pytorch/pull/145790)
We ensure gpu isolation for each pod within kubernetes by propagating the drivers selected for the pod from the Kubernetes layer up to the docker run in pytorch here.
Now we stick with the GPUs assigned to the pod in the first place and there is no overlap between the test runners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145829
Approved by: https://github.com/jeffdaily
2025-01-28 17:20:46 +00:00
cyy
c751541e79 Fix cppcoreguidelines-init-variables ignorance (#141795)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141795
Approved by: https://github.com/albanD
2025-01-28 17:11:37 +00:00
ac87388e61 [AOTInductor] Refactor CPU and GPU to remove ifdef macros (#145639)
Summary: Remove #ifdef USE_CUDA macros through some refactor

Test Plan: Refactor code, existing tests.

Differential Revision: D68636743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145639
Approved by: https://github.com/desertfire
2025-01-28 16:46:00 +00:00
6967ef1b07 [ROCm] fix test_cublas_workspace_explicit_allocation for gfx12 (#145227)
gfx12 passes the condition `torch.cuda.get_device_capability() >= (9, 4)` and uses `default_workspace_size=128MB`, but it required only for MI300
Fix condition to use `("gfx94" in gcn_arch)` instead of `torch.cuda.get_device_properties()` to detect MI300.
Now `default_workspace_size=32MB` is used for gfx12 and the test passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145227
Approved by: https://github.com/jeffdaily, https://github.com/eqy
2025-01-28 16:19:27 +00:00
80a0412b76 [dynamo][builtin-skipfiles-cleanup] Remove posixpath (#145828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145828
Approved by: https://github.com/zou3519
ghstack dependencies: #145744, #145753, #145826
2025-01-28 16:14:34 +00:00
6824a4a75d [dynamo][builtin-skipfiles-cleanup] Remove re (#145826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145826
Approved by: https://github.com/zou3519
ghstack dependencies: #145744, #145753
2025-01-28 16:14:34 +00:00
4307e6c008 [dynamo][builtin-skipfile-cleanup] Remove signal (#145753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145753
Approved by: https://github.com/zou3519
ghstack dependencies: #145744
2025-01-28 16:14:23 +00:00
3a56089217 fix unbacked + view incorrectness (#145548)
fix for https://github.com/pytorch/pytorch/issues/143498

We were incorrectly using contiguous strides for a non-contiguous tensor. There are two separate causes:

1. https://github.com/pytorch/pytorch/pull/110520 made it so we turn Views contiguous with unbacked symints becuase
`dynamic_reshape_indexer below will fail due to the size_hint's inability to process unbacked SymInts`. Seems like we should fix. Regardless - it will make the input contiguous if input is unbacked to workaround this.

2. We weren't actually making it contiguous! I filed an issue for this here: https://github.com/pytorch/pytorch/issues/145561.

This is still worth landing as a fix, even though we should those issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145548
Approved by: https://github.com/desertfire
2025-01-28 16:03:45 +00:00
97b3b73f3e [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-01-28 15:21:12 +00:00
a08f7f3266 OpenReg: fix issue of pin_memory (#145046)
Fix issue of `pin_memory` when rewrapping a storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145046
Approved by: https://github.com/albanD
2025-01-28 09:41:04 +00:00
bdf6dfa17d [chore][ez] change alloc buffer size from 4000 to 4096 (#145759)
Summary:
Allocations typically happen as a power of 2 anyway.
Change the default alloc size to 4096 so eek out a bit more perf.

Test:
unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145759
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756, #145757
2025-01-28 09:14:07 +00:00
5c5306e8bc [dynamo][builtin-skiplist-cleanup] Remove weakref (#145744)
WeakKeyDictionary already works very nicely with the UserDefinedObject Variable Tracker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145744
Approved by: https://github.com/jansel
2025-01-28 07:55:12 +00:00
45f64e770a relax assertion to warning for unbacked binding names (#145777)
Summary:
Quick fix following up on https://github.com/pytorch/pytorch/pull/144894 to unblock internal tests.

Will keep investigating a more principled fix.

Test Plan: Failures in T213563826 now pass

Differential Revision: D68731710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145777
Approved by: https://github.com/angelayi
2025-01-28 07:52:40 +00:00
0a8a0ef767 [inductor] Fix crash running wrapper_benchmark with no device (#145644)
Fixes #145434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145644
Approved by: https://github.com/shunting314
2025-01-28 07:31:36 +00:00
a699034eec Record inputs at time of tracing, constrain to them for triton fn (#145448)
Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context.

We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448
Approved by: https://github.com/zou3519
2025-01-28 07:07:14 +00:00
0f5a68344a [BE][Inductor] Simplify custom_op tests (#145814)
Not sure what were the motivation behind repeating the same function over and over again for different backends
Change `test_custom_op_[123]` from acceptig separate (but identical) implementations for CPU, CUDA and XPU, to take just `fn` and `fn_meta` args

Test that it also extendable to MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145814
Approved by: https://github.com/jansel
2025-01-28 05:58:51 +00:00
23eb0a3201 Improve typing in torch/types.py (#145237)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145237
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2025-01-28 05:29:12 +00:00
8e46d0f595 [BE]: Update typing of OrderedSet ancestor (#145783)
Now that we are on python 3.9 minimum version we can properly use Generics in the superclass
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145783
Approved by: https://github.com/eellison
2025-01-28 04:43:49 +00:00
cyy
67fcc7cf02 [3/N] Remove unnecessary once flag usage (#145672)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145672
Approved by: https://github.com/albanD
2025-01-28 04:28:18 +00:00
01a4d86b31 add pt2 callbacks for backward pass and prevent duplicate callbacks (#145732)
Summary: This change adds callbacks for lazy backwards compilation while preventing duplicate callbacks to be fired.

Differential Revision: D68577593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145732
Approved by: https://github.com/mlazos
2025-01-28 03:50:02 +00:00
1a26cdd5cb [cond] remove warning for unsupported tuple returns (#145766)
I guess this is supported now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145766
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2025-01-28 03:13:36 +00:00
9010649292 Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)"
This reverts commit db3685a35cdce32622ab89f6c92e09d52210ff53.

Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))
2025-01-28 03:07:17 +00:00
78f02bf07c [bug] handle case when remote peer closes connection (#145757)
Summary:
In the case where remote peer closes the connection, nread returns 0. In
this case, we still want to free up the allocated buffer.
Also, reorder the if so that the likely success cases (nread > 0) is at
the top of the function with an early return.

Test Plan:
unit tests

Differential Revision: [D68733192](https://our.internmc.facebook.com/intern/diff/D68733192)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145757
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756
2025-01-28 03:06:38 +00:00
4be831ba2d [draft_export] fix dense-in-memory check for inferring fakes (#145653)
Test Plan: fixes check for dense tensors with size-1 dimensions

Differential Revision: D68644028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145653
Approved by: https://github.com/zou3519
2025-01-28 02:52:14 +00:00
7c1fc0a047 Log cache state for AOTAutograd in title of file (#145715)
Differential Revision: [D68692755](https://our.internmc.facebook.com/intern/diff/D68692755/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145715
Approved by: https://github.com/bobrenjc93
2025-01-28 02:14:18 +00:00
78a94c9114 [inductor] Remove type ignores from scheduler.py (#145712)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145712
Approved by: https://github.com/yanboliang, https://github.com/Skylion007
ghstack dependencies: #145692
2025-01-28 01:44:32 +00:00
2df2f9d895 [inductor] Change type of get_backend_features to OrderedSet (#145692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145692
Approved by: https://github.com/yanboliang
2025-01-28 01:44:32 +00:00
db33d23aa8 [SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144886
Approved by: https://github.com/awgu
2025-01-28 01:43:37 +00:00
e3d3f2b22e [dynamo] save/restore system random state more carefully (#145750)
Reattempt of https://github.com/pytorch/pytorch/pull/145435 since the state of the linked internal diff appears to be messed up.

Note: I have verified that the previously failing internal tests now pass internally.

Differential Revision: [D68723334](https://our.internmc.facebook.com/intern/diff/D68723334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145750
Approved by: https://github.com/StrongerXi
2025-01-28 01:34:13 +00:00
f16ce3c7e9 Refactor fuzzer and add support for Dynamo (#145565)
## Summary:
Dynamo now works with config fuzzer.

For BE week, we also found and fixed 5 different bugs (in inductor):
- https://github.com/pytorch/pytorch/pull/145426
- https://github.com/pytorch/pytorch/pull/145523
- https://github.com/pytorch/pytorch/pull/145527
- https://github.com/pytorch/pytorch/pull/145532
- https://github.com/pytorch/pytorch/pull/145538

## Test Plan:
New Dynamo Unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145565
Approved by: https://github.com/masnesral
2025-01-28 00:44:27 +00:00
6eb74fbec6 Updates NCCL user buffer registration test for NCCL 2.24.3 (#145285)
NCCL 2.24.3 changed the content of the debug output for NVLS registration. We use this debug output in our test suite to check if NVLS was successfully registered or not. Hence we need to specialize for the NCCL version in the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145285
Approved by: https://github.com/kwen2501
2025-01-28 00:24:53 +00:00
5a4d959cdb [dynamo] Properly model torch profiler context objects (#145537)
Prior to this patch, Dynamo conveniently modelled torch profiler context
objects (e.g., `torch.profiler.profile`) as `NullContextVariable`
because `torch.compile` ignore the effect of these profiler contexts.

However, the semantics of these profiler contexts diverges from
`contextlib.nullcontext` in the `__enter__` function, where the former
returns `self` and the latter returns `None`. This causes subtle error
as observed in #125021.

This patch adds back a `ProfilerContextVariable`, which addresses the
aforementioned semantic discrepency.

Fixes #125021.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145537
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-01-28 00:03:36 +00:00
db3685a35c Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-27 23:57:30 +00:00
7db0afabaa Remove lexicographical sorting of storage keys in torch.save (#143879)
Currently the order lexicographical (i.e. 0, 10, 11, ...19, 2, ....) instead of 0, 1, 2, 3, 4, 5 (the order that storage metadata is actually pickled in), since PyTorch will never be used with Python < 3.7 we can be assured that the keys will be read in the order of insertion (numerically sorted)

This makes it such that the order storages are written in are the same as the pickling/unpickling order so we can calculate their offsets with less random reads

Differential Revision: [D67673025](https://our.internmc.facebook.com/intern/diff/D67673025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143879
Approved by: https://github.com/albanD
2025-01-27 23:57:30 +00:00
c1161957a4 inductor_config_logging: Don't drop keys (#144700)
This bit me while I was trying to debug some trace issues.
In general this config is already quite large when dumping, so adding
more fields doesn't make it significantly worse.

Also a number of the items we are type checking for (except the test
configs), don't even show up. Primarily this will help us when debugging
rocm, halide, and trace configs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144700
Approved by: https://github.com/ezyang
2025-01-27 23:47:25 +00:00
7d01f6e6f2 Add ignorable commits on run_test.py to git blame ignore (#145787)
Chanced upon it while searching through cpp_extension related code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145787
Approved by: https://github.com/malfet
2025-01-27 23:24:48 +00:00
3ce68dc61e [c10d] Flush file in file recorder (#145458)
Summary:
Flushing file to hopefully prevent file corruptions as reported in
https://github.com/pytorch/pytorch/pull/145125

Test Plan:
Couldn't get file corruption to occur in my tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145458
Approved by: https://github.com/kwen2501
2025-01-27 23:15:52 +00:00
5534c270db [chore] fix new linter (#145756)
Summary:
Fix new linter that's complaining when I made changes to this file:
class 'LibUVStoreDaemon' defines a non-default destructor but does not
define a copy constructor, a copy assignment operator, a move
constructor or a move assignment operator

Test Plan:
make lint passes

Differential Revision: [D68733191](https://our.internmc.facebook.com/intern/diff/D68733191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145756
Approved by: https://github.com/XilunWu, https://github.com/Skylion007, https://github.com/fduwjj
2025-01-27 22:48:12 +00:00
2de53b3b65 Revert "pickler for GraphModule (#141659)"
This reverts commit c6ad08357bf8e766b5220bfb5cbbfdb2a4ec0ca5.

Reverted https://github.com/pytorch/pytorch/pull/141659 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, please take a look at D68694181 for more details. ([comment](https://github.com/pytorch/pytorch/pull/141659#issuecomment-2617045120))
2025-01-27 22:39:30 +00:00
006397fac3 Remove FBGEMM sccache hack (#145664)
Testing https://github.com/pytorch/pytorch/actions/runs/12959358756, sccache is working correctly now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145664
Approved by: https://github.com/wdvr
2025-01-27 22:00:06 +00:00
69e82d02d3 [inductor][3/N] triton support post-#5512, tt.divisibility format (#145575)
1. Fix the tt.divisibility format in hints.py. Previously, it was `{((0,), (1,)): [["tt.divisibility", 16]]}`. Now it is `{(0,): [["tt.divisibility", 16]], (1,): [["tt.divisibility", 16]]}`. This was an oversight in the first PR I added. I've verified that we now get `{ tt.divisibility = 16 }` in the generated TTGIR.
2. Update the test_codegen_triton.py test to work with multiple triton versions (and test this divisibility format in the new triton version)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145575
Approved by: https://github.com/SamGinzburg
2025-01-27 21:48:58 +00:00
993b229665 [dynamo][dicts] Fix dict.__new__ bug (#145723)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145723
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #145519, #145547, #145558
2025-01-27 21:42:43 +00:00
7e1c7253e9 [dynamo][builtin-skipfile-cleanup] Support tuple.__new__ (#145558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145558
Approved by: https://github.com/jansel, https://github.com/StrongerXi
ghstack dependencies: #145519, #145547
2025-01-27 21:42:43 +00:00
1ba1b7b597 Support remaining *_like factory functions for NJT (#144889)
Fixes #144761

This PR adds NJT impls for those *_like functions that were previously missing:
* `full_like()`
* `rand_like()`
* `randint_like()`

It also fixes a bug in existing *_like functions when a new device is specified. Fix is to also transfer `offsets` / `lengths` to the new device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144889
Approved by: https://github.com/soulitzer
2025-01-27 21:33:51 +00:00
3a23d75b37 [MPS] Fix c0:🤘:log_gamma correctness on M4 (#145740)
To workaround a bug where `abs` method call seems to be ignored before calling log, which could be reproduced by running the following code (submitted as FB16415011 )
```swift
import Metal

func run_shader<T: BinaryFloatingPoint> (library: MTLLibrary, kernel_name: String, type: T.Type, nelem: Int = 16) {
  guard let mfunc = library.makeFunction(name: kernel_name) else { fatalError("Can't find function") }
  let device = library.device
  guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") }
  guard let cmdBuffer = queue.makeCommandBuffer() else { fatalError("Can't make command buffer") }
  guard let computeEncoder = cmdBuffer.makeComputeCommandEncoder() else { fatalError("Can't make compute encoder") }
  guard let ibuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let ibuf_data = ibuf.contents().assumingMemoryBound(to: T.self)
  for i in 0..<nelem {
    ibuf_data[i] = T(sin(Float(2 + i)))
  }
  guard let obuf = device.makeBuffer(length:nelem * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") }
  let obuf_data = obuf.contents().assumingMemoryBound(to: T.self)

  computeEncoder.setComputePipelineState(try! device.makeComputePipelineState(function: mfunc))
  computeEncoder.setBuffer(obuf, offset:0, index: 0)
  computeEncoder.setBuffer(ibuf, offset:0, index: 1)
  computeEncoder.dispatchThreads(MTLSizeMake(nelem, 1, 1), threadsPerThreadgroup:MTLSizeMake(nelem, 1, 1))
  computeEncoder.endEncoding()
  cmdBuffer.commit()
  cmdBuffer.waitUntilCompleted()

  print("Results for \(String(describing: T.self)):", terminator: " ")
  for i in 0..<nelem {
    print(obuf_data[i], terminator: " ")
  }
  print()
}

let shader_source = """
#include <metal_stdlib>

template<typename T>
float foo(T x) {
  const auto abs_x = :🤘:abs(static_cast<float>(x));
  auto rc = :🤘:log(abs_x);

  return rc - :🤘:log(:🤘:abs(abs_x * :🤘:sinpi(abs_x)));
}

kernel void half_kernel(
    device half* out_ptr0,
    constant half* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<half>(out);
}

kernel void float_kernel(
    device float* out_ptr0,
    constant float* in_ptr0,
    uint xindex [[thread_position_in_grid]]
) {
  auto inp = in_ptr0[xindex];
  auto out = foo(inp);
  out_ptr0[xindex] = static_cast<float>(out);
}
"""
let options = MTLCompileOptions()
options.mathMode = .safe
options.mathFloatingPointFunctions = .precise

guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") }
let library = try! device.makeLibrary(source:shader_source, options:options)
run_shader(library:library, kernel_name:"half_kernel", type: Float16.self)
run_shader(library:library, kernel_name:"float_kernel", type: Float.self)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145740
Approved by: https://github.com/dcci
2025-01-27 21:24:22 +00:00
60f98262f1 PEP585: .github (#145707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145707
Approved by: https://github.com/huydhn
2025-01-27 21:21:01 +00:00
bfaf76bfc6 [dynamo] clear out traced frames at the start of test_log_traced_frames (#145640)
The test was being flaky in CI, and this patch fixes it.

Fixes #137461.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145640
Approved by: https://github.com/williamwen42
2025-01-27 20:49:59 +00:00
93dd6bc4d8 Add CUDA 12.8 installation and manylinux-cuda12.8 (#145567)
Breaking https://github.com/pytorch/pytorch/pull/145557 into two parts.
Need to have manylinux-cuda12.8 in order to build magma.

Issue: https://github.com/pytorch/pytorch/issues/145570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145567
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2025-01-27 20:49:07 +00:00
64cd81712d torch.distributions: replace numbers.Number with torch.types.Number. (#145086)
Fixes #144788 (partial)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145086
Approved by: https://github.com/malfet
2025-01-27 20:24:55 +00:00
2f8ad8f4b9 Run inductor perf benchmark on ROCm (#145763)
This requires https://github.com/pytorch/pytorch/pull/144594.  The test run on PT2 dashboard is at https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2020%20Jan%202025%2019%3A46%3A14%20GMT&stopTime=Mon%2C%2027%20Jan%202025%2019%3A46%3A14%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=rocm&lBranch=144594&lCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b&rBranch=144594&rCommit=9f5cb037965aa2990b2e4593610bca92526ebb3b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145763
Approved by: https://github.com/jeffdaily
2025-01-27 20:19:03 +00:00
66631bc84b [dynamo] Fix read/write conflicts in a cuda test (#145658)
Prior to this patch, the `test_cuda_event_created_outside_of_graph`
is flaky in CI, and that's because we have read and write to the same
`foo` tensor buffer from 2 different streams. This patch eliminates that
by adding a synchronization to wait till read finishes before starting
the write.

Fixes #133837, #133828.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145658
Approved by: https://github.com/yifuwang
2025-01-27 19:55:57 +00:00
c986eba560 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit abf28982a8cb43342e7669d859de9543fd804cc9.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See  D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))
2025-01-27 19:38:26 +00:00
9728e900dc [Inductor][CPP] fix torch logit decomposition (#145576)
**Summary**

Fix issue https://github.com/pytorch/pytorch/issues/145379, current decomposition using `self = torch.clamp(self, lo, hi)` which gives wrong result when `lo` is larger than `hi` comparing to eager implementation: cd68d54911/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp (L165)
Align their behavior in this PR.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_torch_logit
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145576
Approved by: https://github.com/jgong5, https://github.com/eellison
2025-01-27 19:37:51 +00:00
635b98fa08 Add nitpick warning that aoti_torch/c/shim.h is ABI stable (#145745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145745
Approved by: https://github.com/albanD
2025-01-27 19:25:37 +00:00
bc377c503e [Custom Ops] Fix f-strings in custom ops error message (#145673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145673
Approved by: https://github.com/zou3519
ghstack dependencies: #145588
2025-01-27 19:22:43 +00:00
ec91b7720f [Custom Ops] Add a new API to allow users to register an autocast for the custom op (#145588)
Fixes #137033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145588
Approved by: https://github.com/zou3519
2025-01-27 19:22:43 +00:00
f951d216e0 [autocast][pytorch] Support autocast for MTIA (policy) (#145666)
Summary: Add autocast support for MTIA (policy)

Reviewed By: egienvalue

Differential Revision: D68604796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145666
Approved by: https://github.com/chaos5958
2025-01-27 18:26:04 +00:00
1835e1eb98 [BE] Remove test_ops from FIXME_inductor_dont_reset_dynamo (#145307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145307
Approved by: https://github.com/zou3519, https://github.com/FindHao
2025-01-27 18:12:39 +00:00
835e770bad Use typing.IO[bytes] instead of io.BytesIO in annotations (#144994)
Fixes #144976

Using appoach ① `IO[bytes]`, but could also try with a protocol.

## Notes:

- moved `torch.serialization.FILE_LIKE` to `torch.types.FileLike`
- Use `FileLike` annotation where it makes sense
- made sure those functions also support `os.PathLike`
- Replaced `isinstance(x, io.BytesIO)` with `isinstance(x, (io.IOBase, IO))` where appropriate.
- Replaced `BinaryIO` with `IO[bytes]` (the two ABCs are almost identical, the only difference is that `BinaryIO` allows `bytearray` input to `write`, whereas `IO[bytes]` only `bytes`)
- needed to make `torch.serialization._opener` generic to avoid LSP violations.
- skipped `torch/onnx/verification` for now (functions use `BytesIO.getvalue` which is not part of the `IO[bytes]` ABC, but it kind of seems that this is redundant, as e.g. `onnx.load` supports `str | PathLike[str] | IO[bytes]` directly...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144994
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-01-27 18:08:07 +00:00
abf28982a8 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-27 18:05:23 +00:00
30dea8429d [MPS][BE] Use conveinence methods to set args (#145736)
It's better to call `mtl_setArgs` rather than set arguments one by one with the risk of making a typo

Also, all interactions with MTLCommandBuffer must be serialized, which is commonly done using dispatch queues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145736
Approved by: https://github.com/Skylion007
2025-01-27 17:42:01 +00:00
7db20ffd68 Remove public_allowlist from TestPublicBindings.test_correct_module_names and ensure private_allowlist-ed things are actually private (#145620)
This passes locally, also sanity checked importing these modules on [colab](https://colab.research.google.com/drive/1edynWX1mlQNZIBxtb3g81_ZeTpAqWi19?usp=sharing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145620
Approved by: https://github.com/albanD
2025-01-27 17:30:02 +00:00
5d01a2874f Increase the number of perf benchmark shards (#145534)
Per the discussion on https://github.com/pytorch/pytorch/issues/140332#issuecomment-2610805551, this adds 2 more shards for HF, 2 more for TorchBench, and 1 more for TIMM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145534
Approved by: https://github.com/jeanschmidt
2025-01-27 16:20:42 +00:00
639dd54ef7 [BE] Use copy_method to import all tests (#145718)
Less chances for typo when doing the imports

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145718
Approved by: https://github.com/dcci
2025-01-27 16:01:12 +00:00
2e80093306 setitem node shouldn't be deadcode eliminated (#145714)
**Summary**
Fix issue https://github.com/pytorch/pytorch/issues/145697. The `operator.setitem` has been eliminated as dead code, causing a correctness issue. Mark it as impure in this PR to avoid this side effect.

**TestPlan**
```
python -u -m pytest -s -v test/fx/test_dce_pass.py -k test_keep_setitem
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145714
Approved by: https://github.com/ezyang
2025-01-27 15:08:21 +00:00
0674ab7e33 solve apl dependency issue (#145215)
According to the [APL documentation](https://developer.arm.com/documentation/101004/2404/General-information/Arm-Performance-Libraries-example-programs), libraries ending with _mp are OpenMP multi-threaded libraries.

When a project is compiled with MSVC and the -openmp flag, the vcomp library (Visual C++ implementation of OpenMP) is used for runtime calls.

However, the current APL implementation uses the libomp.dll (LLVM) variant.

As a result, there are unexpected behaviors at runtime.

---

For Example:

```python
import torch

# Create a sparse tensor
# Input (Sparse Tensor):
# [[0, 1],
#  [1, 0]]
indices = torch.tensor([[0, 1], [1, 0]])
values = torch.tensor([1, 1], dtype=torch.float32)
size = torch.Size([2, 2])

sparse_tensor = torch.sparse_coo_tensor(indices, values, size)

# Convert sparse tensor to dense tensor
dense_tensor = sparse_tensor.to_dense()

# Expected Output (Dense Tensor):
# [[0, 1],
#  [1, 0]]
print("\nDense Tensor:")
print(dense_tensor)
```

However, it prints unexpected outputs such as:

```python
# [[0, 11],
#  [10, 0]]
```

The issue arises because the following code does not function as expected at runtime:

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h#L30

```c++
// returns 1 , however since OpenMP is enabled it should return total number of threads
int64_t num_threads = omp_get_num_threads();
```

---

In the runtime, loading multiple OpenMP libraries (in this case `libomp` and `vcomp`) is causing unexpected behaviours.

So, we've changed libraries from `_mp` to non `_mp` versions and we used `vcomp` for OpenMP calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145215
Approved by: https://github.com/ozanMSFT, https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-27 13:02:16 +00:00
7b6029dcc2 Update slow tests (#145206)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145206
Approved by: https://github.com/pytorchbot
2025-01-27 11:40:39 +00:00
e6c1e6e20e simplify torch.utils.cpp_extension.include_paths; use it in cpp_builder (#145480)
While working on conda-forge integration, I needed to look at the way the include paths are calculated, and noticed an avoidable duplication between `torch/utils/cpp_extension.py` and `torch/_inductor/cpp_builder.py`. The latter already imports the former anyway, so simply reuse the same function.

Furthermore, remove long-obsolete include-paths. AFAICT, the `/TH` headers have not existed since pytorch 1.11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145480
Approved by: https://github.com/ezyang
2025-01-27 07:19:42 +00:00
e90cf4abcf [inductor] Add some typing to common.py (#145691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145691
Approved by: https://github.com/malfet
ghstack dependencies: #145690
2025-01-27 06:27:13 +00:00
ddae87f792 [inductor] Add some typing to simd.py (#145690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145690
Approved by: https://github.com/malfet
2025-01-27 06:27:13 +00:00
71caac2b30 [MPSInductor] Add rand support (#145705)
Using Philox4 as PRNG

Test plan (other that CI)
Run
```python
mport torch
from torch._inductor.utils import run_and_get_code
from contextlib import nullcontext

def foo(x):
   return x * torch.randn_like(x)

foo_c = torch.compile(foo)

x = torch.ones(100, 100, device="mps")

y = foo_c(x)

print(y.mean().item(), y.std().item())
for i in range(25):
  print(y[i].mean(), y[i].std())
```
And observe that printed values are close to 0 and 1

TODO: Better `randint` algorithm for large ranges

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145705
Approved by: https://github.com/dcci, https://github.com/jansel
2025-01-27 06:07:36 +00:00
ea141d8134 functional compiled autograd (#144707)
This PR squashes together the following commits:

https://github.com/pytorch/pytorch/pull/144115
https://github.com/pytorch/pytorch/pull/143417
https://github.com/pytorch/pytorch/pull/143405
https://github.com/pytorch/pytorch/pull/143387
https://github.com/pytorch/pytorch/pull/143304
https://github.com/pytorch/pytorch/pull/143296

This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses.

For more information, please read the commit messages for each PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707
Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel
2025-01-27 05:20:56 +00:00
87fdadde1d Remove FFT from stride incorrect ops (#145080)
I gotta say, the FFT implementation is completely insane, there's gotta be a better way to do this than repeatedly inplace restriding the output tensor. Anyway, this is a faithful translation of both the MKL and cuFFT paths to Python.

Fixes https://github.com/pytorch/pytorch/issues/135087

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145080
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #145530
2025-01-27 04:26:04 +00:00
b75afa2e2e [MPS] cholesky implementation (#145701)
Requested in #77764

Closed #144193  due to a lot of conflicts when rebasing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145701
Approved by: https://github.com/malfet
2025-01-27 01:53:03 +00:00
c6ad08357b pickler for GraphModule (#141659)
Pickling GraphModule needs some special handling for wrapping things that normally can't be pickled - but async compile needs to pass them across a wire so we need to be able to serialize it - add some helpers to enable that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141659
Approved by: https://github.com/jamesjwu
2025-01-26 19:29:13 +00:00
f3ddc08ddc Additional operators in operator benchmark (#145625)
The list of added operators:
add_, addcmul, arange, baddbmm…, bmm, clamp, div, div_, gelu, index_add, logical_and, mul_, sub_, topk, where

This pull request is the same as a previous one: https://github.com/pytorch/pytorch/pull/145121 which inadvertently got deleted while merging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145625
Approved by: https://github.com/jeffdaily
2025-01-26 19:20:02 +00:00
6a4fb4b615 Revert "Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)"
This reverts commit cb814c0b961369a7ab154c58856c730cafaa2307.

Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/malfet due to It broke ROCM tests again, see 5cd2b34e82/1 ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2614523822))
2025-01-26 17:49:05 +00:00
5cd2b34e82 [inductor] Adjust test_log_fp64 to only run when float64 is supported. (#145686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145686
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-26 15:58:19 +00:00
ed015143ef Set RUNPATH on CUDA and XPU tests (#144305)
#136627 has almost fixed the issue that test binaries' runpath has not been set correctly, with few cases left.

This PR fixes the rest.

The binaries are found by `auditwheel repair` a wheel built with `BUILD_TEST=1`.

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144305
Approved by: https://github.com/malfet
2025-01-26 08:40:22 +00:00
c4523999a1 Fix incorrect type comparison (#145449)
Summary: This change was incorrectly made as part of #145166

Differential Revision: D68536221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145449
Approved by: https://github.com/bobrenjc93
2025-01-26 04:40:26 +00:00
09ae69a364 Revert "Fix type annotation of Linear.bias (#142326)"
This reverts commit 81e370fc6b90f9cb98c88f3173e738aba0dc650a.

Reverted https://github.com/pytorch/pytorch/pull/142326 on behalf of https://github.com/malfet due to This introduced a graph break and regressed inductor tests, see 73622fc5fa/1 ([comment](https://github.com/pytorch/pytorch/pull/142326#issuecomment-2614196349))
2025-01-26 03:41:00 +00:00
73622fc5fa Fix Throughputbenchmark issue (#144669)
Fixes [144461](https://github.com/pytorch/pytorch/issues/144461)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144669
Approved by: https://github.com/leslie-fang-intel, https://github.com/williamwen42, https://github.com/jansel
2025-01-26 03:37:20 +00:00
cb814c0b96 Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)
Fixes https://github.com/pytorch/pytorch/issues/142466.
Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case.

Test plan:
```
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859
Approved by: https://github.com/mingfeima, https://github.com/malfet
2025-01-26 01:56:40 +00:00
90448f0128 Output of nonzero is transposed, fix fake tensor (#144695)
Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695
Approved by: https://github.com/bobrenjc93, https://github.com/albanD
2025-01-26 01:07:22 +00:00
76bec878da Remove unnecessary HPUHooksInterface method (#145272)
getDefaultHPUGenerator is no longer necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145272
Approved by: https://github.com/ezyang
2025-01-26 01:06:34 +00:00
3cf7874ebe [MPS][BE] Implement bilineard2d as shader (#145581)
That significantly improves performance and addresses correctness problem(to an extend permitted by reducing precision of scale factor computation to float32). uint8 scaling algorithm mimics CPU/Pillow implementation
569b785371/src/libImaging/Resample.c (L306-L309)
I.e. using fixed precision integral arithmetic and rounding results of horizontal interpolation back to integers before performing vertical one, which results in technically less accurate results.

But even with those changes, `atol`, `rtol` must be tweaked to `1, 0` when scale factor is `1/3` or `2/3` because of the difference of representation  of those values as floats and doubles.

Changes in the performance could be measured using the following script
```python
import torch
import time
import subprocess

def benchmark(device, dtype):
    # Create example inputs
    x = torch.testing.make_tensor(1, 1, 2048, 2048, device=device, dtype=dtype)
    sf = .5

    # Check output
    y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear")
    z = torch.nn.functional.interpolate(x.cpu(), scale_factor=sf, mode="bilinear")
    outputs_match = torch.allclose(y.cpu(), z)
    if not outputs_match:
       atol = (y.cpu() - z).abs().max()
       rtol = ((y.cpu() - z)[z!=0]/z[z!=0]).abs().max()
       print(f"atol={atol} rtol={rtol}")

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        y = torch.nn.functional.interpolate(x, scale_factor=sf, mode="bilinear")
    torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return "True " if outputs_match else "False", average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16, torch.uint8]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

brand_string = subprocess.check_output(['sysctl', '-n', 'machdep.cpu.brand_string']).decode("utf-8").strip()
print(f"\nBenchmarking Results (collected on {brand_string}):")
print("-"*40)
print("Device            :                MPS                 |               CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8")
print(f"Outputs Match     :  ", " |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))
```

Benchmark results before
```
Benchmarking Results (collected on Apple M4 Pro):
----------------------------------------
Device            :                MPS                 |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8
Outputs Match     :   True  |  True  |  True  |  False |  True  |  True  |  True  |  True
Average Time (us) :  277.3  | 197.2  | 188.0  | 163.5  | 302.8  | 248.1  | 308.7  | 650.9
```
After(almost **100x** perf gain):
```
Benchmarking Results (collected on Apple M4 Pro):
----------------------------------------
Device            :                MPS                 |               CPU
Dtype             :   FP32  |  FP16  |  BF16  |   U8   |  FP32  |  FP16  |  BF16  |  U8
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    1.7  |   1.5  |   1.7  |   1.5  | 296.5  | 236.0  | 310.8  | 642.6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145581
Approved by: https://github.com/Skylion007
ghstack dependencies: #145578
2025-01-25 21:09:46 +00:00
0afdee4c39 [dynamo] raise IndexError when inserting into a full deque (#139379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139379
Approved by: https://github.com/jansel
2025-01-25 18:04:49 +00:00
513f889a36 [Rocm][Inductor][CK] silence ck package not installed warning when CK backend is not used to autotune bmm (#145626)
As titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145626
Approved by: https://github.com/coconutruben
2025-01-25 08:44:35 +00:00
c5216d2b6c [ca] add test_reset for 2.6 release validation (#145549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145549
Approved by: https://github.com/atalman
2025-01-25 06:28:58 +00:00
bbe7f53218 Save integral tensor data for ET (#144508)
Summary:
et_replay uses random data to run operators, however, the operators using index tensor to access memory won't work with random data. It usually ran into two exceptions: 1. illegal memory access since index is out of range, it has been fixed with the environment variable ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR_RANGE to record the min/max value of index tensors. 2. unaligned memory access, FBGEMM ops have speical requirements for the memory layout.

To fix the second execption, ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR is added to allow user to specify the node names, separated by comma, so ET will save the integral tensor data for these nodes. The saved data will be used in et_replay.

Be careful to turn on this option since it will use more space to save the extra data.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_data_cuda

Differential Revision: D67989856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144508
Approved by: https://github.com/briancoutinho
2025-01-25 05:38:10 +00:00
3d506491b9 [inductor] Fix duplicate detection in _dynamic_scale_rblock (#145577)
Before this the code was doing nothing because Config doesn't define `__hash__` or `__eq__` (so it was based on object id).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145577
Approved by: https://github.com/shunting314
ghstack dependencies: #142026
2025-01-25 04:58:54 +00:00
9007eb5f8e [inductor] Kernel memory analysis for use in heuristics (#142026)
This computes statistics about each kernel's memory usage that should allow us to write more precise heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142026
Approved by: https://github.com/eellison
2025-01-25 04:58:54 +00:00
cc1ecead07 [Dynamo] Allow format() to handle int (#144956)
Fixes #144830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144956
Approved by: https://github.com/jansel
2025-01-25 04:12:45 +00:00
b2a0feac85 Update OSS nested tensor docs to focus on NJT (#145402)
Updated nested tensor docs to be NJT-centric (instead of NST-centric). They now include:
* High-level description of NST vs. NJT + a recommendation to use NJT
* General NJT construction / usage
* torch.compile() integration w/ dynamic shapes
* Common errors and how to fix them
* Contribution guide
* Data layout / shape information (with diagram)
* Links to more extensive tutorials involving Transformers / SDPA / FlexAttention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145402
Approved by: https://github.com/soulitzer
2025-01-25 04:08:19 +00:00
392dc177a9 OpenReg: Refactor impl_registry (#145465)
Refactor impl_registry to use `driver.exec` as fallback.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145465
Approved by: https://github.com/albanD
2025-01-25 03:31:49 +00:00
6939a56e13 [autocast][pytorch] Support autocast for MTIA (#145627)
Summary: Add autocast support to MTIA

Reviewed By: egienvalue

Differential Revision: D68572548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145627
Approved by: https://github.com/egienvalue
2025-01-25 03:24:59 +00:00
ef60de07a0 [dynamo] Log guard latency (#145132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132
Approved by: https://github.com/ezyang
ghstack dependencies: #145509
2025-01-25 03:01:18 +00:00
42b8e233d9 serde unbacked bindings (#144894)
Adds unbacked bindings during deserialization. These are carried by a node's metadata, and map pending fresh unbacked symbols to paths to such symbols inside the corresponding example value carried by the node's metadata.

Since it is awkward to serialize paths, we only serialize the names of these symbols and reconstruct the paths on deserialization, using a shape env util. We also need to bump counters for unbacked symbols here, because the shape env util we use to create these symbols (when deserializing example values) don't do so, and not doing so makes later passes (like `run_decompositions`) crash because new unbacked symbols don't get new names.

This is enough for non-strict. For strict, the unbacked bindings and example values in node metadata can get out of sync, because of running AOTAutograd as an additional step after Dynamo. So we have to sync those back.

Differential Revision: [D68232274](https://our.internmc.facebook.com/intern/diff/D68232274/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144894
Approved by: https://github.com/pianpwk
2025-01-25 02:34:27 +00:00
5725462cd8 Update NJT linear_backward to return non-aliased tensor bias grad (#145399)
Fixes https://github.com/pytorch/pytorch/issues/141292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145399
Approved by: https://github.com/jbschlosser
ghstack dependencies: #145520, #145531, #145533
2025-01-25 00:58:04 +00:00
3a3e2cf90a Remove det_singular OpInfo (#145533)
Fixes https://github.com/pytorch/pytorch/issues/93045 https://github.com/pytorch/pytorch/issues/93044

From previous discussion https://github.com/pytorch/pytorch/issues/93045#issuecomment-1477674083 the resolution is that we're okay with removing this.

Some older attempts:
- https://github.com/pytorch/pytorch/pull/102581
- https://github.com/pytorch/pytorch/pull/109249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145533
Approved by: https://github.com/lezcano, https://github.com/malfet
ghstack dependencies: #145520, #145531
2025-01-25 00:58:03 +00:00
c7ca1df37e Disable slow gradcheck for nn.Transformer ModuleInfo (#145531)
Fixes https://github.com/pytorch/pytorch/issues/117140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145531
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #145520
2025-01-25 00:58:03 +00:00
9e0ee152e5 Fix allow_mutation_on_saved_tensors for inplace foreach (#145520)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145520
Approved by: https://github.com/albanD
2025-01-25 00:58:03 +00:00
clr
b4fe3c159d inductor: Explicitly test that torch.compile(option=...) does something (#145321)
This would have prevented https://github.com/pytorch/pytorch/pull/139833 from dropping the triggers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145321
Approved by: https://github.com/jansel
2025-01-25 00:48:26 +00:00
efebec5ef5 [dcp] Add ZStandard transformer (#143360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360
Approved by: https://github.com/saumishr, https://github.com/albanD
ghstack dependencies: #145528
2025-01-25 00:14:07 +00:00
f2ad2cdf1c [utils] add try_import method for importing optional modules (#145528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145528
Approved by: https://github.com/albanD
2025-01-25 00:14:07 +00:00
f3304571fc [BE][Ez]: FURB148 - remove useless enumerate calls (#145619)
Remove useless enumerate calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145619
Approved by: https://github.com/drisspg
2025-01-24 23:37:15 +00:00
0741963e01 [CI][CUDA][Blackwell] sm_\d\d no longer matches sm_100. (#145641)
Therefore making it sm_\d+

Fixes this unit test failure: python test/test_cpp_extensions_jit.py -k TestCppExtensionJIT.test_jit_cuda_archflags

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145641
Approved by: https://github.com/eqy, https://github.com/malfet
2025-01-24 23:20:22 +00:00
4cc5e880f9 Add accuracy issue support in AOTI Minifier (#145539)
Summary:

Add three more repro levels for AOTI minifier (level 2 already exists). They are the same as the existing dynamo minifier repro levels.

Now AOTI minifier can minify and repro programs that have numerical accuracy issues as well.

1: Dumps the original graph out to repro.py if compilation fails
2: Dumps a minifier_launcher.py if aoti fails.
3: Always dumps a minifier_launcher.py. Good for segfaults.
4: Dumps a minifier_launcher.py if the accuracy fails.

Refactor AOTI minifier unit tests to be cleaner and better re-use the existing minifier testing code. We do not need to manually patch {"aot_inductor.dump_aoti_minifier": True} to each test now, this config is generated in the test code.

Differential Revision: D68294638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145539
Approved by: https://github.com/desertfire
2025-01-24 23:07:19 +00:00
5b988ac4fa [Easy] Replace paper description with link to make a concise description. (#145031)
Description in [Transformer,](https://pytorch.org/docs/main/generated/torch.nn.Transformer.html), [TransformerEncoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerEncoderLayer.html), [TransformerDecoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerDecoderLayer.html) pages contain authors and paper details seems redundant for users who want to know how to use it, replace with a link to paper content, users can go to the paper detail if they want to learn more.

**Test Result**

**Before**
![image](https://github.com/user-attachments/assets/678402b1-e759-402c-b56b-e24f63dc8490)
![image](https://github.com/user-attachments/assets/ca191734-f2ce-493f-bf34-2d7046a9868f)
![image](https://github.com/user-attachments/assets/10f55083-6eb6-4b1c-9a77-579f0c4c56ed)

**After**
![image](https://github.com/user-attachments/assets/020f81ca-d89b-47d1-a7a9-cae1893df968)
![image](https://github.com/user-attachments/assets/5b9b34df-b892-4d71-8cdb-df18380b2744)
![image](https://github.com/user-attachments/assets/b3348da2-842a-4037-bad3-f23687503cf8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145031
Approved by: https://github.com/mikaylagawarecki
2025-01-24 23:01:02 +00:00
57591edca1 [mps/inductor] Add support for erfinv. (#145643)
After several rounds of refactoring, this seems to be done now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145643
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-24 22:55:44 +00:00
46e06e1d09 Avoid data-dependent errors in NJT tests via capture_scalar_outputs=True (#144588)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

There are several xfails related to data-dependent errors in torch.compile. This PR sets `torch._dynamo.config.capture_scalar_outputs=True` to avoid these, which tends to exercise unbacked SymInt logic and will require `torch._check()`-related fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144588
Approved by: https://github.com/soulitzer
ghstack dependencies: #144586, #144587
2025-01-24 22:45:01 +00:00
81e370fc6b Fix type annotation of Linear.bias (#142326)
Currently the `bias` attribute of `torch.nn.Linear` (and `Bilinear`) is typed incorrectly, because it relies on the implicit `Module.__getattr__` which types it as `Tensor | Module`. This has two issues:

- It hides the fact that `bias` is optional, and can be `None`, which in turn can hide actual bugs on user side.
- It blurs the type due to having `Module` in the union, which can require unnecessary `isistance(linear.bias, Tensor)` on user side.

This PR types the `bias` attribute explicitly to fix these issues.

CC @ezyang @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142326
Approved by: https://github.com/ezyang
2025-01-24 22:43:52 +00:00
70577d335e [ATen][CUDA][Transformers] Add Blackwell support to SDPA (#145602)
This PR adds sm_100 and sm_120 archs to support SDPA (Flash Attention and Memory Efficient Attention) on Blackwell machines.

Special thanks to @Fuzzkatt for co-authoring these changes!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145602
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet

Co-authored-by: Patrick Wang <22803332+Fuzzkatt@users.noreply.github.com>
2025-01-24 22:27:39 +00:00
5bf5ce0e15 Modify enable logic of COLLECTIVE_COMM profiler activity type (#145478)
Summary:
Since `KINETO_NCCL_PROFILER` flag is not used anymore (we are moving from linking the profiler during compile time to loading it dynamically), we change the logic for enabling the profiler to use `TORCH_PROFILER_ENABLE_COLLECTIVE_PROFILING` environment variable for NCCL Collective Communication Profiler.

For HCCL, we still keep the same logic

Test Plan: See  https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/gpu_traces/tree/traces/clientAPI/0/1737579474/devvm29927.cln0/nccl_activities_2387985.json.gz for sample trace on nccl-profiler

Differential Revision: D68515945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145478
Approved by: https://github.com/sraikund16
2025-01-24 22:21:09 +00:00
d4171b724e Let tensor_a.new_tensor() be on tensor_a.device by default (#144958)
Fixes #144957
Closes #73838 cc @albanD @ezyang

Currently, `tensor_a.new_tensor()` will return a on-cpu tensor no matter where is `tensor_a`. This differs from the document and is a side-effect of https://github.com/pytorch/pytorch/pull/41984.

See #144957 how current logic breaks dynamo.

This PR restore the documented behavior and add tests for `new_tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144958
Approved by: https://github.com/ezyang
2025-01-24 22:12:31 +00:00
2a70de7e92 [CUDA] Change slim-wheel libraries load order (#145638)
There is no libnvjitlink in  CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc

Fixes issues reported in https://github.com/pytorch/pytorch/pull/145614#issuecomment-2613107072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145638
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-24 22:00:56 +00:00
FEI
615bdd9c81 Improve the caching allocator test for raw alloc (#145269)
1 Prevent block allocated by torch._C._cuda_cudaCachingAllocator_raw_alloc from affecting torch.cuda.empty_cache() in other unit tests
2 Additionally, tested the changes to raw_delete in https://github.com/pytorch/pytorch/pull/131114

@jeffdaily @albanD @houseroad @eqy @aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145269
Approved by: https://github.com/albanD, https://github.com/eqy, https://github.com/jeffdaily
2025-01-24 21:07:17 +00:00
d79c6f4946 Improve torchrun documentation (#144354)
Fixes #142042:
- #142042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144354
Approved by: https://github.com/c-p-i-o, https://github.com/H-Huang
2025-01-24 20:40:05 +00:00
caf60395f4 [torchbench] Increase tolerance for amp only poolformer_m36 (#145375)
https://github.com/pytorch/pytorch/issues/144893

```
python benchmarks/dynamo/timm_models.py --only poolformer_m36 --accuracy --no-translation-validatio  --training --amp --device cuda --backend inductor
```

`--float32`, `--bfloat16` - passes the accuracy
`--disable-cudagraph` does not change the result

accuracy_fail only for `--amp` and  gives `0.048` res_error, on 1-element result Tensor.

This fails with `0.01` tolerance.

If to increase tolerance to 0.04 it passes. I have not reproduced "eager_two_runs_differ" on H100.
I think this is a true distribution of results with `--amp`, so increasing tolerance to 0.04 for ano case only makes it passing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145375
Approved by: https://github.com/desertfire
2025-01-24 19:56:21 +00:00
457facf7e2 [caffe2] Use the manifold cache backend as the default (#144773)
Test Plan: CI

D68155591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144773
Approved by: https://github.com/izaitsevfb
2025-01-24 19:48:34 +00:00
c16866a582 [BE] mv test/inductor_skips/* to test/inductor_expected_failures/ (#145572)
Summary: I think skipping these tests is suboptimal. If we categorize as expected failures, then we'll see test failures when they start passing, which means they're more likely to be removed. As a skip, they quietly continue to skip.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145572
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-24 19:41:38 +00:00
cf063d41f8 Spruce up docs for emulate_precision_casts (#145579)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145579
Approved by: https://github.com/gchanan
2025-01-24 19:28:37 +00:00
96149a201a [Inductor] be able to disable cache for test (#141195)
Let TORCHINDUCTOR_FX_GRAPH_CACHE=0 being respected in unit test. This is helpful if I want the compilation to happen for testing.   Setting INDUCTOR_TEST_DISABLE_FRESH_CACHE to 1 is not the same, since that will cause the generated wrapper file being deleted. But we may want to check those files after running a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141195
Approved by: https://github.com/masnesral, https://github.com/desertfire
2025-01-24 19:15:55 +00:00
2fd2a950e6 [torchbench] Add meta function for _cudnn_rnn_flatten_weight (#145488)
https://github.com/pytorch/pytorch/issues/144989

This fixes tts_angular model on torchbench for `--export-aot-inductor`

I put meta function in cpp, as shape calculation requires cudnn API calls.
I've extracted shape calculation to be used in implementation as this logic has some non-trivial actions and comments.

```
└─ $ python benchmarks/dynamo/torchbench.py --only tts_angular --accuracy --no-translation-validation --inference --bfloat16 --export-aot-inductor --disable-cudagraphs --device cuda
loading model: 0it [00:00, ?it/s]WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
loading model: 0it [00:01, ?it/s]
WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
cuda eval  tts_angular
WARNING:common:Model tts_angular does not support bfloat16, running with amp instead
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145488
Approved by: https://github.com/eqy, https://github.com/zou3519
2025-01-24 19:08:14 +00:00
ad36f4f42c Revert "Add generator parameter to rand*_like functions (#136780)"
This reverts commit c7b2f7dd142fc97c8ce4ad7ad591687cf295fcda.

Reverted https://github.com/pytorch/pytorch/pull/136780 on behalf of https://github.com/izaitsevfb due to internal regression ([comment](https://github.com/pytorch/pytorch/pull/136780#issuecomment-2613191933))
2025-01-24 19:00:21 +00:00
a989a0b13a [NFC] Fix some minor typos. (#145599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599
Approved by: https://github.com/Skylion007
2025-01-24 18:58:59 +00:00
6cda572c98 [mps] Hoist erfinv logic out of the kernel in preparation for moving. (#145568)
Will be used in inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145568
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-01-24 18:51:09 +00:00
8eea554332 [Dynamo] Fix names collisions with foreach decomps (#145479)
Fixes https://github.com/pytorch/pytorch/issues/138698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145479
Approved by: https://github.com/yanboliang
2025-01-24 18:46:58 +00:00
e57cdb8402 [ROCm] trunk.yml only runs pre-merge via ciflow/trunk label (#145629)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145629
Approved by: https://github.com/jeffdaily
2025-01-24 18:31:33 +00:00
b8087747f5 [inductor][BE] Enable test_cpu_cpp_wrapper in fbcode (#145373)
Differential Revision: D68278174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145373
Approved by: https://github.com/Skylion007
2025-01-24 17:59:13 +00:00
74cfb4f364 [dynamo][refactor] Move collections.namedtuple out of SkipFunctionVariable (#145547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145547
Approved by: https://github.com/zou3519
ghstack dependencies: #145519
2025-01-24 17:39:33 +00:00
97c0b7cb0a Add unique identifer to bmm thread_mm functions (#145303)
Summary:
The bmm template generates code like this

```
template<bool accum>
void cpp_fused_bmm_66_micro_gemm(...) {
    ...
}

void single_thread_mm() {
    ...
    cpp_fused_bmm_66_micro_gemm(...)
    ...
}

void threaded_mm() {
    ...
    cpp_fused_bmm_66_micro_gemm(...)
    ...
}

void cpp_fused_bmm_66(...)
{
    ...
    single_thread_mm(...);
    ...
    threaded_mm(...);
    ...
}
```

The generated  `fused_bmm` and `fused_bmm_microgemm` functions both have unique identifiers added to their names, but the `single_threaded_mm` and `threaded_mm` do not.

This diff adds unique identifies to those generated functions as well. The identifier is based on the kernel name. So for the example above we would generate a bmm template name like `cpp_fused_bmm_66_single_thread_mm()`.

Differential Revision: D68364772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145303
Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475
2025-01-24 17:35:50 +00:00
547c18ee9f Add Torchao docs link to Pytorch libraries (#145412)
Add Torchao docs link to the libraries section in torch docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145412
Approved by: https://github.com/svekars
2025-01-24 17:11:20 +00:00
ce371ab4c6 [ROCm] Create inductor-rocm-mi300 (#145621)
- Adds an mi300 inductor workflow to main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145621
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-24 17:04:17 +00:00
c0861d092c [PGNCCL] Add an API to get the status/error code at the PG level (#144498)
Summary:
This PR is basically a replacement of
https://github.com/pytorch/pytorch/pull/140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread

If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective

Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498
Approved by: https://github.com/kwen2501
2025-01-24 16:47:32 +00:00
9132f4b7ce [dynamo][guards] Log guard latency to tlparse (#145509)
Example
![image](https://github.com/user-attachments/assets/1503ee59-ff35-46d9-9b61-16352a4a30e2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145509
Approved by: https://github.com/ezyang
2025-01-24 16:33:29 +00:00
1335882b2a If mypy fails it should report the error back to lintrunner (#145550)
This happened to me because I had a bad LD_LIBRARY_PATH and mypy was failing to run (.so load error) - but lintrunner was silent about the underlying problem.

Differential Revision: [D68593081](https://our.internmc.facebook.com/intern/diff/D68593081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145550
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2025-01-24 15:40:30 +00:00
7c314bfed4 [Intel GPU] Add TORCH_API macro to export symbol NestedTensor_to_mask for libtorch_xpu (#145467)
Part of https://github.com/intel/torch-xpu-ops/issues/1141.

The `TORCH_API` macro is added to export the symbol `NestedTensor_to_mask`, which is needed by libtroch_xpu for `NestedTensor_softmax_dropout_xpu`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145467
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-01-24 15:38:46 +00:00
5d24a9a274 Advance docker release latest verison to cuda 12.4 (#145566)
Fixed latest tag in ghcr.io to be cuda 12.4 docker image. Todo, Need to add it to : https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD

Will need to check if we can automate this by introducing cuda_stable variable or something like this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145566
Approved by: https://github.com/nWEIdia, https://github.com/kit1980, https://github.com/malfet
2025-01-24 15:27:25 +00:00
5c64aaea40 [triton] Update triton pin to include warp specialization support (#145120)
The warp specialization work has been landed to the triton rc/3.2.x branch as b2684bf3b0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120
Approved by: https://github.com/bertmaher
2025-01-24 14:45:12 +00:00
bc62930765 Work around buggy use_const_ref_for_mutable_tensors (#145530)
See https://github.com/pytorch/pytorch/issues/145522 for context

This doesn't fix the problem with use_const_ref_for_mutable_tensors and the boxed wrapper, instead it just gets all of our out kernels off of this flag so that the mutable matching pattern works correctly. I also add a check in torchgen to prevent people from making this mistake in the future.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145530
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2025-01-24 14:38:49 +00:00
9d6927715f Revert "Fix triton masked loading for non-block tl.loads (#144782)"
This reverts commit 31c2f36989e35ccf023a8e35c4bc21aca077d344.

Reverted https://github.com/pytorch/pytorch/pull/144782 on behalf of https://github.com/ezyang due to This regresses compile time for one of our internal models by 20%, internal xref https://fb.workplace.com/groups/1075192433118967/posts/1591490218155850 ([comment](https://github.com/pytorch/pytorch/pull/144782#issuecomment-2612660287))
2025-01-24 14:28:48 +00:00
cyy
6a35d9aaa4 Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-24 12:22:13 +00:00
f08b9bc7e4 [WIP] Move XNNPACKQuantizer from PyTorch to ExecuTorch (#144940)
Summary:
This replicates XNNPACKQuantizer from PyTorch to ExecuTorch.

Rationale:
Main motivation is to avoid pytorch pin update in OSS after updating XNNPACKQuantizer, which can be rather frequent.
Other impact and considerations:
PT2e flow (which lives in PyTorch) relies havily on XNNPACKQuantizer for a "example" implementation for quantizer and more importantly tests. Fow now, we will keep the torch.ao.quantization.xnnpack_quantizer as is but mark is as not BC, and deprecated to discourace future new dependencies on it.
Other OSS repository using XNNPACKQuantizer from PyTorch now have to take an additional dependency on ExecuTorch.

Differential Revision: D68191752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144940
Approved by: https://github.com/jerryzh168, https://github.com/mcr229
2025-01-24 10:06:07 +00:00
d3989ca636 Add multi env variable support to configs (#145288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288
Approved by: https://github.com/c00w
2025-01-24 10:04:24 +00:00
10bdd0a1cc [BE][export] Fix hop tests with flaky memory leak (#145391)
Summary:
As title. Added `torch._dynamo.reset()` for each test

This should fix several flaky tests in `test_hop.py` such as https://github.com/pytorch/pytorch/issues/139073

Test Plan:
```
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/export/test_hop.py TestHOPCUDA.test_serialize_export_scan_simple_cuda_float32
```

Differential Revision: D68506280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145391
Approved by: https://github.com/ydwu4
2025-01-24 09:53:21 +00:00
72da0a8a42 [Submodule] Add flash as third-party submodule [Prep for later PRs] (#145502)
# Context

Prototyped here: https://github.com/pytorch/pytorch/pull/144120, we are going to make flash-attention a 3rd party submodule. We will then use the c++ sources and include into our build of libtorch.so

This requires various changes to work including external and internal changes. Since these require internal changes we need to co-dev and in the co-dev environment I haven't found a way to sync submodule changes + internal only changes.

This is unused for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145502
Approved by: https://github.com/Skylion007
2025-01-24 09:21:41 +00:00
d62e900d8c [CI][CUDA][MultiGPU][Regression] Skip a failure due to https://github.com/pytorch/pytorch/issues/139520 (#145318)
Related: https://github.com/pytorch/pytorch/issues/139520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145318
Approved by: https://github.com/eqy
2025-01-24 06:58:05 +00:00
0e98b26b28 [CI][CUDA][Dynamic Shape] xfail: DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda (#145204)
python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda

failed to generate triton kernels, causing assert failures on 2x H100 systems (and 2x Grace H100 systems).

Failures like below:

Finline_call []                                                                                                                                                    stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('autograd_cache_miss', 1), ('autograd_cache_saved', 1), ('ok', 1)]

FAIL: test_linspace4_dynamic_shapes_cuda (__main__.DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda)                                       [61/1892]----------------------------------------------------------------------                                                                                             Traceback (most recent call last):                                                                                                                                   File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 12212, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/testing.py", line 420, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 2603, in test_linspace4
    self.common(fn, (torch.Tensor([]),))
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 424, in common
    return check_codegen(
           ^^^^^^^^^^^^^^
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor_codegen_dynamic_shapes.py", line 82, in check_codegen
    self.assertTrue("def triton" in code, f"Failed to find triton kernel\n{code}")
AssertionError: False is not true : Failed to find triton kernel

# AOT ID: ['0_inference']                                                                                                                                 [42/1892]from ctypes import c_void_p, c_long, c_int
import torch
import math
import random
import os
import tempfile
from math import inf, nan
from torch._inductor.hooks import run_intermediate_hooks
from torch._inductor.utils import maybe_profile
from torch._inductor.codegen.memory_planning import _align as align
from torch import device, empty_strided
from torch._inductor.async_compile import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels
from torch._inductor.codegen.multi_kernel import MultiKernelCall

aten = torch.ops.aten
inductor_ops = torch.ops.inductor
_quantized = torch.ops._quantized
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
alloc_from_pool = torch.ops.inductor._alloc_from_pool
async_compile = AsyncCompile()
empty_strided_p2p = torch._C._distributed_c10d._SymmetricMemory.empty_strided_p2p

async_compile.wait(globals())
del async_compile

def call(args):
    with torch.cuda._DeviceGuard(1):
        torch.cuda.set_device(1)
        buf0 = empty_strided_cuda((0, ), (1, ), torch.float32)
    return (buf0, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    fn = lambda: call([])
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor_codegen_dynamic_shapes.py DynamicShapesCodegenGPUTests.test_linspace4_dynamic_shapes_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145204
Approved by: https://github.com/eellison
2025-01-24 06:57:35 +00:00
817fd14714 [BE] Type annotation for _inductor/dependencies.py (#145311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145311
Approved by: https://github.com/eellison
2025-01-24 06:32:48 +00:00
2ce70da96c [cp] override compute_log_sumexp to True for aten._scaled_dot_product_efficient_attention.default if False (#145421)
## Description
Our current CP doesn't support efficient attention when `compute_log_sumexp=False`. `compute_log_sumexp=False` only if that `requires_grad=False` and since PP's [shape inference](d95a6babcc/torch/distributed/pipelining/stage.py (L1387)) happens under `torch.no_grad()` context , we need to override `compute_log_sumexp` to `True` in our CP attention implementation.

## Test
- Test PP+FSDP+CP w/ `mixed_precision = "float32"` in torchtitan

- `pytest test/distributed/tensor/test_attention.py -s -k test_ring_attention_sdpa`

Before:
<img width="1880" alt="image" src="https://github.com/user-attachments/assets/872ff583-295e-4751-a280-cf7f2d41c61a" />

After:
<img width="2988" alt="image" src="https://github.com/user-attachments/assets/4bdcc2e5-22a5-427a-91a5-82206d5bd78f" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145421
Approved by: https://github.com/H-Huang, https://github.com/tianyu-l
2025-01-24 06:17:54 +00:00
53fc921ce2 [dynamo][trace-rules-cleanup] Remove functools from the Builtins skiplist (#145519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145519
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2025-01-24 06:02:03 +00:00
9752c7c1c8 [CD] Fix slim-wheel cuda_nvrtc import problem (#145582)
Similar fix as: https://github.com/pytorch/pytorch/pull/144816

Fixes: https://github.com/pytorch/pytorch/issues/145580

Found during testing of https://github.com/pytorch/pytorch/issues/138340

Please note both nvrtc and nvjitlink exist for cuda 11.8, 12.4 and 12.6 hence we can safely remove if statement. Preloading can apply to all supporting cuda versions.

CUDA 11.8 path:
```
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/cuda_nvrtc/lib
__init__.py  __pycache__  libnvrtc-builtins.so.11.8  libnvrtc-builtins.so.12.4  libnvrtc.so.11.2  libnvrtc.so.12
(.venv) root@b4ffe5c8ac8c:/pytorch/.ci/pytorch/smoke_test# ls /.venv/lib/python3.12/site-packages/torch/lib/../../nvidia/nvjitlink/lib
__init__.py  __pycache__  libnvJitLink.so.12
```

Test with rc 2.6 and CUDA 11.8:
```
python cudnn_test.py
2.6.0+cu118
---------------------------------------------SDPA-Flash---------------------------------------------
ALL GOOD
---------------------------------------------SDPA-CuDNN---------------------------------------------
ALL GOOD
```

Thank you @nWEIdia for discovering this issue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145582
Approved by: https://github.com/nWEIdia, https://github.com/eqy, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-24 04:47:57 +00:00
732c4998f3 [NVIDIA] Full Family Blackwell Support codegen (#145436)
More references:
https://github.com/NVIDIA/nccl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145436
Approved by: https://github.com/ezyang, https://github.com/drisspg
2025-01-24 04:36:00 +00:00
c184055743 [BE] Use value_or in layer_norm.cpp (#145417)
Now that we have proper optional, no need to do `if (has_value) value else default_value;`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145417
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2025-01-24 04:02:23 +00:00
4799ebf326 [MPS][BE] Turn bicubic2d into generic metal template (#145578)
In preparation for more metal shaders to come
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145578
Approved by: https://github.com/Skylion007
2025-01-24 04:01:23 +00:00
68a1505985 serde and_ operator (#145506)
Differential Revision: [D68565887](https://our.internmc.facebook.com/intern/diff/D68565887/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145506
Approved by: https://github.com/zhxchen17, https://github.com/Skylion007
2025-01-24 03:48:03 +00:00
29ddf9a63e Document dispatch trace build flag (#145517)
Ok, the build flag seems to have been broken for a while since the function it calls doesn't exist anymore.
Repurposed it to enable dispatcher printing (which requires a full (and slow) debug build otherwise).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145517
Approved by: https://github.com/bdhirsh
2025-01-24 03:19:39 +00:00
a40ead1fd6 Don't fail if fresh_inductor_cache fails to clean up its tmp dir. (#145513)
Summary: I see we have a test failure due to an error removing the tmp dir: https://github.com/pytorch/pytorch/issues/141761. Seems like we should not raise an exception for this case in general. Also, let's clean up the exception handling related to windows. The comment makes it sound like we want to specifically ignore failures cleaning up, but the current impl is swallowing all exceptions.

Fixes #141761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145513
Approved by: https://github.com/eellison
2025-01-24 03:17:03 +00:00
36fcf98db6 [cutlass backend tests] Manually clear cache, test more tests in fbcode and limit configs in some tests (#145545)
Summary:
Manually clear cache:
You want to clear cache in most tests. Otherwise link command won't work and you have multiple .o files and you get something like `ld.lld: error: duplicate symbol: cuda_fused_0`.

test more tests in fbcode:
A few tests have been skipping in fbcode. Unskip them.

limit configs in some tests:
to reduce time spent on each test

Differential Revision: D68584071

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145545
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-01-24 03:06:59 +00:00
386650353b [ARM] Fix bf32 and tf32 precision for tensordot unit test (#141136)
Fixes unit test failure on aarch64 ( neoverse-v1 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141136
Approved by: https://github.com/malfet
2025-01-24 02:59:45 +00:00
d6bea398ac Only include RMSNorm.h in layer_norm.cpp for MPS (#145524)
Test Plan: CI

Differential Revision: D68578213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145524
Approved by: https://github.com/malfet
2025-01-24 02:08:49 +00:00
d5629889f1 cpp_wrapper: Properly handle scalars when input to tensor arguments (#144910)
Additionally, reduce code duplication in `cpp_wrapper_cpu_array_ref.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144910
Approved by: https://github.com/desertfire
2025-01-24 02:06:35 +00:00
47e65077b1 OpenReg: Remove REGISTER_GENERATOR_PRIVATEUSE1 (#144841)
Replace REGISTER_GENERATOR_PRIVATEUSE1 with new API in AcceleratorHooksInterface.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144841
Approved by: https://github.com/albanD
2025-01-24 01:52:10 +00:00
cd68d54911 Inductor cache: Revamp how we handle frozen params (#143808)
Summary: In https://github.com/pytorch/pytorch/pull/143563 we have a report of a problem with the treatment of frozen params in the inductor cache implementation. There seems to be a path where new constants are added in the `GraphLowering`. On a cache hit when we try to find those constant names in the `torch.fx.GraphModule`, they do not exist. The current approach treats all constants differently if the GM has any frozen params. This PR changes the approach to only treat the _frozen_ params specially, but store all other constants in the cache entry (as we do without freezing):
1) When creating a cache entry, store the names of any frozen params, but the values of any other constants.
2) On a cache hit, restore the values of the frozen params by looking up in the current GM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143808
Approved by: https://github.com/leslie-fang-intel, https://github.com/eellison
2025-01-24 01:20:07 +00:00
54e2f4b201 Fix lerp weight type promotion (#141117)
Fixes #140601

Enable `promote_inputs_to_common_dtype` when tensors not same dtype when invoke `lerp` function.

For `lerp_Tensor`
- Check whether same `dtype` of tensors, enable promote if not
- Remove type check assert

For `lerp_Scalar`
- Seems already enable `promote_inputs_to_common_dtype` by default, just remove the type check. Make sure promote behavior consistent with `lerp_Tensor`

`lerp_Scalar` get TensorIteratorConfig from here
c37185c76a/aten/src/ATen/TensorIterator.cpp (L979-L985)

**Test Result**
Test case in issue passed

```python
>>> import torch
>>>
>>> x = torch.ones(2, 2, dtype=torch.float64)
>>> w = torch.ones(2, 2, dtype=torch.float64)
>>> s = torch.tensor(2.2)
>>> x.lerp_(w, s)
tensor([[1., 1.],
        [1., 1.]], dtype=torch.float64)

>>> x = torch.ones(2, 2, dtype=torch.float16)
>>> w = torch.ones(2, 2, dtype=torch.float16)
>>> s = torch.tensor(2.2)
>>> x.lerp_(w, s)
tensor([[1., 1.],
        [1., 1.]], dtype=torch.float16)

```

```bash
$ pytest test/test_binary_ufuncs.py -k 'test_lerp_tensor_type_promotion or test_lerp_scalar_type_promotion'
```
![image](https://github.com/user-attachments/assets/288a5294-a9ee-47f3-bbf7-d4ff986f3ba8)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/d469836f-5c49-4d89-a2fd-379cad4db3af)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141117
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-24 01:18:20 +00:00
b2c89bc115 [inductor][2/N] triton support post-#5512, user-defined triton kernels (#145348)
Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This PR fixes user-defined triton kernel handling (in most cases) for these new triton commits.

What this PR fixes:
* in triton_kernel_wrap.py, AST->TTIR parsing was to be updated for the new triton API
* ir.py - don't remove None args when using newer triton versions
* wrapper.py - update signature & constant handling

What this doesn't fix:
* correct None handling - I want to do a closer look at constant handling (including None, equal_to_1, and other constants).
* cpp wrapper (which needs to be fixed for both user-defined triton kernels and inductor-generated kernels)

test/inductor/test_triton_kernels.py passed on triton commit 74de6b46, with the exception of three tests (those shown here: 1374074098)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145348
Approved by: https://github.com/jansel
ghstack dependencies: #145051
2025-01-24 00:34:01 +00:00
b963ab5325 [inductor][1/N] triton support post-#5512, main components (#145051)
Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This is an initial PR to add support for Triton versions after commit 5512 landed.

The main changes in 5220 and 5512 that need to be supported:
* AttrsDescriptor() gets replaced with a raw dict. The raw dict has the format `{(TUPLES): [["tt.divisibility", 16]]}`, where `(TUPLES)` is a tuple of indices, e.g. `((0,), (1,), (3,))` to indicate that args 0, 1, and 3 are divisible by 16. These indices are, themselves, represented as tuples to support nested inputs (e.g. an argument that's a tuple), but support for tuples is not implemented right now.
* "signature" changes: the signature now contains _all_ args, including constexpr and constant args.
* ASTSource now takes "constexprs" instead of "constants" - for example, equal-to-1 args are constants but not constexprs so we don't need to pass these args as "constants".

What this PR supports:
* Triton versions before Dec 9, 2024, and (partial support for) Triton versions after Jan 1, 2025
* (triton jan 1+) typical inductor-generated triton: updated AttrsDescriptor, signatures, constexpr/constant handling.

What this PR doesn't support (TODO in follow-up PRs):
* Triton versions between Dec 9, 2024 and before Jan 1, 2025
* (triton jan 1+) user-defined triton kernel support (this is implemented already in @anmyachev's patch)
* (triton jan 1+) triton_helper support (failing in triton codegen - needs investigation)
* (triton jan 1+) AOTI / cpp wrapper

thanks to @anmyachev for patches in https://github.com/intel/intel-xpu-backend-for-triton/blob/main/scripts/pytorch.patch, which contains most of these changes already

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145051
Approved by: https://github.com/jansel
2025-01-24 00:34:01 +00:00
714f64329b Revert "Add multi env variable support to configs (#145288)"
This reverts commit a8b7cb6a2ddbba4924b6b2531f1ecd2f5ed6d512.

Reverted https://github.com/pytorch/pytorch/pull/145288 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint from a landrace with some recent PEP585 changes ([comment](https://github.com/pytorch/pytorch/pull/145288#issuecomment-2611278428))
2025-01-24 00:20:00 +00:00
6a2b4db0a1 Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)"
This reverts commit 42f4fda2ebb27693411f7acca1665778d539bf79.

Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))
2025-01-24 00:17:34 +00:00
6f60c65a3a Revert "[dynamo] Log guard latency (#145132)"
This reverts commit 0a310d738819ae000f49b32298305724117634c2.

Reverted https://github.com/pytorch/pytorch/pull/145132 on behalf of https://github.com/anijain2305 due to CI failures observed after PR was merged ([comment](https://github.com/pytorch/pytorch/pull/145132#issuecomment-2611268421))
2025-01-24 00:11:50 +00:00
f0e9f87a9b [BE/mps] Mark input args as constant to prevent incorrect usage. (#145535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145535
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-24 00:11:44 +00:00
6aaae9d78f Make torchelastic etcd rendezvous publicly importable (#145396)
Make torchelastic publicly importable by raising error on import etcd lazily, [BE task, row 7](https://docs.google.com/spreadsheets/d/1TtATnLJf1rVXaBQd3X3yYqm9xNN9BIWG7QqRgrFiRRI/edit?gid=1748512924#gid=1748512924)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145396
Approved by: https://github.com/albanD
ghstack dependencies: #145387
2025-01-23 23:56:45 +00:00
f8a4f16634 [c10d] fix memory leak on shutdown (#145507)
Summary:
Fix memory leak on shutdown when socket is closed.
We still need to free the buffer to make valgrind happy.

Test Plan:
Use `mtiavm`.
Repro steps provided by cristianlume.

on window 1:
```
vm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2
```
on window 2:
```
vm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1
```

without the fix:
```
==8766==ERROR: LeakSanitizer: detected memory leaks
```
With fix, no leak

Differential Revision: D68566104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145507
Approved by: https://github.com/XilunWu, https://github.com/d4l3k
2025-01-23 23:36:15 +00:00
6dd8283381 Revert "[compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296)"
This reverts commit 5531fafffefc45cd894040b2b07b0d5227430082.

Reverted https://github.com/pytorch/pytorch/pull/143296 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
c3fadacf84 Revert "[compiled autograd] Proxy a node for CopyBackwards into the graph (#143304)"
This reverts commit 8c7c5f7bfcbc55638a0e4aed6eaa27f6194dbebe.

Reverted https://github.com/pytorch/pytorch/pull/143304 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
9553301ade Revert "[compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387)"
This reverts commit 784bb2127ca9729c646f1650ecc2cf946a583da8.

Reverted https://github.com/pytorch/pytorch/pull/143387 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
16c4f8c395 Revert "[compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405)"
This reverts commit ec820fe57c2d6a2847569a107856e7fcff87dc5c.

Reverted https://github.com/pytorch/pytorch/pull/143405 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
3f6cfd0156 Revert "[compiled autograd] stop specializing on metadata during initial trace (#143417)"
This reverts commit 99dd1bf1b93bc26080e611af54497a73a618e02a.

Reverted https://github.com/pytorch/pytorch/pull/143417 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:12 +00:00
ab082863a1 Revert "[compiled autograd] support Tensor Subclasses in AOTBackward (#144115)"
This reverts commit 082c28c3c655984ce65c13336cff822db95ee470.

Reverted https://github.com/pytorch/pytorch/pull/144115 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:12 +00:00
0a310d7388 [dynamo] Log guard latency (#145132)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145132
Approved by: https://github.com/ezyang
ghstack dependencies: #145351, #145420
2025-01-23 23:30:07 +00:00
bf62222d81 Revert "[compiled_autograd] Rename interface to pyinterface (#145495)"
This reverts commit e1407f5aeb658c8c959d33158f465e975799a3d0.

Reverted https://github.com/pytorch/pytorch/pull/145495 on behalf of https://github.com/izaitsevfb due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145495#issuecomment-2611194932))
2025-01-23 23:07:17 +00:00
a8b7cb6a2d Add multi env variable support to configs (#145288)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145288
Approved by: https://github.com/c00w
2025-01-23 23:00:23 +00:00
dad9bc3461 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit de945d78da9198e58df7c19c53b737d0f987ddff.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))
2025-01-23 22:59:25 +00:00
cyy
42f4fda2eb Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-23 22:47:18 +00:00
6f07847efe Bail on checking internal overlap when dealing with unbacked symints (#145385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145385
Approved by: https://github.com/ezyang
2025-01-23 22:31:31 +00:00
e1407f5aeb [compiled_autograd] Rename interface to pyinterface (#145495)
Summary: interface is a reserved word in some MSVC variants.

Test Plan: build

Differential Revision: D68561379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145495
Approved by: https://github.com/xmfan
2025-01-23 21:40:59 +00:00
302b07f166 Implement deepcopy for AOTICompiledModel (#145423)
Summary:

Fix https://github.com/pytorch/pytorch/issues/145411

Support deepcopying AOTICompiledModel. The `loader` is shallow copied.

Test Plan:
```
buck2 run fbcode//mode/opt //caffe2/test/inductor:aot_inductor_package -- -r deepcopy
```

Differential Revision: D68524673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145423
Approved by: https://github.com/desertfire
2025-01-23 21:05:30 +00:00
e924ddbef1 [BE] [mps] Refactor UnaryConstants to be its own kernel. (#145230)
In preparation for using this file for inductor (for erfinv).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145230
Approved by: https://github.com/malfet
2025-01-23 20:58:43 +00:00
881eb86692 Fix staging for CPU tensors in OSS DCP async_save (#145408)
Fix staging for CPU tensors in OSS DCP async_save (#145408)

Summary:

As found in
https://github.com/pytorch/pytorch/issues/144657
for CPU tensors we accidentally skip copying during staging due to using offload to cpu helper, which does a no-op for CPU tensors. This means that if the trainer changes the original source CPU tensor value after launch async save but before the actual writing/uploading to the destination commences, the writing/uploading logic will accidentally pick up the latest state of the tensor, while it should have dealt with its own dedicated copy saved earlier. Dropping _offload_state_dict_to_cpu in favor of _copy_state_dict fixes this bug.

Test Plan:
Running the user script from the linked GitHub issue verifies the fix:
```
import os

import torch

import torch.distributed as dist
import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_model_state_dict
import torch.nn as nn


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(1, 1))

    def forward(self, x):
        return self.layer(x)

os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12345"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"

dist.init_process_group()

model = Net()
state_dict = get_model_state_dict(model)
pg = dist.new_group(backend="gloo")

try:
    steps = [10, 20, 30, 40, 50]
    future = None
    for step in steps:
        # simulate a training step, e.g. optimizer updating values
        with torch.no_grad():
            model.weight.data.fill_(step)

        if future is not None:
            future.result()
            future = None
        future = dcp.async_save(
            state_dict,
            checkpoint_id=f"outputs/{step}",
            process_group=pg,
        )

    future.result()

    for step in steps:
        dcp.load(
            state_dict,
            checkpoint_id=f"outputs/{step}",
            process_group=pg,
        )
        assert state_dict["weight"][0, 0] == step, f"got {state_dict['weight'][0, 0]=} on {step=}"
finally:
    dist.destroy_process_group(pg)
    dist.destroy_process_group()
```
passes all asserts with this fix. If the script is run in trunk, confirmed that it fails the first assert.

Differential Revision: D68518689
2025-01-23 12:49:26 -08:00
6a44a61514 [BE] Bump TIMM pin (#145320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145320
Approved by: https://github.com/Skylion007
2025-01-23 20:44:26 +00:00
99367ecbed [draft export] count how many times a data-dep error shows up (#145030)
Summary: maybe this is helpful?

Test Plan: draft_export

Differential Revision: D68303934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145030
Approved by: https://github.com/angelayi
2025-01-23 20:27:31 +00:00
5ebca3015d [BE]: Simplify set add with set update (#145152)
Simplifies the set update slightly to be more readable and efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-01-23 20:18:13 +00:00
d7b6746470 Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347)"
This reverts commit c27dd9cf72265161f85a18c0b19f365097f7a1ac.

Reverted https://github.com/pytorch/pytorch/pull/145347 on behalf of https://github.com/huydhn due to Remove -e breaks the theme somehow ([comment](https://github.com/pytorch/pytorch/pull/145347#issuecomment-2610911258))
2025-01-23 20:06:07 +00:00
d53f2067fe [BE][export] add "+export" logging to de/serialization (#145283)
adds de/serialization debug logging to `TORCH_LOGS="+dynamic"`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145283
Approved by: https://github.com/ydwu4, https://github.com/angelayi
2025-01-23 19:47:48 +00:00
ce4a097bf7 Revert "Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)"
This reverts commit 55084443cabbaf6c28d8c546d8988cf3ed0f3d1c.

Reverted https://github.com/pytorch/pytorch/pull/144829 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/144829#issuecomment-2610855579))
2025-01-23 19:37:54 +00:00
527101fa95 Move Windows arm64 scripts from pytorch/builder (#144317)
This PR moves the Windows Arm64 scripts from the builder repository to the main repository. The corresponding PR to pytorch/builder that removes them is here : https://github.com/pytorch/builder/pull/2058
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144317
Approved by: https://github.com/Skylion007, https://github.com/seemethere

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-23 19:29:29 +00:00
66bf7da446 Enable sleef for Win Arm64 (#144876)
Sleef module was disabled for Windows Arm64 on b021486405
This PR enables it again since the issue is no longer valid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144876
Approved by: https://github.com/albanD, https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2025-01-23 19:22:58 +00:00
991a4b5925 [dynamo] Add --profile-details and --export-perfdoctor option (#144751)
Summary:
Add `--profile-details` option to add shapes and other details to the Kineto profile.

Add `--export-perfdoctor` to directly dump trace to perfdoctor for webview.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench_internal -- --only mrs_video_watch_over --performance --training --amp --export-profiler-trace --backend=inductor --profile-details --export-perfdoctor
```

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/pyper_traces/tree/traces/test/inductor_mrs_video_watch_over_rank_0_20250113_173817_6535183793.json.gz

Differential Revision: D68134547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144751
Approved by: https://github.com/drisspg
2025-01-23 19:09:40 +00:00
5b37249259 Enable fp16 linear layers in PyTorch via ACL (#144992)
This pull request aims to enable the use of linear layers with the fp16 data type through the ACL.

On a Graviton3 instance running with 16 threads, `torch.randn(2048, 4096, dtype=torch.half)` will take 50+% less time to complete compared with `torch.randn(2048, 4096, dtype=torch.float32)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144992
Approved by: https://github.com/ng-05, https://github.com/digantdesai, https://github.com/malfet
2025-01-23 19:07:54 +00:00
6d4f5f7688 [Utilization][Usage Log] Add data model for record (#145114)
Add data model for consistency and data model change in the future.

The data model will be used during the post-test-process pipeline
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114
Approved by: https://github.com/huydhn
2025-01-23 19:04:41 +00:00
2f317bbdbc Missing autorelease in lstm_mps caused a ton of leaked memory (#145503)
The dictionary held onto the new MPSGraphTensorData objects and MPSNDArrays.  Regression caused by https://github.com/pytorch/pytorch/pull/95137

Fixes #145374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145503
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-23 18:54:30 +00:00
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44f3cf69dd840db990245cbe1f78e770 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
34b8d8b0c0 update compile time benchmarks to dump compile times to stdout and csv (#145447)
```python
# inductor.csv
dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency
cuda,cait_m36_384,8,pass,2510,1,0,0,0,0,0,87.705186
```

```python
loading model: 0it [01:27, ?it/s]
cuda eval  cait_m36_384
Compilation time (from dynamo_timed): 87.705186276  # <----------------
pass
TIMING: _recursive_pre_grad_passes:0.11023 pad_mm_benchmark:0.50341 _recursive_joint_graph_passes:3.88557 _recursive_post_grad_passes:6.71182 async_compile.wait:4.16914 code_gen:17.57586 inductor_compile:42.55769 backend_compile:72.47122 entire_frame_compile:87.70519 gc:0.00112 total_wall_time:87.70519
STATS: call_* op count: 2510 | FakeTensorMode.__torch_dispatch__:101743 | FakeTensor.__torch_dispatch__:12959 | ProxyTorchDispatchMode.__torch_dispatch__:41079
Dynamo produced 1 graphs covering 2510 ops with 0 graph breaks (0 unique)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145447
Approved by: https://github.com/ezyang
2025-01-23 18:49:19 +00:00
629fb1590c [BE] Type annotate pad_mm.py (#145409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145409
Approved by: https://github.com/Skylion007
2025-01-23 18:34:24 +00:00
015c6d6fdb [dynamo][guards] Turn on profiling of guard manager (#145420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145420
Approved by: https://github.com/ezyang
ghstack dependencies: #145351
2025-01-23 18:17:43 +00:00
fef92c9447 Fix IdentationError of code example (#145251)
I found there is IndentationError when try to copy paste the example of inference with torch.compile
fix the format in this pr

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145251
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-23 18:17:11 +00:00
9a5bc7b6dd [BE] Type annotate metrics.py (#145418)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145418
Approved by: https://github.com/Skylion007
2025-01-23 18:13:59 +00:00
bdc2c2a237 [be] fix flaky test aot_export_ cond caused by free symbol lifting and automatic dynamic shape (#145330)
Fixes https://github.com/pytorch/pytorch/issues/139998#issuecomment-2605908426.

It seems to be an issue caused by the interaction between dynamoed hop X automatic dynamic shape X auto_lift_free symbols. The immediate error is that the asserteExpectedInline of the graph can sometimes be different e.g. see https://hud.pytorch.org/flakytest?name=test_aot_export_with_torch_cond&suite=TestAOTExport&limit=100, where sometimes the shapes are lifted as input to the cond and sometimes they're not.

The root cause of the flakyness is that the two invocations of torch.cond triggers two torch.compile on the same code object ([code](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/cond.py#L192)), and triggers automatic dynamic shape because in test_aot_export_with_torch_cond, x has shape (3, 4) while the pre_dispatch one has shape (2, 2). Because of we auto lift free symbols for dynamic shaped input, this causes cond sometimes have the shape as arguments and sometimes not.

This PR adds a simple fix by adding a _dynamo.reset before each torch.cond tests. This fixes the error by not triggering automatic dynamic shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145330
Approved by: https://github.com/zou3519
2025-01-23 18:12:58 +00:00
3c247ee8c4 [hop][be] add utils for more comprehensive input alias and mutation (#145298)
This PR implements  the idea of checking input mutations through tensor version and check aliasing via storage  from @zou3519. Previously, we rely on whether there's a in place op that takes placeholder input, which doesn't take views into account.

When writing the PR, I also noticed a bug in previous input mutation checking logic: we were checking the whether there are operators functionalized_f where all the mutating ops have been replaced so we won't be able to detect any thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145298
Approved by: https://github.com/zou3519
2025-01-23 18:12:28 +00:00
b0f3597133 Add fused rms_norm implementation for MPS backend (#145301)
Adding a fused rms_norm implementation for MPS backend. This eliminates most of the current CPU overhead, making this computation GPU bound and improving latency of rms_norm by **30x-40x** on MPS backend
The metal shader was adapted from MLX: e6a7ab9675/mlx/backend/metal/kernels/rms_norm.metal

The numbers below are averages over 1000 runs of RMSNorm, obtained on an M1 Pro.

Benchmarking Results (Before):
```
Device            :            MPS            |           CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :  140.5  | 171.0  | 170.4  |  10.9  |  13.3  |  13.5
```

Benchmarking Results (After):
```
Device            :            MPS            |           CPU
Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16
Outputs Match     :   True  |  True  |  True  |  True  |  True  |  True
Average Time (us) :    4.0  |   3.9  |   3.9  |  10.0  |  12.4  |  13.0
```

Profiling Results (Before):
```
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
               aten::rms_norm         2.35%       3.284ms       100.00%     140.038ms     140.038us          1000
                    aten::mul        33.61%      47.068ms        33.61%      47.068ms      23.534us          2000
                    aten::pow        17.04%      23.868ms        17.43%      24.402ms      24.402us          1000
                   aten::add_        16.52%      23.130ms        16.78%      23.497ms      23.497us          1000
                   aten::mean        15.82%      22.151ms        15.82%      22.151ms      22.151us          1000
                  aten::rsqrt        13.63%      19.085ms        13.71%      19.198ms      19.198us          1000
                   aten::item         0.46%     639.370us         0.56%     788.376us       0.394us          2000
                aten::type_as         0.21%     295.507us         0.27%     371.291us       0.371us          1000
                     aten::to         0.13%     177.742us         0.13%     177.742us       0.059us          3000
    aten::_local_scalar_dense         0.11%     149.006us         0.11%     149.006us       0.075us          2000
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 140.038ms
```

Profiling Results (After):
```
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
         aten::rms_norm        63.21%     832.875us       100.00%       1.318ms       1.318us          1000
       aten::empty_like        16.06%     211.631us        36.79%     484.681us       0.485us          1000
    aten::empty_strided        20.72%     273.050us        20.72%     273.050us       0.273us          1000
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 1.318ms
```

Benchmarking and profiling script:
```python
import torch
import torch.nn as nn
from torch.profiler import profile
import time

def benchmark(device, dtype):
    model = nn.RMSNorm(2048, device=device)

    # Create example inputs
    x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype)
    w = torch.randn(2048, requires_grad=False, device=device, dtype=dtype)
    eps = 1e-5

    # Check output
    y = torch.ops.aten.rms_norm(x, [2048], w, eps)
    z = torch.ops.aten.rms_norm(x.cpu(), [2048], w.cpu(), eps)
    outputs_match = torch.allclose(y.cpu(), z)

    # Measure time manually
    start_time = time.time() * 1000
    for _ in range(1000):
        with torch.no_grad():
            y = model(x)
            torch.mps.synchronize
    end_time = time.time() * 1000
    manual_delta = (end_time - start_time)
    average_time = f"{manual_delta:6.1f}"

    return outputs_match, average_time

outputs_match_list = []
average_time_list = []
for device in ["mps", "cpu"]:
    for dtype in [torch.float32, torch.float16, torch.bfloat16]:
        outputs_match, average_time = benchmark(device, dtype)
        outputs_match_list.append(str(outputs_match))
        average_time_list.append(average_time)

print("\nBenchmarking Results:")
print("---------------------")
print("Device            :            MPS            |           CPU")
print("Dtype             :   FP32  |  FP16  |  BF16  |  FP32  |  FP16  |  BF16")
print(f"Outputs Match     :  ", "  |  ".join(outputs_match_list))
print(f"Average Time (us) :", "  |".join(average_time_list))

device = "mps"
dtype = torch.float32
model = nn.RMSNorm(2048, device=device)
x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype)

# Run and profile the model
with profile() as prof:
    with torch.no_grad():
        for _ in range(1000):
            y = model(x)
            torch.mps.synchronize

# Print profiling results
print("\n\nProfiling Results (MPS/FP32):")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145301
Approved by: https://github.com/malfet
2025-01-23 18:07:10 +00:00
a86fa779ce [BE] Fix edge case in translation validation bisector (#145414)
This patch fixes a small bug for the binary-search algorithm in
translation validation bisector. Fixes #131303.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145414
Approved by: https://github.com/ysiraichi, https://github.com/zou3519
2025-01-23 17:35:28 +00:00
045698653a [BE] Remove test_ops_gradients from FIXME_inductor_dont_reset_dynamo (#145308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145308
Approved by: https://github.com/zou3519
ghstack dependencies: #145306
2025-01-23 17:25:04 +00:00
3a8d3785f7 [ca][bug_fix] Fix ref counting of objects in the set_autograd_compiler function. (#145482)
PR#141153 exposed the option to collect sizes as dynamic. After this
change, the function set_autograd_compiler returns PyTuple object which
is populated using PyTuple_SET_ITEM function. Yet, that function steals
reference to the object and doesn't INCREF it. So currently we are
missing INCREF on prior_compiler when it is Py_None and INCREF on
prior_dynamic which is either Py_False or Py_True. This bug may lead to
the possible memory corruption.

@xmfan @jansel @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145482
Approved by: https://github.com/albanD, https://github.com/jansel
2025-01-23 17:13:56 +00:00
c6707734de Enable non power of 2 head_dim for FlexAttention (#133495)
# Summary
- Adds support for non-power of 2 headdim by launching blocks w/ head_dim rounded to the next valid power.
- Other option I considered was building up the final dot_products with smaller blocks (this would probably work but for sake of code complexity going with this option for now)

### Corollary
We had a bug in our backwards kernel where we were using index_k instead of index_v. This should have shown up for the qk_head_dim != v_head_dim cases..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133495
Approved by: https://github.com/Chillee
2025-01-23 17:05:38 +00:00
bf4f8919df Fix test_modules_can_be_imported (#145387)
`test_modules_can_be_imported` test is currently failing due to a few missing private modules and this PR gets it working before I start to clean up the public allow list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145387
Approved by: https://github.com/albanD
2025-01-23 16:03:00 +00:00
768ad0886f Revert "Binary upload checksum (#144887)"
This reverts commit 2efa98d69d362e4ee6f15938ec8ded30bf5c40fd.

Reverted https://github.com/pytorch/pytorch/pull/144887 on behalf of https://github.com/atalman due to Broke nightly index ([comment](https://github.com/pytorch/pytorch/pull/144887#issuecomment-2610066277))
2025-01-23 15:10:42 +00:00
0802e78315 [CD] Disable Kineto for XPU Windows CD (#145255)
Due to issue #145155, disable Kineto for XPU Windows CD temporally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145255
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-01-23 14:09:52 +00:00
629840e038 Backout PEP585 use of Iterable (#145438)
Summary:
Importing Iterable from collections.abc here causes an internal product to fail
MRO discovery causing a collision between Iterable and Generic.

This fixes the failure on D68461304

Differential Revision: D68531443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145438
Approved by: https://github.com/izaitsevfb
2025-01-23 11:45:37 +00:00
cyy
29f52e3972 [2/N] Remove unnecessary once flag usage (#145057)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057
Approved by: https://github.com/albanD
2025-01-23 09:48:46 +00:00
b6941d4e42 [inductor] fix autotuning memory usage (#145410)
We use `cpu_tensor.copy_(gpu_tensor)` to clone mutated kernel arguments for autotuning. The purpose is to avoid increasing peak memory due to the clone. But if `gpu_tensor` is not contiguous, this `copy_` will need allocate an temporary tensor on GPU to store a contiguous copy of `gpu_tensor`:

6e53588789/aten/src/ATen/native/cuda/Copy.cu (L322-L334)

Here is a standalone script to illustrate this behavior: https://gist.github.com/shunting314/812a848dc67b1d674ae42415a7a462c8 . The script report 6GB rather than 3GB peak memory usage.

Note that, with all the following efforts
1. donated buffer
2. inplace padding
3. this PR

We save 3GB peak memory (18.6GB -> 15.5GB) for GPT2 model for torch.compile.

The peak memory of GPT2 is like a '...\_M\_...' shape. There are 2 places that we reach the peak. Donated buffer remove the first peak by computing grad_softmax inplace, and inplace padding removes the second peak by not allocating an extra buffer for mm-padding.

Before all these optimizations, the peak memory is 18.6GB for GPT2 with torch.compile.
With 1 & 2, the peak memory is
1. 17.7GB with a cold cache
2. 15.5GB with a warm cache (since the autotuning overhead is skipped)

With 1 & 2 & 3, we save 3GB peak memory  (18.6GB -> 15.5GB) no matter if autotuning happens or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145410
Approved by: https://github.com/masnesral, https://github.com/jansel
ghstack dependencies: #140249, #145325
2025-01-23 09:34:23 +00:00
638903aeee Adapt Dynamo tests to HPUs using instantiate_device_type_tests (#144387)
**MOTIVATION**

We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.

Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )

**CHANGES**

Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices

Previously we had submitted some changes in https://github.com/pytorch/pytorch/pull/140131 . However, deleted that PR due to merge conflicts and other issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144387
Approved by: https://github.com/ankurneog, https://github.com/EikanWang, https://github.com/yanboliang, https://github.com/guangyey
2025-01-23 09:24:42 +00:00
d3f196909d [inductor] let inplace-padding support cpp-wrapper (#145325)
Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth.  One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed).

The fix is when we allocate the tensor, instead of doing something like:
```
  buf0 = randn_strided([2048, 2047], [2048, 1])
```
we do some small overallocation
```
  buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1])
```

cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #140249
2025-01-23 09:22:38 +00:00
f52901a0a7 [ONNX] Remove LegacyDynamoStrategy (#145442)
It's legacy. So remove. Shouldn't affect anything and will facilitate cleaning up our legacy code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145442
Approved by: https://github.com/titaiwangms
2025-01-23 07:56:04 +00:00
28c251dd0b [BE] Remove test_modules from FIXME_inductor_dont_reset_dynamo (#145306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145306
Approved by: https://github.com/zou3519
2025-01-23 06:37:21 +00:00
f56c638849 [c10/metal] Add a vectype variant for short/int/long (#145430)
Some of the kernels (exp_complex/atan_complex) need the specialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430
Approved by: https://github.com/malfet, https://github.com/jansel
2025-01-23 04:52:56 +00:00
c58198184b [dynamo][dicts] Insert LENTGH guard on an if condition on dict (#145432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145432
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-23 04:40:56 +00:00
faa10faa2c [ROCm] CK SDPA - Move arch check to CK patch (#144777)
__gfxXXX__ should only be visible by device code. Move the check to the ck kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144777
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell, https://github.com/jianyuh
2025-01-23 04:12:25 +00:00
5e6451ea78 [c10] catch c10 error and log message (#145413)
Summary:
Explicitly catch c10 error and log the error message only.

The standard exception `e.what()` below ends up logging the stack trace that is confusing users.
See S477887 for details.

Test Plan:
tested locally.
```
buck test caffe2/test/cpp/c10d:TCPStoreTest
buck2 daemon constraint mismatch: Version mismatch; killing daemon...
Starting new buck2 daemon...
Connected to new buck2 daemon.
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68
Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918
Network: Up: 1.5GiB  Down: 4.7GiB  (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161)
Loading targets.   Remaining      0/3024                                                                                                                                 69199 dirs read, 687558 targets declared
Analyzing targets. Remaining      0/31483                                                                                                                                1481904 actions, 1719048 artifacts declared
Executing actions. Remaining      0/250391                                                                                                                               77:11:29.7s exec time total
Command: test.     Finished 2031 local, 45445 remote, 51473 cache (52% hit)                                                                                              20:16:36.9s exec time cached (26%)
Time elapsed: 7:32.7s
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D68516080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413
Approved by: https://github.com/fduwjj
2025-01-23 03:45:47 +00:00
719938c77f Generalize pin memory logic for accelerator when non blocking copy happened (#143783)
# Motivation
fix https://github.com/pytorch/pytorch/issues/143641
Generalize pin memory logic for accelerator when non-blocking copy happened. Each accelerator has its implementation on `empty_strided`. The accelerator which doesn't have pin memory mechanism could ignore or mimic when pin_out is True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143783
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #144959
2025-01-23 03:43:05 +00:00
28b6430823 Introduce a new API isAcceleratorExcluded (#144959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144959
Approved by: https://github.com/albanD
2025-01-23 03:43:05 +00:00
5a18f1e1eb [dynamo] Support fx map_aggregate (#145351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145351
Approved by: https://github.com/zou3519
2025-01-23 03:19:30 +00:00
d95a6babcc Revert "Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)"
This reverts commit 0bff37788043626ee5e472389f88cbbbf7add997.

Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failures look legit ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2608631019))
2025-01-23 01:10:31 +00:00
0d28188cc8 Move privateuse1 test out of test_utils and make them serial (#145380)
Fixes https://github.com/pytorch/pytorch/issues/132720

The reason is that changing the privateuse1 module is global and so can race when other tests happen to check if it is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145380
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-01-23 00:31:39 +00:00
c9e12d6a3b [ROCm] Update rocm.yml and add rocm-mi300.yml (#145398)
- Added another workflow to run the mi300 jobs post-merge.
- Updated rocm.yml to use mi200s instead of mi300s.
- Required to get an idea of how PRs are landing on our mi200s and mi300s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145398
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-23 00:07:50 +00:00
1e32842324 Improve softmax's perf in cuda (#144679)
Fixes #144645

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144679
Approved by: https://github.com/eqy
2025-01-23 00:02:57 +00:00
d0a2e11284 [BE][export] Change custom_op registeration style (#145315)
Summary:
`test_unbacked_bindings_for_divisible_u_symint` has been flaky for a while due to

```
Tried to register an operator (mylib::foo(Tensor a, Tensor b) -> Tensor) with the same name and overload name multiple times.
```

It is likely due to when all variants of this test are being run (non-strict, retrace, serdes) simultaneously. In later tests, the operator has already been registered.

In this diff, we change registration style.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test:test_export -- -r test_unbacked_bindings_for_divisible_u_symint
```

Differential Revision: D68465258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145315
Approved by: https://github.com/zou3519
2025-01-22 23:46:51 +00:00
4803e20bc7 [S481486] Move MTIA dynamic library loading from __init__.py to a separate module (#145322)
Summary: As titled

Test Plan:
- Passed CI tests

buck2 test 'fbcode//mode/opt' fbcode//ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu -- --exact 'ai_infra/distributed_ai/pyper_local_run/tests/integration_tests:test_icvr_e2e_gpu - test_icvr_e2e_gpu (ai_infra.distributed_ai.pyper_local_run.tests.integration_tests.test_icvr_e2e_gpu.TestIcvrE2EGpu)' --run-disabled
```

https://www.internalfb.com/intern/testinfra/testconsole/testrun/9007199320480497/

Differential Revision: D68463242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145322
Approved by: https://github.com/yuhc, https://github.com/albanD
2025-01-22 23:39:43 +00:00
35c8c31f11 Fix for failure in D68425364 (#145304)
Summary: Back out change from #145166 which causes an internal model to fail.

Differential Revision: D68459095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145304
Approved by: https://github.com/izaitsevfb
2025-01-22 23:33:02 +00:00
e6a84be3d3 [PyTorch] Add backend aot_eager_decomp_partition_with_mode (#143250)
Summary:
## Why
To make it possible to run torch dispatch mode inside compiled modules. This is to enable running MemoryTrackerMode (in next diff) to collect memory usage of compiled modules.

## What
Add a backend aot_eager_decomp_partition_with_mode.
Add an enable_log to the backend to control the compilation logging (which can be very verbose and slow the run of mode)

Test Plan:
unittest

E2e tested in the next diff which shows the memory read from the mode passed to this backend is very close to the actual job's memory snapshot.

Differential Revision: D67227144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143250
Approved by: https://github.com/bdhirsh
2025-01-22 23:20:59 +00:00
f0a210bf5d Revert "Output of nonzero is transposed, fix fake tensor (#144695)"
This reverts commit 693d8c7e945cc494bd31ad694ae4f4b6ff13b82a.

Reverted https://github.com/pytorch/pytorch/pull/144695 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68461259 ([comment](https://github.com/pytorch/pytorch/pull/144695#issuecomment-2608443589))
2025-01-22 23:04:50 +00:00
de945d78da [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-22 22:42:48 +00:00
6e53588789 Revert "[BE]: Simplify set add with set update (#145152)"
This reverts commit 0cb9b2284a31fa497d684dbc2f56398c1d1e3114.

Reverted https://github.com/pytorch/pytorch/pull/145152 on behalf of https://github.com/davidberard98 due to land race with https://github.com/pytorch/pytorch/pull/145165 broke lint ([comment](https://github.com/pytorch/pytorch/pull/145152#issuecomment-2608378172))
2025-01-22 22:14:26 +00:00
dddf52b1b9 Revert "Enable grep_linter to use -a (#144589)"
This reverts commit 3c55669b8814237e018a613a494564da5bea9f15.

Reverted https://github.com/pytorch/pytorch/pull/144589 on behalf of https://github.com/clee2000 due to the line parameter is kind of important and -a is not as important as I thought it was so I'm going to revert this ([comment](https://github.com/pytorch/pytorch/pull/144589#issuecomment-2608349155))
2025-01-22 21:55:27 +00:00
082c28c3c6 [compiled autograd] support Tensor Subclasses in AOTBackward (#144115)
Compiled autograd's initial trace traces through the AOTBackward
epilogue. The Tensor Subclass code is not traceable. This PR changes it
so that when we see Tensor Subclass constructors, we proxy nodes for
their construction into the graph.

Test Plan:
- New basic test with TwoTensor
- Existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144115
Approved by: https://github.com/jansel, https://github.com/xmfan, https://github.com/bdhirsh
ghstack dependencies: #143296, #143304, #143387, #143405, #143417
2025-01-22 21:51:07 +00:00
99dd1bf1b9 [compiled autograd] stop specializing on metadata during initial trace (#143417)
The previous PRs built up to this. We change compiled autograd's initial
trace to stop baking in metadata.

While tracing, we allocate some weirdly shaped tensors that we can put
proxies on. The initial trace should not be accessing any metadata of
these tensors (it will likely error out if it does because of how weird
the shapes are).

This involved fixing some various sites where we do specialize on the
metadata, like:
- we change CopySlices's apply_with_saved to proxy some calls
  into the graph (this change is fairly hard to split out by itself).
- we stop calling InputBuffer::add
- we delete the weird metadata from the graph so that no graph passes
  can make use of it.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387, #143405
2025-01-22 21:51:07 +00:00
ec820fe57c [compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405)
We will always proxy autograd.Function nodes in compiled autograd's
initial graph capture (previously there was an
option to proxy vs trace into the autograd.Function)

We have some requirements for the AOTBackward. Compiled Autograd runs
accumulate grad reordering passes on the AOTBackward graph directly
after the initial graph capture, so we can't just proxy a single node for it.

Instead, we:
- proxy the AOTBackward prologue function into the CA graph
- copy-paste the AOTBackward graph into the CA graph
- trace directly through the epilogue (the traced nodes go into the CA
  graph).

Tracing through the epilogue is safe (assuming no Tensor subclasses)
because the only thing the epilogue does is drop some outputs. The
Tensor subclass situation was already broken so this doesn't regress
anything but this PR sets it up to be fixed (in a followup, where we
will proxy "make_subclass" calls into the graph from the epilogue).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143405
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387
2025-01-22 21:50:56 +00:00
784bb2127c [compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387)
We define a functional version of a C++ torch::autograd::Function. The
functional version reconstructs the ctx object and then calls
backward with it.

Some more details:
- we define how to pack/unpack ctx.saved_data into an IValue. It's a
  Dict[str, IValue], so it wasn't difficult.
- every call to CppNode::apply_with_saved binds a new function to
  Python. This is because we're unable to reuse the a previously bound
  function for reasons (the schema may change depending on what the user
  actually puts into their Dict[str, IValue]).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143387
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304
2025-01-22 21:50:47 +00:00
8c7c5f7bfc [compiled autograd] Proxy a node for CopyBackwards into the graph (#143304)
CopyBackwards is a manual C++ torch::autograd::Node; we update its
apply_with_saved to proxy a functional version of it into the graph instead
of inlining into it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143304
Approved by: https://github.com/xmfan, https://github.com/jansel
ghstack dependencies: #143296
2025-01-22 21:50:37 +00:00
5531fafffe [compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296)
This PR is on the way to getting compiled autograd's initial capture to
stop specializing on Tensor metadata.

This PR changes compiled autograd's initial capture to proxy an opaque
(w.r.t. Dynamo) function into the graph for all built-in codegen'ed
autograd nodes and validate_outputs.

We changed each codegen'ed apply_with_saved (e.g.
MulBackward0::apply_with_saved) to call into Python to proxy a function
(compiled_autograd.ops.MulBackward0) into the graph. Then, we use the
node's InputMetadata to "guess" at the properties of the output Tensors
to create some new FakeTensors.

Some details:
- MulBackward0::apply_with_saved lives in libtorch_cpu, but needs to be
  call to Python via libtorch_python. There is an indirection
  (PyCompilerInterface) to do this.
- MulBackward0::apply_with_saved passes a C++ function to Python. To make
  our lives easier, every codegen'ed apply_with_saved passes a C++
  function with the same signature
  `(variable_list, ivalue_list) -> variable_list`.
- We define how to pack arbitrary C++ types into IValue via a helper
  IValuePacker struct and codegen functional variants of each builtin
  C++ autograd node (e.g. MulBackward0_apply_functional_ivalue).

MulBackward0 before this PR:
https://gist.github.com/zou3519/a80381d5fa38e970e413fcd91b0530de

MulBackward0 after this PR:
https://gist.github.com/zou3519/0c2eee8b3d8d96232b51ef430b53c5b0

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143296
Approved by: https://github.com/jansel
2025-01-22 21:50:29 +00:00
0cb9b2284a [BE]: Simplify set add with set update (#145152)
Simplifies the set update slightly to be more readable and efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152
Approved by: https://github.com/XuehaiPan, https://github.com/albanD

Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
2025-01-22 21:31:13 +00:00
9f150786bb [dynamo] Fix numpy test accuracy error induced by randomness divergence (#145293)
Previously `TestGradient.test_second_order_accurate` was failing because
of a small tolerance error (0.03... which is above the 0.03 tolerance).

Upon investigating, `np.random.random` caused some divergence between
eager and compiled randomness because in compiled we are not using
`np.random`'s random seed, rather we end up using `torch`'s. This in
turn caused numerical divergence and aforementioned accuracy issue.

This patch fixes the failure by patching the test case with
`use_numpy_random_stream=True`, which forces a graph break on
`np.random.random()` and thereby falling back to eager to ensure
consistency of the numpy randomness.

Fixes #116746.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145293
Approved by: https://github.com/lezcano
2025-01-22 20:53:02 +00:00
2efa98d69d Binary upload checksum (#144887)
Equivalent to https://github.com/pytorch/test-infra/pull/6172 but for pytorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144887
Approved by: https://github.com/atalman
2025-01-22 20:46:04 +00:00
a57133e3c7 [NVIDIA] Jetson Thor Blackwell Support codegen (#145395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145395
Approved by: https://github.com/eqy, https://github.com/malfet
2025-01-22 20:13:19 +00:00
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
95ff9f0340 [Doc] Add period at the end of the sentence (#145384)
Test plan: https://docs-preview.pytorch.org/pytorch/pytorch/145384/generated/torch.compiler.disable.html#torch-compiler-disable
Fixes https://github.com/pytorch/pytorch/issues/145365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145384
Approved by: https://github.com/huydhn, https://github.com/svekars, https://github.com/kit1980
2025-01-22 19:56:31 +00:00
3917053f63 [audio hash update] update the pinned audio hash (#145328)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145328
Approved by: https://github.com/pytorchbot
2025-01-22 19:39:03 +00:00
70ccbade83 [MPSInductor] Add gamma op (#145341)
By moving `gamma` and `log_gamma` implementation from `Gamma.metal` to `c10/metal/special_math.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145341
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #145309
2025-01-22 19:37:45 +00:00
b81209557b Fix tests broken by #145176 (#145393)
#145176 broke
test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_graph_break_on_jit_isinstance_dynamic_shapes
test/dynamo/test_repros.py::ReproTests::test_graph_break_on_jit_isinstance

this backs out the offending change until it can be fixed properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145393
Approved by: https://github.com/ZainRizvi
2025-01-22 19:33:16 +00:00
e8e3c03f96 [Test][Inductor] Fix test_tma_graph_breaks (#145271)
Per title. Before these changes, below tests:
```
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_False
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_True
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_False
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True
```

fail with the following message:
```
__________________________________________________________________ KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ___________________________________________________________________
Traceback (most recent call last):
  File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 557, in instantiated_test
    test(self, **param_kwargs)
  File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1760, in test_tma_graph_breaks
    eager_out = f(a, b)
                ^^^^^^^
  File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1740, in f
    t.element_size(),
    ^
UnboundLocalError: cannot access local variable 't' where it is not associated with a value

To execute this test, run the following from the base repo dir:
    python test/inductor/test_triton_kernels.py KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145271
Approved by: https://github.com/jansel
2025-01-22 19:18:59 +00:00
ac8ddf1150 [export][be] Clean up local imports from export [1/n] (#145287)
Summary: as title

Test Plan: CI

Differential Revision: D68449844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145287
Approved by: https://github.com/pianpwk
2025-01-22 19:09:17 +00:00
30717d25fe Move Dynamo test to skip from expected_failures (#145390)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/116105
This test is consistently failing. It shouldn't be marked as a flaky
test in the CI using the disabld tests mechanism. I'm skipping the test for now.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145390
Approved by: https://github.com/williamwen42
2025-01-22 19:06:39 +00:00
0bff377880 Align CPU behavior with CUDA for ConvTranspose when out_channels=0 (#142859)
Fixes https://github.com/pytorch/pytorch/issues/142466.
Remove the `weight.numel() != 0` check to align the behavior with CUDA for `ConvTranspose` when `out_channels=0`. After removing this check, the existing code is already able to give an empty output in such case.

Test plan:
```
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cpu_float32
python -u test/nn/test_convolution.py -k test_ConvTranspose_output_channels_0_cuda_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142859
Approved by: https://github.com/mingfeima, https://github.com/malfet
2025-01-22 17:52:53 +00:00
698106951e [dynamo] Re-enable test_fs family for dynamo (#145302)
Fixes #91467.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145302
Approved by: https://github.com/zou3519
2025-01-22 17:50:05 +00:00
057d9aff39 [S481486] [MTIA] Correct mtia.device_count() API (#145338)
Summary:
Prev: Count the number of "general" accelerators

Curr: Count the number of MTIA devices by using the MTIA runtime API

Test Plan:
```
buck test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r  test_get_device_count
```

https://www.internalfb.com/intern/testinfra/testrun/8162774572631995

Reviewed By: BoyueZheng

Differential Revision: D68472668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145338
Approved by: https://github.com/BoyueZheng, https://github.com/egienvalue
2025-01-22 17:45:15 +00:00
c27dd9cf72 Fix deprecated pytorch_sphinx_theme editable installation (#145347)
Fixes https://github.com/pytorch/pytorch/issues/145221

Pip editable install is going to be deprecated soon https://github.com/pypa/pip/issues/11457.  The fix here is just to remove it and install `pytorch_sphinx_theme` normally.

### Testing

Doc build is working with the change:

* PR https://github.com/pytorch/pytorch/actions/runs/12901499736/job/35975042345?pr=145347
* Nightly https://github.com/pytorch/pytorch/actions/runs/12901500521/job/35975046289
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145347
Approved by: https://github.com/ZainRizvi
2025-01-22 17:28:16 +00:00
288f21cc11 [MPS][BE] Prepare Gamma funcs to be moved ot headers (#145309)
----
- Use `float y = 1.0 + metal::frac(x)` instead of complex
```metal
float y = x;
int n = 0;
bool less_than_one = (y < 1.0);
// Add or subtract integers as necessary to bring y into (1,2)
if (less_than_one) {
  y += 1.0;
} else {
  n = static_cast<int>(floor(y)) - 1;
  y -= n;
}
```
- Declare them all as templates, to avoid instantiation
- Move global arrays to be local to the specific functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145309
Approved by: https://github.com/dcci
2025-01-22 16:14:06 +00:00
c2b401933f [torchbench] Fix mobilenetv2 inductor freezing fail_accuracy (#145296)
Issue: https://github.com/pytorch/pytorch/issues/144891

inductor freezing effectively enables inductor conv-batchnorm fusion. This fusion increases the accuracy error.

More context about this: https://github.com/pytorch/pytorch/issues/120545
For Timm models that are run through benchmarks/dynamo/timm_models.py with TimsRunner the tolerance was increased here:
https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/timm_models.py#L367

If to comment out  conv-batchnorm fusion as Elias suggested in Context issue, the accuracy is back.

=>
Increasing tolerace for mobilenetv2  to the same value via introducing the special configuration for tolerance for freezing only

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145296
Approved by: https://github.com/eellison, https://github.com/zou3519
2025-01-22 15:54:09 +00:00
0dbff7e4be Add MKLDNN support for Half GELU (#145339)
Add MKLDNN support for Half GELU to align with BFloat16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145339
Approved by: https://github.com/yanbing-j, https://github.com/leslie-fang-intel, https://github.com/Skylion007
2025-01-22 15:14:51 +00:00
0efa843392 Dynamic shape guards in C++ (#139899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139899
Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/jansel
ghstack dependencies: #143385, #143164
2025-01-22 14:58:35 +00:00
fbaef0ac03 Add a language option for symbolic shape guards (#143164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143164
Approved by: https://github.com/ezyang
ghstack dependencies: #143385
2025-01-22 14:58:35 +00:00
4b77ff9784 Fix PythonMod printing for C++ (#143385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143385
Approved by: https://github.com/leslie-fang-intel, https://github.com/anijain2305
2025-01-22 14:58:35 +00:00
079a3e0f75 [BE] Add type annotations to cudagraph_utils.py and test_cases.py (#145291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145291
Approved by: https://github.com/Skylion007
2025-01-22 14:54:45 +00:00
31c2f36989 Fix triton masked loading for non-block tl.loads (#144782)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144782
Approved by: https://github.com/eellison
2025-01-22 14:30:56 +00:00
3cbc8c54fd [BE][export] Remove disabled floordiv test in export (#145292)
Summary:
Removing `test_slice_with_floordiv` as it doesn't raise the Runtime Error as expected and it has been disabled since the time it was added https://github.com/pytorch/pytorch/issues/131101

For the case that we expect to fail, it actually returns an empty tensor. This is consistent with the following snippet which prints an empty tensor

```
a = torch.ones(4)
print(a[5:])
```

Test Plan: CI

Differential Revision: D68450650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145292
Approved by: https://github.com/pianpwk
2025-01-22 05:17:56 +00:00
99dbc5b0e2 PEP585 update - test (#145176)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145176
Approved by: https://github.com/bobrenjc93
2025-01-22 04:48:28 +00:00
40e27fbcf2 Refactor CPUReproTests to be more vector-length agnostic (#141245)
This changes the hardcoded assumptions of a `256-bit` vector length to querying from `cpu_vec_isa` and changes relevant tests to share the logic.

Also refactored the `config.cpp.simdlen != 1` into the assertion so we stop duplicating it throughout the test cases.

Fixes issues on `128-bit` machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141245
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-01-22 04:24:45 +00:00
dcd9de79e7 [dynamo] Re-enable a AOT-Dispatch test with Dynamo (#145299)
Fixes #124590.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145299
Approved by: https://github.com/zou3519
2025-01-22 03:47:05 +00:00
3a58512613 [Inductor] inplace padding (#140249)
https://github.com/pytorch/pytorch/issues/139865

This PR may change the semantic of constant_pad_nd from 'clone' to 'view'. I tried a few tests to do inplace update. Looks like thanks to functionalization, this works fine.

Perf for `test_linear_and_cel`:
```
# TORCHINDUCTOR_INPLACE_PADDING=0 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=False ms=83.311

# TORCHINDUCTOR_INPLACE_PADDING=1 DO_PERF_TEST=1 python test/inductor/test_inplace_padding.py -k test_linear_and_cel
inductor_config.inplace_padding=True ms=79.827
```

The saving is about 4ms (slightly less since we need fill 0 for the padding area). Similar savings for llm.c.
- Without the feature: 182.151ms per batch, 180.9K tokens/s
- With the feature:  178.278ms per batch, 183.9K tokens/s. There are 3K tokens/s increase.

Perf test shows compilation time regression. . I'm not sure if that's real. Will debug more. But a good thing is, there is no accuracy failure: [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2004%20Nov%202024%2020%3A23%3A22%20GMT&stopTime=Mon%2C%2011%20Nov%202024%2020%3A23%3A22%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=03fd924ff382958daf5055dc8425d279e4e10a1e&rBranch=main&rCommit=c03324de2dfbbf0006818c86b88c92a3378f46b7) .

UPDATE: Perf test regression seems to be not real. Here is a rerun [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2007%20Nov%202024%2001%3A29%3A55%20GMT&stopTime=Thu%2C%2021%20Nov%202024%2001%3A29%3A55%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/shunting314/186/head&lCommit=7e2c8e5d9256ac06205e7cd5e740c9e20ce804d0&rBranch=main&rCommit=565a7942eee1ddc23067cdbae597443d0f2290a0). Our dashboard is not that reliable recently due to AWS migration.

Differential Revision: [D68340248](https://our.internmc.facebook.com/intern/diff/D68340248)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140249
Approved by: https://github.com/jansel, https://github.com/eellison
2025-01-22 03:37:06 +00:00
46851022ff [Inductor][CPU] Add auto-tuning support for da8w8 sym act sym wgt GEMM (#143187)
## Summary

Templated `int8xint8->int32` GEMM that uses AMX ISA (present on Intel Xeon Gen 4 & above). Any epilogues such as weight scale, activation scale, and bias are applied per output block in a fused manner .
Performs well for large values of `M` dimension (assuming canonical dimensions [`M, K`] and [`K, N`] for the activation & weight matrices'/tensors' sizes) when the activation is quantized per-token.
Also supports SmoothQuant GEMM pattern when activation is quantized per-tensor (scalar scale) or per-token (vector scale is applied as an epilogue in this case).

Also increased coverage of GEMM template for uint8 activation, int8 weight GEMM UTs for when the activation zero point is a 1D tensor (the existing implementation only accepted 0D tensors). However, some of such UTs would have to be explicitly enabled with `max-autotune` Inductor config.

## Performance data

The templated codegened fused GEMM with M=32, K=4096, N=14336 used in LLaMA3 exhibits more than 2x perf-gain compared to oneDNN qlinear + mul (for activation's scale) with 48 cores of one socket of Xeon SP 4th gen Platinum 8468 when per-token quantization is used.

For M=1, K=4096, N=14336, regardless of whether per-tensor quantization was used for activation or per-token, the perf gain was more than 3x.

Intel OpenMP & libtcmalloc had been preloaded. All cores used by the workload corresponded to distinct physical cores.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143187
Approved by: https://github.com/jansel, https://github.com/leslie-fang-intel, https://github.com/jgong5

Co-authored-by: Leslie Fang <leslie.fang@intel.com>
2025-01-22 02:27:53 +00:00
27598cd154 [fx] move DCE rand check to import time (#145118)
Mitigates the deterministic benchmark regression: https://github.com/pytorch/pytorch/issues/144775#issuecomment-2593411844. and maybe the dashboard issue.

fx.Node.is_impure is unexpectedly a hot spot. It gets called for every node in the graph whenever we invoke DCE, which should be okay, EXCEPT we invoke DCE on the full graph ~10 times at various stages of torch.compile, and an insane number of times (>O(parameters)) for the subgraphs traced by the pattern matcher.

I considered addressing this problem by reducing the amount of times DCE is called, but I think we can only trim the ones from the pattern matcher, which will require some refactor/caching solution that I leave out of this PR.

torch.Tag.nondeterministic_seeded is provided by native_functions.yml and is implemented as a list. Most of the time, it has <=2 elements, so it's not really worth it to turn it into a set for fast lookup.

Using the deterministic instruction count benchmarks
```python
# before
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8914894946
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8866669058
# after
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8770562314
aotdispatcher_partitioner_cpu,compile_time_instruction_count,8779547794
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145118
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-01-22 02:23:02 +00:00
f2cfe8b59f PEP585 update - mostly toplevels (#145178)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145178
Approved by: https://github.com/bobrenjc93
2025-01-22 02:21:14 +00:00
1ce533867f Teach dynamo to handle GenericAlias without a graph break (#145240)
Dynamo wasn't handling the new PEP585 type annotations:
```
x = list[Foo]
```
Although this worked in py3.9 this was causing an `unimplemented` (Unexpected type in sourceless builder) in py3.12.

This fixes it to treat them as a BuiltinVariable.

Fixes #145226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145240
Approved by: https://github.com/anijain2305
2025-01-22 01:55:51 +00:00
2d1649bc2a Revert "[triton] Update triton pin to include warp specialization support (#145120)"
This reverts commit e261629dc85c061ee35f539ee8bd35aec9971215.

Reverted https://github.com/pytorch/pytorch/pull/145120 on behalf of https://github.com/ZainRizvi due to Reverting since the test failures area about not being able to find a version of triton to install, and this is breaking trunk as well ([comment](https://github.com/pytorch/pytorch/pull/145120#issuecomment-2606107792))
2025-01-22 01:52:36 +00:00
f2d7fe12d8 [BE][MPS] Mark gamma inputs as const (#145295)
Doubt it will change the perf, but it's good to correctly mark const inputs as const
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145295
Approved by: https://github.com/manuelcandales
ghstack dependencies: #145289
2025-01-22 01:00:53 +00:00
c106e9b4c6 [BE][MPS] Move Gamma kernels to its own file (#145289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145289
Approved by: https://github.com/manuelcandales, https://github.com/dcci
2025-01-22 01:00:53 +00:00
1908116ace [MPS][BE] Move vectypes from Quantized to utils (#145312)
That allows one to get appropriate vectorized types for templates using `c10:🤘:vec2type_t<>` or `c10:🤘:vec4type_t<>`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145312
Approved by: https://github.com/dcci
2025-01-22 00:37:28 +00:00
266fd35c58 Fix ExecuTorch, XLA, Triton hash updates (#145314)
Fix some stale hash updates https://github.com/pytorch/pytorch/pulls/pytorchupdatebot reported by @izaitsevfb

* XLA and ExecuTorch now wait for all jobs in pull instead of hardcoding the job names which are not correct anymore and the bot waits forever there
* Trion commit hash hasn't been updated automatically since 2023 and people have been updating the pin manually with their testings from time to time, so I doubt that it would be an useful thing to keep.

The vision update failures looks more complex though and I would need to take a closer look.  So, I will keep it in another PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145314
Approved by: https://github.com/izaitsevfb
2025-01-21 23:24:21 +00:00
1e8d6d6f0e [SkipFiles] New modules added to torch.* are inlined by default (#145279)
This PR:
- makes it so that new modules added to torch are inlined by default
- adds a list of the previously "skipped by default" modules to avoid
  regressing anything. This is a new MOD_SKIPLIST list that is consulted
  in trace_rules.check_file.
- Follow-up work will go through this list, one-by-one, and try to delete
  modules. I think we should be able to delete almost everything,
  except for torch._dynamo.

Test Plan
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145279
Approved by: https://github.com/yanboliang
2025-01-21 23:24:12 +00:00
e261629dc8 [triton] Update triton pin to include warp specialization support (#145120)
The warp specialization work has been landed to the triton rc/3.2.x branch as b2684bf3b0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145120
Approved by: https://github.com/bertmaher
2025-01-21 22:14:13 +00:00
19c3ba44a2 Use TORCH_CHECK instead of std::runtime_error in stack.h and ivalue.h (#145280)
TORCH_CHECK will preserve the stacktrace for when TORCH_CPP_SHOW_STACKTRACES=1, whereas std::runtime_error will not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145280
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-21 21:58:59 +00:00
7dd9d1f243 Update clickhouse-connect to 0.8.14 (#144915)
Corresponds to https://github.com/pytorch/test-infra/pull/6177

I only tested the slow test script but I also did testing on the new version with scripts in https://github.com/pytorch/test-infra/pull/6177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144915
Approved by: https://github.com/huydhn
2025-01-21 21:43:18 +00:00
35f5668f7e [NVIDIA] RTX50 Blackwell Support codegen (#145270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145270
Approved by: https://github.com/ezyang
2025-01-21 21:10:05 +00:00
895659cb41 Revert "Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)"
This reverts commit 07e23653cd9ef8cfda01773d94d9f76e5072528d.

Reverted https://github.com/pytorch/pytorch/pull/142848 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see D68355212 ([comment](https://github.com/pytorch/pytorch/pull/142848#issuecomment-2605734067))
2025-01-21 21:04:45 +00:00
bac62341eb PEP585 update - torch/_inductor (#145198)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145198
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:33 +00:00
2f9d378f7b PEP585 update - torch/utils (#145201)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145201
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:10 +00:00
693d8c7e94 Output of nonzero is transposed, fix fake tensor (#144695)
Needs this companion executorch PR: https://github.com/pytorch/executorch/pull/7657

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144695
Approved by: https://github.com/bobrenjc93, https://github.com/albanD
2025-01-21 20:50:09 +00:00
323fb4dad0 Unconditionally exclude upper bound in all size oblivious tests (#144867)
I was thinking about https://github.com/pytorch/pytorch/pull/144471 some more and I thought, "Hmm, why not just always exclude the constant upper bound." So here it is.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144867
Approved by: https://github.com/bobrenjc93
2025-01-21 20:44:09 +00:00
df67ac4c86 [CI][CUDA][Distributed][FSDP] Remove hardcoded world size of 2 (#145195)
as these unit tests would fail if run

on a single GPU (i.e**. skip_if_lt_x_gpu(2)) seems to view world size as 2 even on platforms with 1 GPU.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145195
Approved by: https://github.com/Skylion007, https://github.com/atalman
2025-01-21 20:25:52 +00:00
505ade7471 [inductor] Simplify mode options, only apply CompilerBisector changes once (#145232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145232
Approved by: https://github.com/yanboliang
2025-01-21 19:25:46 +00:00
85811631d7 [Intel CPU] Fix issue #143489. (#145062)
Fix issue in https://github.com/pytorch/pytorch/issues/143489.
kernel_height * kernel_weight will cause Floating point exception, so we will divide by them one by one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145062
Approved by: https://github.com/soulitzer
2025-01-21 18:38:33 +00:00
128f3627b1 Implement backward for NJT matmul (#144587)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

This PR implements missing backward support for NJT matmul. Notably, for dense tensors, matmul dispatches to bmm. However, due to historical reasons related to NST, NJT handles matmul directly, and thus can't rely on the CompositeImplicit impl of matmul to get the derivative formula.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144587
Approved by: https://github.com/soulitzer
ghstack dependencies: #144586
2025-01-21 18:27:50 +00:00
af204135d8 Fix NJT fill.Scalar for contiguous inputs (#144586)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

This PR implements the missing `fill.Scalar` support, which works fine for contiguous inputs, but there is still some AOTAutograd debugging required to handle non-contiguous transposed NJTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144586
Approved by: https://github.com/soulitzer
2025-01-21 18:22:08 +00:00
efa88e04e1 Don't overspecialize float when propagating cache guards to ShapeEnv (#145078)
Fixes https://github.com/pytorch/pytorch/issues/142507

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145078
Approved by: https://github.com/Skylion007
2025-01-21 18:05:43 +00:00
b3e90c8c33 Add support for torch function on dtype arguments (#145085)
Along the lines of https://github.com/pytorch/pytorch/issues/119194 although it doesn't actually address the FCD case.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145085
Approved by: https://github.com/vmoens, https://github.com/Skylion007
2025-01-21 17:44:47 +00:00
eb553ae3cf Fix broken gpt_fast micro benchmark after #144315 (#145235)
The benchmark is failing with the following error

```
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 333, in <module>
    main(output_file=args.output, only_model=args.only)
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 308, in main
    lst = func(device)
  File "/var/lib/jenkins/workspace/benchmarks/gpt_fast/benchmark.py", line 66, in run_mlp_layer_norm_gelu
    us_per_iter = benchmarker.benchmark(compiled_mod, (x,)) * 1000
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_inductor/runtime/benchmarking.py", line 39, in wrapper
    return fn(self, *args, **kwargs)
TypeError: benchmark() missing 1 required positional argument: 'fn_kwargs'
```

An example error is https://github.com/pytorch/pytorch/actions/runs/12862761823/job/35858912555

I also assign `oncall: pt2` as the owner of this job going forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145235
Approved by: https://github.com/nmacchioni
2025-01-21 17:42:24 +00:00
2cffbff7da Add 3.13t Windows and MacOS binary builds (#141806)
Related to: https://github.com/pytorch/pytorch/issues/130249

For conda uses approach described here:
https://conda-forge.org/blog/2024/09/26/python-313/

Create Python 3.13t conda env like so:
```
conda create -n py313 python=3.13 python-freethreading  -c conda-forge
```

For windows executable installation we need to pass additional parameter to enable 3.13t:
```
Include_freethreaded=1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141806
Approved by: https://github.com/albanD
2025-01-21 17:16:19 +00:00
0afd335174 PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175
Approved by: https://github.com/bobrenjc93
2025-01-21 16:57:27 +00:00
803017f3cb [inductor] fix MA on poor gpu (#145133)
Found this bug when debugging a MA issue in CI that can not be repro-ed on devgpu.

On GPU with less than 68 SMs (like NVidia L4 used in CI), running torch compile in max-autotune mode may result in the following confusing error https://gist.github.com/shunting314/370f42f547e3367a3773237942725a86 complaining about layout:
```
torch._inductor.exc.InductorError: LoweringException: AssertionError: convert FlexibleLayout to FixedLayout first
```
The reason is, even if we don't pick Triton template, Inductor still returns a MultiTemplateBuffer for tuned addmm. MultiTemplateBuffer.get_reads called from Reduction.num_splits may indexing a FlexibleLayout which results in the error aforementioned.

The issue does not appear on devgpu because we freeze the layout of addmm inputs when rendering triton templates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145133
Approved by: https://github.com/jansel
2025-01-21 09:31:34 +00:00
b5655d9821 PEP585 update - .ci android aten (#145177)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145177
Approved by: https://github.com/Skylion007
2025-01-21 06:31:26 +00:00
00ffeca1b1 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-21 04:23:29 +00:00
c6986ca2e1 Revert "[dcp] Add ZStandard transformer (#143360)"
This reverts commit 7b56b039afe2b4a4038c09d8b6cb1597823f3d5f.

Reverted https://github.com/pytorch/pytorch/pull/143360 on behalf of https://github.com/atalman due to Broke 3.13t builds please test with ciflow/binaries label attached ([comment](https://github.com/pytorch/pytorch/pull/143360#issuecomment-2603433066))
2025-01-21 01:10:16 +00:00
5fd881a5b6 Revert "PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)"
This reverts commit 54a00af2c6026a830f40d9e6a659ff81d51f9bc6.

Reverted https://github.com/pytorch/pytorch/pull/145175 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145175#issuecomment-2603418267))
2025-01-21 00:49:55 +00:00
dea7ad3371 PEP585 update - torch/testing (#145200)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200
Approved by: https://github.com/bobrenjc93
2025-01-20 22:42:42 +00:00
805c4b597a PEP585 update - torch/_higher_order_ops torch/_subclasses torch/backends torch/compiler torch/cuda torch/masked torch/mtia torch/nested (#145202)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145202
Approved by: https://github.com/bobrenjc93
2025-01-20 22:37:26 +00:00
54a00af2c6 PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175
Approved by: https://github.com/bobrenjc93
2025-01-20 22:32:59 +00:00
bd97ce0b45 PEP585 update - torch/ao (#145199)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145199
Approved by: https://github.com/bobrenjc93
2025-01-20 22:32:35 +00:00
cf05f6a134 [BE]: Improve typing for torch/fx/_pytree.py and torch/utils/_pytree.py (#145173)
Improve type inference in _pytree.py utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145173
Approved by: https://github.com/bobrenjc93
2025-01-20 22:18:19 +00:00
225a10febe [CI] Add xpu linux build into pull workflow (#145084)
To mitigate the XPU build failure risk introduced by non-XPU specific PRs. Refer #144967 & #143803
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145084
Approved by: https://github.com/huydhn, https://github.com/atalman
2025-01-20 19:31:48 +00:00
d0100050dd [aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [2/n] (#145091)
Summary: Following up D68122536 to remove configurable aot_mode for inner_compile

Test Plan: CI

Reviewed By: desertfire

Differential Revision: D68158512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145091
Approved by: https://github.com/ydwu4
2025-01-20 19:09:10 +00:00
0b2a3687b9 PEP585 update - torch/fx (#145166)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145166
Approved by: https://github.com/bobrenjc93
2025-01-20 18:11:54 +00:00
6374332d33 Revert "PEP585 update - torch/distributed (#145164)"
This reverts commit 6cb186e279bc179a6bb63f0226e24ab42a07b394.

Reverted https://github.com/pytorch/pytorch/pull/145164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing an inductor test ([comment](https://github.com/pytorch/pytorch/pull/145164#issuecomment-2602875679))
2025-01-20 16:46:46 +00:00
57b2b64acf Fix always true scaled_mm test (#143912)
Looks like `out_fp8` should use matmul without scales and `out_fp8_s` with
Scales were optional arguments before PR https://github.com/pytorch/pytorch/pull/128683
Then test_float8_scale started comparing two identical results and lost its meaning
Reason of making scales required https://github.com/pytorch/pytorch/pull/128683#issuecomment-2169146402UMBER

This PR uses scale=1.0 to compare result with scaled matmul

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143912
Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/pruthvistony
2025-01-20 16:17:46 +00:00
53e2408015 Improve cleanup of cancelled jobs on s390x for tests too (#144968)
Follow up to https://github.com/pytorch/pytorch/pull/144149
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144968
Approved by: https://github.com/huydhn
2025-01-20 12:56:07 +00:00
92b9da1fc2 fix torch.atan for torch.complex datatypes on CPU (#144749)
Fix https://github.com/pytorch/pytorch/issues/141487.
This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `atan`. For correctness, I temporarily fallback the implementation of `atan` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144749
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-20 08:45:03 +00:00
ed669a9db7 fix torch.div for torch.complex datatypes on CPU (#140375)
Fix https://github.com/pytorch/pytorch/issues/135428.
Fix https://github.com/pytorch/pytorch/issues/106845.
These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `div`. For correctness, I temporarily fallback the implementation of `div` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140375
Approved by: https://github.com/mingfeima
2025-01-20 08:34:29 +00:00
c922ccb7c4 fix sigmoid for torch.complex datatypes on CPU (#140391)
Fix https://github.com/pytorch/pytorch/issues/135777.
This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `reciprocal`. For correctness, I temporarily fallback the implementation of `reciprocal` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140391
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
ghstack dependencies: #140358
2025-01-20 08:23:58 +00:00
507bf65c6a fix torch.exp for torch.complex datatypes on CPU (#140358)
Fix https://github.com/pytorch/pytorch/issues/48010, https://github.com/pytorch/pytorch/issues/136063.
These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `exp`. For correctness, I temporarily fallback the implementation of `exp` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140358
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-20 08:03:17 +00:00
972d4a154d Add facility to run dynamo UTs for non-cuda devices (#140929)
This is in line with changes introduced with https://github.com/pytorch/pytorch/pull/130714, additional files are included to support non-cuda devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140929
Approved by: https://github.com/kwen2501, https://github.com/EikanWang, https://github.com/guangyey
2025-01-20 05:56:38 +00:00
2b809e58ad PEP585 update - torch/onnx (#145174)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145174
Approved by: https://github.com/justinchuby
2025-01-20 05:48:52 +00:00
19584b28fd [dynamo][dicts] Consolidate dict(..) construction (#144342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342
Approved by: https://github.com/StrongerXi
2025-01-20 04:42:06 +00:00
980c75fe6e [MPSInductor] Add TrueDiv and Round[Int|Decimal] (#145160)
That fixes `test_builtins_round_float_ndigits_neg` and `test_builtins_round`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145160
Approved by: https://github.com/jansel, https://github.com/dcci
2025-01-20 04:29:42 +00:00
6cb186e279 PEP585 update - torch/distributed (#145164)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145164
Approved by: https://github.com/bobrenjc93
2025-01-20 00:19:01 +00:00
b6c5562c1f PEP585 update - torch/export (#145165)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145165
Approved by: https://github.com/bobrenjc93
2025-01-19 20:56:55 +00:00
316808e4e9 PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145163
Approved by: https://github.com/Skylion007
2025-01-19 20:55:59 +00:00
c64e657632 PEP585 update - torch/distributed/fsdp (#145162)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162
Approved by: https://github.com/bobrenjc93
2025-01-19 20:04:05 +00:00
371a361db9 Enable bfloat16 testing on MacOS14+ (#145159)
As Metal-3.1 supports this dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145159
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #145157
2025-01-19 19:35:31 +00:00
97d4d3c40a PEP585 update - torch/_export (#145138)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145138
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145154
2025-01-19 18:48:35 +00:00
cd8d0fa20c Tweak schema_check to handle annotated builtin types (#145154)
As of python 3.9 annotated lists can be written as `list[T]` and `List[T]` has been deprecated.  However schema_check was converting `list[T]` to simply be `list`. This change teaches it to handle `list[T]` the same as `List[T]`.

A couple small drive-by changes I noticed as well:
- Path concatenation should use `os.path.join`, not `+`
- Spelling in error message

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145154
Approved by: https://github.com/bobrenjc93
2025-01-19 18:48:35 +00:00
9e0437a04a PEP585 update - torch/ao/quantization (#145140)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145140
Approved by: https://github.com/bobrenjc93
2025-01-19 10:20:00 +00:00
78bff1e8c1 PEP585 update - torch/_functorch (#145139)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145139
Approved by: https://github.com/bobrenjc93
2025-01-19 07:06:10 +00:00
10e4d3aebb [DCP] Fix fsspec fsync bug on .finish() (#144753)
Fixes #144752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144753
Approved by: https://github.com/Skylion007, https://github.com/saumishr
2025-01-19 03:21:00 +00:00
8cc415774f [mps/inductor] Introduce a metal approx for erf() and use it. (#145161)
Probably we can do better, but this is a start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145161
Approved by: https://github.com/malfet
2025-01-19 02:29:05 +00:00
893ca1dfe1 PEP585 update - torch/_inductor/[_-i]* (#145137)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145137
Approved by: https://github.com/bobrenjc93
2025-01-19 01:22:47 +00:00
cede43e06b [MPSInductor][BE] NaN-propagating min/max to header (#145157)
May be to be later reused from eager op as well

Also, didn't know that Metal already have type_traits
And use `metal::isunorderder(a, b)` instead of `metal::isnan(a + b)` is it is defined as function that is equivalent  `a != a || b != b`, but I suspect it might have a best native implementation for the specific architecture

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145157
Approved by: https://github.com/dcci
2025-01-18 22:52:44 +00:00
5b5766665d PEP585 update - torch/_C torch/_decomp torch/_lazy torch/_library torch/_numpy torch/_prims torch/_refs torch/_strobelight (#145102)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145102
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #145105
2025-01-18 20:47:12 +00:00
a79100ab11 PEP585 update - torch/_dynamo (#145105)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145105
Approved by: https://github.com/bobrenjc93
2025-01-18 20:47:11 +00:00
c95efc37ba PEP585 update - torch/distributed/tensor (#145141)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145141
Approved by: https://github.com/bobrenjc93
2025-01-18 20:01:59 +00:00
4f8237dbad [mps/inductor] Skip "double" tests as 64-bits FP is not supported. (#145123)
257 tests failed (before) -> 242 tests failed (after)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145123
Approved by: https://github.com/malfet, https://github.com/jansel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-18 19:13:34 +00:00
5802be698e Revert "parametrized test name handles class arguments (#133546)"
This reverts commit 4e4b8592a32f701b4974679ab1381ba7cccd4844.

Reverted https://github.com/pytorch/pytorch/pull/133546 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but trying to disable the new tests does seem to fully cover all the cases and some are still failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/133546#issuecomment-2599814339))
2025-01-18 18:12:18 +00:00
b63b81410c Fix NJT frexp() to handle both outputs (#144585)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Before this PR, `frexp()` for NJT was handled via the unary pointwise fallback. The op returns a tuple, however, and the fallback doesn't handle that. This PR defines an explicit impl for `frexp()` that wraps both returned `(mantissa, exponent)` as NJTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144585
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583, #144584
2025-01-18 15:59:56 +00:00
3ee531f8b9 Support NJT chunk() backward on batch dim (#144584)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Implements `chunk()` backward on the batch dim, which was left out before. This PR unbinds the components and invokes `copy_()` on these to pass along the appropriate gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144584
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583
2025-01-18 15:58:24 +00:00
8a57234033 [MPSInductor] Implement i0 and i1 ops (#145092)
Using shared definitions with eager op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145092
Approved by: https://github.com/dcci, https://github.com/jansel
ghstack dependencies: #145023, #145087
2025-01-18 15:41:02 +00:00
1d9fc9df38 Downgrade ignored guard to info level (#145075)
Fixes https://github.com/pytorch/pytorch/issues/101265

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145075
Approved by: https://github.com/Skylion007
2025-01-18 15:30:01 +00:00
5e4cf3e6ad Moved .all() checks for distributions to _is_all_true (#145029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145029
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-18 07:55:48 +00:00
2bf772d1ba PEP585 update - torch/_inductor/codegen (#145106)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145106
Approved by: https://github.com/bobrenjc93
2025-01-18 06:56:03 +00:00
4bf29f44b7 [aoti] Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass (#145028)
Summary:
Remove torch.ops.aten._assert_tensor_metadata.default in post_grad_pass because this op is blocking fusion.

This should not have any affect on the result, because the op would not show up in the final aoti compiled model anyway (the assertion has no effect).

An real example where this improves performance:

In the example below, the post grad graph would contain `torch.ops.aten._assert_tensor_metadata.default`, because of PR  https://github.com/pytorch/pytorch/pull/142420. This op is added when functionalizing aten.to.

We want the `add` node from `linear` to be fused with the rest of the pointwise ops, instead of fused with the `mm` from `linear`.

```

class Model(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(Model, self).__init__()
        self.linear = nn.Linear(input_dim, hidden_dim).half()
        self.rms_norm = nn.RMSNorm(hidden_dim)

    def forward(self, x):
        linear_458 = self.linear(x)  # Linear layer with weights'
        # mimic the torchtune rms norm: /torchtune/torchtune/modules/rms_norm.py
        linear_458 = linear_458.to(torch.float32)
        rms_norm_34 = self.rms_norm(linear_458)  # RMS Normalization
        sigmoid_168 = torch.sigmoid(rms_norm_34)  # Sigmoid activation function
        mul_168 = sigmoid_168 * rms_norm_34  # Element-wise multiplication

        return mul_168

def main():
    with torch.no_grad():
        input_dim = 512
        hidden_dim = 256
        batch_size = 32
        model = Model(input_dim, hidden_dim).to("cuda")
        example_inputs = (
            torch.randn(batch_size, input_dim).to("cuda").to(torch.float16),
        )
        ep = torch.export.export(model, example_inputs)
        package_path = torch._inductor.aoti_compile_and_package(ep)
```

Test Plan:
CI

Differential Revision: D68303114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145028
Approved by: https://github.com/angelayi
2025-01-18 06:06:25 +00:00
dc9b77cc55 [MPS] Support includes in metal objects (#145087)
Useful for code reuse for Metal shader build both for eager mode and MPSInductor, but it requires one to implement `_cpp_embed_headers` tool that, as name suggests, would preprocess and embeds the for shader to be used in dynamic compilation.
Test using:
 -  `TestMetalLibrary.test_metal_include`
 - Moving `i0`/`i1` implementation to `c10/util/metal_special_math.h` and call it from `SpecialOps.metal` shader, which now looks much more compact:
 ```metal
template <typename T, typename Tout = T>
void kernel
i0(constant T* input,
   device Tout* output,
   uint index [[thread_position_in_grid]]) {
  output[index] = c10::i0(static_cast<Tout>(input[index]));
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145087
Approved by: https://github.com/dcci
ghstack dependencies: #145023
2025-01-18 05:35:22 +00:00
2859b11bdb [pytorch/ncclx] Remove Alltoallv specialization for PTD all_to_all (#145045)
Summary:
PTD all_to_all uses a list of tensors, while ncclAllToAllv (provided
by NCCLX and RCCL) assumes that a single contiguous buffer is used.
These are fundamentally mismatched.  The list of tensors might not be
contiguous or even ordered (buffer addresses might not be in
increasing order).

This patch removes the ncclAllToAllv specialization for PTD
all_to_all, and instead let's it directly call ncclSend/ncclRecv.

Co-authored by @pavanbalaji
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145045
Approved by: https://github.com/pavanbalaji, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/ezyang
2025-01-18 05:26:55 +00:00
07669ed960 PEP585 update - benchmarks tools torchgen (#145101)
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc).  Most of the PRs were completely automated with RUFF as follows:

Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:

```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
                     "ruff",
                     "check",
                     "--fix-only",
+                    "--unsafe-fixes",
                     "--exit-zero",
                     *([f"--config={config}"] if config else []),
                     "--stdin-filename",
```

Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):

```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@

 [tool.ruff]
-target-version = "py38"
+target-version = "py39"
 line-length = 88
 src = ["caffe2", "torch", "torchgen", "functorch", "test"]

@@ -87,7 +87,6 @@
     "SIM116", # Disable Use a dictionary instead of consecutive `if` statements
     "SIM117",
     "SIM118",
-    "UP006", # keep-runtime-typing
     "UP007", # keep-runtime-typing
 ]
 select = [
```

Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
2025-01-18 05:05:07 +00:00
2c4281d7da Make MultiProcContinuousTest timeout configurable (#145099)
Allows test classes using MPCT to set their own timeout as a class
property, which is good enough since the processgroup is shared across
test instances and the timeout is set at processgroup init.

Also sets a default timeout of 2 minutes, which is probably (?) long
enough for reasonable tests, but can be changed if it causes flakyness.
It's preferable to have as short default timeout as possible, since when
debugging tests getting a timeout quickly helps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145099
Approved by: https://github.com/d4l3k, https://github.com/fduwjj
ghstack dependencies: #145010, #145011
2025-01-18 04:37:12 +00:00
bdfeda5c9a composability test cleanup (#145011)
minor changes to test public PP api instead of internal/private one and
also save a few lines of code for microbatch splitting in the process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145011
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
ghstack dependencies: #145010
2025-01-18 04:37:12 +00:00
4eea2f7496 [inductor] Fix ignored options for torch.compile (#145131)
#139833 broke `torch.compile(options=...)` so that many (all?) options passed in get completely ignored.  @alexreinking pointed this out when `options={"cpu_backend":"halide"}` did nothing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145131
Approved by: https://github.com/exclamaforte
2025-01-18 03:39:49 +00:00
668fb7dfba [ca] Use aot_eager on flex attention test (#145097)
FIXES https://github.com/pytorch/pytorch/issues/144912

The flex attention lowering incompatibilities are covered by https://github.com/pytorch/pytorch/blob/main/test/inductor/test_flex_attention.py. For the CA + flex integration, we don't actually need to test the lowering, only the frontend graph capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145097
Approved by: https://github.com/drisspg
2025-01-18 02:47:13 +00:00
55084443ca Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829)
Summary:

Test Plan:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144829
Approved by: https://github.com/Chillee
2025-01-18 02:39:22 +00:00
2f51d06210 basic InductorBenchmarker (#133058)
This PR adds the most basic custom benchmarker (i.e. a benchmarker that is not provided by Triton), which we call `InductorBenchmarker`. This new benchmarker is very basic in principal, and very closely follows Triton's `do_bench` implementation with slight changes such as flushing the exact L2 cache size (Triton defaults to 256mb), using a buffer zero for warmup (Triton uses the benchmarked kernel itself, I found that buffer zeroes are more consistent),  and returning the min runtime (Triton can return min, among other things, currently Inductor picks median).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133058
Approved by: https://github.com/eellison
ghstack dependencies: #144315
2025-01-18 02:35:00 +00:00
ee3e89190a refactor benchmarking to use dynamo_timed (#144315)
use dynamo_timed for all our wrapped calls, instead of our custom timer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144315
Approved by: https://github.com/eellison
2025-01-18 02:35:00 +00:00
17c3a10cbd PEP585 update - torch/_inductor/fx_passes (#145107)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145107
Approved by: https://github.com/oulgen, https://github.com/bobrenjc93
2025-01-18 02:04:29 +00:00
8e4539245e Update ci_expected_accuracy for TIMM levit_128 for further investigation (#145112)
TSIA, it looks like an upstream change, but I'm not sure from where yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145112
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2025-01-18 01:55:34 +00:00
0b151f260f [AOTI] Add an option to skip optimizing generated wrapper code (#144866)
Summary: In some cases, generated wrapper code faces a long cpp compilation time. As an alleviation, this PR adds an option to skip cpp compiler optimizers for the generated main wrapper function body.

D68174038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144866
Approved by: https://github.com/chenyang78, https://github.com/hl475
2025-01-18 01:44:21 +00:00
7c1fb9b1ae [inductor] Refactor CachingAutotuner so that it can pickle (#144044)
These are refactors needed for #144288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144044
Approved by: https://github.com/eellison
2025-01-18 01:44:16 +00:00
02385ed625 [Break XPU][Inductor UT] Fix broken XPU CI introduced by community changes (#145058)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145058
Approved by: https://github.com/jansel
2025-01-18 01:30:24 +00:00
c434a64f31 Delete torch._library.register_functional_op (#145110)
Fixes #117816, #117834, #117871

This has been superceded by auto_functionalized_v2. There are no
internal usages and this is private API so it is safe to delete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145110
Approved by: https://github.com/williamwen42
ghstack dependencies: #145109
2025-01-18 00:58:25 +00:00
712d9882d2 Skip test responsible for causing flakiness (#145109)
Investigation is a separate issue. For now I want to get the CI back up
and running on the other tests. The problem seems to be that
IncludeDispatchKeyGuard doesn't actually reset the state, which seems
very, very wrong.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145109
Approved by: https://github.com/williamwen42
2025-01-18 00:58:25 +00:00
c338dda6be fix test_rng bisector test (#143662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143662
Approved by: https://github.com/zou3519
2025-01-18 00:15:38 +00:00
d02c396fbb add fp8 support to index_cuda (#144747)
Fixes #133605

**Summary**

This PR adds support for FP8 data types to the `index_cuda` op.

It uses `AT_DISPATCH_V2` which is a new macro that can handle arbitrary number of dtypes, as opposed to the old implementations which had a separate macro for each possible number of dtype arguments (e.g. `AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND{2,3,4,5...}`).

**Test plan**

Updated test `index_cuda_with_cpu` in `test/test_fake_tensor.py` to have cases for all dtypes handled by `index_cuda`, including fp8 dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144747
Approved by: https://github.com/vkuzo
2025-01-17 22:53:23 +00:00
4e4b8592a3 parametrized test name handles class arguments (#133546)
Previously, parametrized tests with class arguments, for example

```
@parametrize("this_cls", (Foo, Bar))
```

would create parametrized tests with names `test_foo_this_cls0` and `test_foo_this_cls1`. With this change, we instead should get `test_foo_this_cls_Foo` and `test_foo_this_cls_Bar`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133546
Approved by: https://github.com/eellison
2025-01-17 22:48:38 +00:00
64e54d5af6 [Pipelining] Relax scale_grads assert (#145010)
The assert felt morally valid- if no gradients are scaled, then something
is definitely wrong with the setup.  In one instance, PP +
optimizer-in-backward (in torchtitan) resulted in grad=None after
running .backward() and before scaling grads.

On the other hand, the existing assert is too restrictive.  It's
possible that a model used with pipelining would have some parameters
that do not receieve gradients, and we shouldn't hard-error in these
cases.  (E.g. if the parameter is literally not used, or is frozen).
In the extreme case, the whole stage could be frozen.  So we do not
complain if no grads are scaled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145010
Approved by: https://github.com/mori360, https://github.com/tianyu-l
2025-01-17 21:33:28 +00:00
07e23653cd Fix RMSNorm epsilon value type for BF16 or FP16 (#142848)
Fixes #140092

Here's what this PR does:

In before, we create a `scalar_t eps_val;` variable, and the `eps` is mostly a double scalar which passed from python frontend, like 1e-6.

While we do `eps_val = std::numeric_limits<at::scalar_value_type<scalar_t>::type>::epsilon();` or `eps_val = eps.value();`, we down cast this epsilon to match input tensor dtype (`scalar_t`), in case of BFloat16, the 1e-6 double would be cast to `1.00136e-05`.

However, while we act `auto rqrst_input = rsqrt(at::pow(upcasted_input, 2).mean(dims_to_reduce_ref, /*keepdim=*/true).add_(eps_val));`, we up cast `eps_val` to match the `opmath_t`, the conversion between these two dtypes is UNNECESSARY, so we could just make the `opmath_t eps_val` instead of `scalar_t`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142848
Approved by: https://github.com/mikaylagawarecki
2025-01-17 21:30:54 +00:00
a8ef423fed Fix NJT min / max backward() for non-ragged reductions (#144583)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

`value_selecting_reduction_backward()` is used in the backward for min / max, so this PR implements it for NJT. Notably, this isn't enough for reducing over the ragged dim, since that results in a dense tensor and thus NJT's torch_dispatch will not be called for this op. We need factory function support for nested ints to fix that case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144583
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582
2025-01-17 20:57:11 +00:00
cac10b8190 Fix NJT OpInfo entry for nn.functional.prelu (#144582)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

The OpInfo entry for prelu was wrong before this PR; `weight` needs to be passed as well. The op isn't fully implemented yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144582
Approved by: https://github.com/soulitzer
2025-01-17 20:36:15 +00:00
eaef613688 Fix issue with test/nn/test_convolution:TestConvolutionNNDeviceTypeCUDA.test_conv_large_batch_1_cuda (#145067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145067
Approved by: https://github.com/Skylion007, https://github.com/nWEIdia

Co-authored-by: Wei Wang <143543872+nWEIdia@users.noreply.github.com>
2025-01-17 20:31:25 +00:00
0eda02a94c Prevent legacy_load when weights_only=True (correctly) (#145020)
Only prevent `legacy_load` (.tar format removed in https://github.com/pytorch/pytorch/pull/713), not the whole of `_legacy_load` (.tar format + _use_new_zipfile_serialization=False)

Differential Revision: [D68301405](https://our.internmc.facebook.com/intern/diff/D68301405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145020
Approved by: https://github.com/kit1980, https://github.com/albanD
2025-01-17 20:10:22 +00:00
2ef7b68666 [inductor] fix TORCH_LOGS="benchmarking" (#144997)
Saw this error with TORCH_LOGS="benchmarking"
```
  File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 37, in wrapper
    result = fn(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/runtime/benchmarking.py", line 66, in wrapper
    return fn(self, *args, **kwargs)
torch._inductor.exc.InductorError: TypeError: Benchmarker.benchmark() missing 1 required positional argument: 'fn_kwargs'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144997
Approved by: https://github.com/eellison, https://github.com/nmacchioni
2025-01-17 19:41:18 +00:00
d996d7ec13 upgrade to sccache 0.9.1 - dealing with nvcc -E correctly (#145012)
sccache 0.9.1 should be dealing with `nvcc -E` correctly

see https://github.com/mozilla/sccache/pull/2300

If this works as expected, we can get rid of this code:
https://github.com/pytorch/pytorch/pull/142813/files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145012
Approved by: https://github.com/malfet
2025-01-17 19:26:01 +00:00
46fbd63405 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-17 18:21:22 +00:00
18638b91fe Adding more compile time logging in pad_mm (#144884)
Summary: As title

Test Plan:
[midin@6262.od /data/sandcastle/boxes/fbsource/fbcode (99e64d2e4)]$ tlp buck run mode/opt caffe2/test/inductor:pad_mm -- -r test_exclude_padding

https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2F.tmpiJLgXX%2Fchromium_events.json&local_cache_key

 {F1974355662}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144884
Approved by: https://github.com/oulgen
2025-01-17 17:35:55 +00:00
567552b98b fix typo in doc and import for torch._library.triton (#144882)
Previously, the doc's suggested `from torch._library.triton import wrap_triton, triton_op` doesn't work because wrap_triton is not imported in torch/_library/__init__.py but `from torch.library import wrap_triton` works. This PR imports wrap_triton and fix the doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144882
Approved by: https://github.com/zou3519
2025-01-17 17:32:12 +00:00
18eba9575f [Accelerator] Use uniform GetAllocator for devices in new_qtensor function (#144849)
Fixes #144848
This PR is intended to use a uniform `GetAllocator()` call for all the accelerators for `new_qtensor` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144849
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-01-17 16:37:37 +00:00
a215e174a1 [BE] Remove conda from scripts and build files Part 2 (#145015)
Continuation of https://github.com/pytorch/pytorch/pull/144870

Remove conda logic from scripts:

1. Remove conda build from triton build script
2. Remove conda checks from setup.py
3. Remove conda from release scripts
4. Script read_conda_versions.sh is not used (checked via git grep)

Related to: https://github.com/pytorch/pytorch/issues/138506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145015
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-01-17 16:26:24 +00:00
b7af210d8d Add SM89 support for f8f8bf16_rowwise() (#144348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144348
Approved by: https://github.com/drisspg
2025-01-17 15:12:35 +00:00
f522502b97 Revert "patch for block-wise quantization + pt2e (#144492)"
This reverts commit 1d43b8150852cdfcbe754edcf027d6e25f33ac63.

Reverted https://github.com/pytorch/pytorch/pull/144492 on behalf of https://github.com/albanD due to Broke a few things in trunk ([comment](https://github.com/pytorch/pytorch/pull/144492#issuecomment-2598485291))
2025-01-17 14:27:53 +00:00
dbed747aae Add Intel GPU specific CMake files to merge rules (#135110)
As the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135110
Approved by: https://github.com/atalman
2025-01-17 09:44:13 +00:00
a0d2c09115 Add flop formula for _scaled_mm (#144973)
This will make it work correctly with the partitioner's AutoAC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144973
Approved by: https://github.com/jeffdaily
2025-01-17 09:38:30 +00:00
96c0dbbe97 Enhance running pr time benchmarks locally experience. (#144838)
Summary: title

Test Plan: NA

Differential Revision: D68195894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144838
Approved by: https://github.com/huydhn
2025-01-17 07:57:40 +00:00
465a1cfe2e update get start xpu (#143183)
- Support new Intel client GPU on Windows [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) and [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html)
- Support vision/audio prebuilt wheels on Windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143183
Approved by: https://github.com/EikanWang, https://github.com/leslie-fang-intel, https://github.com/atalman, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-17 06:31:40 +00:00
fd8e0e3e10 [mps/inductor] Introduce is_mps_backend/skip_if_mps decorators. (#145035)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145035
Approved by: https://github.com/jansel
2025-01-17 05:36:38 +00:00
cfd9cc19a3 [executorch hash update] update the pinned executorch hash (#145022)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145022
Approved by: https://github.com/pytorchbot
2025-01-17 04:51:56 +00:00
f13c864eda Fuzzer Improvements (#144952)
Added more tests and cleaned up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144952
Approved by: https://github.com/masnesral
2025-01-17 04:46:58 +00:00
1d43b81508 patch for block-wise quantization + pt2e (#144492)
Summary: As title, needed for enable qcom block-wise quantization kernel

Test Plan: local test

Differential Revision: D67985303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144492
Approved by: https://github.com/angelayi, https://github.com/billmguo
2025-01-17 04:10:49 +00:00
adbbcd87d9 OpenReg: Split Allocator (#144843)
Split the Allocator into HostAllocator and DeviceAllocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144843
Approved by: https://github.com/albanD
2025-01-17 03:38:15 +00:00
43a00d73b3 [Trace Python Dispatcher] Support FuncTorchInterpreter (#144444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144444
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #144439
2025-01-17 02:26:37 +00:00
5d02575aa1 [Trace Python dispatcher] Support torch.DispatchKey & torch.DispatchKeySet (#144439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144439
Approved by: https://github.com/zou3519
2025-01-17 02:26:36 +00:00
3a50aba7d3 [dynamo] add option to not skip on empty graph (#144885)
Temporary fix to https://github.com/pytorch/pytorch/issues/144360.

Turning the config on globally will cause a bunch of tests to fail, which needs to be addressed in followups.

I had a previous attempt at https://github.com/pytorch/pytorch/pull/144712, but this is a more complicated change and will likely be absorbed into work to refactor Dynamo's exception handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144885
Approved by: https://github.com/jansel
2025-01-17 02:12:20 +00:00
7b56b039af [dcp] Add ZStandard transformer (#143360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143360
Approved by: https://github.com/saumishr
ghstack dependencies: #143358, #143359
2025-01-17 01:51:37 +00:00
9c909bf3bb [dcp] Integrate stream extensions into DCP impl (#143359)
Summary: Updates FileSystemReader/Writer, Planner, DefaultLoad/SavePlanner

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143359
Approved by: https://github.com/saumishr
ghstack dependencies: #143358
2025-01-17 01:51:37 +00:00
ba3f1c29ee [dcp] Add extension mechanism (#143358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143358
Approved by: https://github.com/saumishr
2025-01-17 01:51:37 +00:00
176cde6240 Use torch with statement in torch distributed module (#144951)
# Motivation
In https://github.com/pytorch/pytorch/pull/137678, we help use the device-agnostic APIs to generalize distributed module. As this [comment](https://github.com/pytorch/pytorch/pull/137678#discussion_r1828645683) said, we will use the with statement of `torch.Stream` once https://github.com/pytorch/pytorch/pull/140138 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144951
Approved by: https://github.com/kwen2501, https://github.com/albanD
2025-01-17 01:49:28 +00:00
a61a65ff82 [MPSInductor] Add Worker.current_device method (#145023)
That just returns 0, as multi-gpu is not currently supported by MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145023
Approved by: https://github.com/dcci
2025-01-17 01:41:01 +00:00
55b0819bee Revert "Add tests for different dtypes with max autotune (#144721)"
This reverts commit d2a77f48c9dc6df056051de270ce5875d8d2edd0.

Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/kit1980 due to breaking internal builds, max autotune tests a failing, see D68297606 ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2597250605))
2025-01-17 01:36:14 +00:00
45e6647268 [FSDP2] Make post-backward condition more robust (#144781)
Fixes https://github.com/pytorch/pytorch/issues/144755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144781
Approved by: https://github.com/fegin
2025-01-17 01:28:56 +00:00
6077102415 [DSD][BE] Rewrite some tests to remove with_comms (#143241)
Summary:
This saves ~ 1 minute test time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143241
Approved by: https://github.com/mori360, https://github.com/XilunWu
ghstack dependencies: #143240
2025-01-17 01:15:55 +00:00
5d54e7b812 [Pipelining] move scale_grads to base class, add docs (#144833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144833
Approved by: https://github.com/H-Huang
2025-01-17 01:07:12 +00:00
3afc5170d4 [Submodule] Upgrade to Cutlass 3.6 part deux (#144911)
# Summary
Take 2 of [D67866269](https://www.internalfb.com/diff/D67866269)
Main change is that we identified and fixed the FA2 regression. More details can be found here https://github.com/pytorch/pytorch/issues/144729 and have landed that before this here: [D68194635](https://www.internalfb.com/diff/D68194635)

Differential Revision: D68194470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144911
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-17 00:53:42 +00:00
6c713ccb5e Revert "Make functionalization ViewMeta serializable with pickle. (#143712)"
This reverts commit b8abdaa286fd161af48af57a675827f4f849914d.

Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))
2025-01-17 00:52:50 +00:00
42c64bd35c [MPSInductor] More is_dtype_supported gating (#144981)
This makes `GPUTest.test_scalar_cpu_tensor_arg_mps` pass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144981
Approved by: https://github.com/dcci
ghstack dependencies: #144971
2025-01-17 00:48:02 +00:00
94c0f15302 Revert "cpp_wrapper: Move #includes to per-device header files (#143909)"
This reverts commit d62b3979dadfa4928ec1c76e850f874d49803125.

Reverted https://github.com/pytorch/pytorch/pull/143909 on behalf of https://github.com/kit1980 due to breaking internal builds because of removal of torch‎/_inductor‎/codegen‎/aoti_runtime‎/implementation.cpp‎ ([comment](https://github.com/pytorch/pytorch/pull/143909#issuecomment-2597188669))
2025-01-17 00:36:38 +00:00
5e6e6200bf Revert "[dynamo][dicts] Consolidate dict(..) construction (#144342)"
This reverts commit a54a784b8207617d2b99fbded9bb34c94fb6dd23.

Reverted https://github.com/pytorch/pytorch/pull/144342 on behalf of https://github.com/kit1980 due to breaking internal builds, see D68125388 ([comment](https://github.com/pytorch/pytorch/pull/144342#issuecomment-2597184167))
2025-01-17 00:32:09 +00:00
cyy
2ea394ba29 Modernize C++ code (#144603)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144603
Approved by: https://github.com/malfet
2025-01-17 00:25:18 +00:00
c3fcb3606d Profile compile_inner instead of _compile_inner (#144930)
Summary: title

Test Plan: NA

Reviewed By: jamesjwu

Differential Revision: D67990492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144930
Approved by: https://github.com/jamesjwu
2025-01-16 23:59:27 +00:00
573fc42f25 [BE][CP] Use run_subtests instead of parametrize (#143240)
Summary:
This provides a 15X increase in test performance speed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143240
Approved by: https://github.com/XilunWu
2025-01-16 23:55:05 +00:00
fea9d18d5a [Utilization Log] Concurrently collect aggregate data during the output interval (#143235)
# overview
Add worker to collect metrics in short intervals
1.Worker: Add a worker to collect usage metrics, by default, every 500ms, notice this is configurable
2.Calculate &  avg and max as data point, by default, every 5 second.

# Other
clean up the log format for necessary needs, currentl we do not need to track gpu processesors etc, or all pids from psutil
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143235
Approved by: https://github.com/huydhn
2025-01-16 23:52:43 +00:00
288d67d6c2 [inductor] [bug fix] align avg_pool with eager when handling uint (#144313)
Fixes #144310

~~We just need to add a check in lowering~~

updated: we add the error checking in `meta registration`

### UT
```
 pytest -s -v test/inductor/test_torchinductor.py -k test_avg_pool_errors_with_uint
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144313
Approved by: https://github.com/jansel, https://github.com/jgong5
2025-01-16 23:37:51 +00:00
d2a77f48c9 Add tests for different dtypes with max autotune (#144721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721
Approved by: https://github.com/cpuhrsch, https://github.com/etaf
2025-01-16 23:04:56 +00:00
clr
171fb7f358 easy: Fix missing tab in test/dynamo/test_compile.py (#145013)
It turns out that if you request a merge on a pytorch PR, and then push a fix for a bad rebase, and the test is
relativley new, the merge will go through with the previous commit and not notice the test break.

Explicitly running the test now passes vs failing, and this is just the last missing commit from https://github.com/pytorch/pytorch/pull/144817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145013
Approved by: https://github.com/masnesral, https://github.com/jansel
2025-01-16 22:51:51 +00:00
181d93b4f2 [BE] Move is_device_supported to helper function (#144971)
And extend `test_inf` to check half (explicitly instead of check_lowp) and bfloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144971
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/jansel
2025-01-16 22:44:03 +00:00
a33e02cb26 [executorch hash update] update the pinned executorch hash (#144813)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144813
Approved by: https://github.com/pytorchbot, https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-01-16 22:39:00 +00:00
7c7bcb1e33 update IS_JETSON check (#144725)
update IS_JETSON check to include the latest SM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144725
Approved by: https://github.com/eqy
2025-01-16 22:34:48 +00:00
95c363cc9b dynamo: Don't crash with internal error if getattr on a tensor fails (#144817)
This prevents crashes when getattr is called on a tensor for something
which doesn't exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144817
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-16 22:04:06 +00:00
0e6d44df3f Add heuristic to fail block pointer match early (#144681)
This PR adds a heuristic to potentially fail the block pointer match early. Expressions like below take a long time to match using sympy (e.g. > 100 seconds)
```python
# torch._inductor.config.triton.use_block_ptr = True
# torch._inductor.config.triton.prefer_nd_tiling = True
# Expression from pytest -k test_max_pool2d1_dynamic_shapes_cuda:
 ((xindex//ps1))*((s2 - 3//2))**2 + 2*((xindex//ps1))*((s2 - 3//2)) + ((xindex//ps1)) + ((s2 - 3//2))*(ModularIndexing(xindex, ps0, ps0)) + (ModularIndexing(xindex, 1, ps0)) + (ModularIndexing(xindex, ps0, ps0))
```
Additionally, the heuristic for the number of dimensions based on the indexing expression is refined to only add dimensions for FloorDiv(index, denom) and ModularIndexing(index, denom, modulo) instead of including FloorDiv/ModularIndexing expressions that don't involve the index.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144681
Approved by: https://github.com/jansel
2025-01-16 21:57:30 +00:00
46b92c025d Revert "Cholesky mps implementation (#144193)"
This reverts commit 727ae1331820bb3d83d70e9cd3c9d3cd4c79ff56.

Reverted https://github.com/pytorch/pytorch/pull/144193 on behalf of https://github.com/malfet due to Alas, inductor changes broke inductor tests, see aa4a1ff027/1 ([comment](https://github.com/pytorch/pytorch/pull/144193#issuecomment-2596938163))
2025-01-16 21:37:32 +00:00
aa4a1ff027 Revert "Prevent _legacy_load with weights_only=True (#144914)"
This reverts commit 7c3aa1da1c97812af54d41f3f0eff2ef922c0f32.

Reverted https://github.com/pytorch/pytorch/pull/144914 on behalf of https://github.com/izaitsevfb due to breaking inductor on trunk ([comment](https://github.com/pytorch/pytorch/pull/144914#issuecomment-2596922781))
2025-01-16 21:29:50 +00:00
4ea189422d Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit a6763b7b81cd1a55c8316dfdb5bca19819a1429a.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))
2025-01-16 21:12:41 +00:00
3a5bf0bc36 expose extra torch_python apis (#144746)
Fixes #144302
After checking the code of my third-party devices, I think these APIs are also relied on by us, so I exposed them according to the discussion in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144746
Approved by: https://github.com/albanD
2025-01-16 20:50:31 +00:00
577708e6de Unskipped multiple inductor tests for ROCm (#143581)
All of them should be fine to run now after the triton fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-16 20:46:06 +00:00
a9bfc5f70c Fix boundary conditions for hardswish backward (#143899)
Fixes #136345.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143899
Approved by: https://github.com/jgong5, https://github.com/ezyang
2025-01-16 20:26:27 +00:00
aad5f600ff [mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)
Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877
Approved by: https://github.com/jansel, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-16 19:51:45 +00:00
3908be676c Fix loading older state_dict into AdamW after refactor (#144972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144972
Approved by: https://github.com/albanD
2025-01-16 19:50:31 +00:00
b8abdaa286 Make functionalization ViewMeta serializable with pickle. (#143712)
Fix: #141974

This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.

In summary, these are the changes this PR introduces:

- `ViewMeta` is turned into an abstract class (see
  _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
  functions that need to be implemented. `to_out_index` should be
  implemented by operations that might return more than 1 output.

- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
  created (see _FunctionalizeFallbackKernel.h_).

- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
  declaration and definition of the `ViewMeta` specializations, which
  are automatically generated in the ATen codegen (see _gen.py_).

- New `_functionalization` Python sub-module is created (see
  _Module.cpp_). It serves as namespace for the `ViewMeta`
  specializations and `InverseReturnMode` enum.

- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
  the automatically generated Python bindings for the `ViewMeta`
  specialization, which are generated in the torch codegen (see
  _generate_code.py_).

Note that this PR makes use of codegen at 2 different moments:

- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
  them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
2025-01-16 19:41:41 +00:00
7c3aa1da1c Prevent _legacy_load with weights_only=True (#144914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144914
Approved by: https://github.com/malfet, https://github.com/albanD
2025-01-16 19:33:46 +00:00
cf28d613f1 Allow ROCm runner to upload benchmark results if found (#144710)
https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database. This will unblock AMD when they try to run benchmark MI300 benchmarks on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144710
Approved by: https://github.com/kit1980
2025-01-16 19:31:45 +00:00
31a73eb712 fix acquire pattern in topk (#144945)
Similar to #128455, topk needs another threadfence to complete acquire pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144945
Approved by: https://github.com/Skylion007
2025-01-16 19:20:43 +00:00
3004b657f0 [Inductor][FlexAttention] Supports dynamic shapes with custom kernel options (#144938)
Fixes #144815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144938
Approved by: https://github.com/drisspg
2025-01-16 19:02:35 +00:00
e32d2bf853 Document decoupled_weight_decay for Adam for consistency with N/RAdam (#144984)
Followup from #144972 and #143710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144984
Approved by: https://github.com/albanD
2025-01-16 18:58:29 +00:00
ad15436db6 Fix pt2-bug-report.yml formatting (#144987)
This is a 2nd regression caused by https://github.com/pytorch/pytorch/pull/144574

Test plan: `python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"`
Before it printed
```
% python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"
{'type': 'markdown', 'attributes': {'value': ''}}
```
After
```
% python3 -c "import yaml; foo=yaml.safe_load(open('pt2-bug-report.yml'));print(foo['body'][0])"
{'type': 'markdown', 'attributes': {'value': '#### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.\n'}}
```

Fixes https://github.com/pytorch/pytorch/issues/144970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144987
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2025-01-16 18:58:07 +00:00
829c4570ca Revert "[mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)"
This reverts commit 1b34665767fcc35ae4a8f211945a24701c79df79.

Reverted https://github.com/pytorch/pytorch/pull/144877 on behalf of https://github.com/malfet due to Actually no, lint is red ([comment](https://github.com/pytorch/pytorch/pull/144877#issuecomment-2596385712))
2025-01-16 18:10:37 +00:00
13d35ea67a [BE] Add missing throw of std::runtime_error in scrc/cuda/utils.cpp (#144962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144962
Approved by: https://github.com/amjames, https://github.com/Skylion007, https://github.com/malfet
2025-01-16 17:35:39 +00:00
53256edff9 [export] Support module inputs for non strict mode. (#143925)
Summary:
Add experimental support for torch.nn.Module as input types.

Before this change, we don't support module inputs but recently we saw some interesting use cases like gpt-fast https://github.com/pytorch-labs/gpt-fast/blob/main/generate.py#L68 where we directly pass in a module input for different variants of the same models.

Since we don't really care about non-param or non-buffer states in non strict mode, we don't care about those either and pretend they are like plain constants during tracing. We treat any module input like a nested container of tensor, and each time we will automatically register a pytree handler for these module types to flatten its state dict into a group of tensors. We will just inline any module method call during tracing like we did for `self` module in export_for_training. This will make input modules' behavior very similar to the training module in typical case, except that we don't record the inputs as parameter or buffers but rather just plain user inputs.

Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_module_input

Differential Revision: D67680827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143925
Approved by: https://github.com/tugsbayasgalan
2025-01-16 17:30:36 +00:00
519269a415 [BE] - Remove conda test and upload scripts and env variables from Workflows Part 1 (#144870)
Remove conda test and upload scripts and env variables from Workflows

Related to: https://github.com/pytorch/pytorch/issues/138506
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144870
Approved by: https://github.com/malfet
2025-01-16 17:20:14 +00:00
727ae13318 Cholesky mps implementation (#144193)
Requested in #77764

PR is still in draft because it needs some cleanups and optimizations to get to cpu performance the least. Tasks:
- [x] Make `upper=True` work, only `upper=False` works now
- [x] Code cleanup
- [x] Optimizations(Though might need some help on this)(tried my best, maybe there is still some more to squeeze out)
- [x] Checks for positive definite input
- [x] Support for (*, N, N) input, currently only supports (B, N, N) input
- [x] Support other dtypes(float16, bfloat16)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144193
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-16 16:26:46 +00:00
1b34665767 [mps] Massage test_full_truncation to work only on the supported dtypes. (#144877)
Converted a first one to make sure the pattern was the one we wanted -- if we're OK with this, I'll probably adjust all the other failing ones in a batch or two. Let me know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144877
Approved by: https://github.com/jansel, https://github.com/malfet
2025-01-16 16:22:06 +00:00
3d29de3ac8 [aoti] Deduplicate "V.aot_compilation" and "V.graph.aot_mode" flags. [1/n] (#144709)
Summary:
According to angelayi, these two flags indicated different things when we have two-pass codegen but since now we basically keep the two flags all the same, we should merge two flags.

This can prevent some bug (e.g. we change value of aot_mode which will not cover branches like if V.aot_compialtion is True) from happening when we're trying to add different code paths to tweak the value of aot_mode in the future.

Test Plan: CI

Differential Revision: D68122536

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144709
Approved by: https://github.com/angelayi, https://github.com/desertfire
2025-01-16 16:02:18 +00:00
241a8a101b Fix erroneous at_vreinterpretq_u16_bf16 call (#144883)
Here, `mask` is definitely a `uint16x8_t`, not an `at_bfloat16x8_t`, so we shouldn't be reintepreting it. Candidate fix for #144818 .

Differential Revision: [D68224128](https://our.internmc.facebook.com/intern/diff/D68224128/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144883
Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/malfet
2025-01-16 15:16:28 +00:00
6559374494 Revert "Add flop formula for _scaled_mm (#144872)"
This reverts commit f31452268bf9f7e395f263cd8a9d693633ea75ce.

Reverted https://github.com/pytorch/pytorch/pull/144872 on behalf of https://github.com/lw due to Breaks ROCm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/144872#issuecomment-2595994134))
2025-01-16 15:16:18 +00:00
6470b0ea6f Update torch-xpu-ops commit pin (#144739)
Update the torch-xpu-ops commit to [22cc419e4e60f469341712a5a103fa309a7dfd48](22cc419e4e), includes:

- Fix building issue https://github.com/intel/torch-xpu-ops/issues/1279
- Aten operator coverage improvement

Note: new torch-xpu-ops commit don't support bundle 0.5.3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144739
Approved by: https://github.com/EikanWang, https://github.com/malfet
2025-01-16 15:12:37 +00:00
f31452268b Add flop formula for _scaled_mm (#144872)
This will make it work correctly with the partitioner's AutoAC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144872
Approved by: https://github.com/vkuzo
2025-01-16 13:57:54 +00:00
1c290912e4 Revert "Add tests for different dtypes with max autotune (#144721)"
This reverts commit 9e568cbaa22df89b77e112f1a373d82acb2e6219.

Reverted https://github.com/pytorch/pytorch/pull/144721 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/144721#issuecomment-2595210355))
2025-01-16 10:59:30 +00:00
0c0583254e [inductor] fix index.Tensor fallback (#144736)
The original issue is we see accuracy problem in a meta internal model [meta internal link](https://fb.workplace.com/groups/1075192433118967/posts/1567334737238065/).  The debugging is hard but the root cause is relatively simple. The root cause is that the model has mix-device inputs for index.Tensor which causes Inductor to fallback. And the meta kernel for index.Tensor returns a tensor with inconsistent strides to the eager kernel.

The following code snippet
```
import torch
from torch._subclasses import FakeTensorMode

device = "cuda"

x = torch.randn((24, 16, 32, 32), device=device).to(memory_format=torch.channels_last)
x = x.view(2, 12, 16, 32, 32)

i1 = torch.arange(2).unsqueeze(-1)
i2 = torch.argsort(torch.rand(2, 12), dim=-1)[:, :3]

print(f"Eager stride: {x[i1, i2].stride()}")

mode = FakeTensorMode()
with mode:
    f_x = mode.from_tensor(x)
    f_i1 = mode.from_tensor(i1)
    f_i2 = mode.from_tensor(i2)
    f_out = f_x[f_i1, f_i2]
    print(f"Meta stride: {f_out.stride()}")
```

would output:
```
Eager stride: (49152, 16384, 1, 512, 16)
Meta stride: (49152, 16384, 1024, 32, 1)
```

In this PR, I fix the problem to run eager kernel to get the index.Tensor fallback's output layout. A better solution would be to change meta/eager kernel implementation so that their output layout matches. But I'm not sure how to properly do that.
In the index.Tensor meta kernel, we always produce dense output:  6d56277682/torch/_meta_registrations.py (L3184) . While the eager kernel seems to leverage TensorIteratorBase to decide some dimension permutation: 6d56277682/aten/src/ATen/TensorIterator.cpp (L232-L308) .  We can duplicate this logic to the meta kernel implementation if we really want meta matches eager. I can follow up on this if people have strong opinion to do this.

And here is an issue https://github.com/pytorch/pytorch/issues/144717 for asserting size/strides for fallback kernels. With that, the issue debugged here would be much easier to root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144736
Approved by: https://github.com/jansel
2025-01-16 09:38:29 +00:00
57d5659c3b XFAIL test_save_load_checkpoint (#144927)
Fixes https://github.com/pytorch/pytorch/issues/137771

The issue keeps showing up and rerun disable tests couldn't reproduce the issue.  So, XFAIL it while waiting for distributed team to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144927
Approved by: https://github.com/kit1980, https://github.com/malfet
2025-01-16 07:31:56 +00:00
7d8c087e24 [Pipelining] Improve shape inference debug logging (#144929)
Remove log that just said "running forward" since that is not so useful
in itself, replace with somewhat equivalent log that reports both input
and output shapes after running forward.

Note: enabled by `TORCH_LOGS=+pp`

Example:
```
[rank0]:V0115 13:28:58.282000 3908366 torch/distributed/pipelining/stage.py:1400] Shape inference: stage 0 inputs (tensor(..., device='meta', size=(1, 64), dtype=torch.int64),), outputs (tensor(..., device='meta', size=(1, 64, 256), dtype=torch.bfloat16),)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144929
Approved by: https://github.com/H-Huang
2025-01-16 07:30:11 +00:00
0b17c09893 restore rng generation for fbcode (#144819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819
Approved by: https://github.com/malfet, https://github.com/kit1980
2025-01-16 06:46:26 +00:00
49bdc418be Add strict kwarg to nn.Module.set_submodule and fix bug for non dot delineated strings (#143455)
Before fixing set_submodule, it used to create leaf modules when the target was not a dot-delimited string. After the fix it will not create a new attribute if target is a non-dot-delimited string. If you want to create leaf nodes of `nn.Module` parent nodes, you can use `replace_or_create_new_leaf_module`.

Fixes https://github.com/pytorch/pytorch/issues/143441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143455
Approved by: https://github.com/mikaylagawarecki
2025-01-16 05:06:33 +00:00
e3c4d1b7d6 [c10d][fr] Fix the bug when we still mark mismatch when there are match case (#144916)
When we introduce partial match, we accidentally introduce the mark of mismatch for the full match case. This is wrong and this PR fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144916
Approved by: https://github.com/c-p-i-o
2025-01-16 04:36:30 +00:00
9e568cbaa2 Add tests for different dtypes with max autotune (#144721)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144721
Approved by: https://github.com/cpuhrsch, https://github.com/etaf
2025-01-16 04:29:44 +00:00
52a620845b OpenReg: Use device agnostic API (#144840)
Use `torch.accelerator.device_count()` to get the number of devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144840
Approved by: https://github.com/albanD
2025-01-16 03:31:52 +00:00
1230de4c1b [Quant][Inductor][X86] Separate binary post op fusion and lowering for qconv (#144318)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves binary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise`
2. Fuse `onednn.qconv2d_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144318
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #144224, #144312
2025-01-16 03:30:36 +00:00
cyy
843627b7b1 Remove unnecessary once flag usage (#143255)
Static variables in C++11 is guaranteed to be initialised exactly once, as mentioned [here](https://en.cppreference.com/w/cpp/language/storage_duration)
```
If multiple threads attempt to initialize the same static local variable concurrently,
the initialization occurs exactly once
(similar behavior can be obtained for arbitrary functions with std::call_once.
Usual implementations of this feature use variants
of the double-checked locking pattern,
which reduces runtime overhead for already-initialized local statics
 to a single non-atomic boolean comparison.
```
Given that static c10::once_flag is used before, why not just use the associated function to initialised the related static variables? That is the motivation behind this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143255
Approved by: https://github.com/albanD
2025-01-16 02:36:11 +00:00
41ec2e8d3e [MPSInductor] Fix codegen regression (#144924)
Caused by https://github.com/pytorch/pytorch/pull/144649

Do not try to insert anything into the header if wrapper is not ready yet

Fixes `test_sort_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144924
Approved by: https://github.com/dcci
ghstack dependencies: #144827, #144917
2025-01-16 02:12:42 +00:00
05505771a0 [MPSInductor] Properly convert index (#144917)
By calling `self.index_to_str` from `load`,`store` and `check_bounds` in order to properly handle sizevars variables renames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144917
Approved by: https://github.com/dcci
ghstack dependencies: #144827
2025-01-16 02:12:41 +00:00
d595b96059 Revert "restore rng generation for fbcode (#144819)"
This reverts commit 2bc18a905544f4e25cfbd354351418b36a0f5fc1.

Reverted https://github.com/pytorch/pytorch/pull/144819 on behalf of https://github.com/ngimel due to internal failure ([comment](https://github.com/pytorch/pytorch/pull/144819#issuecomment-2594298941))
2025-01-16 01:52:29 +00:00
6492851125 symbolic_convert: Don't fail when we hit a undefined name (#144784)
We're using a python builtin NameError here,
instead of throwing a Unsupported exception. This causes the
NameError to get wrapped in a InternalTorchDynamoError
instead of just causing a graph break, and letting the user code fail
directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144784
Approved by: https://github.com/williamwen42, https://github.com/jansel
2025-01-16 01:47:48 +00:00
c8bcb22e5f Default Copies are not vectorized in v3.6.0 of cutlass (#144837)
Summary:
FlashAttentionV2 perf was tanked in v3.6.0, See: https://github.com/pytorch/pytorch/issues/144729 for more details.

This PR makes it possible to land v3.6.0 update and fixes perf regression. See: https://github.com/pytorch/pytorch/issues/144729#issuecomment-2591644076 for anlaysis, as well we have various internal tests to verify

Differential Revision: D68194635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144837
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-01-16 01:12:46 +00:00
926f9056a9 speculation_log: Raise a unique error for divergence issues (#144785)
This is primarily sent for discussion and to see what tests fail due to
this. The idea is that rather than capturing this as a regex on the
fail_reason, just give it a unique failure type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144785
Approved by: https://github.com/ezyang
2025-01-16 00:49:43 +00:00
b90231a189 [inductor][BE] don't try/except ImportError for AttrsDescriptor versions (#144807)
motivation: Ed's advice to avoid `except ImportError` (i.e. based on the fact that your target module/class might in fact exist, but you might run into some different ImportError whose stacktrace you now ignore).

additional motivation: I'm going to add some more cases to this list, and would like to avoid this pattern:
```
try:
   ...
except ImportError:
    try:
        ...
    except ImportError:
        try:
            ...
```

suggestions on better ways to do this would be appreciated!

test: ran with triton commit e5be006a (last working commit) and 34a6a2ff8 (in june, when AttrsDescriptor was still in triton.compiler.compiler)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144807
Approved by: https://github.com/ezyang
2025-01-16 00:32:29 +00:00
cyy
ee97d80be2 Apply Ruff fixes and pyupgrade to torch/jit (#144208)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144208
Approved by: https://github.com/davidberard98
2025-01-16 00:28:50 +00:00
774f21a370 [export] handle buffer/input mutations for joint-graph (#144806)
Summary: previous construction of GraphSignature output specs didn't consider buffer/user input mutations

Test Plan: test_experimental

Differential Revision: D68177409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144806
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2025-01-16 00:22:16 +00:00
d7f45fc575 dynamic shape support for interpolate(antialias=True) backward (#141198)
Fixes https://github.com/pytorch/pytorch/issues/141187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141198
Approved by: https://github.com/ezyang, https://github.com/Chillee
ghstack dependencies: #141161
2025-01-16 00:08:25 +00:00
4831f89790 support numbers as tensors for aten.copy(Tensor, Tensor) (#141161)
Fixes https://github.com/pytorch/pytorch/issues/141149. `aten.copy_` supports numbers as tensors in the python arg parser. So we need to give the same treatment to `aten.copy`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141161
Approved by: https://github.com/ezyang
2025-01-16 00:08:25 +00:00
2645fc45b1 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-15 23:43:41 +00:00
fb4b5a9299 [ONNX] Use python_dispatcher in type promotion (#144801)
Fix #143118

Use python_dispatcher in the type promotion pass to preserve symbolic shapes according to @angelayi 's suggestions. (Thanks!)

Tested locally. I wasn't able to create a minimal repro except for using the full model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144801
Approved by: https://github.com/titaiwangms
2025-01-15 23:25:19 +00:00
7265dc0622 Enable s8s8s8 for qlinear with mkl-dnn (#139887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887
Approved by: https://github.com/huydhn
2025-01-15 23:20:10 +00:00
4e1834f5f3 use cooperative schedule in scaled_mm for fast_accum=false (#144809)
This improves perf for large matrices by more than 2x, more detailed benchmark coming.
On master
![image](https://github.com/user-attachments/assets/fc6a0987-5b82-475d-a2ff-b46641bb17dc)
On this branch
<img width="601" alt="image" src="https://github.com/user-attachments/assets/7f55152b-1110-45e4-b2ea-6f274d543869" />
A plot similar to https://github.com/pytorch/ao/pull/1325#discussion_r1868193786
<details>
  <summary>Benchmarking code:</summary>

```python
import torch
from triton.testing import do_bench
import itertools

def fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False):
    return torch._scaled_mm(a, b.t(), scale_a.view(-1, 1), scale_b.view(1, -1), use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16)

def fn_aten(a, b, scale, use_fast_accum=False):
    return torch._scaled_mm(a, b.t(), scale, scale, use_fast_accum=use_fast_accum, out_dtype=torch.bfloat16)

for i,j,k in itertools.product(range(9, 15), range(9, 15), range(9, 15)):
    m = 2**i
    n = 2**j
    k = 2**k

    a=torch.randn(m, k, device="cuda").to(dtype=torch.float8_e4m3fn)
    b=torch.randn(n, k, device="cuda").to(dtype=torch.float8_e4m3fn)
    scale_a = torch.randint(1, 11, (a.shape[0],), device="cuda", dtype=torch.float32)
    scale_b = torch.randint(1, 11, (b.shape[0],), device="cuda", dtype=torch.float32)
    scale_0 = torch.randn((), device="cuda", dtype=torch.float32)

    ms_rowwise_fast = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=True), warmup=25, rep=50)
    ms_rowwise_slow = do_bench(lambda: fn_aten_scales(a, b, scale_a, scale_b, use_fast_accum=False), warmup=25, rep=50)

    ms_tensor_fast = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=True), warmup=25, rep=50)
    ms_tensor_slow = do_bench(lambda: fn_aten(a, b, scale_0, use_fast_accum=False), warmup=25, rep=50)

    print(f"m={m}, n={n}, k={k}, fast={ms_rowwise_fast}, slow={ms_rowwise_slow}, ratio_tw={ms_tensor_slow /ms_tensor_fast}, ratio_rw={ms_rowwise_slow / ms_rowwise_fast}")

```
</details>

Higher N/K values still have about 40% penalty, perhaps some additional heuristics tweaks would be useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144809
Approved by: https://github.com/drisspg
2025-01-15 23:04:14 +00:00
0f051eaf66 Revert "Fix global namespace pollution in ATen/Dispatch.h (#138626)"
This reverts commit 326c7cae28783f29c577b5a5d3ac38a3b61188bd.

Reverted https://github.com/pytorch/pytorch/pull/138626 on behalf of https://github.com/malfet due to This broke inductor tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_torchbench%2C%202%2C%202 ([comment](https://github.com/pytorch/pytorch/pull/138626#issuecomment-2594021436))
2025-01-15 21:59:04 +00:00
Sam
c7b2f7dd14 Add generator parameter to rand*_like functions (#136780)
Fixes #128786
Fixes #101974
Fixes #27072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136780
Approved by: https://github.com/Chillee, https://github.com/ezyang
2025-01-15 21:16:52 +00:00
d62b3979da cpp_wrapper: Move #includes to per-device header files (#143909)
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Differential Revision: [D67938955](https://our.internmc.facebook.com/intern/diff/D67938955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143909
Approved by: https://github.com/desertfire
2025-01-15 21:14:02 +00:00
05095a45f2 Fix the wrong artifact in remaining workflows (#144812)
I missed them in https://github.com/pytorch/pytorch/pull/144694 as they weren't run often.  But they are still failing nonetheless, i.e. https://github.com/pytorch/pytorch/actions/runs/12762640334/job/35578870178

The issue was from https://github.com/pytorch/pytorch/pull/125401 where it added `use-gha: ${{ inputs.use-gha }}` to linux_test workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144812
Approved by: https://github.com/clee2000
2025-01-15 20:36:40 +00:00
b88dcb4835 dynamo: Don't crash when tracing a missing attr on a constant. (#144593)
dynamo: Don't crash when tracing a missing attr on a constant.

This throws a InternalTorchDynamoError: AttributeError: 'NoneType' object has no attribute 'max'
instead of just skipping the bad call when tracing, and throwing a
normal AttributeError instead.

There are two questions that I would love reviewer comment on.
1) Is throwing unimplemented the right thing here? or should I throw
   something like ObservedAttributeError
2) Do we need to worry about performance with this code? In particular,
   should we just catch the exception? Or maybe cache the lookup result?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144593
Approved by: https://github.com/jansel
2025-01-15 20:23:43 +00:00
d812fdd490 fix as_bool serde (#144791)
Differential Revision: [D68167701](https://our.internmc.facebook.com/intern/diff/D68167701/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144791
Approved by: https://github.com/pianpwk
2025-01-15 20:22:26 +00:00
904641769e [MPSInductor] Implement pow() (#144827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144827
Approved by: https://github.com/dcci, https://github.com/jansel
2025-01-15 20:11:34 +00:00
b410378d93 Register nonzero for meta device for FBLSim (#144727)
Summary:
Fix `nonzero is not registered to meta` issue:
```
"NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered".
```

Reviewed By: ezyang

Differential Revision: D66525640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144727
Approved by: https://github.com/ezyang
2025-01-15 19:40:42 +00:00
834086c023 [export] Load side info about pos/kw argument kind for serialization. (#144686)
Summary:
Fixing issue of nodes like
```
torch.ops.aten.linear.default(x, w, b)
```
being deserialized as
```
torch.ops.aten.linear.default(x, w, bias=b)
```
which breaks roundtripping.

Test Plan:
buck test mode/opt caffe2/test:test_export -- -r TestDeserialize
buck test mode/opt caffe2/test:test_export -- -r TestSerialize

Differential Revision: D67991410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144686
Approved by: https://github.com/angelayi
2025-01-15 19:08:38 +00:00
898a90c6bb [dynamo][hop] Introduce FlexAttentionBackwardHighOrderVariable (#144533)
FIXES https://github.com/pytorch/pytorch/issues/143180

This PR adds a new variable mapping to SourcelessBuilder to represent the flex attention intermediates. The variable proxies a call to HOP, and carryovers the graph state (subgraphs represented as UnspecializedNNModuleVariable) to the dynamo output graph. This is safe to do because the nn modules used in flex attention have either been speculated on before, or are outputs of make_fx of the forward.

tlparse of `TestCompiledAutograd.test_flex_attention`: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpiWendk/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

```python
class GraphModule(torch.nn.Module):
    def forward(self, L_inputs_ : list):
         ...
         # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1)
        ...
        fw_graph0_0 = self.fw_graph0_0
        joint_graph0_0 = self.joint_graph0_0
        mask_graph0_0 = self.mask_graph0_0
        flex_attention_backward = torch.ops.higher_order.flex_attention_backward(aot0_primals_1, aot0_primals_1, aot0_primals_1, aot0_detach_3, aot0_detach_5, aot0_expand_5, aot0_zeros_1, fw_graph0_0, joint_graph0_0, (1, 1, aot0_ones, aot0_zeros, None, None, aot0__to_copy_1, aot0__to_copy_2, None, None, 1073741824, 1073741824, mask_graph0_0), 0.125, {'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'WRITE_DQ': True, 'OUTPUT_LOGSUMEXP': True}, (), ());  aot0_primals_1 = aot0_detach_3 = aot0_detach_5 = aot0_expand_5 = aot0_zeros_1 = fw_graph0_0 = joint_graph0_0 = aot0_ones = aot0_zeros = aot0__to_copy_1 = aot0__to_copy_2 = mask_graph0_0 = None
        aot0_getitem_4: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[0]
        aot0_getitem_5: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[1]
        aot0_getitem_6: "bf16[1, 1, s0, s1][s0*s1, s0*s1, s1, 1]cuda:0" = flex_attention_backward[2];  flex_attention_backward = None
        ...

    class fw_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0"):
            return arg0_1

    class joint_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "bf16[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0", arg4_1: "i32[][]cuda:0", arg5_1: "bf16[][]cuda:0"):
            return [arg5_1, None, None, None, None]

    class mask_graph0_0(torch.nn.Module):
        def forward(self, arg0_1: "i32[][]cuda:0", arg1_1: "i32[][]cuda:0", arg2_1: "i32[][]cuda:0", arg3_1: "i32[][]cuda:0"):
             # File: /data/users/xmfan/core/b/pytorch/torch/_dynamo/compiled_autograd.py:832 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 1)
            new_ones: "b8[][]cuda:0" = torch.ops.aten.new_ones.default(arg0_1, [], dtype = torch.bool, device = device(type='cuda', index=0), pin_memory = False);  arg0_1 = None
            return new_ones

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144533
Approved by: https://github.com/zou3519
2025-01-15 18:40:57 +00:00
eqy
a6763b7b81 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-15 18:37:55 +00:00
6ac0616504 [ROCm] hipblaslt rowwise f8 gemm (#144432)
hipblaslt added rowwise f8 gemm support.  Integrate with scaled_mm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144432
Approved by: https://github.com/drisspg
2025-01-15 18:23:44 +00:00
069419569d [PagedAttention] Support different input position for each batch index (#144693)
In LLM inference, each request usually has different prefill length, leading to different input position for each batch index. This PR adds such support for paged attention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144693
Approved by: https://github.com/drisspg
2025-01-15 18:03:52 +00:00
7e80758efc [CUDAGraph][Docs] add cuda to torch.randn (#144793)
Previous doc example created `torch.randn` tensor on cpu so CUDAGraph was skipped.

Fixes #144386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144793
Approved by: https://github.com/eellison
2025-01-15 18:02:10 +00:00
ee8f833d13 Undo leading underscore on ctx for breakpoint (#144864)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144864
Approved by: https://github.com/Skylion007
2025-01-15 18:00:58 +00:00
443de667b1 Revert "Enable s8s8s8 for qlinear with mkl-dnn (#139887)"
This reverts commit dc8692b0eb093d5af150ae0f3a29a0957c3e4c0d.

Reverted https://github.com/pytorch/pytorch/pull/139887 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to have broken trunk. See here for more details: [GH job link](https://github.com/pytorch/pytorch/actions/runs/12788709683/job/35651699934) [HUD commit link](dc8692b0eb) ([comment](https://github.com/pytorch/pytorch/pull/139887#issuecomment-2593597977))
2025-01-15 17:58:33 +00:00
d065e8a9de [ez] add lint commits to .git-blame-ignore-revs (#144576)
Test Plan: Ran git blame on .lintrunner.toml and github's linter (+ manual testing) shows all commits exist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144576
Approved by: https://github.com/janeyx99
2025-01-15 17:39:29 +00:00
c07dc64017 Update pin memory related APIs to not pass 'device' argument (#131858)
Based on https://github.com/pytorch/pytorch/pull/126376, this PR tries to update all PT callers (e.g., `Tensor.is_pinned()`, `Tensor.pin_memory()`) to not pass `device` argument.
As for `storage/untyped_storage.is_pinned()/pin_memory()`, we keep the `device` argument but passing `device` is discouraged. And if not given, the default `device` is still 'cuda' for BC.
Additionally, based on device-agnostic pin_memory, `pin_memory_device` argument of `torch.utils.data.DataLoader` is discouraged  now. For BC, explictly passing this argument is still effective. If not given, the default `device` will be the current accelerator.

Fixes #124908
Relates https://github.com/pytorch/pytorch/pull/126376

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131858
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-15 17:23:35 +00:00
0dca756832 Revert "Upload METADATA file with whl binaries (#143677)" (#144706)
This reverts commit 3eb3f4ed5580010a7961d996ccc6ee19c7ccbb5e.

Also reverts https://github.com/pytorch/pytorch/pull/144164

Manual revert because the above causes merge conflicts

Reverting in favor of https://github.com/pytorch/test-infra/pull/6159
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144706
Approved by: https://github.com/janeyx99, https://github.com/atalman, https://github.com/malfet
2025-01-15 17:20:21 +00:00
d782e46a36 [BE] typing for decorators - library (#138969)
Test Plan: unit tests

Differential Revision: D62302678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138969
Approved by: https://github.com/zou3519
2025-01-15 17:08:55 +00:00
c7a9599100 Handle meta tensors in FX quantization (#144726)
Summary:
D66895899 got reverted in D67565250 because of pytorch OSS linter failure.
Adding back with the format the linter suggested
https://github.com/pytorch/pytorch/actions/runs/12443655335/job/34743090791

Test Plan: buck run fbcode//mode/dev-nosan fbcode//torchrec/fb/quant/tests:test_embedding_modules

Reviewed By: emlin

Differential Revision: D68132568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144726
Approved by: https://github.com/iamzainhuda, https://github.com/janeyx99
2025-01-15 16:49:43 +00:00
2bc18a9055 restore rng generation for fbcode (#144819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144819
Approved by: https://github.com/malfet, https://github.com/kit1980
2025-01-15 16:34:25 +00:00
154185dcd0 Revert "Removed unused _RequiredParameter (#144771)"
This reverts commit 6a5f895e549665a6895c84881a35736677071048.

Reverted https://github.com/pytorch/pytorch/pull/144771 on behalf of https://github.com/malfet due to It broke number of cpuinductor tests ([comment](https://github.com/pytorch/pytorch/pull/144771#issuecomment-2593293542))
2025-01-15 15:51:33 +00:00
7c52c97a65 Expose several APIs to public (torch python APIs) (#144525)
Fixes #144302
Try to expose several APIs to public for privateuse1 scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144525
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-01-15 14:34:45 +00:00
dc8692b0eb Enable s8s8s8 for qlinear with mkl-dnn (#139887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139887
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168, https://github.com/ng-05, https://github.com/digantdesai
2025-01-15 12:51:21 +00:00
7e1c1e65eb Graph freezing preparation for non-Inductor backends (#139902)
Enable preparing module named parameters and buffers in tracing context for non-Inductor backends to implement graph freezing.

Fixes #139272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139902
Approved by: https://github.com/eellison, https://github.com/masnesral, https://github.com/gujinghui
2025-01-15 11:25:04 +00:00
62ce3e6e84 refresh benchmarks results after recent recent regressions (#143075)
refresh data after !5 regression by https://github.com/pytorch/pytorch/pull/144319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143075
Approved by: https://github.com/bobrenjc93, https://github.com/huydhn
2025-01-15 09:11:57 +00:00
e263f0af23 [BE] Make a SymbolInfo NamedTuple (#144745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144745
Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007
2025-01-15 08:59:27 +00:00
d9d7cca009 make eval_frame safe (#141357)
Fixes #108942

this PR converts eval_frame.c's static extension types to heap types, making it thread and sub-interpreter safe.

the current modification only showcases one state variable being lifted, but there are opportunities for other variables that can be addressed in this PR

todo / suggestions:

1. uplift `eval_frame_callback_key` to module state
2. define `.m_slots` to module definition so initialization is within python's module lifecycle rather than an explicit `torch_c_dynamo_eval_frame_init`
3. define configurations for module allowing sub-interpreters or not

```c
static int module_exec(PyObject *m) {}

static PyModuleDef_Slot module_slots[] = {
    {Py_mod_exec, module_exec},
    {0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT,
     ....
    .m_slots = module_slots
};
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141357
Approved by: https://github.com/jansel

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2025-01-15 07:37:50 +00:00
6ba53a5f1c [AMD] De-noise tf32 warnings (#144797)
Summary: This is way too noisy especially during unit tests. So just log once.

Test Plan: OSS CI. Tested on a unit test and now I only see one line (hard to notice :) ).

Differential Revision: D68167633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144797
Approved by: https://github.com/jianyuh, https://github.com/leitian, https://github.com/yoyoyocmu
2025-01-15 07:10:38 +00:00
69b883d7ac Remove C10_EMBEDDED (#144808)
I added this to support code sharing with ExecuTorch, but the operator<< overrides are load-bearing for builds -- we have other code that attempts to pretty-print Half/BFloat16, and implicit conversions can't be used to make that work because there are *multiple* implicit conversions from Half/BFloat16 to primitive types, so which one to select is ambiguous. Also, we don't actually seem to need it now in ExecuTorch core because we have `include <ostream>` in there at the moment anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144808
Approved by: https://github.com/janeyx99, https://github.com/malfet
2025-01-15 06:08:53 +00:00
b801210035 Restore support for other types of async_compile pools (spawn, fork) (#144491)
Summary: https://github.com/pytorch/pytorch/pull/142001 removed support for process pools other than "subprocess", but some OSS users still find it useful; put it back.

Test Plan: New unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144491
Approved by: https://github.com/jansel, https://github.com/haifeng-jin
2025-01-15 06:04:49 +00:00
326c7cae28 Fix global namespace pollution in ATen/Dispatch.h (#138626)
Summary:

Was it a typo? Since we already have `at::detail::record_kernel_function_dtype()` in `ATen/Dispatch.h`

Test Plan: just build

Differential Revision: D64642080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138626
Approved by: https://github.com/malfet
2025-01-15 05:43:54 +00:00
7d71ddbe5d Add non_c_binding torch functions to allowlist for AOTAutogradCache, confirm no special handlers for them (#144802)
Differential Revision: [D68173093](https://our.internmc.facebook.com/intern/diff/D68173093/)

This diff allows any function in torch_non_c_binding_in_graph_functions to be safe to cache. These functions should be safe to cache because they are part of the torch API, and do not save global state (or if they do, dynamo creates unique guards around the constants they return).
A function that's allowed in a dynamo graph is safe to cache for AOTAutograd purposes as long as:
- It's functional (i.e. does not access global state);
- or its value is constant folded away (and guarded against by dynamo)

The tricky cases are functions that dynamo uses special handlers to track. These special handlers can sometimes close over stuff that's safe for dynamo locally, but isn't encoded anywhere when cached across processes. An example of this is `DTensor.from_local`, where various DeviceMesh information doesn't change in the same dynamo process, but can change across multiple processes. The handler for `DTensor.from_local` closes over these and dynamo creates a proxy for the function call. This is not safe to cache.

That said, most special handlers are in fact functional and safe. So I add a unit test to test_trace_rules.py that confirms that any function with special handlers in dynamo added to this list needs to be audited to be safe to cache.

The list of safe handlers there either:
- Don't access global state;
- Guard on global state; or
- Always returns a constant that never changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144802
Approved by: https://github.com/bdhirsh
2025-01-15 05:41:36 +00:00
79312ddb73 [PP] Don't allow for num_microbatches > num_stages for single stage schedules (#144702)
There is an edge case where `Schedule1F1B` will hang when num_microbatches=1 (https://github.com/pytorch/torchtitan/issues/775). For validation it makes sense to check that the number of stages should be >= number of microbatches otherwise there will be an even larger bubble.

This can be removed when we have the single stage schedules to use an IR and updated to run with schedule runtime (issue tracker https://github.com/pytorch/pytorch/issues/144701)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144702
Approved by: https://github.com/kwen2501
2025-01-15 05:35:29 +00:00
ae7df51232 [c10d] Fix CudaEventCache for dangling references (#144496)
Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it.
1. We add a unit test to repro the issue mentioned in the issue.
2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496
Approved by: https://github.com/kwen2501
2025-01-15 05:11:48 +00:00
9cd6f46130 [ca] raise error message on AOT Autograd caching (#144595)
FIXES https://github.com/pytorch/pytorch/issues/144175, bandaid

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144595
Approved by: https://github.com/bdhirsh
2025-01-15 05:09:42 +00:00
e0bbff6019 [c10d][ez] Add comments to the end of Macro for better readability (#144789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144789
Approved by: https://github.com/c-p-i-o
2025-01-15 05:06:41 +00:00
d2ca8163c0 [MPSInductor] Support abs in MetalPrintExpr (#144826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144826
Approved by: https://github.com/dcci
ghstack dependencies: #144509, #144798, #144795, #144796
2025-01-15 05:01:25 +00:00
9610a22e94 Fix FakeTensor device creation for MPS (#144796)
By promoting torch.device("mps") to `torch.device("mps:0")`, but skipping `is_initialized` check, as MPS does not really support multi-GPU right now

This fixes `GPUTests.test_remove_no_ops_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144796
Approved by: https://github.com/ezyang
ghstack dependencies: #144509, #144798, #144795
2025-01-15 05:01:25 +00:00
18786c65e5 [BE] Extend test_remove_no_ops (#144795)
----

- Use `is_dtype_supported` to skip dtype promotions portion of the test on unsupported device
- Extend it to use `torch.float16` so promotions could be checked there
- Implement `CpuInterface.is_bfloat16_supported` that returns true (which looks like the case, even if it's supported via emulation)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144795
Approved by: https://github.com/Skylion007
ghstack dependencies: #144509, #144798
2025-01-15 05:00:26 +00:00
48f7e7c378 [torch][ao][EASY] Change print to log in numeric debugger to avoid large output (#144790)
Summary:
This print statement was spewing a bunch of data in logs by default, but it should
be silenceable.
Use `log.debug` instead.

Differential Revision: D68166823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144790
Approved by: https://github.com/tarun292
2025-01-15 04:58:56 +00:00
6a5f895e54 Removed unused _RequiredParameter (#144771)
As per this [discussion](https://discuss.pytorch.org/t/a-question-about-requiredparameter/137977), I figured that `_RequiredParameter` is no longer used.

The `required` object was initially introduced in this [PR](4db6667923) as the `SGD` optimizer did not offer a default value for the learning rate. However there isn't a single place in the code base using `_RequiredParameter`, nor `required`. I am therefore removing unused `_RequiredParameter` and `required`.

Everything not included in this PR is Not a Contribution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144771
Approved by: https://github.com/janeyx99
2025-01-15 04:11:17 +00:00
cyy
d87aad6877 [5/N] Apply Ruff fixes and pyupgrade to Python 3.9 (#144205)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144205
Approved by: https://github.com/albanD
2025-01-15 04:00:47 +00:00
db787181b5 Back out "[Submodule] Upgrade to Cutlass 3.6" (#144738)
Summary: Revert due to perf regressions see: https://github.com/pytorch/pytorch/issues/144729

Test Plan: sand castle

Differential Revision: D68137326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144738
Approved by: https://github.com/huydhn
2025-01-15 02:57:14 +00:00
e2251fffbb [MPSInductor] Add min/max to MetalExprPrinter (#144798)
After that `GPUTests::test_avg_pool2d8_mps` and `GPUTests::test_avg_pool2d5_mps` passes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144798
Approved by: https://github.com/dcci
ghstack dependencies: #144509
2025-01-15 01:43:42 +00:00
9199c79a9c [Quant][Inductor][X86] Separate unary post op fusion and lowering for qconv (#144312)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused the op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves unary post op fusion of qconv out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - conv` patterns are replaced by `onednn.qconv2d_pointwise`
2. Fuse `onednn.qconv2d_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144312
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
ghstack dependencies: #144224
2025-01-15 00:50:54 +00:00
825fe15024 EZ fix to make sure local pytest run succeeds in export (#144764)
Previously run_tests() was protected under IS_FBCODE flag so that following works:
```
python test/export/test_export_legacy.py
```

But it fails on:
```
pytest test/export/test_export_legacy.py
```

This is because pytest doesn't seem to get triggered through run_tests().

Differential Revision: [D68152737](https://our.internmc.facebook.com/intern/diff/D68152737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144764
Approved by: https://github.com/avikchaudhuri
2025-01-15 00:43:40 +00:00
8c2aa0c533 [cutlass backend] cexpr the arg before writing to cpp file (#144714)
Summary: The problem is for certain shapes, see unit test, one of the dimensions is like `s0 // 2`. If we use cutlass backend, this means writing that to C++ file, which would lead to C++ compilation error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144714
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78, https://github.com/desertfire
2025-01-14 23:09:44 +00:00
8ad37ed710 Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483
Approved by: https://github.com/Skylion007
2025-01-14 22:32:51 +00:00
ea3395e4f2 [ROCm] Improvements for vectorized elementwise kernels (#143269)
*  Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
   * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143269
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
2025-01-14 22:09:21 +00:00
c000214826 Allow GradientEdge as torch.autograd.backward outputs (#144744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144744
Approved by: https://github.com/albanD
2025-01-14 21:31:44 +00:00
64829b356a [PrivateUse1] Support parseDispatchKey with modified PrivateUse1 (#144325)
PyTorch now support many private1 backend names like `AutogradPrivateUse1` or `QuantizedPrivateUse1`, not mentioned the original `PrivateUse1` backend.

However, users that implement `PrivateUse1` funtionalities would modified the backend name by calling  `torch.utils.rename_privateuse1_backend("my_backend")`, in that case, all `PrivateUse1` backend string would not be found when we call other functions related to it. For example, we utilize `torch.library` to register some customize functions to our new backend, we would use "my_backend" as the backend name instead of "PrivateUse1", in which the error will be throw:
```
could not parse dispatch key 'my_backend'
```

So, this PR changed the function `c10::DispatchKey parseDispatchKey(const std::string& k)`, it would double check if the `PrivateUse1` has been modified, and if so, we would change `k` to adapt new backend name then find it again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144325
Approved by: https://github.com/albanD
2025-01-14 21:21:29 +00:00
130452dad6 [Pipelining] fix test_schedule.py (missing destroy_process_group (#144734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144734
Approved by: https://github.com/H-Huang
ghstack dependencies: #144352, #144596
2025-01-14 21:16:09 +00:00
aa57f0c663 [Pipelining] Refactor common utils from test_pp_dp (#144596)
Split test_pp_dp into pp_ddp and pp_fsdp so its a bit more
concise and easier to add CP to the FSDP one.

Realize that 'use_new_runtime' parametrization was not even being used,
removing it saves a bunch of test time. We should migrate schedules to
the new runtime and have them be covered that way.  (And
test_schedule*.py are testing new runtime too).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144596
Approved by: https://github.com/H-Huang
ghstack dependencies: #144352
2025-01-14 20:13:17 +00:00
6f5dce3035 [Pipelining] Fix PP grad scaling (#144352)
Adds a grad-scaling method `perform_pp_grad_scaling()` which divides grads by num_microbatches.

Enables grad scaling by default, unless disabled due to using a loss function that sums instead of averaging losses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144352
Approved by: https://github.com/H-Huang
2025-01-14 20:13:17 +00:00
9157a748a6 [MPSInductor] Add dummy properties (#144509)
For compute capabilitiy (which is an empty string, same as CPU)
And for multicore count return 8, as this is smallest number of GPU cores on Apple silicon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144509
Approved by: https://github.com/jansel
2025-01-14 20:12:38 +00:00
bdd942efd7 Revert "Increase C10_COMPILE_TIME_MAX_GPUS to 128 (#144138)"
This reverts commit 6cfc08167595e27ee9a5701c6426a7a8a7e387ef.

Reverted https://github.com/pytorch/pytorch/pull/144138 on behalf of https://github.com/albanD due to This seems to impact the caffe2 code ([comment](https://github.com/pytorch/pytorch/pull/144138#issuecomment-2590891200))
2025-01-14 19:04:12 +00:00
b4b4e57469 [CD] Enable profiling for XPU Windows nightly wheels (#144316)
PR https://github.com/pytorch/pytorch/pull/144034 added profiling support for torch XPU Windows binary, enable it in PyTorch XPU Windows CD
Works for https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144316
Approved by: https://github.com/xuhancn, https://github.com/atalman
2025-01-14 19:01:27 +00:00
2683691237 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-14 18:47:42 +00:00
e2891d43a8 [codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +1 (#144783)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144783
Approved by: https://github.com/albanD, https://github.com/malfet
2025-01-14 18:34:54 +00:00
ec1c3ab3b2 [inductor][triton] skip test_data_type_propagation if triton (#142054)
None cpp inductor backends don't have a `DataTypePropagation` pass on the scheduler nodes so skip the test. CUDA only passes because the device is currently not changed to "cuda" in the test body.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142054
Approved by: https://github.com/eellison
2025-01-14 18:03:00 +00:00
e666807653 [Fix]: Enable support for Arm Neon & SVE support for FP32 Gemm Wrapper (#144327)
**Performance Improvements**:
Linear Layer [ 1x512 * 512x512 ] ->  2x - 4x
Linear Layer [ 3x512 * 512x512 ] -> 2x - 4x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144327
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/cfRod, https://github.com/malfet

Co-authored-by: Crefeda Rodrigues <crefeda.Rodrigues@arm.com>
2025-01-14 17:52:00 +00:00
eee7a47e94 Support FunctionalTensor subclass in is_fake and maybe_get_fake_mode (#144719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144719
Approved by: https://github.com/bdhirsh
2025-01-14 17:49:11 +00:00
d21738f24a Revert "Fix torch.normal ignores default_device (#144070)"
This reverts commit 184549b2d7e59acfc6e47d121e9ebb50648945b3.

Reverted https://github.com/pytorch/pytorch/pull/144070 on behalf of https://github.com/ezyang due to broken a specific use case ([comment](https://github.com/pytorch/pytorch/pull/144070#issuecomment-2590681953))
2025-01-14 17:41:58 +00:00
7977a3638e [executorch hash update] update the pinned executorch hash (#140769)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140769
Approved by: https://github.com/pytorchbot
2025-01-14 17:38:07 +00:00
f2975717f3 [CD] Fix slim-wheel nvjit-link import problem (#141063)
When other toolkit (say CUDA-12.3)  is installed and `LD_LIBRARY_PATH` points to there, import torch will fail with
```
ImportError: /usr/local/lib/python3.10/dist-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
```
It could not be worked around by tweaking rpath, as it also depends on the library load order, which are not guaranteed by any linker. Instead solve this by preloading `nvjitlink` right after global deps are loaded, by running something along the lines of the following
```python
        if version.cuda in ["12.4", "12.6"]:
            with open("/proc/self/maps") as f:
                _maps = f.read()
            # libtorch_global_deps.so always depends in cudart, check if its installed via wheel
            if "nvidia/cuda_runtime/lib/libcudart.so" in _maps:
                # If all abovementioned conditions are met, preload nvjitlink
                _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]")
```

Fixes https://github.com/pytorch/pytorch/issues/140797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141063
Approved by: https://github.com/kit1980

Co-authored-by: Sergii Dymchenko <sdym@meta.com>
2025-01-14 17:33:07 +00:00
5c727d5679 [minifier] Fix config generator for callables (#144518)
Summary:
When config contains callables, the current configs generated cannot be run:

```
torch._dynamo.config.reorderable_logging_functions = {<built-in function print>, <function warning at 0x7f774c595630>, <function log at 0x7f774c595870>, <function error at 0x7f774c595510>, <function info at 0x7f774c595750>, <built-in function warn>, <function exception at 0x7f774c5955a0>, <function debug at 0x7f774c5957e0>, <function critical at 0x7f774c5953f0>}
```

We fix the config to generate the right string, so the config is runnable, like below

```
import logging
import warnings
torch._dynamo.config.reorderable_logging_functions = { warnings.warn, logging.warn, print }
```

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:utils -- -r test_codegen_config
```

Differential Revision: D67998703

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144518
Approved by: https://github.com/desertfire
2025-01-14 17:18:13 +00:00
cbb1ed2966 [1/N] OpenReg: Replace open_registration_extension.cpp with openreg (#141815)
As described in OpenReg [next-steps](https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/README.md#next-steps), here we replace the current `open_registration_extension.cpp` test in PyTorch CI with openreg.

The current `open_registration_extension.cpp` contains two parts:
1. Implentations to support `PrivateUse1` backend.
2. Helper functions used for UTs in `test_cpp_extensions_open_device_registration.py` and `test_transformers.py`.

For the first part, we'll replace it with openreg. For the second part, we'll migrate them to ut files step by step.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141815
Approved by: https://github.com/albanD
2025-01-14 15:59:00 +00:00
347a74b8f5 Mark CUDA-12.6 as experimental for 2.6 release (#144769)
Because that's the first time we are trying to release it, and it also is the first release to use manylinux2_28
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144769
Approved by: https://github.com/atalman
2025-01-14 15:30:00 +00:00
60d2e32fa4 [BE] Remove lambda from str (#144743)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144743
Approved by: https://github.com/avikchaudhuri, https://github.com/Skylion007
ghstack dependencies: #144471
2025-01-14 15:10:57 +00:00
ffb3f32693 Add max kwarg to torch._check with alternate size oblivious semantics (#144471)
Fixes https://github.com/pytorch/pytorch/issues/120288 for the static bound case

I had been tying myself in knots in the original issue about the fact that we can't really do symbolic bounds like u0 < s0. But then I realized, "Wait, but the static bounds are easy!" So this makes it so you can also exclude a specific upper bound when doing size oblivious tests, which is enough to solve https://github.com/pytorch/pytorch/issues/123592#issuecomment-2574556708

It's written very dirtily, maybe there's some cleanup. Bikeshed on the public API name also welcome.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144471
Approved by: https://github.com/avikchaudhuri
2025-01-14 15:10:57 +00:00
95b41d2aa4 Tests Generelization for multiple accelerator devices (#139749)
Motivation: Generalize unit tests so that can be executed for cuda and non cuda devices.
Chnages: There are general changes in common_dtesnor module for device type generalization so that tests can be executed on non cuda devices too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139749
Approved by: https://github.com/kwen2501
2025-01-14 08:52:46 +00:00
1800f5f461 Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735)
**Motivation:**

- Enable coalescing path on XPU for `batch_isend_irecv`.
- If XCCL backend is specified, then construct a XPU tensor to ensure `barrier` dispatch to XCCL backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143735
Approved by: https://github.com/kwen2501
2025-01-14 08:37:48 +00:00
21cbee5d9b Drop unused num_elements variable (#144723)
Summary:
With the recent enforcement of unused variable as an error in D67329035, certain tests like
https://www.internalfb.com/intern/test/562950135258426?ref_report_id=0
can't build citing:
```
Action failed: fbcode//caffe2:libtorch_cuda (cfg:linux-x86_64-fbcode-platform010-clang17-no-san#2a7259832b2f5c67) (cxx_compile torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (pic))
Remote command returned non-zero exit code 1
Remote action, reproduce with: `frecli cas download-action a95a6625d2b071a782a7a8ea2882f4adccf103b023df5ccb596f48c506101754:145`
Stdout: <empty>
Stderr:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3757:16: error: unused variable 'num_elements' [-Werror,-Wunused-variable]
 3757 |         size_t num_elements = output.numel();
      |                ^~~~~~~~~~~~
1 error generated.
```
This causes Sandcastle to turn off these tests, decreasing protection from other bad diffs. Clean up the unused variable to unblock.

Test Plan:
```
buck2 build --config hpc_comms.use_ncclx=dev --flagfile fbcode//mode/opt fbcode//ftar:ftar_py_e2e_test
```

https://www.internalfb.com/buck2/888dfc68-07eb-4ba1-add5-b38c12d52b33

Reviewed By: c-p-i-o

Differential Revision: D68126236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144723
Approved by: https://github.com/fduwjj, https://github.com/Skylion007

Co-authored-by: Daulet Askarov <dauleta@meta.com>
2025-01-14 08:29:01 +00:00
80eff6e720 [MPS] fix triangular for >3D tensors (#144545)
Old implementation leads to incorrect output due to not handling the other batch sizes other than 3D tensors(B, M, N)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144545
Approved by: https://github.com/malfet
2025-01-14 08:25:01 +00:00
8436a5c2cb [Quant][Inductor][X86] Separate binary post op fusion and lowering for qlinear (#144224)
**Summary**
The current implementation fuses quantized ops and their post ops and lowers the fused op to cpp backend in the same pass. It is better to separate post op fusion and lowering because
- it looks better in terms of design
- we need the post op fusion pass for PT2E quantization eager mode

As one of a series of PRs which do the separation, this PR moves binary post op fusion of qlinear out of the lowering pass to after the weight-prepack pass. The workflow is
1. Weight prepack for qlinear so that `dq - linear` patterns are replaced by `onednn.qlinear_pointwise`
2. Fuse `onednn.qlinear_pointwise` and post ops
3. Lower to cpp backend

This PR adds additional `PatternMatcherPass`'s to handle the post op fusion. Pattern matchers used for fusion are reused.

**Test plan**
It is covered by existing UTs in `test_mkldnn_pattern_matcher.py` for post op fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144224
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-01-14 06:46:38 +00:00
c031defe0b [RELAND] Generalize at::manual_seed for all accelerators (#144370)
# Additional Context
This is a reland PR originated from eeb57394f93d720bca498c3fa9d167fc7b9cca46

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144370
Approved by: https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui
2025-01-14 06:09:36 +00:00
9d98b66e7b [Inductor][CPP] Enable Epilogue Fusion for Grouped GEMM Template (#143897)
**Summary**
In this PR, we enable the epilogues fusion and code generation for Grouped GEMM. Here are the high-level description of how we implement it.

**Fusion**

- The Grouped GEMM Template produces a `Template Buffer` with a `MultiOutputLayout` and a set of `MultiOutput Buffers`, where each buffer corresponds to a specific GEMM.
- During the initial round of fusion, the `Template Buffer` and all associated `MultiOutput Buffers` are fused into a `FusedSchedulerNode` by extending the existing fusion design.
- In subsequent fusion rounds, this `FusedSchedulerNode` can further fuse with its epilogues, following the original fusion design principles.

**Code Gen**
We maintain a list of epilogues and codegen it one by one.

- If any of the GEMM has bias, we create  a extra `bias_add` epilogue and prepend it at first of the epilogue list.
- If any of the GEMM has no epilogue, we create a `to_bf16` copy epilogue and append it at last of the epilogue list.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_epilogue
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143897
Approved by: https://github.com/jansel, https://github.com/jgong5
ghstack dependencies: #143796
2025-01-14 06:07:50 +00:00
25de671ea8 [Inductor][CPP] Enable Grouped GEMM Template (#143796)
**Summary**
Enable the CPP Grouped GEMM Fusion, lowering and Grouped GEMM Template following the RFC: https://github.com/pytorch/pytorch/issues/144012

- Support flexible number of GEMMs
- Share activation across GEMMs
  - The Grouped GEMM Template supports independent activations
  - However, the pattern matcher requires an anchor node, which is as the shared activation across GEMMs
- Each GEMM can have a unique weight but same sizes
- Each GEMM can have a unique bias or None
  - Current PR does not yet support biases; this will be addressed in a follow-up epilogue fusion PR
- Each GEMM have its own epilogues
  - Epilogue fusion is not yet supported in this PR and will be enabled in an upcoming follow-up epilogue fusion PR

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_grouped_linear_invalid
python -u -m pytest -s -v test/inductor/test_cpu_cpp_wrapper.py -k test_grouped_linear
```

**Example**
Here is the example and generated code
```
batch_size = 4
in_features = 512
out_features = 1024
dtype = torch.bfloat16

class M(torch.nn.Module):
    def __init__(self, bias):
        super().__init__()
        self.linear0 = torch.nn.Linear(in_features, out_features, bias=False)
        self.linear1 = torch.nn.Linear(in_features, out_features, bias=False)

    def forward(self, x):
        return self.linear0(x), self.linear1(x)

if __name__ == "__main__":
    with torch.no_grad():
        input = torch.randn(batch_size, in_features, dtype=dtype)
        m = M(bias=bias).to(dtype=dtype).eval()
        cm = torch.compile(m)
        act_res = cm(input)
```

Generated Code:  https://gist.github.com/leslie-fang-intel/ed2e8d23aeb3586eb504feeace692e16#file-grouped-gemm-generated-code-py

**Next Step**

- Support Epilogue fusion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143796
Approved by: https://github.com/jgong5, https://github.com/jansel
2025-01-14 05:59:07 +00:00
35b46a75f1 [mps/inductor] Add support for round() (#144731)
With this change, inductor/test_view_on_aliased passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144731
Approved by: https://github.com/malfet
2025-01-14 05:56:13 +00:00
17e05cde0c ROCm: Skip tests in elastic/utils/distributed_test (#144692)
The tests are failing on ROCm machines due to the below error. The client socket has timed out after 1000ms while trying to connect to (gpu4f67.jax.cs.cpe.ice.amd.com, 0)
Disabling the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144692
Approved by: https://github.com/jeffdaily
2025-01-14 03:49:06 +00:00
e58c823ab8 Implement increment and add_to_set for CompileEventLogger (#143427)
This diff implements `increment` and `add_to_set`, which are features of MetricsContext, but not ChromiumEventLogger. This allows us to add a bunch of other metricscontext callsites to use CompileEventLogger instead.

Differential Revision: [D67354867](https://our.internmc.facebook.com/intern/diff/D67354867/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143427
Approved by: https://github.com/masnesral
2025-01-14 02:42:49 +00:00
6053242890 [CD] Enable python3.13t builds for aarch64 (#144698)
But make sure that right numpy version is picked (2.0.2 does not support 3.13)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144698
Approved by: https://github.com/atalman
ghstack dependencies: #144696, #144697, #144716
2025-01-14 02:29:01 +00:00
b221f88fc1 Leave SCCACHE_S3_KEY_PREFIX empty to share the cache among all build jobs (#144704)
This is a follow-up of https://github.com/pytorch/pytorch/pull/144112#pullrequestreview-2528451214.  After leaving https://github.com/pytorch/pytorch/pull/144112 running for more than a week, all build jobs were fine, but I failed to see any improvement in build time.

So, let's try @malfet suggestion by removing the prefix altogether to keep it simple.  After this land, I will circle back on this to see if there is any improvements.  Otherwise, it's still a simple BE change I guess.

Here is the query I'm using to gather build time data for reference:

```
with jobs as (
    select
        id,
        name,
        DATE_DIFF('minute', created_at, completed_at) as duration,
        DATE_TRUNC('week', created_at) as bucket
    from
        workflow_job
    where
        name like '%/ build'
        and html_url like concat('%', {repo: String }, '%')
        and conclusion = 'success'
        and created_at >= (CURRENT_TIMESTAMP() - INTERVAL 6 MONTHS)
),
aggregated_jobs_in_bucket as (
    select
        --groupArray(duration) as durations,
        --quantiles(0.9)(duration),
        avg(duration),
        bucket
    from
        jobs
    group by
        bucket
)
select
    *
from
    aggregated_jobs_in_bucket
order by
    bucket desc
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144704
Approved by: https://github.com/clee2000
2025-01-14 02:19:38 +00:00
6d56277682 [export] Fix torchbind constant folding (#144684)
Summary: `CallTorchBind` should not be folded during constant folding

Test Plan:
```
buck2 run mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding_torchbind
```

Reviewed By: henryoier

Differential Revision: D67721272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144684
Approved by: https://github.com/zhxchen17
2025-01-14 01:58:44 +00:00
eaa8a97b39 [RelEng] Add --ami option to build_aarch64 (#144685)
Which should be mutually-exclusive with OS

For example, one can use the following to alloc one-off instance
```
./build_aarch64_wheel.py --alloc-instance  --instance-type g5.4xlarge --key-name nshulga-key --ami ami-0f51103893c02957c --ebs-size 200
```

TODO:
 - Figure out EBS volume name depending on the AMI (for `ami-05576a079321f21f8`(al2023) it's `/dev/xvda`, but for `ami-0f51103893c02957c`(deep learning container) it's `/dev/sda1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144685
Approved by: https://github.com/atalman
2025-01-14 01:48:27 +00:00
de9d6a25d7 [mps/inductor] Add support for ceil (#144715)
inductor/test_index_dynamic_shapes passes after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144715
Approved by: https://github.com/malfet
2025-01-14 01:16:47 +00:00
64bcf39180 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit 388b75edec09182131be0dfe1abeafc5c3b91adf.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))
2025-01-14 00:48:28 +00:00
dfe06e555d Revert "Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)"
This reverts commit dcc04e9237292de10e9cedd8213253e253b1e91c.

Reverted https://github.com/pytorch/pytorch/pull/144483 on behalf of https://github.com/kit1980 due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/144441 ([comment](https://github.com/pytorch/pytorch/pull/144483#issuecomment-2588515018))
2025-01-14 00:46:48 +00:00
58302c4eaa [BE] [CD] Remove pygit2 dep for aarch64_wheel build (#144716)
As it's incompatible with 3.13t and only used to fetch the branch name, which could be done by running
```
git rev-parse --abbrev-ref HEAD
```

Also, remove yet another reference to long gone `master` branch.

Test plan:
  Download `manywheel-py3_11-cpu-aarch64.zip` produced by this PR, install it inside docker container and check it's version
```
# pip install torch-2.7.0.dev20250113+cpu-cp311-cp311-manylinux_2_28_aarch64.whl
...
Installing collected packages: mpmath, typing-extensions, sympy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch
Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.12.0 jinja2-3.1.5 mpmath-1.3.0 networkx-3.4.2 sympy-1.13.1 torch-2.7.0.dev20250113+cpu typing-extensions-4.12.2
root@434f2540345e:/# python
Python 3.11.9 (main, Aug  1 2024, 23:33:10) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.7.0.dev20250113+cpu'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144716
Approved by: https://github.com/atalman
ghstack dependencies: #144696, #144697
2025-01-14 00:43:46 +00:00
dcc04e9237 Stop ignoring mypy errors in torch/testing/_internal/common_utils.py (#144483)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144483
Approved by: https://github.com/Skylion007
2025-01-13 23:19:44 +00:00
c15d6508bd Binary builds Docker images - remove cuda 12.1 (#144575)
Remove cuda 12.1 from manylinux, libtoch and almalinux builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144575
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/malfet, https://github.com/Skylion007
2025-01-13 22:44:59 +00:00
4f74864c94 Revert "[AOTI] Add a boxed_run API (#142213)"
This reverts commit 868984c3e324dedeac04cf10e2bbfbf912dac3b1.

Reverted https://github.com/pytorch/pytorch/pull/142213 on behalf of https://github.com/kit1980 due to breaking lots of internal builds, see D68036023 ([comment](https://github.com/pytorch/pytorch/pull/142213#issuecomment-2588378262))
2025-01-13 22:43:47 +00:00
a54a784b82 [dynamo][dicts] Consolidate dict(..) construction (#144342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144342
Approved by: https://github.com/StrongerXi
2025-01-13 22:24:56 +00:00
0373cd9950 remove allow-untyped-defs from torch/distributed/checkpoint/api.py (#144653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144653
Approved by: https://github.com/Skylion007
2025-01-13 21:57:19 +00:00
1dab79470d c10::string_view -> std::string_view in pytorch (#143591)
Test Plan: Sandcastle

Differential Revision: D67312322

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143591
Approved by: https://github.com/malfet
2025-01-13 21:44:05 +00:00
5129d6ef51 Fix inductor periodic smoke test wrong artifact (#144694)
I'm not entirely sure why this failure starts to show up in periodic since Friday https://github.com/pytorch/pytorch/actions/runs/12716967189/job/35463656803.  The artifact was uploaded to S3, but `use-gha: anything-non-empty-to-use-gh` was set and it was working.  Maybe this is related to https://github.com/pytorch/pytorch/issues/144479

I also clean up the GCP/AWS A100 selection logic as the GCP cluster doesn't exist anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144694
Approved by: https://github.com/clee2000
2025-01-13 21:42:39 +00:00
e15f91337b [inductor] Add unbacked symints binding in ShapeProp (#144605)
Summary: ShapeProp  doesn't know how to propagate unbacked. Patch it up to propagate unbacked symints like PropagateUnbackedSymInts.

Test Plan:
```
buck run mode/dev-nosan  fbcode//caffe2/test:fx -- -r test_shape_prop_unbacked_sym
```

Differential Revision: D68050073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144605
Approved by: https://github.com/guowentian, https://github.com/pianpwk
2025-01-13 21:30:20 +00:00
3c55669b88 Enable grep_linter to use -a (#144589)
Lintrunner can only apply changes (-a) if only one suggestion is made per file.  The grep_linter makes a suggestion for every line it finds incorrect, so it creates multiple suggestions per file if there are multiple lines that it wants to change

This sets the `line` parameter of the LintMessage to None for all of grep_linter, but I'm not sure if that entry did anything

I'm not sure if enabling -a is the best idea, since its currently used for tabs and tab width might differ each time?  I had one instance where running with -a cause the spacing to change.  On the other hand, -a would have already worked if only one line was bad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144589
Approved by: https://github.com/huydhn
2025-01-13 21:18:24 +00:00
91dbd7b75c [BE]: Improve typing inference with TypeIs (#144682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144682
Approved by: https://github.com/albanD

Co-authored-by: Aaron Orenstein <aorenste@meta.com>
2025-01-13 21:14:31 +00:00
4ceca4d60f [dynamo] Avoid graph break on updates to obj.__dict__ (#144419)
`obj.__dict__` is handled specially in Dynamo, and prior to this patch
we only support read and membership check on that dictionary object.

This patch adds support for writes and some documentation.

Fixes #143756.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144419
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-01-13 21:04:10 +00:00
684d015c2f [AOTI] Support _int_mm (#144571)
Summary: Add _int_mm to the C shim, to resolve a torchao issue, https://github.com/pytorch/ao/pull/1531#issue-2776827015

Differential Revision: [D68030385](https://our.internmc.facebook.com/intern/diff/D68030385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144571
Approved by: https://github.com/yushangdi
2025-01-13 20:32:29 +00:00
b7f95df65b [Feat]: Add Multithreading support for kleidiai groupwise GEMM kernels (#144074)
KleidiAI Groupwise GEMM Kernel was not 2D Blocked. This change adds supports for 2D blocking of GEMM kernel to efficiently split workload & speedup GEMM kernel over multiple threads.

Performance improvements:
7B model Pre-fill  speedup from 145 t/s to 175 t/s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144074
Approved by: https://github.com/digantdesai
2025-01-13 20:32:23 +00:00
5a2e8fce9d Fix block pointer test module for triton CPU and add to CI (#144474)
- Fix for BlockPointerTestBase._discontiguous_tensor. It defaults to constructing CUDA tensors, causing a failure if CUDA is not available.
- Add test module to CI to prevent errors like the above from occurring.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144474
Approved by: https://github.com/jansel
2025-01-13 20:25:05 +00:00
80c286cbec remove allow-untyped-defs from torch/_C/_dynamo/eval_frame.pyi (#144655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144655
Approved by: https://github.com/StrongerXi
2025-01-13 20:03:25 +00:00
18deff0262 remove allow-untyped-defs from torch/ao/nn/intrinsic/__init__.py (#144652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144652
Approved by: https://github.com/Skylion007
2025-01-13 19:36:08 +00:00
d44c3906b8 [EZ] [CD] Add 3.13 to FULL_PYTHON_VERSIONS (#144697)
Separation was necessary for Conda codegen, but now it's gone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144697
Approved by: https://github.com/atalman, https://github.com/izaitsevfb
ghstack dependencies: #144696
2025-01-13 19:12:12 +00:00
d2f905760d [EZ] [CD] Eliminate stale TODO (#144696)
As 3.13 has been enabled across the board, which one can verify by running `./github/regenerate.sh` and observe that non of the configs have changed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144696
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2025-01-13 19:12:12 +00:00
cd477cdd1d remove allow-untyped-defs from torch/ao/nn/quantized/reference/modules/linear.py (#144656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144656
Approved by: https://github.com/Skylion007
2025-01-13 19:03:05 +00:00
f93d786f73 remove allow-untyped-defs from torch/nn/parameter.pyi (#144654)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144654
Approved by: https://github.com/Skylion007
2025-01-13 19:02:31 +00:00
983bf604e5 ReshapeTransform: added missing argument in docstring (#144401)
See https://github.com/pytorch/pytorch/pull/144197#discussion_r1907336339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144401
Approved by: https://github.com/janeyx99, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-01-13 17:59:59 +00:00
fe8c5c7a2d Update the Triton DeviceInterface in test/inductor/extension_backends/triton/device_interface.py (#144399)
Following the changes to how `DeviceInterface` is used in this [PR](https://github.com/pytorch/pytorch/pull/142033), the `DeviceInterface` in `extension_backend/triton/device_interface.py` should by updated to return the `DeviceProperties` instead of raising a NotImplementedError.

This PR mirrors the [changes](https://github.com/pytorch/pytorch/pull/142033/files#diff-06553e25e48e1d60f3030458bc46d52067d3d0c3eef2d5fcea29f7e8126bd7c9L112-R114) made in Dynamo when the PR landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144399
Approved by: https://github.com/jansel
2025-01-13 17:19:58 +00:00
bee84e88f8 [BE][Easy] improve submodule discovery for torch.ao type annotations (#144680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144680
Approved by: https://github.com/Skylion007
2025-01-13 17:16:19 +00:00
c40d917182 [MPSInductor] Fix maximum/minimum for int types (#144665)
`metal::isnan` is only defined for floats, so provide a generic wrapper
that is false for integral types

TODO: Figure out why type propagantion is not working (or should it?)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144665
Approved by: https://github.com/dcci
2025-01-13 15:14:01 +00:00
8633845090 Support nanj in inductor (#144064)
Fixes https://github.com/pytorch/pytorch/issues/144029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144064
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-13 14:29:38 +00:00
417354d953 [mps/inductor] Add support for truncdiv(). (#144666)
Two other inductor tests pass after this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144666
Approved by: https://github.com/malfet
2025-01-13 13:39:38 +00:00
7e2239f1f0 [MPSInductor] Better error when kernel fails to compile (#144649)
Now error message looks as follows:
```
% python ../test/inductor/test_torchinductor.py -v -k test_cat_unbacked_2d_mps
test_cat_unbacked_2d_mps (__main__.GPUTests) ... inline_call []
stats [('calls_captured', 6)]
inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('autograd_cache_bypass', 1), ('not_ok', 1)]
ERROR

======================================================================
ERROR: test_cat_unbacked_2d_mps (__main__.GPUTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3126, in wrapper
    method(*args, **kwargs)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 12254, in new_test
    return value(self)
  File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 5885, in test_cat_unbacked_2d
    self.common(
  File "/Users/malfet/miniconda3/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 620, in check_model_gpu
    check_model(
  File "/Users/malfet/git/pytorch/pytorch/build/../test/inductor/test_torchinductor.py", line 461, in check_model
    actual = run(*example_inputs, **kwargs)
  File "/Users/malfet/git/pytorch/pytorch/torch/_dynamo/eval_frame.py", line 580, in _fn
    raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 704, in _compile_fx_inner
    raise InductorError(e, currentframe()).with_traceback(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 689, in _compile_fx_inner
    mb_compiled_graph = fx_codegen_and_compile(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1149, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/compile_fx.py", line 1064, in codegen_and_compile
    compiled_fn = graph.compile_to_module().call
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 1977, in compile_to_module
    return self._compile_to_module()
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/graph.py", line 2018, in _compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/codecache.py", line 2768, in load_by_key_path
    mod = _reload_python_module(key, path)
  File "/Users/malfet/git/pytorch/pytorch/torch/_inductor/runtime/compile_tasks.py", line 51, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 40, in <module>
  File "/var/folders/sc/2thx6_x95h7_h9qs8s48yh140000gn/T/tmpmyfz2ju8/lt/cltm34ognlgcc6oxoe6bexvtbwcdtdfgnkjj5miz7vhkemitacp7.py", line 32, in _compile_mps_shader
torch._inductor.exc.InductorError: SyntaxError: failed to compile
    kernel void generated_kernel(
        device float* out_ptr0,
        constant float* in_ptr0,
        uint xindex [[thread_position_in_grid]]
    ) {
        long x1 = (xindex) / (3);
        auto tmp0 = x1;
        auto tmp1 = static_cast<long>(tmp0);
        auto tmp2 = 0;
        auto tmp3 = tmp1 >= tmp2;
        auto tmp4 = 2;
        auto tmp5 = tmp1 < tmp4;
        long x0 = (xindex) % (3);
        auto tmp6 = in_ptr0[x0 + 3*(x1)];
        auto tmp7 = tmp5 ? tmp6 : 0.0;
        auto tmp8 = tmp1 >= tmp4;
        auto tmp9 = 2 + ks0;
        auto tmp10 = static_cast<long>(tmp9);
        auto tmp11 = tmp1 < tmp10;
        auto tmp12 = 1.0;
        auto tmp13 = tmp8 ? tmp12 : 0.0;
        auto tmp14 = tmp5 ? tmp7 : tmp13;
        long x2 = xindex;
        out_ptr0[x2] = static_cast<float>(tmp14);
    }
 with program_source:18:25: error: use of undeclared identifier 'ks0'
        auto tmp9 = 2 + ks0;
                        ^

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor.py GPUTests.test_cat_unbacked_2d_mps

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.472s

FAILED (errors=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144649
Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci
ghstack dependencies: #144647, #144648
2025-01-13 13:38:03 +00:00
a85d1ee106 Update slow tests (#144670)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144670
Approved by: https://github.com/pytorchbot
2025-01-13 12:06:22 +00:00
5138 changed files with 141697 additions and 312323 deletions

View File

@ -3,8 +3,11 @@ set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
# cuda arm build for Grace Hopper solely
export TORCH_CUDA_ARCH_LIST="9.0"
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="9.0;10.0;12.0"
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
source $SCRIPTPATH/aarch64_ci_setup.sh
@ -17,7 +20,7 @@ cd /
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel
pip install auditwheel==6.2.0
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
#USE_PRIORITIZED_TEXT_FOR_LD for enable linker script optimization https://github.com/pytorch/pytorch/pull/121975/files

View File

@ -5,16 +5,14 @@ set -eux -o pipefail
# By creating symlinks from desired /opt/python to /usr/local/bin/
NUMPY_VERSION=2.0.2
PYGIT2_VERSION=1.15.1
if [[ "$DESIRED_PYTHON" == "3.13" ]]; then
if [[ "$DESIRED_PYTHON" == "3.13" || "$DESIRED_PYTHON" == "3.13t" ]]; then
NUMPY_VERSION=2.1.2
PYGIT2_VERSION=1.16.0
fi
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
source $SCRIPTPATH/../manywheel/set_desired_python.sh
pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2 pygit2==${PYGIT2_VERSION}
pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2
for tool in python python3 pip pip3 ninja scons patchelf; do
ln -sf ${DESIRED_PYTHON_BIN_DIR}/${tool} /usr/local/bin;

View File

@ -4,12 +4,9 @@
import os
import shutil
from subprocess import check_call, check_output
from typing import List
from pygit2 import Repository
def list_dir(path: str) -> List[str]:
def list_dir(path: str) -> list[str]:
"""'
Helper for getting paths for Python
"""
@ -42,7 +39,7 @@ def build_ArmComputeLibrary() -> None:
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v24.09",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
@ -58,7 +55,7 @@ def build_ArmComputeLibrary() -> None:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def update_wheel(wheel_path) -> None:
def update_wheel(wheel_path, desired_cuda) -> None:
"""
Update the cuda wheel libraries
"""
@ -80,7 +77,6 @@ def update_wheel(wheel_path) -> None:
"/usr/local/cuda/lib64/libnvToolsExt.so.1",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
@ -100,6 +96,18 @@ def update_wheel(wheel_path) -> None:
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
if "126" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.6",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
elif "128" in desired_cuda:
libs_to_copy += [
"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.8",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
]
else:
libs_to_copy += [
"/opt/OpenBLAS/lib/libopenblas.so.0",
@ -128,6 +136,9 @@ def complete_wheel(folder: str) -> str:
"""
wheel_name = list_dir(f"/{folder}/dist")[0]
# Please note for cuda we don't run auditwheel since we use custom script to package
# the cuda dependencies to the wheel file using update_wheel() method.
# However we need to make sure filename reflects the correct Manylinux platform.
if "pytorch" in folder and not enable_cuda:
print("Repairing Wheel with AuditWheel")
check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)
@ -139,7 +150,14 @@ def complete_wheel(folder: str) -> str:
f"/{folder}/dist/{repaired_wheel_name}",
)
else:
repaired_wheel_name = wheel_name
repaired_wheel_name = wheel_name.replace(
"linux_aarch64", "manylinux_2_28_aarch64"
)
print(f"Renaming {wheel_name} wheel to {repaired_wheel_name}")
os.rename(
f"/{folder}/dist/{wheel_name}",
f"/{folder}/dist/{repaired_wheel_name}",
)
print(f"Copying {repaired_wheel_name} to artifacts")
shutil.copy2(
@ -171,22 +189,22 @@ if __name__ == "__main__":
args = parse_arguments()
enable_mkldnn = args.enable_mkldnn
enable_cuda = args.enable_cuda
repo = Repository("/pytorch")
branch = repo.head.name
if branch == "HEAD":
branch = "master"
branch = check_output(
["git", "rev-parse", "--abbrev-ref", "HEAD"], cwd="/pytorch"
).decode()
print("Building PyTorch wheel")
build_vars = "MAX_JOBS=5 CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000 "
os.system("cd /pytorch; python setup.py clean")
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
if override_package_version is not None:
version = override_package_version
build_vars += (
f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version} PYTORCH_BUILD_NUMBER=1 "
)
elif branch in ["nightly", "master"]:
elif branch in ["nightly", "main"]:
build_date = (
check_output(["git", "log", "--pretty=format:%cs", "-1"], cwd="/pytorch")
.decode()
@ -196,12 +214,11 @@ if __name__ == "__main__":
check_output(["cat", "version.txt"], cwd="/pytorch").decode().strip()[:-2]
)
if enable_cuda:
desired_cuda = os.getenv("DESIRED_CUDA")
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date}+{desired_cuda} PYTORCH_BUILD_NUMBER=1 "
else:
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "
elif branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
if enable_mkldnn:
build_ArmComputeLibrary()
@ -225,6 +242,6 @@ if __name__ == "__main__":
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")
wheel_path = f"/pytorch/dist/{filename[0]}"
update_wheel(wheel_path)
update_wheel(wheel_path, desired_cuda)
pytorch_wheel_name = complete_wheel("/pytorch/")
print(f"Build Complete. Created {pytorch_wheel_name}..")

View File

@ -12,7 +12,7 @@ import os
import subprocess
import sys
import time
from typing import Dict, List, Optional, Tuple, Union
from typing import Optional, Union
import boto3
@ -24,10 +24,12 @@ os_amis = {
"ubuntu22_04": "ami-0c6c29c5125214c77", # login_name: ubuntu
"redhat8": "ami-0698b90665a2ddcf1", # login_name: ec2-user
}
ubuntu18_04_ami = os_amis["ubuntu18_04"]
ubuntu20_04_ami = os_amis["ubuntu20_04"]
def compute_keyfile_path(key_name: Optional[str] = None) -> Tuple[str, str]:
def compute_keyfile_path(key_name: Optional[str] = None) -> tuple[str, str]:
if key_name is None:
key_name = os.getenv("AWS_KEY_NAME")
if key_name is None:
@ -57,7 +59,7 @@ def ec2_instances_by_id(instance_id):
def start_instance(
key_name, ami=ubuntu18_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50
key_name, ami=ubuntu20_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50
):
inst = ec2.create_instances(
ImageId=ami,
@ -96,7 +98,7 @@ class RemoteHost:
self.keyfile_path = keyfile_path
self.login_name = login_name
def _gen_ssh_prefix(self) -> List[str]:
def _gen_ssh_prefix(self) -> list[str]:
return [
"ssh",
"-o",
@ -108,13 +110,13 @@ class RemoteHost:
]
@staticmethod
def _split_cmd(args: Union[str, List[str]]) -> List[str]:
def _split_cmd(args: Union[str, list[str]]) -> list[str]:
return args.split() if isinstance(args, str) else args
def run_ssh_cmd(self, args: Union[str, List[str]]) -> None:
def run_ssh_cmd(self, args: Union[str, list[str]]) -> None:
subprocess.check_call(self._gen_ssh_prefix() + self._split_cmd(args))
def check_ssh_output(self, args: Union[str, List[str]]) -> str:
def check_ssh_output(self, args: Union[str, list[str]]) -> str:
return subprocess.check_output(
self._gen_ssh_prefix() + self._split_cmd(args)
).decode("utf-8")
@ -157,7 +159,7 @@ class RemoteHost:
def using_docker(self) -> bool:
return self.container_id is not None
def run_cmd(self, args: Union[str, List[str]]) -> None:
def run_cmd(self, args: Union[str, list[str]]) -> None:
if not self.using_docker():
return self.run_ssh_cmd(args)
assert self.container_id is not None
@ -178,7 +180,7 @@ class RemoteHost:
if rc != 0:
raise subprocess.CalledProcessError(rc, docker_cmd)
def check_output(self, args: Union[str, List[str]]) -> str:
def check_output(self, args: Union[str, list[str]]) -> str:
if not self.using_docker():
return self.check_ssh_output(args)
assert self.container_id is not None
@ -230,7 +232,7 @@ class RemoteHost:
)
self.download_file(remote_file, local_file)
def list_dir(self, path: str) -> List[str]:
def list_dir(self, path: str) -> list[str]:
return self.check_output(["ls", "-1", path]).split("\n")
@ -327,7 +329,7 @@ def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None
]
)
host.run_cmd(
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v24.09 {git_clone_flags}"
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"
)
host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")
@ -358,7 +360,7 @@ def checkout_repo(
branch: str = "main",
url: str,
git_clone_flags: str,
mapping: Dict[str, Tuple[str, str]],
mapping: dict[str, tuple[str, str]],
) -> Optional[str]:
for prefix in mapping:
if not branch.startswith(prefix):
@ -681,7 +683,7 @@ def build_domains(
branch: str = "main",
use_conda: bool = True,
git_clone_flags: str = "",
) -> Tuple[str, str, str, str]:
) -> tuple[str, str, str, str]:
vision_wheel_name = build_torchvision(
host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags
)
@ -708,7 +710,7 @@ def start_build(
pytorch_build_number: Optional[str] = None,
shallow_clone: bool = True,
enable_mkldnn: bool = False,
) -> Tuple[str, str, str, str, str]:
) -> tuple[str, str, str, str, str]:
git_clone_flags = " --depth 1 --shallow-submodules" if shallow_clone else ""
if host.using_docker() and not use_conda:
print("Auto-selecting conda option for docker images")
@ -759,7 +761,7 @@ def start_build(
version = host.check_output("cat pytorch/version.txt").strip()[:-2]
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1"
if branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1:branch.find('-')]} PYTORCH_BUILD_NUMBER=1"
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
if enable_mkldnn:
@ -932,9 +934,9 @@ def parse_arguments():
parser.add_argument("--debug", action="store_true")
parser.add_argument("--build-only", action="store_true")
parser.add_argument("--test-only", type=str)
parser.add_argument(
"--os", type=str, choices=list(os_amis.keys()), default="ubuntu20_04"
)
group = parser.add_mutually_exclusive_group()
group.add_argument("--os", type=str, choices=list(os_amis.keys()))
group.add_argument("--ami", type=str)
parser.add_argument(
"--python-version",
type=str,
@ -964,7 +966,13 @@ def parse_arguments():
if __name__ == "__main__":
args = parse_arguments()
ami = os_amis[args.os]
ami = (
args.ami
if args.ami is not None
else os_amis[args.os]
if args.os is not None
else ubuntu20_04_ami
)
keyfile_path, key_name = compute_keyfile_path(args.key_name)
if args.list_instances:

View File

@ -1,4 +1,8 @@
#!/bin/bash
# The purpose of this script is to:
# 1. Extract the set of parameters to be used for a docker build based on the provided image name.
# 2. Run docker build with the parameters found in step 1.
# 3. Run the built image and print out the expected and actual versions of packages installed.
set -ex
@ -86,30 +90,20 @@ CMAKE_VERSION=3.18.5
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
if [[ "$image" == *rocm* ]]; then
_UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
fi
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc11)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
@ -134,36 +128,6 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
@ -194,8 +158,8 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -208,8 +172,53 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
pytorch-linux-focal-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks)
CUDA_VERSION=12.6.3
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.13
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -264,7 +273,7 @@ case "$image" in
;;
pytorch-linux-focal-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
@ -272,10 +281,14 @@ case "$image" in
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
@ -283,6 +296,10 @@ case "$image" in
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.9
@ -368,7 +385,7 @@ case "$image" in
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
@ -376,7 +393,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.4
CUDA_VERSION=12.6
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes

View File

@ -1 +1 @@
a29b208a06ab378bb29ab1aa68932e412f8e09f1
01a22b6f16d117454b7d21ebdc691b0785b84a7f

View File

@ -0,0 +1 @@
v2.21.5-1

View File

@ -0,0 +1 @@
v2.26.2-1

View File

@ -1 +1 @@
ac3470188b914c5d7a5058a7e28b9eb685a62427
5d535d7a2d4b435b1b5c1177fd8f04a12b942b9a

View File

@ -1 +1 @@
e98b6fcb8df5b44eb0d0addb6767c573d37ba024
0bcc8265e677e5321606a3311bf71470f14456a8

View File

@ -1 +1 @@
0d4682f073ded4d1a8260dd4208a43d735ae3a2b
96316ce50fade7e209553aba4898cd9b82aab83b

View File

@ -1,6 +1,6 @@
set -euo pipefail
readonly version=v24.04
readonly version=v25.02
readonly src_host=https://github.com/ARM-software
readonly src_repo=ComputeLibrary

View File

@ -32,8 +32,12 @@ install_ubuntu() {
# HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes
# See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729
# TODO: Eliminate this hack, we should not relay on apt-get installation
# See https://github.com/pytorch/pytorch/issues/144768
if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then
maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
elif [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "12.4"* ]]; then
maybe_libnccl_dev="libnccl2=2.26.2-1+cuda12.4 libnccl-dev=2.26.2-1+cuda12.4 --allow-downgrades --allow-change-held-packages"
else
maybe_libnccl_dev=""
fi

View File

@ -9,7 +9,7 @@ install_ubuntu() {
# Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``
apt-get install -y cargo
echo "Checking out sccache repo"
git clone https://github.com/mozilla/sccache -b v0.9.0
git clone https://github.com/mozilla/sccache -b v0.9.1
cd sccache
echo "Building sccache"
cargo build --release

View File

@ -66,7 +66,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
if [[ $(uname -m) == "aarch64" ]]; then
conda_install "openblas==0.3.28=*openmp*"
conda_install "openblas==0.3.29=*openmp*"
else
conda_install "mkl=2021.4.0 mkl-include=2021.4.0"
fi

View File

@ -2,7 +2,7 @@
set -ex
NCCL_VERSION=v2.21.5-1
NCCL_VERSION=v2.26.2-1
CUDNN_VERSION=9.5.1.17
function install_cusparselt_040 {
@ -16,17 +16,6 @@ function install_cusparselt_040 {
rm -rf tmp_cusparselt
}
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
@ -51,6 +40,7 @@ function install_cusparselt_063 {
function install_118 {
CUDNN_VERSION=9.1.0.70
NCCL_VERSION=v2.21.5-1
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"
rm -rf /usr/local/cuda-11.8 /usr/local/cuda
# install CUDA 11.8.0 in the same container
@ -83,39 +73,6 @@ function install_118 {
ldconfig
}
function install_121 {
echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.1 /usr/local/cuda
# install CUDA 12.1.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
chmod +x cuda_12.1.1_530.30.02_linux.run
./cuda_12.1.1_530.30.02_linux.run --toolkit --silent
rm -f cuda_12.1.1_530.30.02_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
@ -214,37 +171,6 @@ function prune_118 {
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_121 {
echo "Pruning CUDA 12.1"
#####################################################################################
# CUDA 12.1 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.1/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
@ -313,18 +239,52 @@ function prune_126 {
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
function install_128 {
CUDNN_VERSION=9.7.1.26
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run
chmod +x cuda_12.8.0_570.86.10_linux.run
./cuda_12.8.0_570.86.10_linux.run --toolkit --silent
rm -f cuda_12.8.0_570.86.10_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_063
ldconfig
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
;;
12.1) install_121; prune_121
;;
12.4) install_124; prune_124
;;
12.6) install_126; prune_126
;;
12.8) install_128;
;;
*) echo "bad argument $1"; exit 1
;;
esac

View File

@ -3,19 +3,8 @@
set -ex
NCCL_VERSION=v2.21.5-1
CUDNN_VERSION=9.5.1.17
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
NCCL_VERSION=v2.26.2-1
CUDNN_VERSION=9.8.0.87
function install_cusparselt_063 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
@ -28,80 +17,15 @@ function install_cusparselt_063 {
rm -rf tmp_cusparselt
}
function install_124 {
CUDNN_VERSION=9.1.0.70
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run
./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_062
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.4 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
function install_126 {
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.6 /usr/local/cuda
# install CUDA 12.6.3 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run
chmod +x cuda_12.6.3_560.35.05_linux_sbsa.run
./cuda_12.6.3_560.35.05_linux_sbsa.run --toolkit --silent
rm -f cuda_12.6.3_560.35.05_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.6 /usr/local/cuda
function install_128 {
echo "Installing CUDA 12.8.0 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.3"
rm -rf /usr/local/cuda-12.8 /usr/local/cuda
# install CUDA 12.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux_sbsa.run
chmod +x cuda_12.8.0_570.86.10_linux_sbsa.run
./cuda_12.8.0_570.86.10_linux_sbsa.run --toolkit --silent
rm -f cuda_12.8.0_570.86.10_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
@ -126,47 +50,11 @@ function install_126 {
ldconfig
}
function prune_126 {
echo "Pruning CUDA 12.6"
#####################################################################################
# CUDA 12.6 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.6/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.6/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.6 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.6/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.3.2 $CUDA_BASE/nsight-systems-2024.5.1/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
;;
12.6) install_126; prune_126
12.8) install_128;
;;
*) echo "bad argument $1"; exit 1
;;

View File

@ -4,7 +4,9 @@ if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
if [[ ${CUDA_VERSION:0:4} == "12.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.7.1.26_cuda12-archive"
elif [[ ${CUDA_VERSION:0:4} == "12.6" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.5.1.17_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"

View File

@ -5,7 +5,15 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[5-8]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.3.2-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.4" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
@ -13,17 +21,11 @@ if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
else
echo "Not sure which libcusparselt version to install for this ${CUDA_VERSION}"
fi
tar xf ${CUSPARSELT_NAME}.tar.xz

View File

@ -37,7 +37,12 @@ install_conda_dependencies() {
install_pip_dependencies() {
pushd executorch
as_jenkins bash install_requirements.sh --pybind xnnpack
as_jenkins bash install_executorch.sh
# A workaround, ExecuTorch has moved to numpy 2.0 which is not compatible with the current
# numba and scipy version used in PyTorch CI
conda_run pip uninstall -y numba scipy
popd
}
@ -48,7 +53,7 @@ setup_executorch() {
export EXECUTORCH_BUILD_PYBIND=ON
export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
as_jenkins .ci/scripts/setup-linux.sh cmake || true
as_jenkins .ci/scripts/setup-linux.sh --build-tool cmake || true
popd
}

View File

@ -4,10 +4,15 @@ set -ex
[ -n "$NINJA_VERSION" ]
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"
arch=$(uname -m)
if [ "$arch" == "aarch64" ]; then
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux-aarch64.zip"
else
url="https://github.com/ninja-build/ninja/releases/download/v${NINJA_VERSION}/ninja-linux.zip"
fi
pushd /tmp
wget --no-verbose --output-document=ninja-linux.zip "$url"
unzip ninja-linux.zip -d /usr/local/bin
rm -f ninja-linux.zip
popd
popd

View File

@ -31,15 +31,15 @@ pip_install \
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.16.2
pip_install onnxscript==0.1.0.dev20241124 --no-deps
pip_install onnx==1.17.0
pip_install onnxscript==0.2.2 --no-deps
# required by onnxscript
pip_install ml_dtypes
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/
IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3");' > "${IMPORT_SCRIPT_FILENAME}"
as_jenkins echo 'import transformers; transformers.GPTJForCausalLM.from_pretrained("hf-internal-testing/tiny-random-gptj");' > "${IMPORT_SCRIPT_FILENAME}"
# Need a PyTorch version for transformers to work
pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

View File

@ -4,7 +4,7 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.28 --depth 1 --shallow-submodules
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29 --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="

View File

@ -115,7 +115,7 @@ index a5007ffc..13fa07fc 100644
if (!fp) {
- fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,
- strerror(errno));
+ fprintf(stderr, "amdgpu.ids: No such file or directory\n");
+ //fprintf(stderr, "amdgpu.ids: No such file or directory\n");
return;
}

View File

@ -60,15 +60,15 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 pip_install -e .
CXX=g++-9 pip_install .
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 pip_install -e .
CXX=g++-9 pip_install .
else
pip_install -e .
pip_install .
fi
if [ -n "${CONDA_CMAKE}" ]; then

View File

@ -8,6 +8,12 @@ else
with_cuda=no
fi
if [[ -d "/opt/rocm" ]]; then
with_rocm=/opt/rocm
else
with_rocm=no
fi
function install_ucx() {
set -ex
git clone --recursive https://github.com/openucx/ucx.git
@ -19,6 +25,7 @@ function install_ucx() {
./configure --prefix=$UCX_HOME \
--enable-mt \
--with-cuda=$with_cuda \
--with-rocm=$with_rocm \
--enable-profiling \
--enable-stats
time make -j
@ -36,12 +43,29 @@ function install_ucc() {
git submodule update --init --recursive
./autogen.sh
# We only run distributed tests on Tesla M60 and A10G
NVCC_GENCODE="-gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_86,code=compute_86"
if [[ -n "$ROCM_VERSION" ]]; then
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
amdgpu_targets=`echo $PYTORCH_ROCM_ARCH | sed 's/;/ /g'`
else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
HIP_OFFLOAD="$HIP_OFFLOAD --offload-arch=$arch"
done
else
HIP_OFFLOAD="all-arch-no-native"
fi
./configure --prefix=$UCC_HOME \
--with-ucx=$UCX_HOME \
--with-cuda=$with_cuda \
--with-nvcc-gencode="${NVCC_GENCODE}"
--with-nvcc-gencode="${NVCC_GENCODE}" \
--with-rocm=$with_rocm \
--with-rocm-arch="${HIP_OFFLOAD}"
time make -j
sudo make install

View File

@ -47,6 +47,9 @@ function install_ubuntu() {
# Development Packages
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_PACKAGES="${XPU_PACKAGES} intel-oneapi-dnnl=2025.0.1-6"
fi
apt-get install -y ${XPU_PACKAGES}
# Cleanup
@ -82,6 +85,9 @@ gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.
EOF
# Install Intel Support Packages
if [[ "$XPU_VERSION" == "2025.0" ]]; then
XPU_PACKAGES="${XPU_PACKAGES} intel-oneapi-dnnl-2025.0.1-6"
fi
yum install -y ${XPU_PACKAGES}
# The xpu-smi packages
dnf install -y xpu-smi

View File

@ -56,11 +56,6 @@ RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
RUN bash ./install_magma.sh 12.1
RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
@ -71,6 +66,11 @@ RUN bash ./install_cuda.sh 12.6
RUN bash ./install_magma.sh 12.6
RUN ln -sf /usr/local/cuda-12.6 /usr/local/cuda
FROM cuda as cuda12.8
RUN bash ./install_cuda.sh 12.8
RUN bash ./install_magma.sh 12.8
RUN ln -sf /usr/local/cuda-12.8 /usr/local/cuda
FROM cpu as rocm
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}

View File

@ -39,7 +39,7 @@ case ${GPU_ARCH_TYPE} in
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100;gfx1101;gfx942"
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)

View File

@ -1,153 +0,0 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=10.2
ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
FROM quay.io/pypa/manylinux2014_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
RUN yum install -y yum-utils centos-release-scl sudo
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.2
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
# ninja
RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

View File

@ -38,6 +38,12 @@ RUN yum install -y \
sudo \
gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
# (optional) Install non-default Ninja version
ARG NINJA_VERSION
COPY ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

View File

@ -48,7 +48,7 @@ case ${GPU_ARCH_TYPE} in
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
;;
cpu-cxx11-abi)
@ -97,7 +97,7 @@ case ${GPU_ARCH_TYPE} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
fi
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101"
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
xpu)
@ -121,7 +121,8 @@ fi
(
set -x
if [ "$(uname -m)" != "s390x" ]; then
# Only activate this if in CI
if [ "$(uname -m)" != "s390x" ] && [ -v CI ]; then
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
@ -139,7 +140,7 @@ fi
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GITHUB_REF=${GITHUB_REF:-"dev")}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}

View File

@ -3,7 +3,7 @@
# Script used only in CD pipeline
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/
CURL_DOWNLOAD_URL=https://curl.askapache.com/download
CURL_DOWNLOAD_URL=https://curl.se/download
AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf

View File

@ -90,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.13.0
mypy==1.14.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.10.0
#Pinned versions: 1.14.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -294,7 +294,7 @@ ghstack==0.8.0
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.5
jinja2==3.1.6
#Description: jinja2 template engine
#Pinned versions: 3.1.4
#test that import:
@ -329,7 +329,7 @@ lxml==5.3.0
PyGithub==2.3.0
sympy==1.13.1 ; python_version >= "3.9"
sympy==1.13.3
#Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
#Pinned versions:
#test that import:
@ -339,7 +339,7 @@ onnx==1.17.0
#Pinned versions:
#test that import:
onnxscript==0.1.0.dev20240817
onnxscript==0.2.2
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -362,6 +362,7 @@ pwlf==2.2.1 ; python_version >= "3.8"
# To build PyTorch itself
astunparse
PyYAML
pyzstd
setuptools
ninja==1.11.1 ; platform_machine == "aarch64"
@ -371,3 +372,8 @@ pulp==2.9.0 ; python_version >= "3.8"
#Description: required for testing ilp formulaiton under torch/distributed/_tools
#Pinned versions: 2.9.0
#test that import: test_sac_ilp.py
dataclasses_json==0.6.7
#Description: required for data pipeline and scripts under tools/stats
#Pinned versions: 0.6.7
#test that import:

View File

@ -1 +1 @@
3.2.0
3.3.0

View File

@ -14,21 +14,20 @@ ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
COPY ./common/install_base.sh install_base.sh
RUN bash ./install_base.sh && rm install_base.sh
# Install clang
ARG LLVMDEV
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# Install user
COPY ./common/install_user.sh install_user.sh
RUN bash ./install_user.sh && rm install_user.sh
# Install katex
ARG KATEX
COPY ./common/install_docs_reqs.sh install_docs_reqs.sh
RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ARG CONDA_CMAKE
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
@ -39,6 +38,11 @@ ARG GCC_VERSION
COPY ./common/install_gcc.sh install_gcc.sh
RUN bash ./install_gcc.sh && rm install_gcc.sh
# Install clang
ARG CLANG_VERSION
COPY ./common/install_clang.sh install_clang.sh
RUN bash ./install_clang.sh && rm install_clang.sh
# (optional) Install protobuf for ONNX
ARG PROTOBUF
COPY ./common/install_protobuf.sh install_protobuf.sh
@ -85,6 +89,32 @@ COPY ./common/install_amdsmi.sh install_amdsmi.sh
RUN bash ./install_amdsmi.sh
RUN rm install_amdsmi.sh
# (optional) Install UCC
ARG UCX_COMMIT
ARG UCC_COMMIT
ENV UCX_COMMIT $UCX_COMMIT
ENV UCC_COMMIT $UCC_COMMIT
ENV UCX_HOME /usr
ENV UCC_HOME /usr
ADD ./common/install_ucc.sh install_ucc.sh
RUN if [ -n "${UCX_COMMIT}" ] && [ -n "${UCC_COMMIT}" ]; then bash ./install_ucc.sh; fi
RUN rm install_ucc.sh
COPY ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
@ -107,17 +137,17 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# This is needed by sccache
COPY ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
ENV OPENSSL_DIR /opt/openssl
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
RUN bash ./install_cache.sh && rm install_cache.sh
# Install Open MPI for ROCm
COPY ./common/install_openmpi.sh install_openmpi.sh
RUN if [ -n "${CUDA_VERSION}" ]; then bash install_openmpi.sh; fi
RUN rm install_openmpi.sh
# Include BUILD_ENVIRONMENT environment variable in image
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}

View File

@ -12,13 +12,13 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-e PACKAGE_NAME=${PACKAGE_NAME}${DESIRED_CUDA_SHORT} \
-e DESIRED_CUDA=${DESIRED_CUDA} \
-e CUDA_ARCH_LIST="${CUDA_ARCH_LIST}" \
"pytorch/manylinux-builder:cuda${DESIRED_CUDA}-main" \
"pytorch/manylinux2_28-builder:cuda${DESIRED_CUDA}-main" \
magma/build_magma.sh
.PHONY: all
all: magma-cuda128
all: magma-cuda126
all: magma-cuda124
all: magma-cuda121
all: magma-cuda118
.PHONY:
@ -26,6 +26,12 @@ clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-cuda128
magma-cuda128: DESIRED_CUDA := 12.8
magma-cuda128: CUDA_ARCH_LIST += -gencode arch=compute_100,code=sm_100 -gencode arch=compute_120,code=sm_120
magma-cuda128:
$(DOCKER_RUN)
.PHONY: magma-cuda126
magma-cuda126: DESIRED_CUDA := 12.6
magma-cuda126:
@ -36,11 +42,6 @@ magma-cuda124: DESIRED_CUDA := 12.4
magma-cuda124:
$(DOCKER_RUN)
.PHONY: magma-cuda121
magma-cuda121: DESIRED_CUDA := 12.1
magma-cuda121:
$(DOCKER_RUN)
.PHONY: magma-cuda118
magma-cuda118: DESIRED_CUDA := 11.8
magma-cuda118: CUDA_ARCH_LIST += -gencode arch=compute_37,code=sm_37

View File

@ -14,6 +14,7 @@ export USE_CUDA_STATIC_LINK=1
export INSTALL_TEST=0 # dont install test binaries into site-packages
export USE_CUPTI_SO=0
export USE_CUSPARSELT=${USE_CUSPARSELT:-1} # Enable if not disabled by libtorch build
export USE_CUFILE=${USE_CUFILE:-1}
# Keep an array of cmake variables to add to
if [[ -z "$CMAKE_ARGS" ]]; then
@ -52,8 +53,12 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.8)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0+PTX"
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
;;
12.4)
@ -114,7 +119,16 @@ if [[ $USE_CUSPARSELT == "1" && $CUDA_VERSION == "11.8" ]]; then
)
fi
if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then
# Turn USE_CUFILE off for CUDA 11.8, 12.4 since nvidia-cufile-cu11 and 1.9.0.20 are
# not available in PYPI
if [[ $CUDA_VERSION == "11.8" || $CUDA_VERSION == "12.4" ]]; then
export USE_CUFILE=0
fi
# CUDA_VERSION 12.4, 12.6, 12.8
if [[ $CUDA_VERSION == 12* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
@ -155,6 +169,16 @@ if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then
"libnvrtc.so.12"
"libnvrtc-builtins.so"
)
if [[ $USE_CUFILE == 1 ]]; then
DEPS_LIST+=(
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
)
DEPS_SONAME+=(
"libcufile.so.0"
"libcufile_rdma.so.1"
)
fi
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(
@ -171,6 +195,11 @@ if [[ $CUDA_VERSION == "12.4" || $CUDA_VERSION == "12.6" ]]; then
'$ORIGIN/../../nvidia/nccl/lib'
'$ORIGIN/../../nvidia/nvtx/lib'
)
if [[ $USE_CUFILE == 1 ]]; then
CUDA_RPATHS+=(
'$ORIGIN/../../nvidia/cufile/lib'
)
fi
CUDA_RPATHS=$(IFS=: ; echo "${CUDA_RPATHS[*]}")
export C_SO_RPATH=$CUDA_RPATHS':$ORIGIN:$ORIGIN/lib'
export LIB_SO_RPATH=$CUDA_RPATHS':$ORIGIN'

View File

@ -173,6 +173,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
export TORCH_XPU_ARCH_LIST=pvc
fi
# sccache will fail for CUDA builds if all cores are used for compiling
@ -191,7 +192,7 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
@ -377,8 +378,10 @@ else
# This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization
# is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has
# 16 CPUs
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
if [ -z "$MAX_JOBS_OVERRIDE" ]; then
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
fi
# NB: Install outside of source directory (at the same level as the root
# pytorch folder) so that it doesn't get cleaned away prior to docker push.

View File

@ -387,7 +387,7 @@ fi
###############################################################################
# Check for C++ ABI compatibility between gcc7 and gcc9 compiled binaries
###############################################################################
if [[ "$(uname)" == 'Linux' && ("$PACKAGE_TYPE" == 'conda' || "$PACKAGE_TYPE" == 'manywheel')]]; then
if [[ "$(uname)" == 'Linux' && "$PACKAGE_TYPE" == 'manywheel' ]]; then
pushd /tmp
python -c "import torch; exit(0 if torch.compiled_with_cxx11_abi() else (0 if torch._C._PYBIND11_BUILD_ABI == '_cxxabi1011' else 1))"
popd

View File

@ -169,30 +169,40 @@ function install_torchrec_and_fbgemm() {
torchrec_commit=$(get_pinned_commit torchrec)
local fbgemm_commit
fbgemm_commit=$(get_pinned_commit fbgemm)
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
fbgemm_commit=$(get_pinned_commit fbgemm_rocm)
fi
pip_uninstall torchrec-nightly
pip_uninstall fbgemm-gpu-nightly
pip_install setuptools-git-versioning scikit-build pyre-extensions
# TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it
# seems to be an sccache-related issue
if [[ "$IS_A100_RUNNER" == "1" ]]; then
unset CMAKE_CUDA_COMPILER_LAUNCHER
sudo mv /opt/cache/bin /opt/cache/bin-backup
fi
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
# install torchrec first because it installs fbgemm nightly on top of rocm fbgemm
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
pip_uninstall fbgemm-gpu-nightly
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
if [[ "$IS_A100_RUNNER" == "1" ]]; then
export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
sudo mv /opt/cache/bin-backup /opt/cache/bin
pip_install tabulate # needed for newer fbgemm
pip_install patchelf # needed for rocm fbgemm
git clone --recursive https://github.com/pytorch/fbgemm
pushd fbgemm/fbgemm_gpu
git checkout "${fbgemm_commit}"
python setup.py install \
--package_variant=rocm \
-DHIP_ROOT_DIR="${ROCM_PATH}" \
-DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
popd
rm -rf fbgemm
else
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
fi
}
function clone_pytorch_xla() {
if [[ ! -d ./xla ]]; then
git clone --recursive --quiet https://github.com/pytorch/xla.git
git clone --recursive -b r2.7 https://github.com/pytorch/xla.git
pushd xla
# pin the xla hash so that we don't get broken by changes to xla
git checkout "$(cat ../.github/ci_commit_pins/xla.txt)"
@ -216,6 +226,11 @@ function checkout_install_torchbench() {
# to install and test other models
python install.py --continue_on_fail
fi
# TODO (huydhn): transformers-4.44.2 added by https://github.com/pytorch/benchmark/pull/2488
# is regressing speedup metric. This needs to be investigated further
pip install transformers==4.38.1
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd

View File

@ -18,6 +18,9 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(
fi
popd
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
# This environment variable makes ProcessGroupGloo default to

View File

@ -40,7 +40,7 @@ retry () {
if [[ "$#" != 3 ]]; then
if [[ -z "${DESIRED_PYTHON:-}" || -z "${DESIRED_CUDA:-}" || -z "${PACKAGE_TYPE:-}" ]]; then
echo "USAGE: run_tests.sh PACKAGE_TYPE DESIRED_PYTHON DESIRED_CUDA"
echo "The env variable PACKAGE_TYPE must be set to 'conda' or 'manywheel' or 'libtorch'"
echo "The env variable PACKAGE_TYPE must be set to 'manywheel' or 'libtorch'"
echo "The env variable DESIRED_PYTHON must be set like '2.7mu' or '3.6m' etc"
echo "The env variable DESIRED_CUDA must be set like 'cpu' or 'cu80' etc"
exit 1

View File

@ -6,7 +6,7 @@ import itertools
import os
import re
from pathlib import Path
from typing import Any, List, Tuple
from typing import Any
# We also check that there are [not] cxx11 symbols in libtorch
@ -46,17 +46,17 @@ LIBTORCH_PRE_CXX11_PATTERNS = _apply_libtorch_symbols(PRE_CXX11_SYMBOLS)
@functools.lru_cache(100)
def get_symbols(lib: str) -> List[Tuple[str, str, str]]:
def get_symbols(lib: str) -> list[tuple[str, str, str]]:
from subprocess import check_output
lines = check_output(f'nm "{lib}"|c++filt', shell=True)
return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]
def grep_symbols(lib: str, patterns: List[Any]) -> List[str]:
def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:
def _grep_symbols(
symbols: List[Tuple[str, str, str]], patterns: List[Any]
) -> List[str]:
symbols: list[tuple[str, str, str]], patterns: list[Any]
) -> list[str]:
rc = []
for _s_addr, _s_type, s_name in symbols:
for pattern in patterns:

View File

@ -46,7 +46,9 @@ def train(args, model, device, train_loader, optimizer, epoch):
optimizer.step()
if batch_idx % args.log_interval == 0:
print(
f"Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)} ({100. * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}" # noqa: B950
f"Train Epoch: {epoch} "
f"[{batch_idx * len(data)}/{len(train_loader.dataset)} "
f"({100.0 * batch_idx / len(train_loader):.0f}%)]\tLoss: {loss.item():.6f}"
)
if args.dry_run:
break
@ -71,7 +73,9 @@ def test(model, device, test_loader):
test_loss /= len(test_loader.dataset)
print(
f"\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\n" # noqa: B950
f"\nTest set: Average loss: {test_loss:.4f}, "
f"Accuracy: {correct}/{len(test_loader.dataset)} "
f"({100.0 * correct / len(test_loader.dataset):.0f}%)\n"
)

View File

@ -6,6 +6,7 @@ import re
import subprocess
import sys
from pathlib import Path
from tempfile import NamedTemporaryFile
import torch
import torch._dynamo
@ -161,6 +162,36 @@ def test_cuda_runtime_errors_captured() -> None:
raise RuntimeError("Expected CUDA RuntimeError but have not received!")
def test_cuda_gds_errors_captured() -> None:
major_version = int(torch.version.cuda.split(".")[0])
minor_version = int(torch.version.cuda.split(".")[1])
if target_os == "windows":
print(f"{target_os} is not supported for GDS smoke test")
return
if major_version < 12 or (major_version == 12 and minor_version < 6):
print("CUDA version is not supported for GDS smoke test")
return
cuda_exception_missed = True
try:
print("Testing test_cuda_gds_errors_captured")
with NamedTemporaryFile() as f:
torch.cuda.gds.GdsFile(f.name, os.O_CREAT | os.O_RDWR)
except RuntimeError as e:
expected_error = "cuFileHandleRegister failed"
if re.search(expected_error, f"{e}"):
print(f"Caught CUDA exception with success: {e}")
cuda_exception_missed = False
else:
raise e
if cuda_exception_missed:
raise RuntimeError(
"Expected cuFileHandleRegister failed RuntimeError but have not received!"
)
def smoke_test_cuda(
package: str, runtime_error_check: str, torch_compile_check: str
) -> None:
@ -381,6 +412,7 @@ def main() -> None:
test_numpy()
if is_cuda_system:
test_linalg("cuda")
test_cuda_gds_errors_captured()
if options.package == "all":
smoke_test_modules()

View File

@ -46,6 +46,9 @@ BUILD_BIN_DIR="$BUILD_DIR"/bin
SHARD_NUMBER="${SHARD_NUMBER:=1}"
NUM_TEST_SHARDS="${NUM_TEST_SHARDS:=1}"
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
export VALGRIND=ON
# export TORCH_INDUCTOR_INSTALL_GXX=ON
if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then
@ -174,6 +177,9 @@ if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# Print GPU info
rocminfo
rocminfo | grep -E 'Name:.*\sgfx|Marketing'
# for benchmarks/dynamo/check_accuracy.py, we need to put results in a rocm specific directory to avoid clashes with cuda
MAYBE_ROCM="rocm/"
fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
@ -308,6 +314,13 @@ test_python() {
assert_git_not_dirty
}
test_lazy_tensor_meta_reference_disabled() {
export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
echo "Testing lazy tensor operations without meta reference"
time python test/run_test.py --include lazy/test_ts_opinfo.py --verbose
export -n TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE
}
test_dynamo_wrapped_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
@ -411,7 +424,10 @@ test_inductor_cpp_wrapper_shard() {
# Run certain inductor unit tests with cpp wrapper. In the end state, we
# should be able to run all the inductor unit tests with cpp_wrapper.
python test/run_test.py --include inductor/test_torchinductor --verbose
python test/run_test.py \
--include inductor/test_torchinductor inductor/test_max_autotune inductor/test_cpu_repro \
--verbose
python test/run_test.py --inductor --include test_torch -k 'take' --verbose
# Run inductor benchmark tests with cpp wrapper.
# Skip benchmark tests if it's in rerun-disabled-mode.
@ -424,7 +440,7 @@ test_inductor_cpp_wrapper_shard() {
--output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_timm_training.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
@ -434,7 +450,7 @@ test_inductor_cpp_wrapper_shard() {
--bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_torchbench_inference.csv"
fi
}
@ -467,6 +483,8 @@ elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--export-aot-inductor)
elif [[ "${TEST_CONFIG}" == *max_autotune_inductor* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--inductor --inductor-compile-mode max-autotune)
elif [[ "${TEST_CONFIG}" == *inductor* && "${TEST_CONFIG}" != *perf* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--inductor)
fi
@ -481,6 +499,59 @@ else
DYNAMO_BENCHMARK_FLAGS+=(--device cuda)
fi
test_cachebench() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
local BENCHMARK
if [[ "${SHARD_NUMBER}" == 1 ]]; then
local BENCHMARK=torchbench
elif [[ "${SHARD_NUMBER}" == 2 ]]; then
local BENCHMARK=huggingface
else
echo "invalid SHARD_NUMBER: ${SHARD_NUMBER}"
exit 1
fi
local mode_options=("training" "inference")
for mode in "${mode_options[@]}"; do
$TASKSET python "benchmarks/dynamo/cachebench.py" \
--mode "$mode" \
--device cuda \
--benchmark "$BENCHMARK" \
--repeat 3 \
--output "$TEST_REPORTS_DIR/cachebench_${BENCHMARK}_${mode}.json"
$TASKSET python "benchmarks/dynamo/cachebench.py" \
--mode "$mode" \
--dynamic \
--device cuda \
--benchmark "$BENCHMARK" \
--repeat 3 \
--output "$TEST_REPORTS_DIR/cachebench_${BENCHMARK}_${mode}_dynamic.json"
done
}
test_verify_cachebench() {
TMP_TEST_REPORTS_DIR=$(mktemp -d)
TEST_OUTPUT="$TMP_TEST_REPORTS_DIR/test.json"
$TASKSET python "benchmarks/dynamo/cachebench.py" \
--mode training \
--device cpu \
--model nanogpt \
--benchmark torchbench \
--output "$TEST_OUTPUT"
# -s checks file exists and is non empty
if [[ ! -s "$TEST_OUTPUT" ]]; then
echo "Cachebench failed to produce an output."
echo "Run 'python benchmarks/dynamo/cachebench.py' to make sure it works"
exit 1
fi
}
test_perf_for_dashboard() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
@ -509,6 +580,10 @@ test_perf_for_dashboard() {
test_inductor_set_cpu_affinity
elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then
device=cuda_a10g
elif [[ "${TEST_CONFIG}" == *h100* ]]; then
device=cuda_h100
elif [[ "${TEST_CONFIG}" == *rocm* ]]; then
device=rocm
fi
for mode in "${modes[@]}"; do
@ -625,16 +700,16 @@ test_single_dynamo_benchmark() {
TEST_CONFIG=${TEST_CONFIG//_avx512/}
fi
python "benchmarks/dynamo/$suite.py" \
--ci --accuracy --timing --explain \
--ci --accuracy --timing --explain --print-compilation-time \
"${DYNAMO_BENCHMARK_FLAGS[@]}" \
"$@" "${partition_flags[@]}" \
--output "$TEST_REPORTS_DIR/${name}_${suite}.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}${TEST_CONFIG}_${name}.csv"
python benchmarks/dynamo/check_graph_breaks.py \
--actual "$TEST_REPORTS_DIR/${name}_$suite.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${TEST_CONFIG}_${name}.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}${TEST_CONFIG}_${name}.csv"
fi
}
@ -657,7 +732,7 @@ test_inductor_halide() {
}
test_inductor_triton_cpu() {
python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose
python test/run_test.py --include inductor/test_triton_cpu_backend.py inductor/test_torchinductor_strided_blocks.py --verbose
assert_git_not_dirty
}
@ -687,6 +762,8 @@ test_dynamo_benchmark() {
fi
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
elif [[ "${TEST_CONFIG}" == *max_autotune_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
else
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
test_single_dynamo_benchmark "training" "$suite" "$shard_id" --training --amp "$@"
@ -721,7 +798,7 @@ test_inductor_torchbench_smoketest_perf() {
--only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_huggingface_training.csv"
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_huggingface_training.csv"
done
}
@ -1096,8 +1173,9 @@ build_xla() {
apply_patches
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
# These functions are defined in .circleci/common.sh in pytorch/xla repo
retry install_deps_pytorch_xla $XLA_DIR $USE_CACHE
retry install_pre_deps_pytorch_xla $XLA_DIR $USE_CACHE
CMAKE_PREFIX_PATH="${SITE_PACKAGES}/torch:${CMAKE_PREFIX_PATH}" XLA_SANDBOX_BUILD=1 build_torch_xla $XLA_DIR
retry install_post_deps_pytorch_xla
assert_git_not_dirty
}
@ -1404,7 +1482,7 @@ test_executorch() {
bash examples/models/llama3_2_vision/install_requirements.sh
# NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch
# from the PR
bash .ci/scripts/setup-linux.sh cmake
bash .ci/scripts/setup-linux.sh --build-tool cmake
echo "Run ExecuTorch unit tests"
pytest -v -n auto
@ -1428,7 +1506,7 @@ test_executorch() {
test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \
test_foreach test_reductions test_unary_ufuncs \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Dynamo tests
@ -1496,6 +1574,16 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
install_torchvision
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == cachebench ]]; then
install_torchaudio cuda
install_torchvision
checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_cachebench
elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then
install_torchaudio cpu
install_torchvision
checkout_install_torchbench nanogpt
PYTHONPATH=$(pwd)/torchbench test_verify_cachebench
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
install_torchaudio cpu
@ -1551,6 +1639,7 @@ elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
test_python_shard "$SHARD_NUMBER"
test_aten
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_lazy_tensor_meta_reference_disabled
test_without_numpy
install_torchvision
test_python_shard 1

View File

@ -0,0 +1,41 @@
r"""
It's used to check basic rnn features with cpu-only.
For example, it would throw exception if some components are missing
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(1, 1, 3)
self.pool = nn.MaxPool2d(2, 2)
def forward(self, inputs):
output = self.pool(F.relu(self.conv(inputs)))
output = output.view(1)
return output
try:
# Mock one infer
net = SimpleCNN()
net_inputs = torch.rand((1, 1, 5, 5))
outputs = net(net_inputs)
print(outputs)
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.1)
# Mock one step training
label = torch.full((1,), 1.0, dtype=torch.float)
loss = criterion(outputs, label)
loss.backward()
optimizer.step()
except Exception as e:
print(f"An error occurred: {e}")

View File

@ -0,0 +1,13 @@
r"""
It's used to check basic rnn features with cpu-only.
For example, it would throw exception if missing some components are missing
"""
import torch
import torch.nn as nn
rnn = nn.RNN(10, 20, 2)
inputs = torch.randn(5, 3, 10)
h0 = torch.randn(2, 3, 20)
output, hn = rnn(inputs, h0)

View File

@ -18,6 +18,9 @@ export PYTORCH_FINAL_PACKAGE_DIR="${PYTORCH_FINAL_PACKAGE_DIR:-/c/w/build-result
PYTORCH_FINAL_PACKAGE_DIR_WIN=$(cygpath -w "${PYTORCH_FINAL_PACKAGE_DIR}")
export PYTORCH_FINAL_PACKAGE_DIR_WIN
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
mkdir -p "$TMP_DIR"/build/torch
export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers

View File

@ -0,0 +1,31 @@
@echo off
echo Dependency ARM Performance Libraries (APL) installation started.
:: Pre-check for downloads and dependencies folders
if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: Set download URL for the ARM Performance Libraries (APL)
set DOWNLOAD_URL="https://developer.arm.com/-/cdn-downloads/permalink/Arm-Performance-Libraries/Version_24.10/arm-performance-libraries_24.10_Windows.msi"
set INSTALLER_FILE=%DOWNLOADS_DIR%\arm-performance-libraries.msi
:: Download installer
echo Downloading ARM Performance Libraries (APL)...
curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%
:: Install ARM Performance Libraries (APL)
echo Installing ARM Performance Libraries (APL)...
msiexec /i "%INSTALLER_FILE%" /qn /norestart ACCEPT_EULA=1 INSTALLFOLDER="%DEPENDENCIES_DIR%"
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed to install ARM Performance Libraries (APL) components. (exitcode = %errorlevel%)"
exit /b 1
)
:: Add to environment
echo ARMPL_DIR=%DEPENDENCIES_DIR%\armpl_24.10\>> %GITHUB_ENV%
echo %DEPENDENCIES_DIR%\armpl_24.10\bin\>> %GITHUB_PATH%
echo Dependency ARM Performance Libraries (APL) installation finished.

View File

@ -0,0 +1,41 @@
@echo off
echo Dependency MSVC Build Tools with C++ with ARM64/ARM64EC components installation started.
:: Pre-check for downloads and dependencies folders
if not exist "%DOWNLOADS_DIR%" mkdir "%DOWNLOADS_DIR%"
if not exist "%DEPENDENCIES_DIR%" mkdir "%DEPENDENCIES_DIR%"
:: Set download URL for the Visual Studio Installer
set DOWNLOAD_URL=https://aka.ms/vs/17/release/vs_BuildTools.exe
set INSTALLER_FILE=%DOWNLOADS_DIR%\vs_BuildTools.exe
:: Download installer
echo Downloading Visual Studio Build Tools with C++ installer...
curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%
:: Install the Visual Studio Build Tools with C++ components
echo Installing Visual Studio Build Tools with C++ components...
echo Installing MSVC %MSVC_VERSION%
"%INSTALLER_FILE%" --norestart --quiet --wait --installPath "%DEPENDENCIES_DIR%\VSBuildTools" ^
--add Microsoft.VisualStudio.Workload.VCTools ^
--add Microsoft.VisualStudio.Component.Windows10SDK ^
--add Microsoft.VisualStudio.Component.Windows11SDK.22621 ^
--add Microsoft.VisualStudio.Component.VC.ASAN ^
--add Microsoft.VisualStudio.Component.VC.CMake.Project ^
--add Microsoft.VisualStudio.Component.VC.CoreBuildTools ^
--add Microsoft.VisualStudio.Component.VC.CoreIde ^
--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest ^
--add Microsoft.VisualStudio.Component.VC.Tools.ARM64EC ^
--add Microsoft.VisualStudio.Component.VC.Tools.ARM64 ^
--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64
echo exitcode = %errorlevel%
:: Check if installation was successful
if %errorlevel% neq 0 (
echo Failed to install Visual Studio Build Tools with C++ components.
exit /b 1
)
echo Dependency Visual Studio Build Tools with C++ installation finished.

View File

@ -0,0 +1,37 @@
:: we need to install newer version of Git manually as "-submodules" function is not supported in the default version of runner.
@echo off
echo Dependency Git installation started.
:: Pre-check for downloads and dependencies folders
if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: Set download URL for the Git
set DOWNLOAD_URL="https://github.com/git-for-windows/git/releases/download/v2.46.0.windows.1/Git-2.46.0-64-bit.exe"
set INSTALLER_FILE=%DOWNLOADS_DIR%\Git-2.46.0-64-bit.exe
:: Download installer
echo Downloading Git...
curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%
:: Install Git
echo Installing Git...
"%INSTALLER_FILE%" /VERYSILENT /DIR="%DEPENDENCIES_DIR%\git"
dir %DEPENDENCIES_DIR%\git
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed to install Git. (exitcode = %errorlevel%)"
exit /b 1
)
:: Enable long paths
call "%DEPENDENCIES_DIR%\git\cmd\git.exe" config --system core.longpaths true
:: Add to PATH
echo %DEPENDENCIES_DIR%\git\cmd\;%DEPENDENCIES_DIR%\git\bin\>> %GITHUB_PATH%
echo Dependency Git installation finished.

View File

@ -0,0 +1,33 @@
@echo off
echo Dependency libuv installation started.
:: Pre-check for downloads and dependencies folders
if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
cd %DEPENDENCIES_DIR%
git clone https://github.com/libuv/libuv.git -b v1.39.0
echo Configuring libuv...
mkdir libuv\build
cd libuv\build
cmake .. -DBUILD_TESTING=OFF
echo Building libuv...
cmake --build . --config Release
echo Installing libuv...
cmake --install . --prefix ../install
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed to install libuv. (exitcode = %errorlevel%)"
exit /b 1
)
echo Dependency libuv installation finished.

View File

@ -0,0 +1,46 @@
@echo off
echo Dependency OpenBLAS installation started.
:: Pre-check for downloads and dependencies folders
if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: Clone OpenBLAS
cd %DEPENDENCIES_DIR%
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.29
echo Configuring OpenBLAS...
mkdir OpenBLAS\build
cd OpenBLAS\build
cmake .. -G Ninja ^
-DBUILD_TESTING=0 ^
-DBUILD_BENCHMARKS=0 ^
-DC_LAPACK=1 ^
-DNOFORTRAN=1 ^
-DDYNAMIC_ARCH=0 ^
-DARCH=arm64 ^
-DBINARY=64 ^
-DTARGET=GENERIC ^
-DUSE_OPENMP=1 ^
-DCMAKE_SYSTEM_PROCESSOR=ARM64 ^
-DCMAKE_SYSTEM_NAME=Windows ^
-DCMAKE_BUILD_TYPE=Release
echo Building OpenBLAS...
cmake --build . --config Release
echo Installing OpenBLAS...
cmake --install . --prefix ../install
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed to install OpenBLAS. (exitcode = %errorlevel%)"
exit /b 1
)
echo Dependency OpenBLAS installation finished.

View File

@ -0,0 +1,44 @@
@echo off
echo Dependency Python installation started.
:: Pre-check for downloads and dependencies folders
if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
if "%DESIRED_PYTHON%" == "3.13" (
echo Python version is set to 3.13
set DOWNLOAD_URL=https://www.python.org/ftp/python/3.13.2/python-3.13.2-arm64.exe
) else if "%DESIRED_PYTHON%" == "3.12" (
echo Python version is set to 3.12
set DOWNLOAD_URL=https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe
) else if "%DESIRED_PYTHON%" == "3.11" (
echo Python version is set to 3.11
set DOWNLOAD_URL=https://www.python.org/ftp/python/3.11.9/python-3.11.9-arm64.exe
) else (
echo DESIRED_PYTHON not defined, Python version is set to 3.12
set DOWNLOAD_URL=https://www.python.org/ftp/python/3.12.7/python-3.12.7-arm64.exe
)
set INSTALLER_FILE=%DOWNLOADS_DIR%\python-installer.exe
:: Download installer
echo Downloading Python...
curl -L -o "%INSTALLER_FILE%" "%DOWNLOAD_URL%"
:: Install Python
echo Installing Python...
"%INSTALLER_FILE%" /quiet Include_debug=1 TargetDir="%DEPENDENCIES_DIR%\Python"
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed to install Python. (exitcode = %errorlevel%)"
exit /b 1
)
:: Add to PATH
echo %DEPENDENCIES_DIR%\Python\>> %GITHUB_PATH%
echo %DEPENDENCIES_DIR%\Python\scripts\>> %GITHUB_PATH%
echo %DEPENDENCIES_DIR%\Python\libs\>> %GITHUB_PATH%
echo Dependency Python installation finished.

View File

@ -0,0 +1,33 @@
@echo off
echo Dependency Rust installation started.
:: Pre-check for downloads and dependencies folders
if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
set DOWNLOAD_URL="https://static.rust-lang.org/rustup/dist/x86_64-pc-windows-msvc/rustup-init.exe"
set INSTALLER_FILE=%DOWNLOADS_DIR%\rustup-init.exe
set RUSTUP_HOME=%DEPENDENCIES_DIR%\rust
set CARGO_HOME=%DEPENDENCIES_DIR%\cargo
:: Download installer
echo Downloading Rust...
curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%
:: Install APL
echo Installing Rust...
"%INSTALLER_FILE%" -q -y --default-host aarch64-pc-windows-msvc --default-toolchain stable --profile default
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed to install Rust. (exitcode = %errorlevel%)"
exit /b 1
)
:: Add to PATH
echo %DEPENDENCIES_DIR%\cargo\bin\>> %GITHUB_PATH%
echo RUSTUP_HOME=%DEPENDENCIES_DIR%\rust>> %GITHUB_ENV%
echo CARGO_HOME=%DEPENDENCIES_DIR%\cargo>> %GITHUB_ENV%
echo Dependency Rust installation finished.

View File

@ -0,0 +1,33 @@
@echo off
echo Dependency sccache installation started.
:: Pre-check for downloads and dependencies folders
if not exist "%DOWNLOADS_DIR%" mkdir %DOWNLOADS_DIR%
if not exist "%DEPENDENCIES_DIR%" mkdir %DEPENDENCIES_DIR%
:: Set download URL for the sccache
set DOWNLOAD_URL="https://github.com/mozilla/sccache/releases/download/v0.8.1/sccache-v0.8.1-x86_64-pc-windows-msvc.zip"
set INSTALLER_FILE=%DOWNLOADS_DIR%\sccache.zip
:: Download installer
echo Downloading sccache.zip...
curl -L -o "%INSTALLER_FILE%" %DOWNLOAD_URL%
:: Install sccache
echo Extracting sccache.zip...
tar -xf "%INSTALLER_FILE%" -C %DEPENDENCIES_DIR%
cd %DEPENDENCIES_DIR%
ren sccache-v0.8.1-x86_64-pc-windows-msvc sccache
cd ..
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed to install sccache. (exitcode = %errorlevel%)"
exit /b 1
)
:: Add to PATH
echo %DEPENDENCIES_DIR%\sccache\>> %GITHUB_PATH%
echo Dependency sccache installation finished.

View File

@ -0,0 +1,22 @@
:: change to source directory
cd %PYTORCH_ROOT%
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: create virtual environment
python -m venv .venv
echo * > .venv\.gitignore
call .\.venv\Scripts\activate
where python
:: install dependencies
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pytest numpy protobuf expecttest hypothesis
:: find file name for pytorch wheel
for /f "delims=" %%f in ('dir /b "%PYTORCH_FINAL_PACKAGE_DIR%" ^| findstr "torch-"') do set "TORCH_WHEEL_FILENAME=%PYTORCH_FINAL_PACKAGE_DIR%\%%f"
pip install %TORCH_WHEEL_FILENAME%

View File

@ -0,0 +1,101 @@
@echo on
:: environment variables
set CMAKE_BUILD_TYPE=%BUILD_TYPE%
set CMAKE_C_COMPILER_LAUNCHER=sccache
set CMAKE_CXX_COMPILER_LAUNCHER=sccache
set libuv_ROOT=%DEPENDENCIES_DIR%\libuv\install
set MSSdk=1
if defined PYTORCH_BUILD_VERSION (
set PYTORCH_BUILD_VERSION=%PYTORCH_BUILD_VERSION%
set PYTORCH_BUILD_NUMBER=1
)
:: Set BLAS type
if %ENABLE_APL% == 1 (
set BLAS=APL
set USE_LAPACK=1
) else if %ENABLE_OPENBLAS% == 1 (
set BLAS=OpenBLAS
set OpenBLAS_HOME=%DEPENDENCIES_DIR%\OpenBLAS\install
)
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: change to source directory
cd %PYTORCH_ROOT%
:: copy libuv.dll
copy %libuv_ROOT%\lib\Release\uv.dll torch\lib\uv.dll
:: create virtual environment
python -m venv .venv
echo * > .venv\.gitignore
call .\.venv\Scripts\activate
where python
:: python install dependencies
python -m pip install --upgrade pip
pip install -r requirements.txt
:: DISTUTILS_USE_SDK should be set after psutil dependency
set DISTUTILS_USE_SDK=1
:: start sccache server and reset sccache stats
sccache --start-server
sccache --zero-stats
sccache --show-stats
:: Prepare the environment
mkdir libtorch
mkdir libtorch\bin
mkdir libtorch\cmake
mkdir libtorch\include
mkdir libtorch\lib
mkdir libtorch\share
mkdir libtorch\test
:: Call LibTorch build script
python ./tools/build_libtorch.py
:: Check if there is an error
IF ERRORLEVEL 1 exit /b 1
IF NOT ERRORLEVEL 0 exit /b 1
:: Move the files to the correct location
move /Y torch\bin\*.* libtorch\bin\
move /Y torch\cmake\*.* libtorch\cmake\
robocopy /move /e torch\include\ libtorch\include\
move /Y torch\lib\*.* libtorch\lib\
robocopy /move /e torch\share\ libtorch\share\
move /Y torch\test\*.* libtorch\test\
move /Y libtorch\bin\*.dll libtorch\lib\
:: Set version
echo %PYTORCH_BUILD_VERSION% > libtorch\build-version
git rev-parse HEAD > libtorch\build-hash
:: Set LIBTORCH_PREFIX
IF "%DEBUG%" == "" (
set LIBTORCH_PREFIX=libtorch-win-arm64-shared-with-deps
) ELSE (
set LIBTORCH_PREFIX=libtorch-win-arm64-shared-with-deps-debug
)
:: Create output
C:\Windows\System32\tar.exe -cvaf %LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip -C libtorch *
:: Copy output to target directory
if not exist ..\output mkdir ..\output
copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_DIR%\"
copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_DIR%\%LIBTORCH_PREFIX%-latest.zip"
:: Cleanup raw data to save space
rmdir /s /q libtorch
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed on build_libtorch. (exitcode = %errorlevel%)"
exit /b 1
)

View File

@ -0,0 +1,60 @@
@echo on
:: environment variables
set CMAKE_BUILD_TYPE=%BUILD_TYPE%
set CMAKE_C_COMPILER_LAUNCHER=sccache
set CMAKE_CXX_COMPILER_LAUNCHER=sccache
set libuv_ROOT=%DEPENDENCIES_DIR%\libuv\install
set MSSdk=1
if defined PYTORCH_BUILD_VERSION (
set PYTORCH_BUILD_VERSION=%PYTORCH_BUILD_VERSION%
set PYTORCH_BUILD_NUMBER=1
)
:: Set BLAS type
if %ENABLE_APL% == 1 (
set BLAS=APL
set USE_LAPACK=1
) else if %ENABLE_OPENBLAS% == 1 (
set BLAS=OpenBLAS
set OpenBLAS_HOME=%DEPENDENCIES_DIR%\OpenBLAS\install
)
:: activate visual studio
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
where cl.exe
:: change to source directory
cd %PYTORCH_ROOT%
:: copy libuv.dll
copy %libuv_ROOT%\lib\Release\uv.dll torch\lib\uv.dll
:: create virtual environment
python -m venv .venv
echo * > .venv\.gitignore
call .\.venv\Scripts\activate
where python
:: python install dependencies
python -m pip install --upgrade pip
pip install -r requirements.txt
:: DISTUTILS_USE_SDK should be set after psutil dependency
set DISTUTILS_USE_SDK=1
:: start sccache server and reset sccache stats
sccache --start-server
sccache --zero-stats
sccache --show-stats
:: Call PyTorch build script
python setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"
:: show sccache stats
sccache --show-stats
:: Check if installation was successful
if %errorlevel% neq 0 (
echo "Failed on build_pytorch. (exitcode = %errorlevel%)"
exit /b 1
)

View File

@ -0,0 +1,49 @@
@echo off
setlocal
if "%PACKAGE_TYPE%" == "wheel" goto wheel
if "%PACKAGE_TYPE%" == "libtorch" goto libtorch
echo "unknown package type"
exit /b 1
:wheel
call %PYTORCH_ROOT%\.ci\pytorch\windows\arm64\bootstrap_tests.bat
echo Running python rnn_smoke.py...
python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke_win_arm64.py
if errorlevel 1 exit /b 1
echo Checking that basic CNN works...
python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke_win_arm64.py
if errorlevel 1 exit /b 1
goto end
:libtorch
echo "install and test libtorch"
if not exist tmp mkdir tmp
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *-latest.zip') do C:\Windows\System32\tar.exe -xf "%%i" -C tmp
if ERRORLEVEL 1 exit /b 1
pushd tmp
set VC_VERSION_LOWER=14
set VC_VERSION_UPPER=36
call "%DEPENDENCIES_DIR%\VSBuildTools\VC\Auxiliary\Build\vcvarsall.bat" arm64
set install_root=%CD%
set INCLUDE=%INCLUDE%;%install_root%\include;%install_root%\include\torch\csrc\api\include
set LIB=%LIB%;%install_root%\lib
set PATH=%PATH%;%install_root%\lib
cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\simple-torch-test.cpp c10.lib torch_cpu.lib /EHsc /std:c++17
if ERRORLEVEL 1 exit /b 1
.\simple-torch-test.exe
if ERRORLEVEL 1 exit /b 1
:end

View File

@ -9,12 +9,13 @@ FOR %%v IN (%DESIRED_PYTHON%) DO (
set PYTHON_VERSION_STR=%%v
set PYTHON_VERSION_STR=!PYTHON_VERSION_STR:.=!
conda remove -n py!PYTHON_VERSION_STR! --all -y || rmdir %CONDA_HOME%\envs\py!PYTHON_VERSION_STR! /s
if "%%v" == "3.8" call conda create -n py!PYTHON_VERSION_STR! -y -q numpy=1.11 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.9" call conda create -n py!PYTHON_VERSION_STR! -y -q numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.10" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.11" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.12" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.0.1 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.13" call conda create -n py!PYTHON_VERSION_STR! -y -q -c=conda-forge numpy=2.1.2 pyyaml boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.9" call conda create -n py!PYTHON_VERSION_STR! -y numpy=2.0.1 boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.10" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1 boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.11" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1 boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.12" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.0.1 boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.13" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.1.2 boto3 cmake ninja typing_extensions setuptools=72.1.0 python=%%v
if "%%v" == "3.13t" call conda create -n py!PYTHON_VERSION_STR! -y -c=conda-forge numpy=2.1.2 boto3 cmake ninja typing_extensions setuptools=72.1.0 python-freethreading python=3.13
call conda run -n py!PYTHON_VERSION_STR! pip install pyyaml
call conda run -n py!PYTHON_VERSION_STR! pip install mkl-include
call conda run -n py!PYTHON_VERSION_STR! pip install mkl-static
)

View File

@ -0,0 +1,59 @@
@echo off
set MODULE_NAME=pytorch
IF NOT EXIST "setup.py" IF NOT EXIST "%MODULE_NAME%" (
call internal\clone.bat
cd %~dp0
) ELSE (
call internal\clean.bat
)
IF ERRORLEVEL 1 goto :eof
call internal\check_deps.bat
IF ERRORLEVEL 1 goto :eof
REM Check for optional components
set USE_CUDA=
set CMAKE_GENERATOR=Visual Studio 15 2017 Win64
IF "%NVTOOLSEXT_PATH%"=="" (
IF EXIST "C:\Program Files\NVIDIA Corporation\NvToolsExt\lib\x64\nvToolsExt64_1.lib" (
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
) ELSE (
echo NVTX ^(Visual Studio Extension ^for CUDA^) ^not installed, failing
exit /b 1
)
)
IF "%CUDA_PATH_V128%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\bin\nvcc.exe" (
set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"
) ELSE (
echo CUDA 12.8 not found, failing
exit /b 1
)
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=5.0;6.0;6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
)
set "CUDA_PATH=%CUDA_PATH_V128%"
set "PATH=%CUDA_PATH_V128%\bin;%PATH%"
:optcheck
call internal\check_opts.bat
IF ERRORLEVEL 1 goto :eof
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy.bat
IF ERRORLEVEL 1 goto :eof
call %~dp0\internal\setup.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -9,7 +9,8 @@ if "%CUDA_VERSION%" == "xpu" (
exit /b 0
)
set SRC_DIR=%NIGHTLIES_PYTORCH_ROOT%
set SRC_DIR=%~dp0\..
if not exist "%SRC_DIR%\temp_build" mkdir "%SRC_DIR%\temp_build"
set /a CUDA_VER=%CUDA_VERSION%
@ -23,9 +24,9 @@ set CUDNN_LIB_FOLDER="lib\x64"
if exist "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION_STR%\bin\nvcc.exe" goto set_cuda_env_vars
if %CUDA_VER% EQU 118 goto cuda118
if %CUDA_VER% EQU 121 goto cuda121
if %CUDA_VER% EQU 124 goto cuda124
if %CUDA_VER% EQU 126 goto cuda126
if %CUDA_VER% EQU 128 goto cuda128
echo CUDA %CUDA_VERSION_STR% is not supported
exit /b 1
@ -111,6 +112,33 @@ xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda128
set CUDA_INSTALL_EXE=cuda_12.8.0_571.96_windows.exe
if not exist "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%" (
curl -k -L "https://ossci-windows.s3.amazonaws.com/%CUDA_INSTALL_EXE%" --output "%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
if errorlevel 1 exit /b 1
set "CUDA_SETUP_FILE=%SRC_DIR%\temp_build\%CUDA_INSTALL_EXE%"
set "ARGS=cuda_profiler_api_12.8 thrust_12.8 nvcc_12.8 cuobjdump_12.8 nvprune_12.8 nvprof_12.8 cupti_12.8 cublas_12.8 cublas_dev_12.8 cudart_12.8 cufft_12.8 cufft_dev_12.8 curand_12.8 curand_dev_12.8 cusolver_12.8 cusolver_dev_12.8 cusparse_12.8 cusparse_dev_12.8 npp_12.8 npp_dev_12.8 nvrtc_12.8 nvrtc_dev_12.8 nvml_dev_12.8 nvjitlink_12.8 nvtx_12.8"
)
set CUDNN_FOLDER=cudnn-windows-x86_64-9.7.0.66_cuda12-archive
set CUDNN_LIB_FOLDER="lib"
set "CUDNN_INSTALL_ZIP=%CUDNN_FOLDER%.zip"
if not exist "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%" (
curl -k -L "http://s3.amazonaws.com/ossci-windows/%CUDNN_INSTALL_ZIP%" --output "%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
if errorlevel 1 exit /b 1
set "CUDNN_SETUP_FILE=%SRC_DIR%\temp_build\%CUDNN_INSTALL_ZIP%"
)
@REM cuDNN 8.3+ required zlib to be installed on the path
echo Installing ZLIB dlls
curl -k -L "http://s3.amazonaws.com/ossci-windows/zlib123dllx64.zip" --output "%SRC_DIR%\temp_build\zlib123dllx64.zip"
7z x "%SRC_DIR%\temp_build\zlib123dllx64.zip" -o"%SRC_DIR%\temp_build\zlib"
xcopy /Y "%SRC_DIR%\temp_build\zlib\dll_x64\*.dll" "C:\Windows\System32"
goto cuda_common
:cuda_common
:: NOTE: We only install CUDA if we don't have it installed already.
:: With GHA runners these should be pre-installed as part of our AMI process

View File

@ -27,7 +27,6 @@ for /F "delims=" %%i in ('wmic path win32_VideoController get name') do (
endlocal & set NVIDIA_GPU_EXISTS=%NVIDIA_GPU_EXISTS%
if "%PACKAGE_TYPE%" == "wheel" goto wheel
if "%PACKAGE_TYPE%" == "conda" goto conda
if "%PACKAGE_TYPE%" == "libtorch" goto libtorch
echo "unknown package type"
@ -37,6 +36,7 @@ exit /b 1
echo "install wheel package"
set PYTHON_INSTALLER_URL=
if "%DESIRED_PYTHON%" == "3.13t" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"
if "%DESIRED_PYTHON%" == "3.13" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.13.0/python-3.13.0-amd64.exe"
if "%DESIRED_PYTHON%" == "3.12" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.12.0/python-3.12.0-amd64.exe"
if "%DESIRED_PYTHON%" == "3.11" set "PYTHON_INSTALLER_URL=https://www.python.org/ftp/python/3.11.0/python-3.11.0-amd64.exe"
@ -47,6 +47,13 @@ if "%PYTHON_INSTALLER_URL%" == "" (
echo Python %DESIRED_PYTHON% not supported yet
)
set ADDITIONAL_OPTIONS=""
set PYTHON_EXEC="python"
if "%DESIRED_PYTHON%" == "3.13t" (
set ADDITIONAL_OPTIONS="Include_freethreaded=1"
set PYTHON_EXEC="python3.13t"
)
del python-amd64.exe
curl --retry 3 -kL "%PYTHON_INSTALLER_URL%" --output python-amd64.exe
if errorlevel 1 exit /b 1
@ -55,85 +62,39 @@ if errorlevel 1 exit /b 1
:: the installed Python to PATH system-wide. Even calling set PATH=%ORIG_PATH% later on won't make
:: a change. As the builder directory will be removed after the smoke test, all subsequent non-binary
:: jobs will fail to find any Python executable there
start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_test=0 TargetDir=%CD%\Python
start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_test=0 %ADDITIONAL_OPTIONS% TargetDir=%CD%\Python
if errorlevel 1 exit /b 1
set "PATH=%CD%\Python%PYTHON_VERSION%\Scripts;%CD%\Python;%PATH%"
if "%DESIRED_PYTHON%" == "3.13" pip install -q --pre numpy==2.1.0 protobuf
if "%DESIRED_PYTHON%" == "3.12" pip install -q --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.11" pip install -q --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.10" pip install -q --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.9" pip install -q --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.8" pip install -q numpy protobuf
if "%DESIRED_PYTHON%" == "3.13t" %PYTHON_EXEC% -m pip install --pre numpy==2.2.1 protobuf
if "%DESIRED_PYTHON%" == "3.13" %PYTHON_EXEC% -m pip install --pre numpy==2.1.2 protobuf
if "%DESIRED_PYTHON%" == "3.12" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.11" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.10" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf
if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install --pre numpy==2.0.2 protobuf networkx
if errorlevel 1 exit /b 1
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do pip install "%%i"
if "%PYTORCH_BUILD_VERSION:dev=%" NEQ "%PYTORCH_BUILD_VERSION%" (
set "CHANNEL=nightly"
) else (
set "CHANNEL=test"
)
set "EXTRA_INDEX= "
if "%CUDA_VERSION%" == "xpu" set "EXTRA_INDEX=--index-url https://download.pytorch.org/whl/%CHANNEL%/xpu"
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *.whl') do %PYTHON_EXEC% -m pip install "%%i" %EXTRA_INDEX%
if errorlevel 1 exit /b 1
goto smoke_test
:conda
echo "install conda package"
:: Install Miniconda3
set "CONDA_HOME=%CD%\conda"
set "tmp_conda=%CONDA_HOME%"
set "miniconda_exe=%CD%\miniconda.exe"
set "CONDA_EXTRA_ARGS=cpuonly -c pytorch-nightly"
if "%CUDA_VERSION%" == "118" (
set "CONDA_EXTRA_ARGS=pytorch-cuda=11.8 -c nvidia -c pytorch-nightly"
)
if "%CUDA_VERSION%" == "121" (
set "CONDA_EXTRA_ARGS=pytorch-cuda=12.1 -c nvidia -c pytorch-nightly"
)
if "%CUDA_VERSION%" == "124" (
set "CONDA_EXTRA_ARGS=pytorch-cuda=12.4 -c nvidia -c pytorch-nightly"
)
if "%CUDA_VERSION%" == "126" (
set "CONDA_EXTRA_ARGS=pytorch-cuda=12.6 -c nvidia -c pytorch-nightly"
)
rmdir /s /q conda
del miniconda.exe
curl -k https://repo.anaconda.com/miniconda/Miniconda3-latest-Windows-x86_64.exe -o "%miniconda_exe%"
start /wait "" "%miniconda_exe%" /S /InstallationType=JustMe /RegisterPython=0 /AddToPath=0 /D=%tmp_conda%
if ERRORLEVEL 1 exit /b 1
set "PATH=%CONDA_HOME%;%CONDA_HOME%\scripts;%CONDA_HOME%\Library\bin;%PATH%"
conda create -qyn testenv python=%DESIRED_PYTHON%
if errorlevel 1 exit /b 1
call conda install -yq conda-build
if errorlevel 1 exit /b 1
call %CONDA_HOME%\condabin\activate.bat testenv
if errorlevel 1 exit /b 1
set "NO_ARCH_PATH=%PYTORCH_FINAL_PACKAGE_DIR:/=\%\noarch"
mkdir %NO_ARCH_PATH%
for /F "delims=" %%i in ('where /R "%PYTORCH_FINAL_PACKAGE_DIR:/=\%" *') do xcopy "%%i" %NO_ARCH_PATH% /Y
if ERRORLEVEL 1 exit /b 1
call conda index %PYTORCH_FINAL_PACKAGE_DIR%
if errorlevel 1 exit /b 1
call conda install -yq -c "file:///%PYTORCH_FINAL_PACKAGE_DIR%" pytorch==%PYTORCH_BUILD_VERSION% -c pytorch -c numba/label/dev -c nvidia
if ERRORLEVEL 1 exit /b 1
call conda install -yq numpy
if ERRORLEVEL 1 exit /b 1
set /a CUDA_VER=%CUDA_VERSION%
set CUDA_VER_MAJOR=%CUDA_VERSION:~0,-1%
set CUDA_VER_MINOR=%CUDA_VERSION:~-1,1%
set CUDA_VERSION_STR=%CUDA_VER_MAJOR%.%CUDA_VER_MINOR%
:: Install package we just build
:smoke_test
python -c "import torch"
%PYTHON_EXEC% -c "import torch"
if ERRORLEVEL 1 exit /b 1
echo Checking that MKL is available
python -c "import torch; exit(0 if torch.backends.mkl.is_available() else 1)"
%PYTHON_EXEC% -c "import torch; exit(0 if torch.backends.mkl.is_available() else 1)"
if ERRORLEVEL 1 exit /b 1
if "%NVIDIA_GPU_EXISTS%" == "0" (
@ -142,24 +103,24 @@ if "%NVIDIA_GPU_EXISTS%" == "0" (
)
echo Checking that CUDA archs are setup correctly
python -c "import torch; torch.randn([3,5]).cuda()"
%PYTHON_EXEC% -c "import torch; torch.randn([3,5]).cuda()"
if ERRORLEVEL 1 exit /b 1
echo Checking that magma is available
python -c "import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)"
%PYTHON_EXEC% -c "import torch; torch.rand(1).cuda(); exit(0 if torch.cuda.has_magma else 1)"
if ERRORLEVEL 1 exit /b 1
echo Checking that CuDNN is available
python -c "import torch; exit(0 if torch.backends.cudnn.is_available() else 1)"
%PYTHON_EXEC% -c "import torch; exit(0 if torch.backends.cudnn.is_available() else 1)"
if ERRORLEVEL 1 exit /b 1
echo Checking that basic RNN works
python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke.py
%PYTHON_EXEC% %PYTORCH_ROOT%\.ci\pytorch\test_example_code\rnn_smoke.py
if ERRORLEVEL 1 exit /b 1
echo Checking that basic CNN works
python %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke.py
%PYTHON_EXEC% %PYTORCH_ROOT%\.ci\pytorch\test_example_code\cnn_smoke.py
if ERRORLEVEL 1 exit /b 1
goto end
@ -167,7 +128,6 @@ goto end
:libtorch
echo "install and test libtorch"
if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1
if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1
if ERRORLEVEL 1 exit /b 1
@ -179,10 +139,6 @@ pushd tmp\libtorch
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
IF "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

View File

@ -70,7 +70,6 @@ echo "install and test libtorch"
pip install cmake
echo "installing cmake"
if "%VC_YEAR%" == "2019" powershell internal\vs2019_install.ps1
if "%VC_YEAR%" == "2022" powershell internal\vs2022_install.ps1
if ERRORLEVEL 1 exit /b 1
@ -83,10 +82,6 @@ pushd tmp\libtorch
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
IF "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -legacy -products * -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

View File

@ -1,12 +1,8 @@
if "%VC_YEAR%" == "2019" powershell windows/internal/vs2019_install.ps1
if "%VC_YEAR%" == "2022" powershell windows/internal/vs2022_install.ps1
set VC_VERSION_LOWER=17
set VC_VERSION_UPPER=18
if "%VC_YEAR%" == "2019" (
set VC_VERSION_LOWER=16
set VC_VERSION_UPPER=17
)
for /f "usebackq tokens=*" %%i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -products Microsoft.VisualStudio.Product.BuildTools -version [%VC_VERSION_LOWER%^,%VC_VERSION_UPPER%^) -property installationPath`) do (
if exist "%%i" if exist "%%i\VC\Auxiliary\Build\vcvarsall.bat" (

View File

@ -1,48 +0,0 @@
# https://developercommunity.visualstudio.com/t/install-specific-version-of-vs-component/1142479
# https://docs.microsoft.com/en-us/visualstudio/releases/2019/history#release-dates-and-build-numbers
# 16.8.6 BuildTools
$VS_DOWNLOAD_LINK = "https://ossci-windows.s3.us-east-1.amazonaws.com/vs16.8.6_BuildTools.exe"
$COLLECT_DOWNLOAD_LINK = "https://aka.ms/vscollect.exe"
$VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStudio.Workload.VCTools",
"--add Microsoft.Component.MSBuild",
"--add Microsoft.VisualStudio.Component.Roslyn.Compiler",
"--add Microsoft.VisualStudio.Component.TextTemplating",
"--add Microsoft.VisualStudio.Component.VC.CoreIde",
"--add Microsoft.VisualStudio.Component.VC.Redist.14.Latest",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Core",
"--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS 2019 Version 16.8.5 installer failed"
exit 1
}
if (Test-Path "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe") {
$existingPath = & "${env:ProgramFiles(x86)}\Microsoft Visual Studio\Installer\vswhere.exe" -products "Microsoft.VisualStudio.Product.BuildTools" -version "[16, 17)" -property installationPath
if ($existingPath -ne $null) {
if (!${env:CIRCLECI}) {
echo "Found correctly versioned existing BuildTools installation in $existingPath"
exit 0
}
echo "Found existing BuildTools installation in $existingPath, keeping it"
}
}
$process = Start-Process "${PWD}\vs_installer.exe" -ArgumentList $VS_INSTALL_ARGS -NoNewWindow -Wait -PassThru
Remove-Item -Path vs_installer.exe -Force
$exitCode = $process.ExitCode
if (($exitCode -ne 0) -and ($exitCode -ne 3010)) {
echo "VS 2019 installer exited with code $exitCode, which should be one of [0, 3010]."
curl.exe --retry 3 -kL $COLLECT_DOWNLOAD_LINK --output Collect.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS Collect tool failed."
exit 1
}
Start-Process "${PWD}\Collect.exe" -NoNewWindow -Wait -PassThru
New-Item -Path "C:\w\build-results" -ItemType "directory" -Force
Copy-Item -Path "C:\Users\${env:USERNAME}\AppData\Local\Temp\vslogs.zip" -Destination "C:\w\build-results\"
exit 1
}

View File

@ -47,9 +47,9 @@ set XPU_EXTRA_INSTALLED=0
set XPU_EXTRA_UNINSTALL=0
if not [%XPU_VERSION%]==[] if [%XPU_VERSION%]==[2025.0] (
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/efc86abd-cb77-452e-a03f-a741895b8ece/intel-deep-learning-essentials-2025.0.0.336_offline.exe
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d6d6c17-ca2d-4735-9331-99447e4a1280/intel-deep-learning-essentials-2025.0.1.28_offline.exe
set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.deep-learning-essentials.product
set XPU_BUNDLE_VERSION=2025.0.0+335
set XPU_BUNDLE_VERSION=2025.0.1+20
set XPU_BUNDLE_INSTALLED=0
set XPU_BUNDLE_UNINSTALL=0
set XPU_EXTRA_URL=NULL
@ -104,14 +104,6 @@ goto xpu_install_end
:xpu_bundle_install
:: Install Level Zero SDK
set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip
curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"
echo "Installing level zero SDK..."
7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"
set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"
:: Install Bundle
curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%
echo "XPU Bundle installing..."
start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle
@ -128,3 +120,14 @@ if errorlevel 1 exit /b 1
del xpu_extra.exe
:xpu_install_end
if not "%XPU_ENABLE_KINETO%"=="1" goto install_end
:: Install Level Zero SDK
set XPU_EXTRA_LZ_URL=https://github.com/oneapi-src/level-zero/releases/download/v1.14.0/level-zero-sdk_1.14.0.zip
curl -k -L %XPU_EXTRA_LZ_URL% --output "%SRC_DIR%\temp_build\level_zero_sdk.zip"
echo "Installing level zero SDK..."
7z x "%SRC_DIR%\temp_build\level_zero_sdk.zip" -o"%SRC_DIR%\temp_build\level_zero"
set "INCLUDE=%SRC_DIR%\temp_build\level_zero\include;%INCLUDE%"
del "%SRC_DIR%\temp_build\level_zero_sdk.zip"
:install_end

View File

@ -28,11 +28,6 @@ call "%XPU_BUNDLE_ROOT%\compiler\latest\env\vars.bat"
call "%XPU_BUNDLE_ROOT%\ocloc\latest\env\vars.bat"
IF ERRORLEVEL 1 goto :eof
:: Workaround for https://github.com/pytorch/pytorch/issues/134989
set CMAKE_SHARED_LINKER_FLAGS=/FORCE:MULTIPLE
set CMAKE_MODULE_LINKER_FLAGS=/FORCE:MULTIPLE
set CMAKE_EXE_LINKER_FLAGS=/FORCE:MULTIPLE
if exist "%NIGHTLIES_PYTORCH_ROOT%" cd %NIGHTLIES_PYTORCH_ROOT%\..
call %~dp0\internal\copy_cpu.bat
IF ERRORLEVEL 1 goto :eof

View File

@ -130,7 +130,19 @@ export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
SETUPTOOLS_PINNED_VERSION="=46.0.0"
PYYAML_PINNED_VERSION="=5.3"
EXTRA_CONDA_INSTALL_FLAGS=""
CONDA_ENV_CREATE_FLAGS=""
RENAME_WHEEL=true
case $desired_python in
3.13t)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=68.0.0"
PYYAML_PINNED_VERSION=">=6.0.1"
NUMPY_PINNED_VERSION="=2.1.0"
CONDA_ENV_CREATE_FLAGS="python-freethreading"
EXTRA_CONDA_INSTALL_FLAGS="-c conda-forge"
desired_python="3.13"
RENAME_WHEEL=false
;;
3.13)
echo "Using 3.13 deps"
SETUPTOOLS_PINNED_VERSION=">=68.0.0"
@ -169,18 +181,15 @@ esac
# Install into a fresh env
tmp_env_name="wheel_py$python_nodot"
conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python"
conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}
source activate "$tmp_env_name"
pip install -q "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests
retry pip install -qr "${pytorch_rootdir}/requirements.txt" || true
# TODO : Remove me later (but in the interim, use Anaconda cmake, to find Anaconda installed OpenMP)
retry pip uninstall -y cmake
retry conda install ${EXTRA_CONDA_INSTALL_FLAGS} -yq llvm-openmp=14.0.6 cmake ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions
pip install "numpy=${NUMPY_PINNED_VERSION}" "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions
retry pip install -r "${pytorch_rootdir}/requirements.txt" || true
retry brew install libomp
# For USE_DISTRIBUTED=1 on macOS, need libuv and pkg-config to find libuv.
# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule
export USE_DISTRIBUTED=1
retry conda install ${EXTRA_CONDA_INSTALL_FLAGS} -yq libuv pkg-config
if [[ -n "$CROSS_COMPILE_ARM64" ]]; then
export CMAKE_OSX_ARCHITECTURES=arm64
@ -222,10 +231,13 @@ echo "The wheel is in $(find $whl_tmp_dir -name '*.whl')"
wheel_filename_gen=$(find $whl_tmp_dir -name '*.whl' | head -n1 | xargs -I {} basename {})
popd
if [[ -z "$BUILD_PYTHONLESS" ]]; then
if [[ -z "$BUILD_PYTHONLESS" && $RENAME_WHEEL == true ]]; then
# Copy the whl to a final destination before tests are run
echo "Renaming Wheel file: $wheel_filename_gen to $wheel_filename_new"
cp "$whl_tmp_dir/$wheel_filename_gen" "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_new"
elif [[ $RENAME_WHEEL == false ]]; then
echo "Copying Wheel file: $wheel_filename_gen to $PYTORCH_FINAL_PACKAGE_DIR"
cp "$whl_tmp_dir/$wheel_filename_gen" "$PYTORCH_FINAL_PACKAGE_DIR/$wheel_filename_gen"
else
pushd "$pytorch_rootdir"

View File

@ -30,12 +30,10 @@ fi
# Pick docker image
export DOCKER_IMAGE=${DOCKER_IMAGE:-}
if [[ -z "$DOCKER_IMAGE" ]]; then
if [[ "$PACKAGE_TYPE" == conda ]]; then
export DOCKER_IMAGE="pytorch/conda-cuda"
elif [[ "$DESIRED_CUDA" == cpu ]]; then
export DOCKER_IMAGE="pytorch/manylinux:cpu"
if [[ "$DESIRED_CUDA" == cpu ]]; then
export DOCKER_IMAGE="pytorch/manylinux2_28:cpu"
else
export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"
export DOCKER_IMAGE="pytorch/manylinux2_28-builder:${DESIRED_CUDA:2}"
fi
fi
@ -63,7 +61,7 @@ if tagged_version >/dev/null; then
# Turns tag v1.6.0-rc1 -> v1.6.0
BASE_BUILD_VERSION="$(tagged_version | sed -e 's/^v//' -e 's/-.*$//')"
fi
if [[ "$(uname)" == 'Darwin' ]] || [[ "$PACKAGE_TYPE" == conda ]]; then
if [[ "$(uname)" == 'Darwin' ]]; then
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}"
else
export PYTORCH_BUILD_VERSION="${BASE_BUILD_VERSION}+$DESIRED_CUDA"
@ -76,6 +74,12 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == cu128 ]]; then
TRITON_CONSTRAINT="platform_system == 'Linux'"
fi
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
@ -100,11 +104,11 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B
fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-8 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+git${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+git${TRITON_SHORTHASH}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"
@ -149,8 +153,6 @@ export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:
# TODO: We don't need this anymore IIUC
export TORCH_PACKAGE_NAME='torch'
export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'
export ANACONDA_USER='pytorch'
export USE_FBGEMM=1
export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"

View File

@ -2,7 +2,7 @@
set -euo pipefail
PACKAGE_TYPE=${PACKAGE_TYPE:-conda}
PACKAGE_TYPE=${PACKAGE_TYPE:-wheel}
PKG_DIR=${PKG_DIR:-/tmp/workspace/final_pkgs}
@ -18,10 +18,8 @@ BUILD_NAME=${BUILD_NAME:-}
DRY_RUN=${DRY_RUN:-enabled}
# Don't actually do work unless explicit
ANACONDA="true anaconda"
AWS_S3_CP="aws s3 cp --dryrun"
if [[ "${DRY_RUN}" = "disabled" ]]; then
ANACONDA="anaconda"
AWS_S3_CP="aws s3 cp"
fi
@ -34,10 +32,6 @@ if [[ ${BUILD_NAME} == *-full* ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"
fi
# Sleep 2 minutes between retries for conda upload
retry () {
"$@" || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")
}
do_backup() {
local backup_dir
@ -49,72 +43,33 @@ do_backup() {
)
}
conda_upload() {
(
set -x
retry \
${ANACONDA} \
upload \
${PKG_DIR}/*.tar.bz2 \
-u "pytorch-${UPLOAD_CHANNEL}" \
--label main \
--no-progress \
--force
)
}
s3_upload() {
local extension
local pkg_type
extension="$1"
pkg_type="$2"
s3_key_prefix="${pkg_type}/${UPLOAD_CHANNEL}"
s3_root_dir="${UPLOAD_BUCKET}/${pkg_type}/${UPLOAD_CHANNEL}"
if [[ -z ${UPLOAD_SUBFOLDER:-} ]]; then
s3_upload_dir="${UPLOAD_BUCKET}/${s3_key_prefix}/"
s3_upload_dir="${s3_root_dir}/"
else
s3_key_prefix="${s3_key_prefix}/${UPLOAD_SUBFOLDER}"
s3_upload_dir="${UPLOAD_BUCKET}/${s3_key_prefix}/"
s3_upload_dir="${s3_root_dir}/${UPLOAD_SUBFOLDER}/"
fi
(
for pkg in ${PKG_DIR}/*.${extension}; do
(
set -x
${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}"
if [[ ${pkg_type} == "whl" ]]; then
dry_run_arg="--dry-run"
if [[ "${DRY_RUN}" = "disabled" ]]; then
dry_run_arg=""
fi
uv run scripts/release/upload_metadata_file.py \
--package "${pkg}" \
--bucket "${UPLOAD_BUCKET}" \
--key-prefix "${s3_key_prefix}" \
${dry_run_arg}
fi
shm_id=$(sha256sum "${pkg}" | awk '{print $1}')
${AWS_S3_CP} --no-progress --acl public-read "${pkg}" "${s3_upload_dir}" \
--metadata "checksum-sha256=${shm_id}"
)
done
)
}
# Install dependencies (should be a no-op if previously installed)
conda install -yq anaconda-client
pip install -q awscli uv
case "${PACKAGE_TYPE}" in
conda)
conda_upload
for conda_archive in ${PKG_DIR}/*.tar.bz2; do
# Fetch platform (eg. win-64, linux-64, etc.) from index file because
# there's no actual conda command to read this
subdir=$(\
tar -xOf "${conda_archive}" info/index.json \
| grep subdir \
| cut -d ':' -f2 \
| sed -e 's/[[:space:]]//' -e 's/"//g' -e 's/,//' \
)
BACKUP_DIR="conda/${subdir}"
done
;;
libtorch)
s3_upload "zip" "libtorch"
BACKUP_DIR="libtorch/${UPLOAD_CHANNEL}/${UPLOAD_SUBFOLDER}"

View File

@ -0,0 +1,22 @@
#!/bin/bash
set -eux -o pipefail
source "${BINARY_ENV_FILE:-/c/w/env}"
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR"
export USE_SCCACHE=1
export SCCACHE_IGNORE_SERVER_IO_ERROR=1
echo "Free space on filesystem before build:"
df -h
export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
if [[ "$PACKAGE_TYPE" == 'libtorch' ]]; then
pytorch/.ci/pytorch/windows/arm64/build_libtorch.bat
elif [[ "$PACKAGE_TYPE" == 'wheel' ]]; then
pytorch/.ci/pytorch/windows/arm64/build_pytorch.bat
fi
echo "Free space on filesystem after build:"
df -h

View File

@ -0,0 +1,6 @@
#!/bin/bash
set -eux -o pipefail
source "${BINARY_ENV_FILE:-/c/w/env}"
pytorch/.ci/pytorch/windows/arm64/smoke_test.bat

View File

@ -8,12 +8,12 @@ export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export USE_SCCACHE=1
export SCCACHE_BUCKET=ossci-compiler-cache
export SCCACHE_IGNORE_SERVER_IO_ERROR=1
export VC_YEAR=2019
export VC_YEAR=2022
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export VC_YEAR=2022
export USE_SCCACHE=0
export XPU_VERSION=2025.0
export XPU_ENABLE_KINETO=1
fi
echo "Free space on filesystem before build:"

View File

@ -4,10 +4,9 @@ set -eux -o pipefail
source "${BINARY_ENV_FILE:-/c/w/env}"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export VC_YEAR=2019
export VC_YEAR=2022
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export VC_YEAR=2022
export XPU_VERSION=2025.0
fi

View File

@ -12,6 +12,7 @@ bugprone-*,
-bugprone-macro-parentheses,
-bugprone-lambda-function-name,
-bugprone-reserved-identifier,
-bugprone-return-const-ref-from-parameter,
-bugprone-swapped-arguments,
clang-analyzer-core.*,
clang-analyzer-cplusplus.*,
@ -24,6 +25,7 @@ cppcoreguidelines-*,
-cppcoreguidelines-avoid-non-const-global-variables,
-cppcoreguidelines-interfaces-global-init,
-cppcoreguidelines-macro-usage,
-cppcoreguidelines-macro-to-enum,
-cppcoreguidelines-owning-memory,
-cppcoreguidelines-pro-bounds-array-to-pointer-decay,
-cppcoreguidelines-pro-bounds-constant-array-index,
@ -46,6 +48,7 @@ misc-*,
-misc-no-recursion,
-misc-non-private-member-variables-in-classes,
-misc-unused-using-decls,
-misc-use-internal-linkage,
modernize-*,
-modernize-macro-to-enum,
-modernize-return-braced-init-list,
@ -55,6 +58,7 @@ modernize-*,
-modernize-use-trailing-return-type,
-modernize-use-nodiscard,
performance-*,
-performance-enum-size,
readability-container-size-empty,
readability-delete-null-pointer,
readability-duplicate-include

View File

@ -38,6 +38,7 @@ per-file-ignores =
torchgen/api/types/__init__.py: F401,F403
torchgen/executorch/api/types/__init__.py: F401,F403
test/dynamo/test_higher_order_ops.py: B950
test/dynamo/test_error_messages.py: B950
torch/testing/_internal/dynamo_test_failures.py: B950
# TOR901 is only for test, we want to ignore it for everything else.
# It's not easy to configure this without affecting other per-file-ignores,

View File

@ -24,6 +24,10 @@ e3900d2ba5c9f91a24a9ce34520794c8366d5c54
2e26976ad3b06ce95dd6afccfdbe124802edf28f
# 2021-06-07 Strictly typed everything in `.github` and `tools`
737d920b21db9b4292d056ee1329945990656304
# 2021-08-12 [codemod][lint][fbcode/c*] Enable BLACK by default
b0043072529b81276a69df29e00555333117646c
# 2021-08-25 Reformat run_test.py
67d8e7b659b19e1ee68208b28bfa7dba73375dbc
# 2022-06-09 Apply clang-format to ATen headers
95b15c266baaf989ef7b6bbd7c23a2d90bacf687
# 2022-06-11 [lint] autoformat test/cpp and torch/csrc
@ -44,3 +48,57 @@ a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
d80939e5e9337e8078f11489afefec59fd42f93b
# 2024-06-28 enable UFMT in `torch.utils.data`
7cf0b90e49689d45be91aa539fdf54cf2ea8a9a3
# 2024-07-03 Enable UFMT on test/test_public_bindings.py (#128389)
fe5424d0f8604f6e66d827ae9f94b05cb7119d55
# 2024-07-03 Enable UFMT on test/test_public_bindings.py (#128389)
c686304277f7cd72331f685605325498cff94a0b
# 2024-07-15 Enable UFMT on all of torch/sparse (#130545)
535016967ae65a6027f83d6b935a985996223d49
# 2024-07-15 [BE][Easy][1/19] enforce style for empty lines in import segments (#129752)
a3abfa5cb57203b6a8ba7dff763f4057db8282a8
# 2024-07-15 [BE][Easy][2/19] enforce style for empty lines in import segments in `.ci/` and `.github/` (#129753)
ba48cf653541e9160dfdefa7bfea885c22e48608
# 2024-07-16 [BE][Easy][5/19] enforce style for empty lines in import segments in `tools/` and `torchgen/` (#129756)
f6838d521a243dbedc50ae96575720bf2cc8a8ad
# 2024-07-17 [BE][Easy][9/19] enforce style for empty lines in import segments in `test/[e-h]*/` (#129760)
76169cf69184bd462b9add40f893f57675f8a057
# 2024-07-16 [BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754)
c0ed38e644aed812d76b0ec85fae2f6019bf462b
# 2024-07-16 [BE][Easy][4/19] enforce style for empty lines in import segments in `functorch/` (#129755)
740fb229660f388feddc288c127ab12c82e67d36
# 2024-07-17 [BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763)
aecc746fccc4495313167e3a7f94210daf457e1d
# 2024-07-18 Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763)"
b732b52f1e4378f8486ceb5e7026be3321c2651c
# 2024-07-18 [BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763)
134bc4fc34bb02795aa694e66b132dcea5dde1e1
# 2024-07-26 [BE][Easy][8/19] enforce style for empty lines in import segments in `test/[k-p]*/` (#129759)
fbe6f42dcf1834213e0baa87b87529161df3c4d7
# 2024-07-31 [BE][Easy][14/19] enforce style for empty lines in import segments in `torch/_[a-c]*/` and `torch/_[e-h]*/` and `torch/_[j-z]*/` (#129765)
e7eeee473c6cb45942e87de5a616b0eb635513d6
# 2024-07-31 Fix lint after PR #130572 (#132316)
d72e863b3ecd3de4c8ea00518e110da964583f4f
# 2024-07-31 [BE][Easy][15/19] enforce style for empty lines in import segments in `torch/_d*/` (#129767)
e74ba1b34a476b46e76b4e32afe2d481f97e9a47
# 2024-07-31 [BE][Easy][18/19] enforce style for empty lines in import segments in `torch/d*/` (#129770)
b25ef91bf158ce459d8654e33c50c8e6ed8db716
# 2024-07-20 [BE][Easy][13/19] enforce style for empty lines in import segments in `test/j*/` (#129764)
6ff1e43a416c43cd82b210e22ac47384494c172e
# 2024-11-01 [Lint] Clang-format all metal kernels (#139530)
b3ad45733bd908b7358959ca1e1f8d026f4507eb
# 2024-11-17 [BE][MPS] Apply clang-format to mps headers (#140906)
99014a297c179862af38ee86bac2051434d3db41
# 2024-11-27 Apply clang-format for ATen/core/boxing headers (#141105)
19d01a1ef0c0d65768eb0a5c97a25328eec57fbd
# 2024-12-05 fix the lint from D66795414 (#142122)
65c2086d452ae6966ce9d7fb3cb2eef2fd0d2add
# 2024-12-20 Apply clang-format for ATen/core/dispatch headers (#143620)
cee06e74eeb54994b97000a02b715a4e63a97951
# 2024-12-22 Better fix for f-strings in set_linter for py3.12 (#143725)
eebc93d41eeffb936cbf20c9052e1e813d0cc052
# 2025-01-04 [mps/BE] Fix linter warning/advice. (#144199)
0dc1e6be192b260f1c072d70e1b06a3ac8e109fa
# 2025-01-07 Fix lint in `test_provenance_tracing.py` (#144296)
61c0a3d1cbaf6420e40ab0f9c9019daa21145e69
# 2025-01-09 [BE] fix ruff rule E226: add missing whitespace around operator in f-strings (#144415)
dcc3cf7066b4d8cab63ecb73daf1e36b01220a4e

View File

@ -5,7 +5,7 @@ body:
- type: markdown
attributes:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/pytorch/pytorch/issues?q=is%3Aissue+sort%3Acreated-desc+). Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/pytorch/pytorch/issues?q=is%3Aissue+sort%3Acreated-desc+). Note: Please write your bug report in English to ensure it can be understood and addressed by the development team. If you are filing a bug for torch.compile, please use the [torch.compile issue template](https://github.com/pytorch/pytorch/issues/new?q=sort%3Aupdated-desc+is%3Aissue+is%3Aopen&template=pt2-bug-report.yml).
- type: textarea
attributes:
label: 🐛 Describe the bug

View File

@ -5,7 +5,7 @@ title: "DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]"
labels: "module: ci"
---
> For example, DISABLED pull / win-vs2019-cpu-py3 / test (default). Once
> For example, DISABLED pull / win-vs2022-cpu-py3 / test (default). Once
> created, the job will be disabled within 15 minutes. You can check the
> list of disabled jobs at https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json

View File

@ -6,7 +6,7 @@ body:
- type: markdown
attributes:
value: >
#### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.
#### Note: Please write your bug report in English to ensure it can be understood and addressed by the development team.
- type: markdown
attributes:
value: >
@ -22,6 +22,8 @@ body:
- If comparing eager and torch.compile at fp16/bf16, you should use fp32 as baseline
- Ensure rng state used to compare results is equivalent. Use `torch._inductor.config.fallback_random=True` and reset the torch rng seed between comparisons
If the above requirements are met, add the label "topic: fuzzer" to your issue.
- type: textarea

View File

@ -1,8 +1,10 @@
self-hosted-runner:
labels:
# GitHub hosted runner that actionlint doesn't recognize because actionlint version (1.6.21) is too old
- ubuntu-24.04
# GitHub hosted x86 Linux runners
- linux.20_04.4x
- linux.20_04.16x
- linux.24_04.4x
- linux.24_04.16x
# Organization-wide AWS Linux Runners
- linux.large
- linux.2xlarge
@ -10,7 +12,6 @@ self-hosted-runner:
- linux.9xlarge.ephemeral
- am2.linux.9xlarge.ephemeral
- linux.12xlarge
- linux.12xlarge.ephemeral
- linux.24xlarge
- linux.24xlarge.ephemeral
- linux.arm64.2xlarge
@ -42,6 +43,8 @@ self-hosted-runner:
- windows.8xlarge.nvidia.gpu
- windows.8xlarge.nvidia.gpu.nonephemeral
- windows.g5.4xlarge.nvidia.gpu
# Windows ARM64 runners
- windows-11-arm64
# Organization-wide AMD hosted runners
- linux.rocm.gpu
- linux.rocm.gpu.2

View File

@ -40,6 +40,11 @@ runs:
fi
mkdir "${GITHUB_WORKSPACE}"
# Use all available CPUs for fetching
cd "${GITHUB_WORKSPACE}"
git config --global fetch.parallel 0
git config --global submodule.fetchJobs 0
- name: Checkout PyTorch
uses: actions/checkout@v4
with:

View File

@ -99,8 +99,14 @@ runs:
# All GPUs are visible to the runner; visibility, if needed, will be set by run_test.py.
# Add render group for container creation.
render_gid=`cat /etc/group | grep render | cut -d: -f3`
# Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG.
if [ -f "/etc/podinfo/gha-render-devices" ]; then
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
else
DEVICE_FLAG="--device /dev/dri"
fi
# The --group-add daemon and --group-add bin are needed in the Ubuntu 24.04 and Almalinux OSs respectively.
# This is due to the device files (/dev/kfd & /dev/dri) being owned by video group on bare metal.
# This video group ID maps to subgid 1 inside the docker image due to the /etc/subgid entries.
# The group name corresponding to group ID 1 can change depending on the OS, so both are necessary.
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device /dev/dri --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host" >> "${GITHUB_ENV}"
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host" >> "${GITHUB_ENV}"

View File

@ -0,0 +1,56 @@
name: upload-utilization-stats
description: Upload utilization stats to artifacts
inputs:
workflow_run_id:
type: string
description: 'workflow (run) id of the workflow the test is running'
required: True
workflow_attempt:
type: string
description: 'the workflow (run) attempt'
required: True
workflow_name:
description: 'name of the workflow'
type: string
required: True
job_id:
type: string
description: 'the job (run) id for the test'
required: True
job_name:
type: string
description: 'the job name of the test'
required: True
runs:
using: composite
steps:
- name: Print Inputs
shell: bash
run: |
echo "workflow_id: ${{inputs.workflow_run_id}}"
echo "workflow_attempt: ${{inputs.workflow_attempt}}"
echo "workflow_Name: ${{inputs.workflow_name}}"
echo "job_id: ${{inputs.job_id}}"
echo "job_name: ${{inputs.job_name}}"
- uses: nick-fields/retry@v3.0.0
name: Setup dependencies
with:
shell: bash
timeout_minutes: 5
max_attempts: 5
retry_wait_seconds: 30
command: |
set -eu
python3 -m pip install python-dateutil==2.8.2 boto3==1.35.42 pandas==2.1.3
- name: Upload utilizatoin stats to s3
shell: bash
run: |
python3 -m tools.stats.upload_utilization_stats.upload_utilization_stats \
--workflow-run-id "${{inputs.workflow_run_id}}" \
--workflow-name "${{inputs.workflow_name}}" \
--workflow-run-attempt "${{inputs.workflow_attempt}}" \
--job-id "${{inputs.job_id}}" \
--job-name "${{inputs.job_name}}"

View File

@ -1 +1 @@
b6d4675c7aedc53ba04f3f55786aac1de32be6b4
c670ad81fda266b6598aeeef434583eb98197ae8

View File

@ -0,0 +1 @@
5fb5024118e9bb9decf96c2b0b1a8f0010bf56be

View File

@ -1 +1 @@
766a5e3a189384659fd35a68c3b17b88c761aaac
373ffb19dc470f4423a3176a4133f8f4b3cdb5bd

View File

@ -1 +1 @@
b2b890e962f5fb6f481e5da2eb4a43bb990d0f1b
r2.7

7
.github/labeler.yml vendored
View File

@ -98,7 +98,7 @@
- test/distributed/**
- torch/testing/_internal/distributed/**
"module: distributed_checkpoint":
"release notes: distributed (checkpoint)":
- torch/distributed/checkpoint/**
- test/distributed/checkpoint/**
@ -107,3 +107,8 @@
- torch/csrc/dynamo/compiled_autograd.h
- torch/_dynamo/compiled_autograd.py
- torch/inductor/test_compiled_autograd.py
"ciflow/xpu":
- torch/csrc/inductor/aoti_include/xpu.h
- torch/csrc/inductor/cpp_wrapper/device_internal/xpu.h
- torch/csrc/inductor/cpp_wrapper/xpu.h

View File

@ -79,7 +79,6 @@
- .ci/docker/ci_commit_pins/triton.txt
approved_by:
- pytorchbot
ignore_flaky_failures: false
mandatory_checks_name:
- EasyCLA
- Lint
@ -91,7 +90,6 @@
- test/slow_tests.json
approved_by:
- pytorchbot
ignore_flaky_failures: false
mandatory_checks_name:
- EasyCLA
- Lint
@ -103,12 +101,10 @@
- .ci/docker/ci_commit_pins/executorch.txt
approved_by:
- pytorchbot
ignore_flaky_failures: false
mandatory_checks_name:
- EasyCLA
- Lint
- pull / linux-jammy-py3-clang12-executorch / build
- pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge)
- pull
- name: OSS CI / pytorchbot / XLA
patterns:
@ -119,8 +115,7 @@
mandatory_checks_name:
- EasyCLA
- Lint
- pull / linux-focal-py3_9-clang9-xla / build
- pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge)
- pull
- name: Documentation
patterns:
@ -247,25 +242,6 @@
- Lint
- pull
- name: XPU ATen
patterns:
- aten/src/ATen/xpu/**
- c10/xpu/**
- torch/csrc/xpu/**
- torch/xpu/**
- test/xpu/**
- test/test_xpu.py
- third_party/xpu.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
approved_by:
- EikanWang
- jgong5
- gujinghui
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: Distributions
patterns:
- torch/distributions/**
@ -358,6 +334,7 @@
- XiaobingSuper
- jgong5
- mingfeima
- EikanWang
mandatory_checks_name:
- EasyCLA
- Lint
@ -390,6 +367,7 @@
- jgong5
- vfdev-5
- leslie-fang-intel
- EikanWang
mandatory_checks_name:
- EasyCLA
- Lint
@ -403,6 +381,7 @@
approved_by:
- leslie-fang-intel
- jgong5
- EikanWang
mandatory_checks_name:
- EasyCLA
- Lint
@ -519,6 +498,19 @@
- Lint
- pull
- name: XPU
patterns:
- '**xpu**'
- '**sycl**'
approved_by:
- EikanWang
- jgong5
- gujinghui
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: superuser
patterns:
- '*'

View File

@ -3,3 +3,10 @@
If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.
pathFilter:
- 'aten/src/ATen/native/native_functions.yaml'
- markdown: |
## Attention! PyTorch one of the C-stable API file was changed
You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.
pathFilter:
- 'torch/csrc/inductor/aoti_torch/c/*'
- 'torch/csrc/inductor/aoti_torch/generated/*'

View File

@ -7,6 +7,7 @@ ciflow_push_tags:
- ciflow/inductor
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
@ -16,6 +17,7 @@ ciflow_push_tags:
- ciflow/nightly
- ciflow/periodic
- ciflow/rocm
- ciflow/rocm-mi300
- ciflow/s390
- ciflow/slow
- ciflow/trunk

View File

@ -5,7 +5,7 @@
# functorch/docs/requirements.txt
# .ci/docker/requirements-ci.txt
boto3==1.35.42
jinja2==3.1.5
jinja2==3.1.6
lintrunner==0.10.7
ninja==1.10.0.post1
nvidia-ml-py==11.525.84

View File

@ -19,8 +19,7 @@ pytest-rerunfailures==10.3
pytest-flakefinder==1.1.0
pytest-subtests==0.13.1
scipy==1.10.1
sympy==1.12.1 ; python_version == "3.8"
sympy==1.13.1 ; python_version >= "3.9"
sympy==1.13.3
unittest-xml-reporting<=3.2.0,>=2.0.0
xdoctest==1.1.0
filelock==3.6.0

View File

@ -52,7 +52,6 @@ def build_triton(
*,
version: str,
commit_hash: str,
build_conda: bool = False,
device: str = "cuda",
py_version: Optional[str] = None,
release: bool = False,
@ -83,55 +82,6 @@ def build_triton(
else:
check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
if build_conda:
with open(triton_basedir / "meta.yaml", "w") as meta:
print(
f"package:\n name: torchtriton\n version: {version}\n",
file=meta,
)
print("source:\n path: .\n", file=meta)
print(
"build:\n string: py{{py}}\n number: 1\n script: cd python; "
"python setup.py install --record=record.txt\n",
" script_env:\n - MAX_JOBS\n",
file=meta,
)
print(
"requirements:\n host:\n - python\n - setuptools\n - pybind11\n"
" run:\n - python\n - filelock\n - pytorch\n",
file=meta,
)
print(
"about:\n home: https://github.com/openai/triton\n license: MIT\n summary:"
" 'A language and compiler for custom Deep Learning operation'",
file=meta,
)
patch_init_py(
triton_pythondir / "triton" / "__init__.py",
version=f"{version}",
)
if py_version is None:
py_version = f"{sys.version_info.major}.{sys.version_info.minor}"
check_call(
[
"conda",
"build",
"--python",
py_version,
"-c",
"pytorch-nightly",
"--output-folder",
tmpdir,
".",
],
cwd=triton_basedir,
env=env,
)
conda_path = next(iter(Path(tmpdir).glob("linux-64/torchtriton*.bz2")))
shutil.copy(conda_path, Path.cwd())
return Path.cwd() / conda_path.name
# change built wheel name and version
env["TRITON_WHEEL_NAME"] = triton_pkg_name
if with_clang_ldd:
@ -172,9 +122,8 @@ def main() -> None:
parser = ArgumentParser("Build Triton binaries")
parser.add_argument("--release", action="store_true")
parser.add_argument("--build-conda", action="store_true")
parser.add_argument(
"--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu"]
"--device", type=str, default="cuda", choices=["cuda", "rocm", "xpu", "aarch64"]
)
parser.add_argument("--py-version", type=str)
parser.add_argument("--commit-hash", type=str)
@ -188,7 +137,6 @@ def main() -> None:
args.commit_hash if args.commit_hash else read_triton_pin(args.device)
),
version=args.triton_version,
build_conda=args.build_conda,
py_version=args.py_version,
release=args.release,
with_clang_ldd=args.with_clang_ldd,

View File

@ -3,7 +3,7 @@
import json
import os
import re
from typing import Any, cast, Dict, List, Optional
from typing import Any, cast, Optional
from urllib.error import HTTPError
from github_utils import gh_fetch_url, gh_post_pr_comment, gh_query_issues_by_labels
@ -67,7 +67,7 @@ def get_release_version(onto_branch: str) -> Optional[str]:
def get_tracker_issues(
org: str, project: str, onto_branch: str
) -> List[Dict[str, Any]]:
) -> list[dict[str, Any]]:
"""
Find the tracker issue from the repo. The tracker issue needs to have the title
like [VERSION] Release Tracker following the convention on PyTorch
@ -117,7 +117,7 @@ def cherry_pick(
continue
res = cast(
Dict[str, Any],
dict[str, Any],
post_tracker_issue_comment(
org,
project,
@ -220,7 +220,7 @@ def submit_pr(
def post_pr_comment(
org: str, project: str, pr_num: int, msg: str, dry_run: bool = False
) -> List[Dict[str, Any]]:
) -> list[dict[str, Any]]:
"""
Post a comment on the PR itself to point to the cherry picking PR when success
or print the error when failure
@ -255,7 +255,7 @@ def post_tracker_issue_comment(
classification: str,
fixes: str,
dry_run: bool = False,
) -> List[Dict[str, Any]]:
) -> list[dict[str, Any]]:
"""
Post a comment on the tracker issue (if any) to record the cherry pick
"""

Some files were not shown because too many files have changed in this diff Show More