Commit Graph

1517 Commits

Author SHA1 Message Date
cyy
01f66d08d9 Remove outdated CMAKE_CUDA_COMPILER_VERSION branch (#160075)
Remove the condition `if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0)` in cmake/Codegen.cmake, because we are now default to CUDA >=12.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160075
Approved by: https://github.com/Skylion007
2025-08-09 14:23:17 +00:00
5f5f508aa8 [ROCm] Ck backend UX refactor (#152951)
Refactors how the enablement/disablement of CK Gemms and SDPA works.

- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-08-08 18:40:17 +00:00
93da9952a7 gloo: fix building system gloo with CUDA/HIP (#146637)
Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not when using system Gloo.

Fixes: #146239

Reported-by: Adam J Stewart <ajstewart426@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146637
Approved by: https://github.com/malfet
2025-08-06 22:56:31 +00:00
e9d27aa8fd [CUDA 13] CMake/Dependencies: no need to call find_package(CUB) (#159854)
CUB library is the part of CCCL of the CUDA Toolkit 13. If CUDA Found, CUB is found as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159854
Approved by: https://github.com/eqy
2025-08-06 06:03:58 +00:00
9b953bb3fb [BE] Update TensorPipe pin (#159834)
No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
-  Avoid using deprecated enums for MacOS builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
2025-08-05 20:45:09 +00:00
85d931f29e Use uppercase OR when checking for system XNNPACK (#159527)
This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`.

---

For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error:

```
CMake Error at cmake/Dependencies.cmake:566 (if):
  if given arguments:

    "NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY"

  Unknown arguments specified
Call Stack (most recent call first):
  CMakeLists.txt:868 (include)
```

Upon making the change in this PR (changing `or` to `OR`), the process continued as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527
Approved by: https://github.com/janeyx99
2025-08-05 02:10:53 +00:00
e136a9175b [BE] Fix dev warning in Dependencies.cmake (#159702)
Namely
```
CMake Warning (dev) in cmake/Dependencies.cmake:
  A logical block opening on the line

    /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:261 (if)

  closes on the line

    /Users/nshulga/git/pytorch/pytorch/cmake/Dependencies.cmake:263 (endif)

  with mis-matching arguments.
```

Introduced by https://github.com/pytorch/pytorch/pull/143846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159702
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-08-03 18:45:07 +00:00
0df78f0c11 Remove /d2implyavx512upperregs- flag (#159431)
And reopen https://github.com/pytorch/pytorch/issues/145702

As this flag is not documented anywhere, slows down sccache accelerated build and  per https://developercommunity.visualstudio.com/t/Invalid-code-gen-when-using-AVX2-and-SSE/10527298#T-N10562579 it does not workaround a compiler bug, but rather disables some optimizations of AVX512 instructions which are being invoked in AVX2 codepath

Fixes https://github.com/pytorch/pytorch/issues/159082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159431
Approved by: https://github.com/clee2000
2025-07-30 18:47:03 +00:00
cyy
65c1109ca2 Remove CUDA 11 CMake code (#156795)
CUDA 11 is no longer supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156795
Approved by: https://github.com/atalman, https://github.com/malfet
2025-07-24 08:00:41 +00:00
cyy
d352c28dd1 [2/N] Remove FindPackageHandleStandardArgs.cmake (#156559)
Following #157188, this PR removes FindPackageHandleStandardArgs.cmake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156559
Approved by: https://github.com/albanD
2025-07-24 02:34:10 +00:00
5619bf9971 Enable MI355X PyTorch CI testing. (#158889)
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.

- Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
- Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](df6023e305) for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: ca7d5fae11 (rocm-mi300)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889
Approved by: https://github.com/jeffdaily
2025-07-23 21:50:31 +00:00
abe0c9538a [BE] Fix extra-semi warnings (#158730)
And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 )

300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert`
Other 3p deps that needs updates:
 - TensorPipe
 - LLVM
 - FBGEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730
Approved by: https://github.com/Skylion007
2025-07-22 01:05:03 +00:00
cbe1cb7018 [CMake] Move xpu flag to xpu.cmake (#158542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158542
Approved by: https://github.com/gujinghui, https://github.com/ezyang
2025-07-21 17:19:59 +00:00
09db3a22e8 [BE] Get rid of final mentions of BUILD_SPLIT_CUDA (#158453)
BUILD_SPLIT_CUDA logic has been removed for a while

Differential Revision: [D78418191](https://our.internmc.facebook.com/intern/diff/D78418191/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158453
Approved by: https://github.com/albanD
ghstack dependencies: #158358, #158365
2025-07-17 06:47:10 +00:00
a23f4471b9 [ROCm][Windows] Fix finding ROCm/HIP version (#156486)
This commit fixes Windows build issue related to trying to use rocm-core (rocm-core doesn't exist on HIP SDK)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156486
Approved by: https://github.com/jeffdaily, https://github.com/stellaraccident
2025-07-16 15:31:43 +00:00
06a67a8948 Fix sha256 for aotriton ROCm7.0 tarball (#158420)
Fixes following issue of building PyTorch with ROCm7.0:
```
-- verifying file...
       file='/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz'
-- SHA256 hash of
    /var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz
  does not match expected value
    expected: '7e29c325d5bd33ba896ddb106f5d4fc7d715274dca7fe937f724fffa82017838'
      actual: '1e9b3dddf0c7fc07131c6f0f5266129e83ce2331f459fa2be8c63f4ae91b0f5b'
-- Hash mismatch, removing...
CMake Error at aotriton_external-prefix/src/aotriton_external-stamp/download-aotriton_external.cmake:163 (message):
  Each download failed!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158420
Approved by: https://github.com/jeffdaily
2025-07-16 15:24:20 +00:00
4ff9b7fa31 Fix diagnostic message for CUDA version mismatch in cuda.cmake (#157370)
This PR fixes  #157354

It fixes the issue in 'cmake/public/cuda.cmake' where a diagnostic message incorrectly showed an empty CUDA version when 'FindCUDA' and header-reported versions differed.

The problem was caused by this line:

set(${cuda_version_from_findcuda} ${CUDA_VERSION_STRING})

This incorrectly used the value of cuda_version_from_findcuda as a variable name. As a result the version string wasn't assigned and the error message omitted the version. This has been corrected to:

set(cuda_version_from_findcuda ${CUDA_VERSION_STRING})

Now the diagnostic message properly displays the CUDA version reported by FindCUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157370
Approved by: https://github.com/soulitzer
2025-07-11 20:58:35 +00:00
eqy
9963845a4e [CUDA] Support family-conditional compute capabilies in TORCH_CUDA_ARCH_LIST (#157999)
Similar to arch-conditionals, such as 9.0a  and 10.0a, family conditionals such as 10.0f enable features specific to a family of architectures, such as between sm100 and sm103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157999
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-11 20:34:59 +00:00
cyy
7381c77724 Use CMake wholearchive group (#156393)
Use CMake wholearchive group to simplify code. It may also support more OSes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393
Approved by: https://github.com/ezyang
2025-07-08 12:20:29 +00:00
cyy
3ee8828c87 [1/N] Don't use CUDA.cmake module (#157188)
Small changes before removing CUDA.cmake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157188
Approved by: https://github.com/ezyang
2025-07-08 03:05:35 +00:00
c5589074e6 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-08 01:21:59 +00:00
19a01382bc Revert "[SymmMem] find_path does not search /usr/local/lib (#157695)"
This reverts commit 3effe0c293219b00a0eae7e139fe2d9aed84bc03.

Reverted https://github.com/pytorch/pytorch/pull/157695 on behalf of https://github.com/kwen2501 due to Changing it to be landable on 2.8 branch ([comment](https://github.com/pytorch/pytorch/pull/157695#issuecomment-3047020152))
2025-07-08 01:12:01 +00:00
3effe0c293 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-07 23:16:45 +00:00
cyy
7c1f627828 Fix 'dllimport attribute ignored on inline function' (#157670)
There are lots of warnings in builds:
```
 2025-07-05T16:59:46.9208806Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5043,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209030Z  5043 | inline at::Tensor & Tensor::less_(const at::Scalar & other) const {
2025-07-05T16:59:46.9209104Z       |                             ^
2025-07-05T16:59:46.9209671Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5048,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209860Z  5048 | inline at::Tensor & Tensor::less_(const at::Tensor & other) const
```
This PR has fixed them and turned the warning into an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157670
Approved by: https://github.com/albanD
2025-07-07 16:57:48 +00:00
99c1a6bdd9 [SymmMem] Find NVSHMEM from system installation (#157513)
Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157513
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-07-04 03:34:44 +00:00
8f0998aafe Check F2C BLAS for OpenBLAS and other vendors (#143846)
This issue came from https://github.com/conda-forge/pytorch-cpu-feedstock/issues/180. MKL follows the F2C convention for returning single precision floats as doubles and uses the G77 convention for returning complex valued scalars. OpenBLAS does the opposite. There is a check for this already, but it's done only when the Generic BLAS vendor code path is used and this PR moves that code to `Dependencies.cmake` to make it work when the BLAS vendor is OpenBLAS and others

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143846
Approved by: https://github.com/rgommers, https://github.com/atalman
2025-07-01 05:56:24 +00:00
5e18bc3331 [PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#155255)
Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0

***Build Failure Logs***

**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
  302 |     if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
  303 |         (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~

```

**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
   19 | class Vectorized<ComplexFlt> {
      |       ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’

aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
  633 |   auto abs_a = a.abs_2_();
      |                  ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
  634 |   auto abs_b = b.abs_2_();
      |                  ^~~~~~

/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  666 |       vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  673 |       vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
      |                 ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  680 |       vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```

***With  this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
  -- USE_TENSORPIPE=True
  -- USE_GLOO=True
  -- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc

```

This patch will fix the pytorch build issue on power, and i am able to build successfully.

Hi @malfet  @albanD

Please review this PR for pytorch build issue that we are observing on power.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet
2025-06-30 17:54:37 +00:00
772d590415 [CUTLASS] [CUDA] SM100 GroupMM (#156203)
Closes https://github.com/pytorch/pytorch/issues/156202

PR adds blackwell support for GroupMM

Most of the code that is used for SM90 can be reused, kernel schedule has to be changed in accordance with https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html

Did some preliminary benchmarking of H200 vs B200

Script
```py
import torch
print(torch.__file__)
device = torch.device("cuda")
dtype = torch.bfloat16

shapes = [
    (16, 128000, 7168, 7168),
    (128, 1, 2048, 7168)
]

for batch, M, N, K in shapes:
    a = torch.randn(batch, M, K, device=device, dtype=dtype)
    b = torch.randn(batch, N, K, device=device, dtype=dtype)

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    for i in range(5): c = torch._grouped_mm(a, b)

    num_iter = 50
    start_event.record()

    for i in range(num_iter): c = torch._grouped_mm(a, b)
    end_event.record()

    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iter
    print(f"batch: {batch}\tM: {M}\tN: {N}\tK: {K}")
    print(f"Time per Iteration:\t {avg_time_ms:.4f} ms")
```

On H200
```
batch: 16	M: 128000	N: 7168	K: 7168
Time per Iteration:	 298.6668 ms
batch: 128	M: 1	N: 2048	K: 7168
Time per Iteration:	 4.1462 ms
```

B200
```
batch: 16       M: 128000       N: 7168 K: 7168
Time per Iteration:      190.7458 ms
batch: 128      M: 1    N: 2048 K: 7168
Time per Iteration:      3.0680 ms
```
nsys nvprof
```
root@16930b42ffc6:/workspace/pytorch# nsys nvprof python gemm_test.py
WARNING: python and any of its children processes will be profiled.

Collecting data...
batch: 16	M: 128000	N: 7168	K: 7168
Time per Iteration:	 192.6420 ms
batch: 128	M: 1	N: 2048	K: 7168
Time per Iteration:	 1.2255 ms
Generating '/tmp/nsys-report-6a53.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /workspace/pytorch/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)                 Name
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  ---------------------------------
     98.9      10586895744          2  5293447872.0  5293447872.0  73786464  10513109280  7381715954.2  cudaDeviceSynchronize
      1.0        104084608          5    20816921.6    33552480.0    100800     34786208    18048125.3  cudaMalloc
      0.1          5694304          4     1423576.0     1416656.0   1258560      1602432      181668.1  cudaGetDeviceProperties_v2_v12000
      0.1          5430496        130       41773.0        4560.0      2496      3854368      345761.8  cudaLaunchKernel
      0.0           587584        110        5341.7        4992.0      4224        16992        1482.0  cudaLaunchKernelExC_v11060
      0.0           119200        660         180.6         128.0        96         4128         206.7  cudaGetDriverEntryPoint_v11030
      0.0            68352        660         103.6          64.0        32         4928         224.6  cuTensorMapEncodeTiled
      0.0            34976         49         713.8         224.0       160         6720        1343.4  cudaStreamIsCapturing_v10000
      0.0            32992          4        8248.0        7456.0      4128        13952        4804.4  cudaEventRecord
      0.0            16928          4        4232.0        3600.0      1728         8000        2764.7  cudaEventQuery
      0.0            16288          4        4072.0        3568.0      1952         7200        2396.1  cudaEventCreateWithFlags
      0.0            13632          4        3408.0        2672.0       544         7744        3408.7  cudaEventDestroy
      0.0             1056          1        1056.0        1056.0      1056         1056           0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     99.0      10549232845         55  191804233.5  192944479.0  165746368  203645313    5353204.3  void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm…
      0.6         67327135         55    1224129.7    1330656.0     924320    1364928     182180.4  void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm…
      0.3         34854783         20    1742739.1    1597856.0      10080    3899616     818421.2  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.0           354880        110       3226.2       3296.0       1920       4160        554.4  void at::cuda::detail::prepare_grouped_gemm_data<cutlass::bfloat16_t, cutlass::bfloat16_t, cutlass:…
```

The kernel names are too long to be shown via nvprof, I pasted this from nsight systems
```
small kernel 1SM
100.0%	1.286 ms	1	1.286 ms	1.286 ms	1.286 ms	1.286 ms	0 ns	void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)3, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> *, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)128, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params)

large kernel 2SM
100.0%	194.178 ms	1	194.178 ms	194.178 ms	194.178 ms	194.178 ms	0 ns	void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)5, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)256>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> *, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_2x1SM_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)256, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156203
Approved by: https://github.com/syed-ahmed, https://github.com/drisspg
2025-06-28 23:02:00 +00:00
cdb144fcf0 Display a warning when overwriting CMAKE_CUDA_ARCHITECTURES (#156123)
Really, pytorch shoudn't be messing with basic _global_ cmake configuration like this, but without a careful analysis what all depends on this behaviour, I'm not confident to propose a change.
But at least notifying the user that something wonky is going on seems like a good idea.
@drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156123
Approved by: https://github.com/drisspg, https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
2025-06-28 11:22:09 +00:00
d9577df312 [ROCm] Bump AOTriton to 0.10b (#156499)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-25 07:09:03 +00:00
19f851ce10 Revert "Simplify nvtx3 CMake handling, always use nvtx3 (#153784)"
This reverts commit 099d0d6121125062ebc05771c8330cb7cd8d053a.

Reverted https://github.com/pytorch/pytorch/pull/153784 on behalf of https://github.com/Camyll due to breaking internal tests and cuda 12.4 builds still used in CI ([comment](https://github.com/pytorch/pytorch/pull/153784#issuecomment-3001702310))
2025-06-24 20:02:07 +00:00
cyy
ce1a07570d Fix TORCH_CUDA_ARCH_LIST (#156667)
Before the fix, `TORCH_CUDA_ARCH_LIST` variable contains string `TORCH_CUDA_ARCH_LIST`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156667
Approved by: https://github.com/ngimel
2025-06-24 07:27:53 +00:00
b2d473c8f8 [ROCm][Windows] Fix rocsolver undefined symbol error (#156591)
Fix undefined symbol error while using `rocsolver_ssyevd_strided_batched` call in `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156591
Approved by: https://github.com/jeffdaily
2025-06-24 03:28:45 +00:00
b1d62febd0 Revert "Use official CUDAToolkit module in CMake (#154595)"
This reverts commit 08dae945ae380d80efbaf140a95abfc5d96e5100.

Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))
2025-06-23 21:15:31 +00:00
4f70fbbd16 Revert "Use CMake wholearchive group (#156393)"
This reverts commit d1b4e0fa9a5feb22fc6de1d36dc4c9dac685caed.

Reverted https://github.com/pytorch/pytorch/pull/156393 on behalf of https://github.com/etaf due to This PR is breaking XPU windows build. ([comment](https://github.com/pytorch/pytorch/pull/156393#issuecomment-2995576362))
2025-06-23 09:03:19 +00:00
cyy
d1b4e0fa9a Use CMake wholearchive group (#156393)
Use CMake wholearchive group to simplify code. It may also support more OSes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393
Approved by: https://github.com/ezyang
2025-06-23 06:22:34 +00:00
cyy
099d0d6121 Simplify nvtx3 CMake handling, always use nvtx3 (#153784)
Fall back to third-party NVTX3 if system NVTX3 doesn't exist. We also reuse the `CUDA::nvtx3` target for better interoperability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153784
Approved by: https://github.com/ezyang
2025-06-23 06:12:46 +00:00
cyy
08dae945ae Use official CUDAToolkit module in CMake (#154595)
Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with `torch::` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595
Approved by: https://github.com/albanD
2025-06-22 05:44:29 +00:00
ee56e9f8a8 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
2025-06-21 03:02:02 +00:00
208ec60e72 Revert "[BE] Make Eigen an optional dependency (#155955)"
This reverts commit 1b50c12584909bda00009f4f0fd0d38ec792d019.

Reverted https://github.com/pytorch/pytorch/pull/155955 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155955#issuecomment-2992512124))
2025-06-20 18:43:52 +00:00
1b50c12584 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
2025-06-20 17:21:27 +00:00
1036f6d114 Revert "[ROCm] Bump AOTriton to 0.10b (#156290)"
This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3.

Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))
2025-06-20 15:35:25 +00:00
34d8e64ef6 [ROCm] Bump AOTriton to 0.10b (#156290)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-06-19 21:13:58 +00:00
30d3cf62fb support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F (#154680)
Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154680
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-18 18:39:01 +00:00
541297daae [Build] Allow metal shaders to include ATen headers (#156256)
No-op change that will be used later to share structs between CPU and Metal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156256
Approved by: https://github.com/dcci
2025-06-18 01:03:25 +00:00
ccea6ddac3 [BE] fix typos in cmake/ (#156079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079
Approved by: https://github.com/Skylion007
2025-06-17 19:25:43 +00:00
fc177801af Enable FP8 row-wise scaled-mm for sm12x (#155991)
## Update using Cutlass 3.x (2025/06/15)

Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91

This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64

Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.

## Original using SM89

I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).

Simple benchmark script

```python
import torch
from torch._inductor.utils import do_bench_using_profiling

N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
    A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
    B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
    scale_A = torch.ones(M, 1).cuda()
    scale_B = torch.ones(1, N).cuda()

    out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
    out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
    torch.testing.assert_close(out, out_ref)

    latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
    tflops = (2 * M * N * K) / latency_us / 1e9
    print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```

M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS

Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...

However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS

Small M slightly improves, but large M is still bad.

If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73

Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.

To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86

A bit better than current configs, but still very far away from hardware limit.

@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?

Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53

Better than default configs, but still far away from the config above for compute-bound

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-06-17 18:52:44 +00:00
1cce73b5f4 [build] Change --cmake{,-only} arguments to envvars to support modern Python build frontend (#156045)
See also:

- #156029
- #156027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156045
Approved by: https://github.com/ezyang
ghstack dependencies: #156040, #156041
2025-06-17 11:40:24 +00:00
cyy
c2beeadeb4 [Reland] Use 3.27 as the minimum CMake version (#154783)
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
2025-06-14 16:37:51 +00:00
ab56e5add9 [CUDA][BUILD] Add back the capability to use env TORCH_CUDA_ARCH_LIST (#155314)
Add back the capability to use env TORCH_CUDA_ARCH_LIST to control how downstream projects (which uses find_package (torch)) build.

Follow up to: https://github.com/pytorch/pytorch/pull/152715

Before this PR,
On a CPU only machine, building a downstream project would ignore the TORCH_CUDA_ARCH_LIST setting (if set) and go straight to the auto GPU detection mode, in which case there would be no GPU detected and an excessive list of cuda architectures may be used. This also means that there is no way to build a binary that would be targeting a different SM on the current machine a developer is using.

After this PR,
TORCH_CUDA_ARCH_LIST is effective for developers to control explicitly which SM architectures to build.

p.s. I think this PR might have been the original intent of https://github.com/pytorch/pytorch/pull/152715

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155314
Approved by: https://github.com/janeyx99, https://github.com/eqy, https://github.com/atalman
2025-06-07 15:52:39 +00:00