This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`.
---
For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error:
```
CMake Error at cmake/Dependencies.cmake:566 (if):
if given arguments:
"NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY"
Unknown arguments specified
Call Stack (most recent call first):
CMakeLists.txt:868 (include)
```
Upon making the change in this PR (changing `or` to `OR`), the process continued as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527
Approved by: https://github.com/janeyx99
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.
- Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
- Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](df6023e305) for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: ca7d5fae11 (rocm-mi300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889
Approved by: https://github.com/jeffdaily
And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 )
300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert`
Other 3p deps that needs updates:
- TensorPipe
- LLVM
- FBGEMM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730
Approved by: https://github.com/Skylion007
Fixes following issue of building PyTorch with ROCm7.0:
```
-- verifying file...
file='/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz'
-- SHA256 hash of
/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz
does not match expected value
expected: '7e29c325d5bd33ba896ddb106f5d4fc7d715274dca7fe937f724fffa82017838'
actual: '1e9b3dddf0c7fc07131c6f0f5266129e83ce2331f459fa2be8c63f4ae91b0f5b'
-- Hash mismatch, removing...
CMake Error at aotriton_external-prefix/src/aotriton_external-stamp/download-aotriton_external.cmake:163 (message):
Each download failed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158420
Approved by: https://github.com/jeffdaily
This PR fixes #157354
It fixes the issue in 'cmake/public/cuda.cmake' where a diagnostic message incorrectly showed an empty CUDA version when 'FindCUDA' and header-reported versions differed.
The problem was caused by this line:
set(${cuda_version_from_findcuda} ${CUDA_VERSION_STRING})
This incorrectly used the value of cuda_version_from_findcuda as a variable name. As a result the version string wasn't assigned and the error message omitted the version. This has been corrected to:
set(cuda_version_from_findcuda ${CUDA_VERSION_STRING})
Now the diagnostic message properly displays the CUDA version reported by FindCUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157370
Approved by: https://github.com/soulitzer
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.
Tested against system install location: /usr/local/lib and /usr/local/include.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.
Tested against system install location: /usr/local/lib and /usr/local/include.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0
***Build Failure Logs***
**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
302 | if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
| ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
303 | (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
| ^~~~~~~~~~~~~~~~~~~~~~~~
```
**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
19 | class Vectorized<ComplexFlt> {
| ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
633 | auto abs_a = a.abs_2_();
| ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
634 | auto abs_b = b.abs_2_();
| ^~~~~~
/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
666 | vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
673 | vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
| ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
680 | vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```
***With this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
-- USE_TENSORPIPE=True
-- USE_GLOO=True
-- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc
```
This patch will fix the pytorch build issue on power, and i am able to build successfully.
Hi @malfet @albanD
Please review this PR for pytorch build issue that we are observing on power.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet
Really, pytorch shoudn't be messing with basic _global_ cmake configuration like this, but without a careful analysis what all depends on this behaviour, I'm not confident to propose a change.
But at least notifying the user that something wonky is going on seems like a good idea.
@drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156123
Approved by: https://github.com/drisspg, https://github.com/msaroufim
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:
* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
+ Without this optimization the binary size of `libaotriton.so` could be
over 100MiB due to 2x more supported architectures compared with 0.9b.
Now it is only about 11MiB.
* Support sliding window attention (SWA) in
`_flash_attention_forward/backward`. Should fix#154582
See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.
Notable changes to SDPA backend:
* `std::optional<int64_t>` `window_size_left/right` are directly passed to
ROCM's SDPA backend, because the default value `-1` is meaningful to
AOTriton's backend and bottom-right aligned causal mask is implemented with
negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.
Remove eigen submodule and replace it with eigen_pin.txt
Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.
Remove eigen submodule and replace it with eigen_pin.txt
Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:
* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
+ Without this optimization the binary size of `libaotriton.so` could be
over 100MiB due to 2x more supported architectures compared with 0.9b.
Now it is only about 11MiB.
* Support sliding window attention (SWA) in
`_flash_attention_forward/backward`. Should fix#154582
See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.
Notable changes to SDPA backend:
* `std::optional<int64_t>` `window_size_left/right` are directly passed to
ROCM's SDPA backend, because the default value `-1` is meaningful to
AOTriton's backend and bottom-right aligned causal mask is implemented with
negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
## Update using Cutlass 3.x (2025/06/15)
Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91
This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64
Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.
## Original using SM89
I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).
Simple benchmark script
```python
import torch
from torch._inductor.utils import do_bench_using_profiling
N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
scale_A = torch.ones(M, 1).cuda()
scale_B = torch.ones(1, N).cuda()
out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
torch.testing.assert_close(out, out_ref)
latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
tflops = (2 * M * N * K) / latency_us / 1e9
print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```
M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS
Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...
However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS
Small M slightly improves, but large M is still bad.
If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results
M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73
Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.
To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`
M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86
A bit better than current configs, but still very far away from hardware limit.
@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?
Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`
M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53
Better than default configs, but still far away from the config above for compute-bound
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
Add back the capability to use env TORCH_CUDA_ARCH_LIST to control how downstream projects (which uses find_package (torch)) build.
Follow up to: https://github.com/pytorch/pytorch/pull/152715
Before this PR,
On a CPU only machine, building a downstream project would ignore the TORCH_CUDA_ARCH_LIST setting (if set) and go straight to the auto GPU detection mode, in which case there would be no GPU detected and an excessive list of cuda architectures may be used. This also means that there is no way to build a binary that would be targeting a different SM on the current machine a developer is using.
After this PR,
TORCH_CUDA_ARCH_LIST is effective for developers to control explicitly which SM architectures to build.
p.s. I think this PR might have been the original intent of https://github.com/pytorch/pytorch/pull/152715
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155314
Approved by: https://github.com/janeyx99, https://github.com/eqy, https://github.com/atalman
* The `rocm-core` CMake package only started appearing in ROCm 6.4, so rework the version probing to work if it is not present. Also collapses the unneeded operating system conditioning in favor of feature probing.
* Make `hipsparselt` optional: it only started appearing in ROCm 6.4 and it is not in all recent distribution channels yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155305
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
In OneDNN v3.7, SDPA has below defects:
1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy.
2. Only support headdim size <= 256.
3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops.
In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg