43390d8b13
ROCm Sparsity through HipSparseLT ( #150578 )
...
TLDR:
- This pull request introduces support for hipSPARSELt in ROCm, current usage would be semi-structure sparsity.
- Require **ROCm 6.4** && **gfx942/gfx950**.
- The average performance uplift (compare to dense operation) is ~ 20% in ROCm 6.4 but expect further performance lift along the way.
### Dense vs. Sparse Performance Comparison
#### **NT (Row-major)**
**Average Uplift**: `1.20`
| M | N | K | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-------|--------|--------|-------------------------|-------------------------------|--------|
| 14336 | 8 | 4096 | 20.05 | 25.3 | 1.26 |
| 4096 | 8 | 14336 | 21.07 | 25.28 | 1.20 |
| 3072 | 3072 | 10240 | 299.05 | 351.82 | 1.18 |
| 3072 | 1536 | 768 | 18.56 | 20.05 | 1.08 |
| 3072 | 17664 | 768 | 163.13 | 173.91 | 1.07 |
| 3072 | 196608 | 768 | 1717.30 | 1949.63 | 1.14 |
| 3072 | 24576 | 768 | 206.84 | 242.98 | 1.17 |
| 3072 | 6144 | 768 | 53.90 | 56.88 | 1.06 |
| 3072 | 98304 | 768 | 833.77 | 962.28 | 1.15 |
| 768 | 1536 | 768 | 8.53 | 19.65 | 2.30 |
| 768 | 17664 | 768 | 46.02 | 46.84 | 1.02 |
| 768 | 196608 | 768 | 463.15 | 540.46 | 1.17 |
| 768 | 24576 | 768 | 54.32 | 59.55 | 1.10 |
| 768 | 6144 | 768 | 19.47 | 20.15 | 1.03 |
| 768 | 98304 | 768 | 231.88 | 258.73 | 1.12 |
---
#### **NN (Row-major)**
**Average Uplift**: `1.13`
| M | N | K | hipsparselt-bench (us) | hipblaslt-bench get all (us) | Uplift |
|-----|--------|-------|-------------------------|-------------------------------|--------|
| 768 | 1536 | 3072 | 27.50 | 28.78 | 1.05 |
| 768 | 17664 | 3072 | 125.06 | 158.94 | 1.27 |
| 768 | 196608 | 3072 | 1568.38 | 1767.12 | 1.13 |
| 768 | 24576 | 3072 | 171.05 | 203.49 | 1.19 |
| 768 | 6144 | 3072 | 58.72 | 60.39 | 1.03 |
| 768 | 98304 | 3072 | 787.15 | 887.60 | 1.13 |
-------------------------
This pull request introduces support for hipSPARSELt in ROCm, alongside various updates and improvements to the codebase and test suite. The changes primarily involve adding configuration flags, updating conditional checks, and ensuring compatibility with hipSPARSELt.
### ROCm and hipSPARSELt Support:
* [`BUILD.bazel`](diffhunk://#diff-7fc57714ef13c3325ce2a1130202edced92fcccc0c6db34a72f7b57f60d552a3R292): Added `@AT_HIPSPARSELT_ENABLED@` substitution to enable hipSPARSELt support.
* [`aten/CMakeLists.txt`](diffhunk://#diff-0604597797bb21d7c39150f9429d6b2ace10b79ab308514ad03f76153ae8249bR104-R110): Introduced a conditional flag to enable hipSPARSELt support based on ROCm version.
* [`aten/src/ATen/CMakeLists.txt`](diffhunk://#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777R37): Added `AT_HIPSPARSELT_ENABLED` configuration.
* [`aten/src/ATen/cuda/CUDAConfig.h.in`](diffhunk://#diff-8bb82da825ca87c28233abacffa1b0566c73a54990b7a77f3f5108d3718fea15R11): Defined `AT_HIPSPARSELT_ENABLED` macro.
* `caffe2/CMakeLists.txt`, `cmake/Dependencies.cmake`, `cmake/public/LoadHIP.cmake`: Included hipSPARSELt in the ROCm dependencies. [[1]](diffhunk://#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1380) [[2]](diffhunk://#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72L1084-R1084) [[3]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5R153)
### Codebase Updates:
* [`aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp`](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6): Added hipSPARSELt support checks and initialization functions. Updated various methods to conditionally handle hipSPARSELt. [[1]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6) [[2]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R22-R67) [[3]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R78-R85) [[4]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R97-R109) [[5]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R183-R188) [[6]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L134-R200) [[7]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R213-R222) [[8]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L217-R285)
### Test Suite Updates:
* [`test/test_sparse_semi_structured.py`](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65): Added checks for hipSPARSELt availability and updated test conditions to skip tests not supported on ROCm. [[1]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65) [[2]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR228) [[3]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR239) [[4]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR250) [[5]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR579) [[6]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR624) [[7]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR661) [[8]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR695) [[9]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR730) [[10]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR755) [[11]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR771) [[12]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR809) [[13]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR844) [[14]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cL840-R854) [[15]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR1005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150578
Approved by: https://github.com/jeffdaily
2025-05-31 02:03:40 +00:00
ad26ec6abe
Use 3.27 as the minimum CMake version ( #153153 )
...
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783 .
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 01:54:35 +00:00
108422ac26
Revert "Use 3.27 as the minimum CMake version ( #153153 )"
...
This reverts commit 78624679a876a21acb14bf075ba6beccff21b9a0.
Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799 ))
2025-05-31 00:28:03 +00:00
78624679a8
Use 3.27 as the minimum CMake version ( #153153 )
...
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783 .
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-31 00:01:52 +00:00
22641f42b6
[Binary-builds]Use System NCCL by default in CI/CD. ( #152835 )
...
Use System NCCl by default. The correct nccl version is already built into the Manylinux docker image.
Will followup with PR on detecting if user has NCCL installed and enabling USE_SYSTEM_NCCL by default in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152835
Approved by: https://github.com/malfet
2025-05-30 18:51:48 +00:00
7e8532077f
Revert "Use 3.27 as the minimum CMake version ( #153153 )"
...
This reverts commit 1ece53b157db4425ad12cae31fb570c591dc19e7.
Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830 ))
2025-05-30 13:16:33 +00:00
1ece53b157
Use 3.27 as the minimum CMake version ( #153153 )
...
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783 .
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153
Approved by: https://github.com/malfet
2025-05-30 11:25:30 +00:00
66f53889d5
[nativert] port semaphore to c10 util ( #153504 )
...
Summary:
nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md
To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed.
This diff adds a simple semaphore interface into c10 until c++20 where we get counting_semaphore
gonna need a oss build export to take a look at this...
Test Plan: CI
Differential Revision: D73882656
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153504
Approved by: https://github.com/zhxchen17
2025-05-28 19:17:30 +00:00
1bebe0424e
Fix platform detection in MKLDNN CMake file ( #142067 )
...
When building PyTorch with `USE_XPU=True` and Clang,
the user sees misleading errors related to incorrect platform
detection that assumes that all users that are not using the GNU
compilers are on Windows. We can fix this by simply using CMake's
builtin platform detection variables.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142067
Approved by: https://github.com/EikanWang , https://github.com/min-jean-cho , https://github.com/guangyey
2025-05-26 06:09:37 +00:00
8fe7ec6721
Add /Zc:preprocessor for torch libraries in MSVC builds ( #147825 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147825
Approved by: https://github.com/janeyx99
2025-05-24 06:57:46 +00:00
7421c21b5e
remove unused code. ( #153979 )
...
Remove the unused cmake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153979
Approved by: https://github.com/albanD
2025-05-22 17:50:11 +00:00
0e5f2339d0
[ROCm][Windows] Run hipcc with compatibility flags. ( #153986 )
...
See also https://github.com/ROCm/TheRock/issues/590 . Including the `-Wno-ignored-attributes` flag here avoids 700MB of log warning spam while compiling and the `-fms-extensions` seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html .
Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com >
Co-authored-by: Scott Todd <scott.todd0@gmail.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153986
Approved by: https://github.com/Skylion007 , https://github.com/jeffdaily
Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com >
2025-05-21 20:26:52 +00:00
daa68e7a93
Update USE_XCCL option if USE_XPU is OFF ( #153936 )
...
# Motivation
Disable `USE_XCCL` when `USE_XPU` is turned `OFF` to ensure configuration consistency. This is required because XCCL depends on XPU functionality.
Especially, ensure that `USE_XCCL` is correctly set to `OFF` when [caffe2_update_option(USE_XPU OFF)](1075bb37d3/cmake/Dependencies.cmake (L97)
) is invoked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153936
Approved by: https://github.com/Skylion007
2025-05-21 01:32:41 +00:00
179e7d8624
Fix vs2022 caused AVX512 illegal instruction issue. ( #153480 )
...
Fixes #145702
Add `/d2implyavx512upperregs-` to disable compiler over-aggressive optimization, which caused involeved AVX512 register on AVX2 machine.
Reference to: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2874029459
Local test passed:
<img width="1208" alt="image" src="https://github.com/user-attachments/assets/26f4cb91-6bb5-416f-aa35-c899eb1489b2 " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153480
Approved by: https://github.com/Blackhex , https://github.com/cyyever , https://github.com/atalman
2025-05-20 20:37:00 +00:00
05bc78e64f
[submodule] Update fbgemm pinned version ( #153950 )
...
Summary:
Update fbgemm pinned version in PyTroch.
Related update in fbgemm: D74434751
Included changes:
Update fbgemm external dependencies directory in setup.py
Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec
Test Plan: PyTorch OSS CI
Differential Revision: D75073516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950
Approved by: https://github.com/Skylion007 , https://github.com/ngimel
2025-05-20 20:24:27 +00:00
2ade886412
[XPU] [Windows] Auto turn on kineto XPU build when compiler version support. ( #153681 )
...
Since SYCL compiler 20250101, it will remove dependency of level zero header. We can turn on kineto XPU by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153681
Approved by: https://github.com/chuanqi129 , https://github.com/cyyever , https://github.com/EikanWang
2025-05-19 03:07:14 +00:00
7ae7324ac4
[submodule] Update google benchmark to v1.9.3 ( #153676 )
...
And remove `include_directories`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153676
Approved by: https://github.com/Skylion007
2025-05-16 23:31:53 +00:00
084c4aa614
Revert "Reapply "Delete TorchScript based Android demo app and point to ExecuTorch ( #153633 )" ( #153656 )"
...
This reverts commit 7ed377f5776578aec4a6a9bc4eeef221a6b80a77.
Reverted https://github.com/pytorch/pytorch/pull/153656 on behalf of https://github.com/larryliu0820 due to Still being used internally so can't remove ([comment](https://github.com/pytorch/pytorch/pull/153656#issuecomment-2887665403 ))
2025-05-16 21:00:11 +00:00
7ed377f577
Reapply "Delete TorchScript based Android demo app and point to ExecuTorch ( #153633 )" ( #153656 )
...
This reverts commit ae0e8f0c7316addab3f415dc767a9d34f58b0dae.
Keep android/libs/fbjni because it's being used by other components of
PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153656
Approved by: https://github.com/malfet
2025-05-16 04:35:42 +00:00
9d3b6ee4c1
[submodule] Update gtest to v1.17.0 ( #153618 )
...
And remove some outdated CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618
Approved by: https://github.com/malfet
2025-05-16 01:24:19 +00:00
d1dd2c1fc8
gloo: cuda ( #153406 )
...
This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend.
This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441
Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both.
The gloo submodule is updated to depend on the new Gloo changes
Test plan:
```py
import os
import time
transport = "TCP"
#transport = "IBVERBS"
os.environ["GLOO_DEVICE_TRANSPORT"] = transport
rank = int(os.environ["RANK"])
os.environ["CUDA_VISIBLE_DEVICES"] = str(rank)
ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank]
ibv_name, ibv_port = ibv.split(":")
os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name
os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port
os.environ["TORCH_GLOO_IBV_INDEX"] = "3"
import torch
import torch.distributed as dist
dist.init_process_group("gloo")
rank = dist.get_rank()
# initial sanity check
#device = "cpu"
#t = torch.zeros(10, device=device)
#dist.all_reduce(t)
#print("sanity complete")
device = "cpu"
iters = 10
warmup_iters = 2
for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]:
t = torch.zeros(nelem, device=device)
torch.cuda.current_stream().synchronize()
for i in range(warmup_iters):
dist.all_reduce(t)
torch.cuda.current_stream().synchronize()
start = time.perf_counter()
for i in range(iters):
dist.all_reduce(t)
torch.cuda.current_stream().synchronize()
dur = (time.perf_counter() - start)
qps = iters/dur
bandwidth_gb = t.nbytes * iters / dur / 1e9
gb = t.nbytes / 1e9
if rank == 0:
print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406
Approved by: https://github.com/fduwjj
2025-05-16 01:13:13 +00:00
e5e06d9cab
[submodule] Update kleidiai to v1.8.0 ( #153592 )
...
And cleans up some CMake instructions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153592
Approved by: https://github.com/malfet
2025-05-15 10:14:05 +00:00
0ca91af6b8
Define USE_C10D_XCCL and USE_XCCL in pytorch ( #147593 )
...
### Motivation:
Add `USE_XCCL` and `USE_C10D_XCCL` to enable support of XCCL backend building in stock PyTorch, similar to `USE_NCCL` and `USE_C10D_NCCL`.
By default, `USE_XCCL` is OFF and allowed set to ON explicitly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147593
Approved by: https://github.com/guangyey , https://github.com/malfet , https://github.com/albanD , https://github.com/cyyever
2025-05-15 05:39:00 +00:00
7d39e73c57
Fix more URLs ( #153277 )
...
Or ignore them.
Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277
Approved by: https://github.com/malfet
2025-05-14 16:23:50 +00:00
9c3cef437c
gloo: support ibverbs in cmake ( #153425 )
...
This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch.
Test plan:
```
sudo dnf install rdma-core-devel
USE_GLOO_IBVERBS=ON python setup.py develop
torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
```
```py
"""
run with:
torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py
"""
import os
os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS"
import torch
import torch.distributed as dist
dist.init_process_group("gloo")
rank = dist.get_rank()
if rank == 0:
device = "cpu"
else:
device = "cuda"
print(device)
t = torch.full((10, 100), fill_value=(rank+1), device=device)
target = torch.full((10, 100), fill_value=3, device=device)
dist.all_reduce(t)
torch.testing.assert_close(t, target)
t = torch.full((10, 100), fill_value=(rank+1), device=device)
if rank == 0:
dist.send(t, dst=1)
else:
dist.recv(t, src=0)
torch.testing.assert_close(t, torch.full_like(t, 1))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425
Approved by: https://github.com/fduwjj
2025-05-13 17:09:00 +00:00
15e08f9571
[submodule] Update ONNX to 1.18 ( #152200 )
...
Update ONNX to 1.18.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152200
Approved by: https://github.com/justinchuby , https://github.com/malfet
2025-05-13 04:18:45 +00:00
f11d7a5978
[ROCm] Update spack includes ( #152569 )
...
* Cleans up code in `caffe2/CMakeLists.txt` to remove individual ROCm library include paths and use `ROCM_INCLUDE_DIRS` CMake var instead
* `ROCM_INCLUDE_DIRS` CMake var is set in `cmake/public/LoadHIP.cmake` by adding all the ROCm packages that PyTorch depends on
* `rocm_version.h` is provided by the `rocm-core` package, so use the include directory for that component to be compliant with Spack
* Move `find_package_and_print_version(hip REQUIRED CONFIG)` earlier so that `hip_version.h` can be located in the hip package include dir for Spack
* `list(REMOVE_DUPLICATES ROCM_INCLUDE_DIRS)` to remove duplicate `/opt/rocm/include` entries in the non-Spack case
* Remove user-provided env var `ROCM_INCLUDE_DIRS` since `ROCM_PATH` already exists as a user-provided env var, which should be sufficient to locate the include directories for ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152569
Approved by: https://github.com/renjithravindrankannath , https://github.com/jeffdaily
Co-authored-by: Renjith Ravindran <Renjith.RavindranKannath@amd.com >
2025-05-09 21:36:38 +00:00
642e9305eb
Fixes detection of ArmPL on Linux platform ( #150031 )
...
On Linux it failed to detect that there is bin directory as it wasn't looking for armpl-info which is the only file that is in that directory on Linux and also adding link to math library as it is required to link against when checking for LAPACK functions.
Fixes #149610
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150031
Approved by: https://github.com/fadara01 , https://github.com/malfet
2025-05-07 19:47:21 +00:00
2f09e79142
Fix Codegen.cmake warning ( #153023 )
...
Fix
```
CMake Warning (dev) in cmake/Codegen.cmake:
A logical block opening on the line
/var/lib/jenkins/workspace/cmake/Codegen.cmake:393 (if)
closes on the line
/var/lib/jenkins/workspace/cmake/Codegen.cmake:401 (endif)
with mis-matching arguments.
```
by removing the condition in `endif`.
We could instead fix it, however, that is not best practice. For example, cmake_lint warns that, and CMake says
```
The optional <condition> argument is supported for backward compatibility only.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153023
Approved by: https://github.com/aditew01 , https://github.com/Skylion007
2025-05-07 12:45:20 +00:00
ac792a0dca
[submodule] Bump ITTAPI to 3.25.5 ( #150263 )
...
It hasn't been updated for 3 years. And also to remove CMake 4 workaround.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150263
Approved by: https://github.com/sraikund16
2025-05-06 01:02:18 +00:00
2107d87dc9
[BE] remove outdated warning about TORCH_CUDA_ARCH_LIST ( #152715 )
...
I saw this warning when compiling a 3rd party lib and did not agree with it. I'm not sure the original reason why we would want to force people to pass in TORCH_CUDA_ARCH_LIST to cmake vs set it as an env var. As a developer, it's much easier to set it as an env var or have it be autodetected. I also realized this warning was from before 2018!!! 7 years ago! And there are no plans to actually enforce this (nor should there be), so let's remove this misleading warning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152715
Approved by: https://github.com/malfet , https://github.com/zou3519
2025-05-02 23:00:51 +00:00
07290bdcdc
Skip search for MKL on ARM cpus ( #145850 )
...
It will not find it anyway and makes a bit easier parsing thru CMake log on non-x86 systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145850
Approved by: https://github.com/atalman
2025-05-02 18:39:49 +00:00
e9e1aacef8
Enable -Wunused on torch targets ( #150077 )
...
For GCC, ``-Wunused`` contains:
```
-Wunused-function
Warn whenever a static function is declared but not defined or a non\-inline static function is unused.
-Wunused-label
Warn whenever a label is declared but not used.
To suppress this warning use the unused attribute.
-Wunused-parameter
Warn whenever a function parameter is unused aside from its declaration.
To suppress this warning use the unused attribute.
-Wunused-variable
Warn whenever a local variable or non-constant static variable is unused aside from its declaration
To suppress this warning use the unused attribute.
```
For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default:
```
Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument ),
[-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable ),
[-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function ),
[-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label ), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture ),
[-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef ),
[-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field ),
[-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar ),
[-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value ), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable ).
```
These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077
Approved by: https://github.com/zou3519 , https://github.com/wdvr
2025-05-02 07:14:19 +00:00
6dadfc4457
Revert "Enable -Wunused on torch targets ( #150077 )"
...
This reverts commit 688adc9941f855e78dd4d595682eea16317b7f54.
Reverted https://github.com/pytorch/pytorch/pull/150077 on behalf of https://github.com/wdvr due to failing internally with use of undeclared identifier ([comment](https://github.com/pytorch/pytorch/pull/150077#issuecomment-2846499828 ))
2025-05-02 06:53:20 +00:00
688adc9941
Enable -Wunused on torch targets ( #150077 )
...
For GCC, ``-Wunused`` contains:
```
-Wunused-function
Warn whenever a static function is declared but not defined or a non\-inline static function is unused.
-Wunused-label
Warn whenever a label is declared but not used.
To suppress this warning use the unused attribute.
-Wunused-parameter
Warn whenever a function parameter is unused aside from its declaration.
To suppress this warning use the unused attribute.
-Wunused-variable
Warn whenever a local variable or non-constant static variable is unused aside from its declaration
To suppress this warning use the unused attribute.
```
For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default:
```
Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument ),
[-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable ),
[-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function ),
[-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label ), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture ),
[-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef ),
[-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field ),
[-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar ),
[-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value ), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable ).
```
These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077
Approved by: https://github.com/zou3519
2025-05-01 04:09:06 +00:00
e872bf8f88
Avoid linking multiple OMP runtimes in libtorch_cpu.so if BLAS used is OpenBLAS. ( #147725 )
...
When PyTorch is built with OpenBLAS support and libopenblas is ldrectly linked with libgomp.so the libtorch_cpu.so ends up getting multiple omp runtimes linked against it. This may result in unexpected runtime behaviour /regression. This patch fixes this by avoiding linking against libomp.so if OpenBLAS is linked against libgomp.so
Fixes #146603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147725
Approved by: https://github.com/albanD
2025-04-29 23:39:48 +00:00
c039cb1a06
submodules: point gloo to new home in pytorch/ ( #152438 )
...
Gloo moved to the PyTorch GitHub org. This updates PyTorch to point to the new location.
https://github.com/pytorch/gloo
Test plan:
CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152438
Approved by: https://github.com/fduwjj
2025-04-29 20:42:24 +00:00
fcbbb03d48
Extend vec backend with BF16 SVE intrinsics ( #143666 )
...
- Following the work in https://github.com/pytorch/pytorch/pull/119571 , BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`.
- Added bf16 detection in CMake
- Added a guard for native NEON code to prevent compilation errors
@aditew01 @maajidkhann please have a look
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666
Approved by: https://github.com/malfet , https://github.com/aditew01 , https://github.com/nikhil-arm
Co-authored-by: Aditya Tewari <aditya.tewari@arm.com >
2025-04-28 18:25:44 +00:00
7cae7902a2
Add scripts to check xrefs and urls ( #151844 )
...
Traverses the docs and code to find any broken links
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151844
Approved by: https://github.com/huydhn
2025-04-28 09:30:07 +00:00
e2f9759bd0
Fix broken URLs ( #152237 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn , https://github.com/malfet
2025-04-27 09:56:42 +00:00
65b845f82b
Remove useless options for third-party ONNX build ( #147616 )
...
Treat ONNX CMake targets properly and remove unneeded options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147616
Approved by: https://github.com/malfet
2025-04-26 02:34:08 +00:00
b8f4dc5a9f
[ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs ( #146264 )
...
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select.
scatter_add performance on MI300x:
dtype|Baseline (before optimizations)|opportunistic fastatomics
-------|----------------------------------|----------------------------------
f32|1.389425039|0.430447996
fp16|2.195472956|0.779729486
bf16|2.194051027|0.784599513
Using the following reproducer
```
import torch
import triton
def main():
dtype = torch.float32
dim = 1305301
a = torch.rand(100, device="cuda", dtype=dtype)
index = torch.randint(0, 100, (dim,), device="cuda")
src = torch.rand(dim, device="cuda", dtype=dtype)
print("=" * 20)
print(
triton.testing.do_bench(
lambda: a.scatter_add(0, index, src),
return_mode="median",
)
)
print("=" * 20)
if __name__ == "__main__":
main()
```
co-authored by: @amd-hhashemi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264
Approved by: https://github.com/jeffdaily , https://github.com/mxz297
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com >
2025-04-22 21:55:40 +00:00
b74be52454
[CUDA][NVTX] Move nvtx3 code from cmake/public/cuda.cmake to cmake/Dependencies.cmake ( #151583 )
...
Fixes [#147220 ]
Context: In the CUDA NVTX world, there are NVTX v2 and NVTX v3. As announced in CUDA release notes, e.g. [CUDA 12.8 Update 1]( https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-or-dropped-operating-systems ) "`NVTX v2 is deprecated. To migrate to NVTX v3. Change your code from: #include <nvtoolsext.h> to #include "nvtx3/nvtoolsext.h`". This header is included in the toolkit."
On the PyTorch side, TORCH_CUDA_USE_NVTX3 compile time macro is used and it is set to true when (most of the time) nvtx3 is found. nvtx3 is found in two cases: 1) USE_SYSTEM_NVTX=0 (default), torch build process would automatically look for the nvtx3 in pytorch/third_party/nvtx. This is the most common and default case. 2) when USE_SYSTEM_NVTX=1 is used, nvtx3 is found from the installed CUDA toolkit (e.g. CUDA 12.8 and even some earlier cuda versions).
As described in #147220 , the reason it can find pytorch/third_party/nvtx is because it used
6f035d8462/cmake/public/cuda.cmake (L176)
note the "PROJECT_SOURCE_DIR" usage in [pytorch/cmake/public/cuda.cmake](6f035d8462/cmake/public/cuda.cmake (L176)
)
Before this PR:
PyTorch build would succeed in finding nvtx3 due to the above described process, everything is good. But downstream projects like torchvision *can* fail, and would by default fail because the following are happening:
1) USE_SYSTEM_NVTX=0 is used (and most likely it is this case because it is the default)
2) NVTX v2 can no longer be found (e.g. future CUDA versions because deprecation would eventually become removal)
3) TorchVision cannot find NVTX3 either because torchvision was invoking [pytorch/cmake/public/cuda.cmake] but the PROJECT_SOURCE_DIR is no longer the pytorch source but torchvision source!
4) One workaround is to "USE_SYSTEM_NVTX=1" but users have to explicitly set this and do the plumbing work
After this PR:
PyTorch can still find nvtx3 because the part of the code that finds nvtx3 is just moved to a new place. The CI logs are showing it being able to find nvtx3. e.g. [this job](https://productionresultssa14.blob.core.windows.net/actions-results/47f8efaa-0afe-4e1f-bc94-0a82629941cb/workflow-job-run-dc8201b1-845b-5da1-a6ea-d3360ce1b508/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-04-18T20%3A38%3A05Z&sig=yMd6egC%2Banl3lR%2BudXFX18bfUH189z0DTGLtscHQJwY%3D&ske=2025-04-19T06%3A21%3A45Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-04-18T18%3A21%3A45Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-04-18T20%3A28%3A00Z&sv=2025-01-05 ), which reads "`Found nvtx3: C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/NVTX/c/include`"
For torchvision, it still invoke [pytorch/cmake/public/cuda.cmake] but it no longer tries to find nvtx3 as torchvision is not using nvtx3 (if in future it uses, it can set USE_SYSTEM_NVTX=1 by default). So it would avoid the error reported in [#147220 ]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151583
Approved by: https://github.com/eqy , https://github.com/atalman , https://github.com/malfet
2025-04-18 21:18:09 +00:00
331423e5c2
Fix tensorpipe compilation with clang-17 ( #151344 )
...
By suppressing `missing-template-arg-list-after-template-kw` warning, which seems to be required to compile Google's libnop, which is in a semi-abandoned state now
```
In file included from /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/variant.h:21:
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:241:30: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
241 | index_ = value_.template Construct(std::forward<Args>(args)...);
| ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:258:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
258 | if (!value_.template Assign(TypeTag<T>{}, index_, std::forward<U>(value))) {
| ^
/Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:265:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw]
265 | if (!value_.template Assign(index_, std::forward<T>(value))) {
| ^
3 errors generated.
```
Fixes https://github.com/pytorch/pytorch/issues/151316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151344
Approved by: https://github.com/ZainRizvi , https://github.com/seemethere
2025-04-15 22:18:06 +00:00
ad5e9065ac
[Profiler/Easy] Remove temp flag for on-demand Memory Snapshot ( #151068 )
...
Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too.
Test Plan: CI
Reviewed By: sanrise
Differential Revision: D72672186
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068
Approved by: https://github.com/davidberard98
2025-04-11 18:50:25 +00:00
78b3d71ece
Docs: Add missing whitespace in the cmake warning message ( #150929 )
...
A trailing whitespace is needed to be concatenated to the following string correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150929
Approved by: https://github.com/Skylion007
2025-04-10 02:50:56 +00:00
ec5f2e3028
[Build] Fix fbgemm build with gcc-12+ ( #150847 )
...
By suppressing more warnings
TODO: fbgemm pin really needs to get updated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150847
Approved by: https://github.com/atalman , https://github.com/Skylion007
2025-04-08 16:03:40 +00:00
99c9a31386
[submodule] [Snapshot/Profiler] Memory Snapshot On Demand ( #150559 )
...
Summary:
Profiler side of memory snapshot.
1. Add API to actually do snapshot when client interface is called
2. Add ifdefs to builds so that kineto hooks snapshot correctly.
Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship
Test Plan: {F1976563426}
Reviewed By: sanrise
Differential Revision: D70733247
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559
Approved by: https://github.com/sanrise
2025-04-07 13:04:38 +00:00
55e62ff74a
bf16 grouped gemm ( #150374 )
...
Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374
Approved by: https://github.com/drisspg
2025-04-06 04:53:24 +00:00
91666eef60
Update gloo submodule ( #150320 )
...
That updates its CMake minimum version(via https://github.com/facebookincubator/gloo/pull/424 ) and removes cmake-4.0.0 workarounds for gloo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150320
Approved by: https://github.com/atalman
2025-03-31 22:40:27 +00:00