pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Peter Y. Yeh	43390d8b13	ROCm Sparsity through HipSparseLT (#150578 ) TLDR: - This pull request introduces support for hipSPARSELt in ROCm, current usage would be semi-structure sparsity. - Require ROCm 6.4 && gfx942/gfx950. - The average performance uplift (compare to dense operation) is ~ 20% in ROCm 6.4 but expect further performance lift along the way. ### Dense vs. Sparse Performance Comparison #### NT (Row-major) Average Uplift: `1.20` \| M \| N \| K \| hipsparselt-bench (us) \| hipblaslt-bench get all (us) \| Uplift \| \|-------\|--------\|--------\|-------------------------\|-------------------------------\|--------\| \| 14336 \| 8 \| 4096 \| 20.05 \| 25.3 \| 1.26 \| \| 4096 \| 8 \| 14336 \| 21.07 \| 25.28 \| 1.20 \| \| 3072 \| 3072 \| 10240 \| 299.05 \| 351.82 \| 1.18 \| \| 3072 \| 1536 \| 768 \| 18.56 \| 20.05 \| 1.08 \| \| 3072 \| 17664 \| 768 \| 163.13 \| 173.91 \| 1.07 \| \| 3072 \| 196608 \| 768 \| 1717.30 \| 1949.63 \| 1.14 \| \| 3072 \| 24576 \| 768 \| 206.84 \| 242.98 \| 1.17 \| \| 3072 \| 6144 \| 768 \| 53.90 \| 56.88 \| 1.06 \| \| 3072 \| 98304 \| 768 \| 833.77 \| 962.28 \| 1.15 \| \| 768 \| 1536 \| 768 \| 8.53 \| 19.65 \| 2.30 \| \| 768 \| 17664 \| 768 \| 46.02 \| 46.84 \| 1.02 \| \| 768 \| 196608 \| 768 \| 463.15 \| 540.46 \| 1.17 \| \| 768 \| 24576 \| 768 \| 54.32 \| 59.55 \| 1.10 \| \| 768 \| 6144 \| 768 \| 19.47 \| 20.15 \| 1.03 \| \| 768 \| 98304 \| 768 \| 231.88 \| 258.73 \| 1.12 \| --- #### NN (Row-major) Average Uplift: `1.13` \| M \| N \| K \| hipsparselt-bench (us) \| hipblaslt-bench get all (us) \| Uplift \| \|-----\|--------\|-------\|-------------------------\|-------------------------------\|--------\| \| 768 \| 1536 \| 3072 \| 27.50 \| 28.78 \| 1.05 \| \| 768 \| 17664 \| 3072 \| 125.06 \| 158.94 \| 1.27 \| \| 768 \| 196608 \| 3072 \| 1568.38 \| 1767.12 \| 1.13 \| \| 768 \| 24576 \| 3072 \| 171.05 \| 203.49 \| 1.19 \| \| 768 \| 6144 \| 3072 \| 58.72 \| 60.39 \| 1.03 \| \| 768 \| 98304 \| 3072 \| 787.15 \| 887.60 \| 1.13 \| ------------------------- This pull request introduces support for hipSPARSELt in ROCm, alongside various updates and improvements to the codebase and test suite. The changes primarily involve adding configuration flags, updating conditional checks, and ensuring compatibility with hipSPARSELt. ### ROCm and hipSPARSELt Support: * [`BUILD.bazel`](diffhunk://#diff-7fc57714ef13c3325ce2a1130202edced92fcccc0c6db34a72f7b57f60d552a3R292): Added `@AT_HIPSPARSELT_ENABLED@` substitution to enable hipSPARSELt support. * [`aten/CMakeLists.txt`](diffhunk://#diff-0604597797bb21d7c39150f9429d6b2ace10b79ab308514ad03f76153ae8249bR104-R110): Introduced a conditional flag to enable hipSPARSELt support based on ROCm version. * [`aten/src/ATen/CMakeLists.txt`](diffhunk://#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777R37): Added `AT_HIPSPARSELT_ENABLED` configuration. * [`aten/src/ATen/cuda/CUDAConfig.h.in`](diffhunk://#diff-8bb82da825ca87c28233abacffa1b0566c73a54990b7a77f3f5108d3718fea15R11): Defined `AT_HIPSPARSELT_ENABLED` macro. * `caffe2/CMakeLists.txt`, `cmake/Dependencies.cmake`, `cmake/public/LoadHIP.cmake`: Included hipSPARSELt in the ROCm dependencies. [[1]](diffhunk://#diff-c5ee05f1e918772792ff6f2a3f579fc2f182e57b1709fd786ef6dc711fd68b27R1380) [[2]](diffhunk://#diff-12e8125164bbfc7556b1781a8ed516e333cc0bf058acb7197f7415be44606c72L1084-R1084) [[3]](diffhunk://#diff-b98e27b9a5f196a6965a99ee5a7bb15b3fc633d6375b767635b1b04ccb2fd3d5R153) ### Codebase Updates: * [`aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp`](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6): Added hipSPARSELt support checks and initialization functions. Updated various methods to conditionally handle hipSPARSELt. [[1]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R1-R6) [[2]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R22-R67) [[3]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R78-R85) [[4]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R97-R109) [[5]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R183-R188) [[6]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L134-R200) [[7]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3R213-R222) [[8]](diffhunk://#diff-ae921dd1584ab98fdd9c25a3521047795de702223f5b65fdaa45a5bd92b4d1f3L217-R285) ### Test Suite Updates: * [`test/test_sparse_semi_structured.py`](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65): Added checks for hipSPARSELt availability and updated test conditions to skip tests not supported on ROCm. [[1]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR50-R65) [[2]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR228) [[3]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR239) [[4]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR250) [[5]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR579) [[6]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR624) [[7]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR661) [[8]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR695) [[9]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR730) [[10]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR755) [[11]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR771) [[12]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR809) [[13]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR844) [[14]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cL840-R854) [[15]](diffhunk://#diff-b7b57bc1e34145ef89c7929751d5d26aeecc8edfb37da9c60e9d3f0a1335133cR1005) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150578 Approved by: https://github.com/jeffdaily	2025-05-31 02:03:40 +00:00
cyy	ad26ec6abe	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-31 01:54:35 +00:00
PyTorch MergeBot	108422ac26	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit 78624679a876a21acb14bf075ba6beccff21b9a0. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2923785799))	2025-05-31 00:28:03 +00:00
cyy	78624679a8	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-31 00:01:52 +00:00
atalman	22641f42b6	[Binary-builds]Use System NCCL by default in CI/CD. (#152835 ) Use System NCCl by default. The correct nccl version is already built into the Manylinux docker image. Will followup with PR on detecting if user has NCCL installed and enabling USE_SYSTEM_NCCL by default in this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152835 Approved by: https://github.com/malfet	2025-05-30 18:51:48 +00:00
PyTorch MergeBot	7e8532077f	Revert "Use 3.27 as the minimum CMake version (#153153 )" This reverts commit 1ece53b157db4425ad12cae31fb570c591dc19e7. Reverted https://github.com/pytorch/pytorch/pull/153153 on behalf of https://github.com/cyyever due to It still breaks windows debug builds ([comment](https://github.com/pytorch/pytorch/pull/153153#issuecomment-2922369830))	2025-05-30 13:16:33 +00:00
cyy	1ece53b157	Use 3.27 as the minimum CMake version (#153153 ) Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as `CUDA::nvperf_host` so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783. It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21). Pull Request resolved: https://github.com/pytorch/pytorch/pull/153153 Approved by: https://github.com/malfet	2025-05-30 11:25:30 +00:00
dolpm	66f53889d5	[nativert] port semaphore to c10 util (#153504 ) Summary: nativert RFC: https://github.com/zhxchen17/rfcs/blob/master/RFC-0043-torch-native-runtime.md To land the runtime into PyTorch core, we will gradually land logical parts of the code into the Github issue and get each piece properly reviewed. This diff adds a simple semaphore interface into c10 until c++20 where we get counting_semaphore gonna need a oss build export to take a look at this... Test Plan: CI Differential Revision: D73882656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153504 Approved by: https://github.com/zhxchen17	2025-05-28 19:17:30 +00:00
Ben Olson	1bebe0424e	Fix platform detection in MKLDNN CMake file (#142067 ) When building PyTorch with `USE_XPU=True` and Clang, the user sees misleading errors related to incorrect platform detection that assumes that all users that are not using the GNU compilers are on Windows. We can fix this by simply using CMake's builtin platform detection variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142067 Approved by: https://github.com/EikanWang, https://github.com/min-jean-cho, https://github.com/guangyey	2025-05-26 06:09:37 +00:00
cyy	8fe7ec6721	Add /Zc:preprocessor for torch libraries in MSVC builds (#147825 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/147825 Approved by: https://github.com/janeyx99	2025-05-24 06:57:46 +00:00
Xu Han	7421c21b5e	remove unused code. (#153979 ) Remove the unused cmake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153979 Approved by: https://github.com/albanD	2025-05-22 17:50:11 +00:00
Scott Todd	0e5f2339d0	[ROCm][Windows] Run hipcc with compatibility flags. (#153986 ) See also https://github.com/ROCm/TheRock/issues/590. Including the `-Wno-ignored-attributes` flag here avoids 700MB of log warning spam while compiling and the `-fms-extensions` seems beneficial to include: https://clang.llvm.org/docs/MSVCCompatibility.html. Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com> Co-authored-by: Scott Todd <scott.todd0@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153986 Approved by: https://github.com/Skylion007, https://github.com/jeffdaily Co-authored-by: Aaryaman Vasishta <jem456.vasishta@gmail.com>	2025-05-21 20:26:52 +00:00
Yu, Guangye	daa68e7a93	Update USE_XCCL option if USE_XPU is OFF (#153936 ) # Motivation Disable `USE_XCCL` when `USE_XPU` is turned `OFF` to ensure configuration consistency. This is required because XCCL depends on XPU functionality. Especially, ensure that `USE_XCCL` is correctly set to `OFF` when [caffe2_update_option(USE_XPU OFF)](`1075bb37d3/cmake/Dependencies.cmake (L97)`) is invoked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153936 Approved by: https://github.com/Skylion007	2025-05-21 01:32:41 +00:00
Xu Han	179e7d8624	Fix vs2022 caused AVX512 illegal instruction issue. (#153480 ) Fixes #145702 Add `/d2implyavx512upperregs-` to disable compiler over-aggressive optimization, which caused involeved AVX512 register on AVX2 machine. Reference to: https://github.com/pytorch/pytorch/issues/145702#issuecomment-2874029459 Local test passed: <img width="1208" alt="image" src="https://github.com/user-attachments/assets/26f4cb91-6bb5-416f-aa35-c899eb1489b2" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/153480 Approved by: https://github.com/Blackhex, https://github.com/cyyever, https://github.com/atalman	2025-05-20 20:37:00 +00:00
Gantaphon Chalumporn	05bc78e64f	[submodule] Update fbgemm pinned version (#153950 ) Summary: Update fbgemm pinned version in PyTroch. Related update in fbgemm: D74434751 Included changes: Update fbgemm external dependencies directory in setup.py Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec Test Plan: PyTorch OSS CI Differential Revision: D75073516 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-05-20 20:24:27 +00:00
Xu Han	2ade886412	[XPU] [Windows] Auto turn on kineto XPU build when compiler version support. (#153681 ) Since SYCL compiler 20250101, it will remove dependency of level zero header. We can turn on kineto XPU by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153681 Approved by: https://github.com/chuanqi129, https://github.com/cyyever, https://github.com/EikanWang	2025-05-19 03:07:14 +00:00
cyy	7ae7324ac4	[submodule] Update google benchmark to v1.9.3 (#153676 ) And remove `include_directories` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153676 Approved by: https://github.com/Skylion007	2025-05-16 23:31:53 +00:00
PyTorch MergeBot	084c4aa614	Revert "Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633 )" (#153656 )" This reverts commit 7ed377f5776578aec4a6a9bc4eeef221a6b80a77. Reverted https://github.com/pytorch/pytorch/pull/153656 on behalf of https://github.com/larryliu0820 due to Still being used internally so can't remove ([comment](https://github.com/pytorch/pytorch/pull/153656#issuecomment-2887665403))	2025-05-16 21:00:11 +00:00
Mengwei Liu	7ed377f577	Reapply "Delete TorchScript based Android demo app and point to ExecuTorch (#153633 )" (#153656 ) This reverts commit ae0e8f0c7316addab3f415dc767a9d34f58b0dae. Keep android/libs/fbjni because it's being used by other components of PyTorch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153656 Approved by: https://github.com/malfet	2025-05-16 04:35:42 +00:00
cyy	9d3b6ee4c1	[submodule] Update gtest to v1.17.0 (#153618 ) And remove some outdated CMake code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153618 Approved by: https://github.com/malfet	2025-05-16 01:24:19 +00:00
Tristan Rice	d1dd2c1fc8	gloo: cuda (#153406 ) This enables Gloo CUDA when used with a backend that supports GPUDirect which currently is only the IBVERBS backend. This requires some changes to Gloo which are in https://github.com/pytorch/gloo/pull/441 Since we're now depending on gloo_cuda we need to split ProcessGroupGloo into two pieces, one with the CPU bits (libtorch_cpu) and one with CUDA kernels in libtorch_cuda. This unfortunately requires some major refactoring as some CPU code is shared across both. The gloo submodule is updated to depend on the new Gloo changes Test plan: ```py import os import time transport = "TCP" #transport = "IBVERBS" os.environ["GLOO_DEVICE_TRANSPORT"] = transport rank = int(os.environ["RANK"]) os.environ["CUDA_VISIBLE_DEVICES"] = str(rank) ibv = "mlx5_0:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_9:1,mlx5_10:1,mlx5_11:1".split(",")[rank] ibv_name, ibv_port = ibv.split(":") os.environ["TORCH_GLOO_IBV_NAME"] = ibv_name os.environ["TORCH_GLOO_IBV_PORT"] = ibv_port os.environ["TORCH_GLOO_IBV_INDEX"] = "3" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() # initial sanity check #device = "cpu" #t = torch.zeros(10, device=device) #dist.all_reduce(t) #print("sanity complete") device = "cpu" iters = 10 warmup_iters = 2 for nelem in [10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000]: t = torch.zeros(nelem, device=device) torch.cuda.current_stream().synchronize() for i in range(warmup_iters): dist.all_reduce(t) torch.cuda.current_stream().synchronize() start = time.perf_counter() for i in range(iters): dist.all_reduce(t) torch.cuda.current_stream().synchronize() dur = (time.perf_counter() - start) qps = iters/dur bandwidth_gb = t.nbytes * iters / dur / 1e9 gb = t.nbytes / 1e9 if rank == 0: print(f"{transport=} {device=} {iters=} {nelem=} {qps=} {gb=} {bandwidth_gb=}\n", end="") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153406 Approved by: https://github.com/fduwjj	2025-05-16 01:13:13 +00:00
cyy	e5e06d9cab	[submodule] Update kleidiai to v1.8.0 (#153592 ) And cleans up some CMake instructions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153592 Approved by: https://github.com/malfet	2025-05-15 10:14:05 +00:00
hanchao	0ca91af6b8	Define USE_C10D_XCCL and USE_XCCL in pytorch (#147593 ) ### Motivation: Add `USE_XCCL` and `USE_C10D_XCCL` to enable support of XCCL backend building in stock PyTorch, similar to `USE_NCCL` and `USE_C10D_NCCL`. By default, `USE_XCCL` is OFF and allowed set to ON explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147593 Approved by: https://github.com/guangyey, https://github.com/malfet, https://github.com/albanD, https://github.com/cyyever	2025-05-15 05:39:00 +00:00
Anthony Shoumikhin	7d39e73c57	Fix more URLs (#153277 ) Or ignore them. Found by running the lint_urls.sh script locally with https://github.com/pytorch/pytorch/pull/153246 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153277 Approved by: https://github.com/malfet	2025-05-14 16:23:50 +00:00
Tristan Rice	9c3cef437c	gloo: support ibverbs in cmake (#153425 ) This updates the gloo submodule in PyTorch to a version that supports the new ibverbs backend that can be used with PyTorch. Test plan: ``` sudo dnf install rdma-core-devel USE_GLOO_IBVERBS=ON python setup.py develop torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py ``` ```py """ run with: torchrun --nproc_per_node 2 ~/scripts/gloo_ibverbs_test.py """ import os os.environ["GLOO_DEVICE_TRANSPORT"] = "IBVERBS" import torch import torch.distributed as dist dist.init_process_group("gloo") rank = dist.get_rank() if rank == 0: device = "cpu" else: device = "cuda" print(device) t = torch.full((10, 100), fill_value=(rank+1), device=device) target = torch.full((10, 100), fill_value=3, device=device) dist.all_reduce(t) torch.testing.assert_close(t, target) t = torch.full((10, 100), fill_value=(rank+1), device=device) if rank == 0: dist.send(t, dst=1) else: dist.recv(t, src=0) torch.testing.assert_close(t, torch.full_like(t, 1)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153425 Approved by: https://github.com/fduwjj	2025-05-13 17:09:00 +00:00
cyy	15e08f9571	[submodule] Update ONNX to 1.18 (#152200 ) Update ONNX to 1.18. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152200 Approved by: https://github.com/justinchuby, https://github.com/malfet	2025-05-13 04:18:45 +00:00
Jithun Nair	f11d7a5978	[ROCm] Update spack includes (#152569 ) * Cleans up code in `caffe2/CMakeLists.txt` to remove individual ROCm library include paths and use `ROCM_INCLUDE_DIRS` CMake var instead * `ROCM_INCLUDE_DIRS` CMake var is set in `cmake/public/LoadHIP.cmake` by adding all the ROCm packages that PyTorch depends on * `rocm_version.h` is provided by the `rocm-core` package, so use the include directory for that component to be compliant with Spack * Move `find_package_and_print_version(hip REQUIRED CONFIG)` earlier so that `hip_version.h` can be located in the hip package include dir for Spack * `list(REMOVE_DUPLICATES ROCM_INCLUDE_DIRS)` to remove duplicate `/opt/rocm/include` entries in the non-Spack case * Remove user-provided env var `ROCM_INCLUDE_DIRS` since `ROCM_PATH` already exists as a user-provided env var, which should be sufficient to locate the include directories for ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152569 Approved by: https://github.com/renjithravindrankannath, https://github.com/jeffdaily Co-authored-by: Renjith Ravindran <Renjith.RavindranKannath@amd.com>	2025-05-09 21:36:38 +00:00
Milos Puzovic	642e9305eb	Fixes detection of ArmPL on Linux platform (#150031 ) On Linux it failed to detect that there is bin directory as it wasn't looking for armpl-info which is the only file that is in that directory on Linux and also adding link to math library as it is required to link against when checking for LAPACK functions. Fixes #149610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150031 Approved by: https://github.com/fadara01, https://github.com/malfet	2025-05-07 19:47:21 +00:00
cyy	2f09e79142	Fix Codegen.cmake warning (#153023 ) Fix ``` CMake Warning (dev) in cmake/Codegen.cmake: A logical block opening on the line /var/lib/jenkins/workspace/cmake/Codegen.cmake:393 (if) closes on the line /var/lib/jenkins/workspace/cmake/Codegen.cmake:401 (endif) with mis-matching arguments. ``` by removing the condition in `endif`. We could instead fix it, however, that is not best practice. For example, cmake_lint warns that, and CMake says ``` The optional <condition> argument is supported for backward compatibility only. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153023 Approved by: https://github.com/aditew01, https://github.com/Skylion007	2025-05-07 12:45:20 +00:00
cyy	ac792a0dca	[submodule] Bump ITTAPI to 3.25.5 (#150263 ) It hasn't been updated for 3 years. And also to remove CMake 4 workaround. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150263 Approved by: https://github.com/sraikund16	2025-05-06 01:02:18 +00:00
Jane Xu	2107d87dc9	[BE] remove outdated warning about TORCH_CUDA_ARCH_LIST (#152715 ) I saw this warning when compiling a 3rd party lib and did not agree with it. I'm not sure the original reason why we would want to force people to pass in TORCH_CUDA_ARCH_LIST to cmake vs set it as an env var. As a developer, it's much easier to set it as an env var or have it be autodetected. I also realized this warning was from before 2018!!! 7 years ago! And there are no plans to actually enforce this (nor should there be), so let's remove this misleading warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152715 Approved by: https://github.com/malfet, https://github.com/zou3519	2025-05-02 23:00:51 +00:00
Nikita Shulga	07290bdcdc	Skip search for MKL on ARM cpus (#145850 ) It will not find it anyway and makes a bit easier parsing thru CMake log on non-x86 systems Pull Request resolved: https://github.com/pytorch/pytorch/pull/145850 Approved by: https://github.com/atalman	2025-05-02 18:39:49 +00:00
cyy	e9e1aacef8	Enable -Wunused on torch targets (#150077 ) For GCC, ``-Wunused`` contains: ``` -Wunused-function Warn whenever a static function is declared but not defined or a non\-inline static function is unused. -Wunused-label Warn whenever a label is declared but not used. To suppress this warning use the unused attribute. -Wunused-parameter Warn whenever a function parameter is unused aside from its declaration. To suppress this warning use the unused attribute. -Wunused-variable Warn whenever a local variable or non-constant static variable is unused aside from its declaration To suppress this warning use the unused attribute. ``` For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default: ``` Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument), [-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable), [-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function), [-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture), [-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef), [-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field), [-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar), [-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable). ``` These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077 Approved by: https://github.com/zou3519, https://github.com/wdvr	2025-05-02 07:14:19 +00:00
PyTorch MergeBot	6dadfc4457	Revert "Enable -Wunused on torch targets (#150077 )" This reverts commit 688adc9941f855e78dd4d595682eea16317b7f54. Reverted https://github.com/pytorch/pytorch/pull/150077 on behalf of https://github.com/wdvr due to failing internally with use of undeclared identifier ([comment](https://github.com/pytorch/pytorch/pull/150077#issuecomment-2846499828))	2025-05-02 06:53:20 +00:00
cyy	688adc9941	Enable -Wunused on torch targets (#150077 ) For GCC, ``-Wunused`` contains: ``` -Wunused-function Warn whenever a static function is declared but not defined or a non\-inline static function is unused. -Wunused-label Warn whenever a label is declared but not used. To suppress this warning use the unused attribute. -Wunused-parameter Warn whenever a function parameter is unused aside from its declaration. To suppress this warning use the unused attribute. -Wunused-variable Warn whenever a local variable or non-constant static variable is unused aside from its declaration To suppress this warning use the unused attribute. ``` For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default: ``` Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument), [-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable), [-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function), [-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture), [-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef), [-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field), [-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar), [-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable). ``` These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077 Approved by: https://github.com/zou3519	2025-05-01 04:09:06 +00:00
Vinitha Vijayan	e872bf8f88	Avoid linking multiple OMP runtimes in libtorch_cpu.so if BLAS used is OpenBLAS. (#147725 ) When PyTorch is built with OpenBLAS support and libopenblas is ldrectly linked with libgomp.so the libtorch_cpu.so ends up getting multiple omp runtimes linked against it. This may result in unexpected runtime behaviour /regression. This patch fixes this by avoiding linking against libomp.so if OpenBLAS is linked against libgomp.so Fixes #146603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147725 Approved by: https://github.com/albanD	2025-04-29 23:39:48 +00:00
Tristan Rice	c039cb1a06	submodules: point gloo to new home in pytorch/ (#152438 ) Gloo moved to the PyTorch GitHub org. This updates PyTorch to point to the new location. https://github.com/pytorch/gloo Test plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/152438 Approved by: https://github.com/fduwjj	2025-04-29 20:42:24 +00:00
Ryo Suzuki	fcbbb03d48	Extend vec backend with BF16 SVE intrinsics (#143666 ) - Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666 Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/nikhil-arm Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>	2025-04-28 18:25:44 +00:00
Anthony Shoumikhin	7cae7902a2	Add scripts to check xrefs and urls (#151844 ) Traverses the docs and code to find any broken links Pull Request resolved: https://github.com/pytorch/pytorch/pull/151844 Approved by: https://github.com/huydhn	2025-04-28 09:30:07 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
cyy	65b845f82b	Remove useless options for third-party ONNX build (#147616 ) Treat ONNX CMake targets properly and remove unneeded options. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147616 Approved by: https://github.com/malfet	2025-04-26 02:34:08 +00:00
Prachi Gupta	b8f4dc5a9f	[ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs (#146264 ) In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select. scatter_add performance on MI300x: dtype\|Baseline (before optimizations)\|opportunistic fastatomics -------\|----------------------------------\|---------------------------------- f32\|1.389425039\|0.430447996 fp16\|2.195472956\|0.779729486 bf16\|2.194051027\|0.784599513 Using the following reproducer ``` import torch import triton def main(): dtype = torch.float32 dim = 1305301 a = torch.rand(100, device="cuda", dtype=dtype) index = torch.randint(0, 100, (dim,), device="cuda") src = torch.rand(dim, device="cuda", dtype=dtype) print("=" * 20) print( triton.testing.do_bench( lambda: a.scatter_add(0, index, src), return_mode="median", ) ) print("=" * 20) if __name__ == "__main__": main() ``` co-authored by: @amd-hhashemi Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264 Approved by: https://github.com/jeffdaily, https://github.com/mxz297 Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>	2025-04-22 21:55:40 +00:00
Wei Wang	b74be52454	[CUDA][NVTX] Move nvtx3 code from cmake/public/cuda.cmake to cmake/Dependencies.cmake (#151583 ) Fixes [#147220] Context: In the CUDA NVTX world, there are NVTX v2 and NVTX v3. As announced in CUDA release notes, e.g. [CUDA 12.8 Update 1]( https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-or-dropped-operating-systems) "`NVTX v2 is deprecated. To migrate to NVTX v3. Change your code from: #include <nvtoolsext.h> to #include "nvtx3/nvtoolsext.h`". This header is included in the toolkit." On the PyTorch side, TORCH_CUDA_USE_NVTX3 compile time macro is used and it is set to true when (most of the time) nvtx3 is found. nvtx3 is found in two cases: 1) USE_SYSTEM_NVTX=0 (default), torch build process would automatically look for the nvtx3 in pytorch/third_party/nvtx. This is the most common and default case. 2) when USE_SYSTEM_NVTX=1 is used, nvtx3 is found from the installed CUDA toolkit (e.g. CUDA 12.8 and even some earlier cuda versions). As described in #147220, the reason it can find pytorch/third_party/nvtx is because it used `6f035d8462/cmake/public/cuda.cmake (L176)` note the "PROJECT_SOURCE_DIR" usage in [pytorch/cmake/public/cuda.cmake](`6f035d8462/cmake/public/cuda.cmake (L176)`) Before this PR: PyTorch build would succeed in finding nvtx3 due to the above described process, everything is good. But downstream projects like torchvision can fail, and would by default fail because the following are happening: 1) USE_SYSTEM_NVTX=0 is used (and most likely it is this case because it is the default) 2) NVTX v2 can no longer be found (e.g. future CUDA versions because deprecation would eventually become removal) 3) TorchVision cannot find NVTX3 either because torchvision was invoking [pytorch/cmake/public/cuda.cmake] but the PROJECT_SOURCE_DIR is no longer the pytorch source but torchvision source! 4) One workaround is to "USE_SYSTEM_NVTX=1" but users have to explicitly set this and do the plumbing work After this PR: PyTorch can still find nvtx3 because the part of the code that finds nvtx3 is just moved to a new place. The CI logs are showing it being able to find nvtx3. e.g. [this job](https://productionresultssa14.blob.core.windows.net/actions-results/47f8efaa-0afe-4e1f-bc94-0a82629941cb/workflow-job-run-dc8201b1-845b-5da1-a6ea-d3360ce1b508/logs/job/job-logs.txt?rsct=text%2Fplain&se=2025-04-18T20%3A38%3A05Z&sig=yMd6egC%2Banl3lR%2BudXFX18bfUH189z0DTGLtscHQJwY%3D&ske=2025-04-19T06%3A21%3A45Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2025-04-18T18%3A21%3A45Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2025-01-05&sp=r&spr=https&sr=b&st=2025-04-18T20%3A28%3A00Z&sv=2025-01-05), which reads "`Found nvtx3: C:/actions-runner/_work/pytorch/pytorch/pytorch/third_party/NVTX/c/include`" For torchvision, it still invoke [pytorch/cmake/public/cuda.cmake] but it no longer tries to find nvtx3 as torchvision is not using nvtx3 (if in future it uses, it can set USE_SYSTEM_NVTX=1 by default). So it would avoid the error reported in [#147220] Pull Request resolved: https://github.com/pytorch/pytorch/pull/151583 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/malfet	2025-04-18 21:18:09 +00:00
Nikita Shulga	331423e5c2	Fix tensorpipe compilation with clang-17 (#151344 ) By suppressing `missing-template-arg-list-after-template-kw` warning, which seems to be required to compile Google's libnop, which is in a semi-abandoned state now ``` In file included from /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/base/variant.h:21: /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:241:30: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 241 \| index_ = value_.template Construct(std::forward<Args>(args)...); \| ^ /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:258:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 258 \| if (!value_.template Assign(TypeTag<T>{}, index_, std::forward<U>(value))) { \| ^ /Users/malfet/git/pytorch/pytorch/third_party/tensorpipe/third_party/libnop/include/nop/types/variant.h:265:26: error: a template argument list is expected after a name prefixed by the template keyword [-Wmissing-template-arg-list-after-template-kw] 265 \| if (!value_.template Assign(index_, std::forward<T>(value))) { \| ^ 3 errors generated. ``` Fixes https://github.com/pytorch/pytorch/issues/151316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151344 Approved by: https://github.com/ZainRizvi, https://github.com/seemethere	2025-04-15 22:18:06 +00:00
Shivam Raikundalia	ad5e9065ac	[Profiler/Easy] Remove temp flag for on-demand Memory Snapshot (#151068 ) Summary: Now that we have profiler impl in we don't need the temporary flag. submodule update too. Test Plan: CI Reviewed By: sanrise Differential Revision: D72672186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151068 Approved by: https://github.com/davidberard98	2025-04-11 18:50:25 +00:00
Yuki Kobayashi	78b3d71ece	Docs: Add missing whitespace in the cmake warning message (#150929 ) A trailing whitespace is needed to be concatenated to the following string correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150929 Approved by: https://github.com/Skylion007	2025-04-10 02:50:56 +00:00
Nikita Shulga	ec5f2e3028	[Build] Fix fbgemm build with gcc-12+ (#150847 ) By suppressing more warnings TODO: fbgemm pin really needs to get updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/150847 Approved by: https://github.com/atalman, https://github.com/Skylion007	2025-04-08 16:03:40 +00:00
Shivam Raikundalia	99c9a31386	[submodule] [Snapshot/Profiler] Memory Snapshot On Demand (#150559 ) Summary: Profiler side of memory snapshot. 1. Add API to actually do snapshot when client interface is called 2. Add ifdefs to builds so that kineto hooks snapshot correctly. Design Philosophy: There is one interesting part of this implementation and it is during export. For export we are callign the python impl of the export rather than CPP even though we are already in CPP. This is because it is better to simply have one path of export rather than 2. Personally, I want there to be parity between auto-trace and on-demand so it if we can limit the side paths then we will have an easier time maintaining this relationship Test Plan: {F1976563426} Reviewed By: sanrise Differential Revision: D70733247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150559 Approved by: https://github.com/sanrise	2025-04-07 13:04:38 +00:00
Natalia Gimelshein	55e62ff74a	bf16 grouped gemm (#150374 ) Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374 Approved by: https://github.com/drisspg	2025-04-06 04:53:24 +00:00
Nikita Shulga	91666eef60	Update gloo submodule (#150320 ) That updates its CMake minimum version(via https://github.com/facebookincubator/gloo/pull/424 ) and removes cmake-4.0.0 workarounds for gloo Pull Request resolved: https://github.com/pytorch/pytorch/pull/150320 Approved by: https://github.com/atalman	2025-03-31 22:40:27 +00:00

1 2 3 4 5 ...

1562 Commits