Shared memory is allocated by creating a file in /dev/shm (by default) that can run out of space. Pytorch reserves the file size by calling ftruncate() that creates a sparse file, so it succeeds even if sufficient disk space is not available.
This could lead to a situation when a shared memory region is successfully created but a subsequent access to a shared memory page results in SIGBUS due to the disk being full.
Using posix_fallocate() instead of ftruncate() eliminates this problem because the former syscall always allocates space and it returns an error if the disk is full.
Related to https://github.com/pytorch/pytorch/issues/5040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161910
Approved by: https://github.com/mikaylagawarecki
----
- `cmake_dependent_option` condition should be `USE_ROCM OR (USE_CUDA AND NOT MSVC)` (similar to the one for flash attention)
- Default settings should be user overridable, i.e. even if one builds for SM_10, they should be able to pass `USE_FBGEMM_GENAI=0` and skip the build
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164165
Approved by: https://github.com/Skylion007
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.
### Summary
🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )
This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.
### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.
Note:
Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.
Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
Note. This is a replica PR of #155901 which will be closed. I had to create a new PR in order to add it into my ghstack as there are some later commits which depend on it.
### Summary
🚀 This PR moves the prioritized text linker optimization from setup.py to cmake ( and enables by default on Linux aarch64 systems )
This change consolidates what was previously manual CI logic into a single location (cmake), ensuring consistent behavior across local builds, CI pipelines, and developer environments.
### Motivation
Prioritized text layout has measurable performance benefits on Arm systems by reducing code padding and improving cache utilization. This optimization was previously triggered manually via CI scripts (.ci/aarch64_linux/aarch64_ci_build.sh) or user-set environment variables. By detecting the target architecture within setup.py, this change enables the optimization automatically where applicable, improving maintainability and usability.
Note:
Due to ninja/cmake graph generation issues we cannot apply the linker file globally to all targets to the targets must be manually defined. See CMakeLists.txt the main libraries torch_python, torch, torch_cpu, torch_cuda, torch_xpu have been targetted which should be enough to maintain the performance benefits outlined above.
Co-authored-by: Usamah Zaheer <usamah.zaheer@arm.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160078
Approved by: https://github.com/seemethere
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.11b:
* Invoke AITER Assembly kernels on gfx942/gfx950 when inputs meet requirements
- AITER ASM kernels deliver over 500TFLOPS training performance. See
[AOTriton 0.11b Release Page](https://github.com/ROCm/aotriton/releases/tag/0.11b) for more
details.
* Now returns natural based `logsumexp` tensor, matching CUDA's behavior
- PR #156903 is reverted in this PR as well since it is not needed anymore.
* Enables `CausalVariant.LOWER_RIGHT`
The build system changes drastically along with new packaging scheme of
AOTriton 0.11
* AOTriton 0.11 packs GPU images separately from AOTriton runtime
* `aotriton.cmake` now selectively downloads image packs according to
`PYTORCH_ROCM_ARCH`
* `aotriton.cmake` now only use pre-compiled runtime library that exactly
matches the ROCM in the build environment. For PyTorch builds with ROCm
versions not listed in the file, the build process will build AOTriton
runtime without GPU images from source
- This avoids any further ABI breaks like ROCM 6.4 -> 7.0
- recursive git clone is disabled since building AOTriton runtime does not
require submodules.
Bug fixes:
* Fix a kernel bug introduced when implementing SWA
Known Problems:
* gfx1100 target (Radeon RX 7000 Series) is moved back to experimental status
due to accuracy issues. Triton compiler fixes are needed to restore the
support status.
* Enabling TF32 tests affects accuracy for later non-TF32 tests on ROCM 7.0.
This issue is under investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161754
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
This pull request adds the following ops for sparse matrices using Eigen library:
```python
add(a_csr, b_csr)
add(a_csc, b_csc)
addmm(c_csr, a_csr, b_csr)
addmm(c_csr, a_csr, b_csc)
addmm(c_csr, a_csc, b_csc)
addmm(c_csr, a_csc, b_csr)
addmm(c_csc, a_csr, b_csr)
addmm(c_csc, a_csr, b_csc)
addmm(c_csc, a_csc, b_csc)
addmm(c_csc, a_csc, b_csr)
```
Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.
This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy
Co-authored-by: Eli Uriegas <eliuriegas@meta.com>
This pull request adds the following ops for sparse matrices using Eigen library:
```python
add(a_csr, b_csr)
add(a_csc, b_csc)
addmm(c_csr, a_csr, b_csr)
addmm(c_csr, a_csr, b_csc)
addmm(c_csr, a_csc, b_csc)
addmm(c_csr, a_csc, b_csr)
addmm(c_csc, a_csr, b_csr)
addmm(c_csc, a_csr, b_csc)
addmm(c_csc, a_csc, b_csc)
addmm(c_csc, a_csc, b_csr)
```
Currently, the operations for sparse matrices on CPU are available through MKL only. The non-existence of MKL on `aarch64` causes the unavailability of these ops on any machines with ARM based CPUs, including Apple Silicon, AWS Graviton and NVIDIA Grace. This PR addresses this issue by using Eigen as a backend for the above ops.
This is a re-factored version of my previous PR #101814. The main difference with the old one, this does not enable Eigen by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155357
Approved by: https://github.com/pearu, https://github.com/eqy
See https://cmake.org/cmake/help/latest/command/file.html#path-conversion. Paths stored in environment variables may use `/` or `\` (e.g. on Windows), while cmake-style paths always use `/`.
This fixes configure errors like:
```
CMake Error at D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2 (set):
Syntax error in cmake code at
D:/b/pytorch_main/build/CMakeFiles/CMakeScratch/TryCompile-srhq07/CMakeLists.txt:2
when parsing string
D:\projects\TheRock\external-builds\pytorch\.venv\Lib\site-packages\_rocm_sdk_devel/cmake/;D:/b/pytorch_main/cmake/Modules
Invalid character escape '\p'.
CMake Error at D:/projects/TheRock/external-builds/pytorch/.venv/Lib/site-packages/cmake/data/share/cmake-3.31/Modules/Internal/CheckSourceCompiles.cmake:108 (try_compile):
Failed to configure test project build system.
```
(note the mixed usage of `\` and `/` in that string)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159080
Approved by: https://github.com/jeffdaily
Merge the recent commits of FBGEMM and remove unnecessary CMake code.
Specifically, we
1. enable `fbgemm_autovec` since the target is now correctly handled.
2. remove option `USE_FAKELOWP` which is not used.
3. remove `CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS` check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158210
Approved by: https://github.com/q10
Refactors how the enablement/disablement of CK Gemms and SDPA works.
- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not when using system Gloo.
Fixes: #146239
Reported-by: Adam J Stewart <ajstewart426@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146637
Approved by: https://github.com/malfet
No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
- Avoid using deprecated enums for MacOS builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`.
---
For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error:
```
CMake Error at cmake/Dependencies.cmake:566 (if):
if given arguments:
"NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY"
Unknown arguments specified
Call Stack (most recent call first):
CMakeLists.txt:868 (include)
```
Upon making the change in this PR (changing `or` to `OR`), the process continued as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527
Approved by: https://github.com/janeyx99