Fix incorrect linking of Gloo's libraries when building with system Gloo. Previously, either Gloo's native library or Gloo's CUDA library were linked. However, Gloo had changed such that all users of Gloo must link the native library, and can optionally link the CUDA or HIP library for Gloo + CUDA/HIP support.
This had been updated when building/linking with vendored Gloo, but not when using system Gloo.
Fixes: #146239
Reported-by: Adam J Stewart <ajstewart426@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146637
Approved by: https://github.com/malfet
Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0
***Build Failure Logs***
**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
302 | if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
| ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
303 | (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
| ^~~~~~~~~~~~~~~~~~~~~~~~
```
**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
19 | class Vectorized<ComplexFlt> {
| ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
633 | auto abs_a = a.abs_2_();
| ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
634 | auto abs_b = b.abs_2_();
| ^~~~~~
/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
666 | vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
673 | vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
| ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
680 | vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```
***With this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
-- USE_TENSORPIPE=True
-- USE_GLOO=True
-- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc
```
This patch will fix the pytorch build issue on power, and i am able to build successfully.
Hi @malfet @albanD
Please review this PR for pytorch build issue that we are observing on power.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet
In OneDNN v3.7, SDPA has below defects:
1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy.
2. Only support headdim size <= 256.
3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops.
In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg
Use System NCCl by default. The correct nccl version is already built into the Manylinux docker image.
Will followup with PR on detecting if user has NCCL installed and enabling USE_SYSTEM_NCCL by default in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152835
Approved by: https://github.com/malfet
When PyTorch is built with OpenBLAS support and libopenblas is ldrectly linked with libgomp.so the libtorch_cpu.so ends up getting multiple omp runtimes linked against it. This may result in unexpected runtime behaviour /regression. This patch fixes this by avoiding linking against libomp.so if OpenBLAS is linked against libgomp.so
Fixes#146603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147725
Approved by: https://github.com/albanD
For #149075
* Add a graceful cmake error instead of cryptic one if SYCL runtime is not found:
```
The link interface of target "c10_xpu" contains:
torch::xpurt
but the target was not found.
```
* Suppress unclear cmake error if SYCL compiler is not available and further version query fails:
```
CMake Error at /home/dvrogozh/pytorch/torch/share/cmake/Caffe2/FindSYCLToolkit.cmake:37 (string):
string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
command.
```
CC: @gujinghui @EikanWang @fengyuan14 @guangyey @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149353
Approved by: https://github.com/guangyey, https://github.com/malfet
When clang-cl parses its command line arguments, it expects MSVC-style arguments (beggining with `/` such as `/WX`, `/MD`, etc.) to be provided, and clang-style arguments to be preceded by `-Xclang`, otherwise, the clang-style parameters are ignored as they are interpreted unrecognized compiler options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148097
Approved by: https://github.com/jeffdaily
ACL is already built with PyTorch as a shared library when USE_MKLDNN_ACL is set.
Currently, it is only used indirectly in ATen via oneDNN for AArch64 targets. However there are cases where it makes sense to utilize ACL directly without oneDNN as an intermediary - e.g. quantization. See #145942, #147337, #146620.
This patch enables such use cases by exposing ACL to ATen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148584
Approved by: https://github.com/malfet
# Motivation
Currently, Intel GPU is moving forward rapidly with the development of feature. We(Intel GPU) want an independent version control over oneDNN component so as to quickly adopt the optimization or bug fixing provided by oneDNN team.
This PR does not change the behaviors of other backends like Intel CPU, ARM. They can keep using the stable version contained in `third_party/ideep`.
# Detail
At compilation time, we will `git clone` oneDNN via URL `https://github.com/oneapi-src/oneDNN` and checkout to the tag/commit that Intel GPU backend prefers. This feature is supported by CMake `Externalproject_add` command.
Following is a build log example:
```bash
[11/60] Performing download step (git clone) for 'xpu_mkldnn_proj'
Cloning into 'xpu_mkldnn_proj'...
HEAD is now at 5e92240360 meta: updated citation file
[12/60] Performing update step for 'xpu_mkldnn_proj'
-- Already at requested tag: v3.7
[13/60] No patch step for 'xpu_mkldnn_proj'
```
The log demonstates that, we explicitly download the source files and checkout to a specific tag. The source file of oneDNN is located at `build/xpu_mkldnn_proj-prefix/src/xpu_mkldnn_proj`
# Runtime verification
Running UT for CPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit fc3f17ad469b8a6da7192ae12d32625faa509f1e)
onednn_verbose,v1,info,cpu,runtime:OpenMP,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:none
onednn_verbose,v1,info,graph,backend,0:dnnl_backend
onednn_verbose,v1,primitive,info,template:operation,engine
```
Runnint UT for Intel GPU
```bash
onednn_verbose,v1,info,oneDNN v3.7.0 (commit 5e9224036021433d2577548ed0539fe9a53256bc)
onednn_verbose,v1,info,cpu,runtime:threadpool,nthr:24
onednn_verbose,v1,info,cpu,isa:Intel AVX-512 with Intel DL Boost
onednn_verbose,v1,info,gpu,runtime:DPC++
onednn_verbose,v1,info,gpu,engine,sycl gpu device count:2
```
We can see that, Intel GPU would uses commit `5e922` (tag v3.7), while CPU uses `fc3f17`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147926
Approved by: https://github.com/EikanWang
Co-authored-by: leizhenyuan <zhenyuan.lei@intel.com>
According to the [APL documentation](https://developer.arm.com/documentation/101004/2404/General-information/Arm-Performance-Libraries-example-programs), libraries ending with _mp are OpenMP multi-threaded libraries.
When a project is compiled with MSVC and the -openmp flag, the vcomp library (Visual C++ implementation of OpenMP) is used for runtime calls.
However, the current APL implementation uses the libomp.dll (LLVM) variant.
As a result, there are unexpected behaviors at runtime.
---
For Example:
```python
import torch
# Create a sparse tensor
# Input (Sparse Tensor):
# [[0, 1],
# [1, 0]]
indices = torch.tensor([[0, 1], [1, 0]])
values = torch.tensor([1, 1], dtype=torch.float32)
size = torch.Size([2, 2])
sparse_tensor = torch.sparse_coo_tensor(indices, values, size)
# Convert sparse tensor to dense tensor
dense_tensor = sparse_tensor.to_dense()
# Expected Output (Dense Tensor):
# [[0, 1],
# [1, 0]]
print("\nDense Tensor:")
print(dense_tensor)
```
However, it prints unexpected outputs such as:
```python
# [[0, 11],
# [10, 0]]
```
The issue arises because the following code does not function as expected at runtime:
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/ParallelOpenMP.h#L30
```c++
// returns 1 , however since OpenMP is enabled it should return total number of threads
int64_t num_threads = omp_get_num_threads();
```
---
In the runtime, loading multiple OpenMP libraries (in this case `libomp` and `vcomp`) is causing unexpected behaviours.
So, we've changed libraries from `_mp` to non `_mp` versions and we used `vcomp` for OpenMP calls.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145215
Approved by: https://github.com/ozanMSFT, https://github.com/malfet
Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.
# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
# Motivation
If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`.
# Solution
Use explicitly `sycl-preview` in linux build only.
# Additional Context
For `find_library`, please note that the variable will not be updated if it has been stored.
```
If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845
Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
# Motivation
This PR intends to support ABI=0 build for XPU backend.
# Additional Context
The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why?
Because we use
- gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And
- use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides,
- `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not.
This PR depends on https://github.com/pytorch/pytorch/pull/131643.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
It was deprecated since CMake-3.0 in favor of `execute_process`, see https://cmake.org/cmake/help/v3.18/command/exec_program.html
This makes the following warning disappear:
```
CMake Warning (dev) at cmake/Modules/FindARM.cmake:5 (EXEC_PROGRAM):
Policy CMP0153 is not set: The exec_program command should not be called.
Run "cmake --help-policy CMP0153" for policy details. Use the cmake_policy
command to set the policy and suppress this warning.
Use execute_process() instead.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129714
Approved by: https://github.com/kit1980