Commit Graph

1328 Commits

Author SHA1 Message Date
cca34be584 Update XNNPACK Version (#139913)
Updating XNNPACK Version to 4ea82e595b36106653175dcb04b2aa532660d0d8

submodule update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139913
Approved by: https://github.com/digantdesai, https://github.com/huydhn
2024-11-18 18:16:31 +00:00
cyy
40fb738197 Use Wextra-semi (#140236)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140236
Approved by: https://github.com/ezyang
2024-11-13 02:15:16 +00:00
891ba2ec8a Fix xpu cmake typo (#140374)
# Motivation
This PR aims to fix a typo in the CMake build. The typo impacts the XPU Windows build and results in PyTorch being built without XPU, which is unexpected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140374
Approved by: https://github.com/EikanWang, https://github.com/ezyang, https://github.com/atalman
2024-11-13 00:26:35 +00:00
a290c1d748 Fix building with system GLOO (#140275)
Leverage existing FindGloo CMake module to locate system's library and headers. Add system's gloo headers to include path rather than the gloo from third party when USE_SYSTEM_GLOO is specified.

Fixes #140274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140275
Approved by: https://github.com/malfet
2024-11-11 22:58:39 +00:00
8051ee802c Add XPU compiler version control in cmake to keep BC (#139258)
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.

# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui
2024-11-09 13:31:21 +00:00
929a647363 [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025)
[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops.

Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part,since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well.
At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2024-11-09 13:09:27 +00:00
7e65060410 Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel
2024-11-02 02:14:01 +00:00
77b72d686e [BE][MPS] Make metal shaders compile cleanly (#139522)
I.e. without warnings, by deleting dead code and fixing one
signed-unsigned comparison warning

Also, pass `-Werror` to metal compiler if WERROR options is set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139522
Approved by: https://github.com/Skylion007
2024-11-01 23:22:47 +00:00
a1f854f270 [MPS] Compile kernels into Metallib (#138636)
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see  https://github.com/pytorch/pytorch/pull/125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`

This PR does exactly that by:
 - Moving shader sources into separate .metal files
 - Adds check on whether full Xcode installed or just DeveloperTools
 - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
 - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```

Some historical stats:
| PyTorch Version  | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12  | 0  | |
| 1.13  | 2  | bitwise_ops and  index.out |
| 2.0  | 4  | cross repeat and view)  |
| 2.1  | 9   | unary_ops, histogram, renorm, binary_ops |
| 2.2  | 11   | gamma and bucketization |
| 2.3  | 12  | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |

Pros:
  - Better code structure/readability
  - Eventually allows one to use shared headers (and implement something like `TensorIterator`)
  - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:
  - Build process is a bit more complicated that it used to be
  - Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636
Approved by: https://github.com/manuelcandales
2024-11-01 21:47:20 +00:00
beb15c80fb print USE_STATIC_MKL for further debug. (#138902)
print `USE_STATIC_MKL` for further debug.
<img width="257" alt="image" src="https://github.com/user-attachments/assets/cd45bada-c28a-441a-b271-35956cfe1f21">
if we use `MKL`, then show its link method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138902
Approved by: https://github.com/ezyang
2024-10-27 18:08:30 +00:00
b021486405 Enable Windows Arm64 (#133088)
This PR enables Pytorch for Windows on Arm64 - CPU only.
Currently, there aren't any checks in place to build and test for Windows on Arm64, but we're working to implement those as soon as possible.
We recommend using [Arm Performance Libraries (APL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries) as a BLAS option, which is introduced in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133088
Approved by: https://github.com/malfet

Co-authored-by: cristian panaite <panaite.cristian2000@gmail.com>
Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com>
Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
2024-10-24 16:10:44 +00:00
96b30dcb25 [Windows][cpu] mkl use mimalloc as allocator on Windows (#138419)
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.

Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         3.91%     682.427ms       100.00%       17.448s       17.448s             1
                     aten::conv2d         0.18%      30.906ms        64.79%       11.305s       2.133ms          5300
                aten::convolution         0.45%      78.031ms        64.62%       11.275s       2.127ms          5300
               aten::_convolution         0.30%      51.670ms        64.17%       11.196s       2.113ms          5300
         aten::mkldnn_convolution        63.58%       11.093s        63.87%       11.145s       2.103ms          5300
                 aten::batch_norm         0.13%      23.536ms        20.10%        3.506s     661.580us          5300
     aten::_batch_norm_impl_index         0.28%      49.486ms        19.96%        3.483s     657.139us          5300
          aten::native_batch_norm        19.26%        3.360s        19.64%        3.427s     646.615us          5300
                 aten::max_pool2d         0.01%       1.038ms         5.84%        1.018s      10.181ms           100
    aten::max_pool2d_with_indices         5.83%        1.017s         5.83%        1.017s      10.171ms           100
                       aten::add_         3.38%     588.907ms         3.38%     588.907ms      85.349us          6900
                      aten::relu_         0.35%      60.358ms         1.67%     292.155ms      59.624us          4900
                 aten::clamp_min_         1.33%     231.797ms         1.33%     231.797ms      47.306us          4900
                      aten::empty         0.46%      80.195ms         0.46%      80.195ms       1.513us         53000
                     aten::linear         0.01%     927.300us         0.23%      39.353ms     393.532us           100
                      aten::addmm         0.20%      35.379ms         0.21%      37.016ms     370.155us           100
                 aten::empty_like         0.12%      20.455ms         0.17%      29.976ms       5.656us          5300
                aten::as_strided_         0.11%      18.830ms         0.11%      18.830ms       3.553us          5300
        aten::adaptive_avg_pool2d         0.00%     419.900us         0.08%      14.265ms     142.647us           100
                       aten::mean         0.01%       1.737ms         0.08%      13.845ms     138.448us           100
                        aten::sum         0.05%       8.113ms         0.05%       8.648ms      86.479us           100
                    aten::resize_         0.03%       5.182ms         0.03%       5.182ms       0.978us          5300
                       aten::div_         0.01%       1.445ms         0.02%       3.460ms      34.600us           100
                         aten::to         0.00%     337.000us         0.01%       2.015ms      20.154us           100
                   aten::_to_copy         0.01%     977.500us         0.01%       1.678ms      16.784us           100
                      aten::copy_         0.01%       1.474ms         0.01%       1.474ms       7.371us           200
                          aten::t         0.00%     775.900us         0.01%       1.410ms      14.104us           100
                    aten::flatten         0.00%     420.900us         0.01%       1.311ms      13.106us           100
                       aten::view         0.01%     889.700us         0.01%     889.700us       8.897us           100
                  aten::transpose         0.00%     410.700us         0.00%     634.500us       6.345us           100
                     aten::expand         0.00%     496.800us         0.00%     566.800us       5.668us           100
                      aten::fill_         0.00%     534.800us         0.00%     534.800us       5.348us           100
                 aten::as_strided         0.00%     293.800us         0.00%     293.800us       1.469us           200
              aten::empty_strided         0.00%     241.700us         0.00%     241.700us       2.417us           100
               aten::resolve_conj         0.00%      54.800us         0.00%      54.800us       0.274us           200
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.448s

Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.

So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html

This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.

For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-24 05:29:47 +00:00
3cfd244495 Add USE_SYSTEM_NVTX option (#138287)
## Summary

We are currently [updating](https://github.com/conda-forge/pytorch-cpu-feedstock/pull/277) the [`conda-forge::pytorch`](https://anaconda.org/conda-forge/pytorch) package to version 2.5.0. This update includes a new dependency, the third_party/NVTX submodule. However, like other package management frameworks (e.g., apt), conda-forge prefers using system-installed packages instead of vendor-provided third-party packages.

This pull request aims to add an option, `USE_SYSTEM_NVTX`, to select whether to use the vendored nvtx or the system-installed one, with the default being the vendored one (which is the current behavior).

## Test Plan

The `USE_SYSTEM_NVTX` option is tested by building the `conda-forge::pytorch` package with the change applied as a [patch](cd1d2464dd/recipe/patches/0005-Use-system-nvtx3.patch).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138287
Approved by: https://github.com/albanD
2024-10-19 04:26:01 +00:00
8cda774a03 Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773)
# Motivation
Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-10-18 02:28:08 +00:00
cyy
af8bd323e8 Remove legacy Caffe2 pthreadpool from CMake (#134936)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936
Approved by: https://github.com/ezyang
2024-10-17 05:22:08 +00:00
2c94c54f10 Export XPU libs to be public (#136974)
# Motivation
Export XPU-related libs to be public. Now they are included in `TORCH_LIBRARIES`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136974
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-10-16 03:41:01 +00:00
e1e6417d4c Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-15 18:52:44 +00:00
dac0b4e62b Revert "Add SVE implementation of embedding_lookup_idx (#133995)"
This reverts commit 770c134998d3422bc2fa3b90baa235ed0c409e62.

Reverted https://github.com/pytorch/pytorch/pull/133995 on behalf of https://github.com/clee2000 due to breaking internal tests, I wondering if this just needs a targets change for buck? ([comment](https://github.com/pytorch/pytorch/pull/133995#issuecomment-2414596554))
2024-10-15 17:23:50 +00:00
770c134998 Add SVE implementation of embedding_lookup_idx (#133995)
Adds an accelerated version of the embedding_lookup_idx perfkernels. This is done via a python codegen file similarly to `caffe2/perfkernels/hp_emblookup_codegen.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133995
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-10-14 10:17:27 +00:00
bd63ec4f45 [ROCm] LoadHIP CMake cleanup (#137112)
Should help mitigate issues reported here: https://github.com/pytorch/pytorch/issues/128313

While working on https://github.com/pytorch/pytorch/pull/136700, we realized that some of the ROCm CMake can be streamlined.

This PR does not fix any bugs or provide any new functionality. Strictly clean-up.

The remaining `${ROCM_ROCTX_LIB}` will be removed when we transition to the rocprofiler-sdk (to be done in a separate PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137112
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-13 00:06:41 +00:00
b9c9f7f0fa Document ROCm environment variables and improve CMake messaging to user (#137308)
Fixes #115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about.

The PR improves the documentation and CMake's behavior for ROCM builds.

- Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`.
- CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-10 03:08:08 +00:00
851b9732aa Download pre-compiled AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603)
PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)) docker images. Building aotriton from source can take ~45 minutes.

This PR fixes the issue by downloading the aotriton tarball in such scenarios, *unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603
Approved by: https://github.com/atalman

Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
2024-09-26 18:05:51 +00:00
e8f1dd6ba0 Fix hardcoded ROCm paths in Caffe2Targets.cmake (#136283)
Fixes #131701

Use CMake imported targets more consistently to eliminate hardcode paths.

Here is the new relevant sections of Caffe2Targets.cmake:
```
set_target_properties(c10_hip PROPERTIES
  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
  INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64"
)
```

```
set_target_properties(torch_hip PROPERTIES
  INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL"
  INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS"
  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
  INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver"
)
```

HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman
2024-09-26 00:34:43 +00:00
4fd16dd8aa Clarify that libtorch API is C++17 compatible (#136471)
As it relies on some common C++17 primitives, such as `std::optional`
Replace all docs references from C++14 to C++17

Fixes https://github.com/pytorch/pytorch/issues/133205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-09-24 02:03:33 +00:00
0e19522122 Revert "Adds support for accelerated sorting with x86-simd-sort (#127936)"
This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86.

Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](239a9ad65e) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))
2024-09-23 14:52:23 +00:00
cyy
75f141be62 Avoid unnecessary CMake warnings on Windows (#136393)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393
Approved by: https://github.com/ezyang
2024-09-23 06:42:59 +00:00
cyy
c459430558 Pass Werror to CUDA host compiler (#130213)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130213
Approved by: https://github.com/ezyang
2024-09-21 08:01:06 +00:00
239a9ad65e Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-20 21:19:33 +00:00
5a6ddbcc3b Extending the Pytorch vec backend for SVE (ARM) (#119571)
**Motivation:**
In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes.

**Reference Link:** https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec

**This PR:**

* Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE.

* More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions)

* We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec.

* Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571
Approved by: https://github.com/malfet, https://github.com/snadampal

Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>
2024-09-18 18:59:10 +00:00
9852c6d236 xpu: fix 3rd party builds on systems with cmake<3.25 (#135767)
Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement.

See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html
Fixes: #135766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767
Approved by: https://github.com/malfet, https://github.com/guangyey
2024-09-12 05:31:01 +00:00
f7c0c06692 Add oneDNN BRGEMM support on CPU (#131878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-07 13:22:30 +00:00
ecbd715363 [Intel GPU][Windows] Fix overriding default CMAKE_CXX_FLAGS (#135093)
The root cause is that `/EHsc` is part of the default `CMAKE_CXX_FLAGS` in CMake.
Fix to not override the default `CMAKE_CXX_FLAGS`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135093
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-09-05 12:52:43 +00:00
a258844a32 Properly handle empty CPUINFO variable (#134916)
Fixes https://github.com/pytorch/pytorch/issues/134915

But I did not root cause why CPUINFO is totally empty to begin with...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134916
Approved by: https://github.com/Skylion007
2024-09-03 15:59:59 +00:00
3402a5d865 fix windows xpu build issue (#133845)
# Motivation
If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`.

# Solution
Use explicitly `sycl-preview` in linux build only.

# Additional Context
For `find_library`, please note that the variable will not be updated if it has been stored.
```
If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845
Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
2024-08-29 23:53:32 +00:00
416a7894fe [Windows][XPU] Disable Kineto PTI on Windows only (#134620)
Disable Kineto + XPU PTI on Windows only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-29 20:58:55 +00:00
3b40b07efb Update PyTorch for XNNPACK 87ee0b4 (#134518)
Summary: Update XNNPACK library version.

Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export).

Differential Revision: D61822610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518
Approved by: https://github.com/mcr229
2024-08-28 19:24:04 +00:00
5fd670e0ef [ROCM] Properly disable Flash Attention/Efficient Attention with environment variables (#133866)
Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly

Fixes #125230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/malfet
2024-08-27 18:24:29 +00:00
c638a40a93 [Caffe2] Remove unused AVX512 code (#133160)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133160
Approved by: https://github.com/albanD
2024-08-23 23:16:16 +00:00
90c821814e SparseCsrCUDA: cuDSS backend for linalg.solve (#129856)
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes #69538

Minimum example of usage:
```
import torch

if __name__ == '__main__':
    spd = torch.rand(4, 3)
    A = spd.T @ spd
    b = torch.rand(3).to(torch.float64).cuda()
    A = A.to_sparse_csr().to(torch.float64).cuda()

    x = torch.linalg.solve(A, b)
    print((A @ x - b).norm())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn

Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-08-22 07:57:30 +00:00
cyy
c3d02fa390 [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.  The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy

Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
2024-08-20 16:33:26 +00:00
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
fa1d7b0262 Revert "Remove unused Caffe2 macros (#132979)"
This reverts commit da65cfbdea4f1f2176f6242004bda940a24f9ddb.

Reverted https://github.com/pytorch/pytorch/pull/132979 on behalf of https://github.com/ezyang due to these are apparently load bearing internally ([comment](https://github.com/pytorch/pytorch/pull/132979#issuecomment-2284666332))
2024-08-12 18:34:56 +00:00
cyy
da65cfbdea Remove unused Caffe2 macros (#132979)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132979
Approved by: https://github.com/ezyang
2024-08-09 04:48:20 +00:00
cyy
05e8e87a69 [Submodule] Remove foxi (#132976)
It is not used after removal of Caffe2 code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976
Approved by: https://github.com/ezyang
2024-08-09 03:46:52 +00:00
26b0011fb8 [XPU][Kineto Submodule] Introduce kineto-based XPU profiler (#130811)
As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin.

Associated PR to introduce kineto-based XPU profiler into kineto:
https://github.com/pytorch/kineto/pull/961

Also updates the Kineto Submodule to include XPU changes.

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811
Approved by: https://github.com/aaronenyeshi
2024-08-07 18:41:37 +00:00
92bebb46fa Support XPU ABI=0 build (#130110)
# Motivation
This PR intends to support ABI=0 build for XPU backend.

# Additional Context
The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why?
Because we use
- gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And
- use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides,
- `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not.

This PR depends on https://github.com/pytorch/pytorch/pull/131643.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-08-01 21:42:14 +00:00
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
cyy
803c5b8640 [CMake] Fix private compile options for CUDA code (#130546)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130546
Approved by: https://github.com/ezyang
2024-07-25 00:22:18 +00:00
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00