c17ba69ba5
[submodule] Revert "Adds support for accelerated sorting with x86-simd-sort ( #127936 ) ( #141901 )
...
Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590
Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901
Approved by: https://github.com/andrewor14 , https://github.com/malfet
2024-12-03 00:16:35 +00:00
0fca51bcc4
[11/N] Fix Wextra-semi warning ( #140926 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140926
Approved by: https://github.com/ezyang
2024-11-20 00:32:45 +00:00
cca34be584
Update XNNPACK Version ( #139913 )
...
Updating XNNPACK Version to 4ea82e595b36106653175dcb04b2aa532660d0d8
submodule update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139913
Approved by: https://github.com/digantdesai , https://github.com/huydhn
2024-11-18 18:16:31 +00:00
a290c1d748
Fix building with system GLOO ( #140275 )
...
Leverage existing FindGloo CMake module to locate system's library and headers. Add system's gloo headers to include path rather than the gloo from third party when USE_SYSTEM_GLOO is specified.
Fixes #140274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140275
Approved by: https://github.com/malfet
2024-11-11 22:58:39 +00:00
8051ee802c
Add XPU compiler version control in cmake to keep BC ( #139258 )
...
# Motivation
This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers.
# Additional Context
The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples:
1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler.
2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility.
3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258
Approved by: https://github.com/EikanWang , https://github.com/atalman , https://github.com/gujinghui
2024-11-09 13:31:21 +00:00
7e65060410
Adds support for accelerated sorting with x86-simd-sort ( #127936 )
...
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.
For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.
<details>
<summary><b>Contiguous Benchmarks</b></summary>
```
float32, normally distributed (in microseconds)
size Default AVX2 AVX512 Default/AVX2 Default/AVX512
16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214
128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697
1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241
10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577
100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766
1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416
10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323
int32_t, uniformly distributed (in microseconds)
size Default AVX2 AVX512 Default/AVX2 Default/AVX512
16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779
128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698
1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285
10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997
100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472
1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611
10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952
```
</details>
Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.
<details>
<summary><b>Discontiguous Benchmarks</b></summary>
```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size Default AVX2 AVX512 Default/AVX2 Default/AVX512
16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243
128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923
1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952
10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974
100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743
1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133
10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621
int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds)
size Default AVX2 AVX512 Default/AVX2 Default/AVX512
16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908
128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862
1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408
10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889
100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902
1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381
10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5 , https://github.com/peterbell10 , https://github.com/sanchitintel
2024-11-02 02:14:01 +00:00
b021486405
Enable Windows Arm64 ( #133088 )
...
This PR enables Pytorch for Windows on Arm64 - CPU only.
Currently, there aren't any checks in place to build and test for Windows on Arm64, but we're working to implement those as soon as possible.
We recommend using [Arm Performance Libraries (APL)](https://developer.arm.com/Tools%20and%20Software/Arm%20Performance%20Libraries ) as a BLAS option, which is introduced in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133088
Approved by: https://github.com/malfet
Co-authored-by: cristian panaite <panaite.cristian2000@gmail.com >
Co-authored-by: Stefan-Alin Pahontu <56953855+alinpahontu2912@users.noreply.github.com >
Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com >
2024-10-24 16:10:44 +00:00
af8bd323e8
Remove legacy Caffe2 pthreadpool from CMake ( #134936 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134936
Approved by: https://github.com/ezyang
2024-10-17 05:22:08 +00:00
e8f1dd6ba0
Fix hardcoded ROCm paths in Caffe2Targets.cmake
( #136283 )
...
Fixes #131701
Use CMake imported targets more consistently to eliminate hardcode paths.
Here is the new relevant sections of Caffe2Targets.cmake:
```
set_target_properties(c10_hip PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64"
)
```
```
set_target_properties(torch_hip PROPERTIES
INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL"
INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS"
INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver"
)
```
HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283
Approved by: https://github.com/jeffdaily , https://github.com/Skylion007 , https://github.com/jithunnair-amd , https://github.com/atalman
2024-09-26 00:34:43 +00:00
0e19522122
Revert "Adds support for accelerated sorting with x86-simd-sort ( #127936 )"
...
This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86.
Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456 ) [HUD commit link](239a9ad65e
) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316 ))
2024-09-23 14:52:23 +00:00
c459430558
Pass Werror to CUDA host compiler ( #130213 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130213
Approved by: https://github.com/ezyang
2024-09-21 08:01:06 +00:00
239a9ad65e
Adds support for accelerated sorting with x86-simd-sort ( #127936 )
...
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.
For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.
<details>
<summary><b>Contiguous Benchmarks</b></summary>
```
float32, normally distributed (in microseconds)
size Default AVX2 AVX512 Default/AVX2 Default/AVX512
16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214
128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697
1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241
10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577
100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766
1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416
10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323
int32_t, uniformly distributed (in microseconds)
size Default AVX2 AVX512 Default/AVX2 Default/AVX512
16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779
128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698
1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285
10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997
100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472
1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611
10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952
```
</details>
Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.
<details>
<summary><b>Discontiguous Benchmarks</b></summary>
```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size Default AVX2 AVX512 Default/AVX2 Default/AVX512
16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243
128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923
1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952
10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974
100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743
1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133
10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621
int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds)
size Default AVX2 AVX512 Default/AVX2 Default/AVX512
16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908
128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862
1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408
10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889
100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902
1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381
10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5 , https://github.com/peterbell10
2024-09-20 21:19:33 +00:00
416a7894fe
[Windows][XPU] Disable Kineto PTI on Windows only ( #134620 )
...
Disable Kineto + XPU PTI on Windows only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620
Approved by: https://github.com/guangyey , https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com >
2024-08-29 20:58:55 +00:00
3b40b07efb
Update PyTorch for XNNPACK 87ee0b4 ( #134518 )
...
Summary: Update XNNPACK library version.
Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export).
Differential Revision: D61822610
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518
Approved by: https://github.com/mcr229
2024-08-28 19:24:04 +00:00
5fd670e0ef
[ROCM] Properly disable Flash Attention/Efficient Attention with environment variables ( #133866 )
...
Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly
Fixes #125230
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866
Approved by: https://github.com/jithunnair-amd , https://github.com/jeffdaily , https://github.com/malfet
2024-08-27 18:24:29 +00:00
018e48c337
[Reland] Add wrappers for synchronous GPUDirect Storage APIs ( #133489 )
...
Reland #130633
USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
fa1d7b0262
Revert "Remove unused Caffe2 macros ( #132979 )"
...
This reverts commit da65cfbdea4f1f2176f6242004bda940a24f9ddb.
Reverted https://github.com/pytorch/pytorch/pull/132979 on behalf of https://github.com/ezyang due to these are apparently load bearing internally ([comment](https://github.com/pytorch/pytorch/pull/132979#issuecomment-2284666332 ))
2024-08-12 18:34:56 +00:00
da65cfbdea
Remove unused Caffe2 macros ( #132979 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132979
Approved by: https://github.com/ezyang
2024-08-09 04:48:20 +00:00
05e8e87a69
[Submodule] Remove foxi ( #132976 )
...
It is not used after removal of Caffe2 code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976
Approved by: https://github.com/ezyang
2024-08-09 03:46:52 +00:00
26b0011fb8
[XPU][Kineto Submodule] Introduce kineto-based XPU profiler ( #130811 )
...
As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin.
Associated PR to introduce kineto-based XPU profiler into kineto:
https://github.com/pytorch/kineto/pull/961
Also updates the Kineto Submodule to include XPU changes.
Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811
Approved by: https://github.com/aaronenyeshi
2024-08-07 18:41:37 +00:00
92bebb46fa
Support XPU ABI=0 build ( #130110 )
...
# Motivation
This PR intends to support ABI=0 build for XPU backend.
# Additional Context
The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why?
Because we use
- gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And
- use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides,
- `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not.
This PR depends on https://github.com/pytorch/pytorch/pull/131643 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110
Approved by: https://github.com/EikanWang , https://github.com/gujinghui , https://github.com/albanD
2024-08-01 21:42:14 +00:00
e191b83462
Revert "Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )"
...
This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6.
Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607 ))
2024-07-26 18:08:20 +00:00
709ddf7a9d
Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )
...
Based in part on https://github.com/NVIDIA/apex/pull/1774
Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
e4b5645f83
Revert "Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )"
...
This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777.
Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738 ))
2024-07-23 17:19:34 +00:00
5b5e0698a5
Add wrappers for synchronous GPUDirect Storage APIs ( #130633 )
...
Based in part on https://github.com/NVIDIA/apex/pull/1774
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
a6345d3477
[CMake] [3/N] Remove unused code ( #130322 )
...
Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322
Approved by: https://github.com/r-barnes
2024-07-09 19:33:33 +00:00
953c6476bd
[CMAKE] Look for Development.Module
instead of Development
( #129669 )
...
Based on the [cmake issue](https://gitlab.kitware.com/cmake/cmake/-/issues/23716 ) and [manylinux issue](https://github.com/pypa/manylinux/issues/1347 ), when building a python module, it should find the `Development.Module` module, not `Development`, which includes `Development.Module` and `Development.Embed`, and will expect the shared python library only. After this PR and before #124613 , pytorch could be built with a static libpython (e.g. in manylinux).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129669
Approved by: https://github.com/malfet
2024-07-09 09:16:43 +00:00
a21d4363d2
[Profiler] Remove all instances of TMP_USE_TSC_AS_TIMESTAMP ( #129973 )
...
Summary: Now that D56584521 is in, we can remove all insteances of TMP_USE_TSC_AS_TIMESTAMP
Test Plan:
Ran resnet. Trace looks good
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jun_27_14_46_01.1967733.pt.trace.json.gz&bucket=gpu_traces
Reviewed By: aaronenyeshi, swolchok
Differential Revision: D59132793
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129973
Approved by: https://github.com/aaronenyeshi
2024-07-03 19:28:52 +00:00
46366888d7
Remove outdated CMake code ( #129851 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129851
Approved by: https://github.com/ezyang
2024-07-02 00:40:37 +00:00
1d0efedc85
[Profiler] Add TSC Clock Callback to CUPTI ( #125036 )
...
Summary:
Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885
This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler
Test Plan:
Obtained following trace using resnet test:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces
TBD: Add benchmarks
Differential Revision: D56584521
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036
Approved by: https://github.com/aaronenyeshi
2024-06-27 21:07:43 +00:00
64f1111d38
Expose nholmann json to torch ( #129570 )
...
Summary:
Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`.
The next PR makes actual use of this header.
imported-using-ghimport
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D59035246
Pulled By: c-p-i-o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570
Approved by: https://github.com/d4l3k , https://github.com/malfet
2024-06-26 21:59:26 +00:00
479ce5e2f4
Remove outdated CUDA code from CMake ( #128801 )
...
It's possible to simplify some CUDA handling logic in CMake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128801
Approved by: https://github.com/r-barnes , https://github.com/malfet
2024-06-21 15:00:00 +00:00
b4a0161449
Build SYCL kernels for ATen XPU ops on Native Windows (take 2) ( #127390 )
...
Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase.
-------
As proposed in https://github.com/pytorch/pytorch/issues/126719 , we are enabling PyTorch XPU on Native Windows on Intel GPU.
This PR enables XPU build on Windows as the first step of #126719 :
- Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows.
- Build oneDNN GPU library on Windows.
Co-authored-by: Yu, Guangye <guangye.yu@intel.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390
Approved by: https://github.com/guangyey , https://github.com/EikanWang , https://github.com/gujinghui , https://github.com/ezyang
2024-06-06 01:41:06 +00:00
df75a9dc80
Remove Caffe2/onnx ( #127991 )
...
Remove Caffe2/onnx since it is not used. Other tiny fixes are also applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127991
Approved by: https://github.com/ezyang
2024-06-05 15:10:12 +00:00
597922ba21
Reapply "distributed debug handlers ( #126601 )" ( #127805 )
...
This reverts commit 7646825c3eb687030c4f873b01312be0eed80174.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805
Approved by: https://github.com/PaliC
2024-06-04 19:44:30 +00:00
53f001c599
Revert "correct BLAS input ( #126200 )" ( #127762 )
...
This reverts commit ea13e9a097aaa875a2b404822579b7f8b62ea291.
Looks like this could have caused: https://github.com/pytorch/pytorch/actions/runs/9346105069/job/25722431775#step:17:984
Aarch64 tests failures:
```
+ echo 'Checking that MKLDNN is available on aarch64'
Checking that MKLDNN is available on aarch64
+ pushd /tmp
/tmp /
+ python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)'
Error: Process completed with exit code 1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127762
Approved by: https://github.com/PaliC , https://github.com/malfet
2024-06-03 15:49:48 +00:00
4e7f497bb3
[Submodule] Remove ios-cmake ( #127694 )
...
It has not been updated for a long time and CI iOS builds don't rely on it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127694
Approved by: https://github.com/ezyang
2024-06-02 04:40:21 +00:00
ea13e9a097
correct BLAS input ( #126200 )
...
Fixes #32407
With this little correction to Dependencies.cmake it is possible to build an MKL-free version of Pytorch up from version v2.0.0 by explicitly choosing another MKL-free BLAS.
This pullrequest fulfills the "if not already present" part of the original comment in Dependencies.cmake:
"setting default preferred BLAS options if not already present."
It's tested with this Action-.yml:
```
name: Build PyTorch v2.0.0 without AVX
on:
push:
branches:
- v2.0.0
pull_request:
branches:
- v2.0.0
jobs:
build:
runs-on: ubuntu-20.04
defaults:
run:
shell: bash -el {0}
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
#repository: 'pytorch/pytorch'
#ref: 'v2.3.0'
submodules: 'recursive'
- uses: conda-incubator/setup-miniconda@v3
with:
auto-activate-base: true
activate-environment: true
python-version: 3.10.13
- name: Install Dependencies - Common - Linux 2
run: |
conda info
conda list
conda install nomkl
conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses
export PYTORCH_CPU_CAPABILITY=cpu
export ATEN_CPU_CAPABILITY_DEFAULT=cpu
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
export ATEN_CPU_CAPABILITY=default
export USE_NNPACK=0
export MAX_JOBS=4
export USE_CUDA=0
export USE_ROCM=0
export BLAS=OpenBLAS
export CMAKE_ARGS="-D CMAKE_BUILD_TYPE=Release -D USE_AVX=OFF -D USE_NNPACK=OFF -D C_HAS_AVX_2=OFF -D C_HAS_AVX2_2=OFF -D CXX_HAS_AVX_2=OFF -D CXX_HAS_AVX2_2=OFF -D CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS=OFF -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))") -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DPYTHON_EXECUTABLE:FILEPATH=`which python`"
pip install build wheel typing_extensions
python setup.py bdist_wheel
- name: Archive production artifacts
uses: actions/upload-artifact@v4
with:
name: dist-without-markdown
path: |
dist
!dist/**/*.md
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126200
Approved by: https://github.com/jgong5 , https://github.com/kit1980
2024-05-31 19:38:42 +00:00
3e66052e16
Improve python3 discovery code in CMake ( #127600 )
...
The improvement is based on my comments in #124613 and it also fixes the current linux-s390x-binary-manywheel CI failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127600
Approved by: https://github.com/Skylion007
2024-05-31 17:29:06 +00:00
0c5faee372
Replace python::python with Python::Module ( #127485 )
...
Use found Python::Module target
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127485
Approved by: https://github.com/ezyang
2024-05-31 05:57:05 +00:00
7646825c3e
Revert "distributed debug handlers ( #126601 )"
...
This reverts commit 3d541835d509910fceca00fc5a916e9718c391d8.
Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987 ))
2024-05-31 01:21:24 +00:00
d44daebdbc
[Submodule] Remove deprecated USE_TBB option and TBB submodule ( #127051 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch , https://github.com/malfet
2024-05-31 01:20:45 +00:00
3d541835d5
distributed debug handlers ( #126601 )
...
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)
This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.
This adds 2 handlers out of the box:
* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder
Test plan:
```
python test/distributed/elastic/test_control_plane.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman , https://github.com/c-p-i-o
2024-05-30 02:21:08 +00:00
67739d8c6f
Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule ( #127051 )"
...
This reverts commit 699db7988d84d163ebb6919f78885e4630182a7a.
Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995 ))
2024-05-30 01:16:57 +00:00
8ea1dc8748
Use Python::NumPy target ( #127399 )
...
Now that we use FindPython, use it again for numpy detection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127399
Approved by: https://github.com/malfet
2024-05-29 23:17:58 +00:00
0910429d72
[BE][CMake] Use FindPython module ( #124613 )
...
As FindPythonInterp and FindPythonLibs has been deprecated since cmake-3.12
Replace `PYTHON_EXECUTABLE` with `Python_EXECUTABLE` everywhere (CMake variable names are case-sensitive)
This makes PyTorch buildable with python3 binary shipped with XCode on MacOS
TODO: Get rid of `FindNumpy` as its part of Python package
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124613
Approved by: https://github.com/cyyever , https://github.com/Skylion007
2024-05-29 13:17:35 +00:00
699db7988d
[Submodule] Remove deprecated USE_TBB option and TBB submodule ( #127051 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch , https://github.com/malfet
2024-05-29 11:58:03 +00:00
cdbb2c9acc
Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule ( #127051 )"
...
This reverts commit 4fdbaa794f9d5af2f171f772a51cb710c51c925f.
Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735 ))
2024-05-29 03:02:35 +00:00
4fdbaa794f
[Submodule] Remove deprecated USE_TBB option and TBB submodule ( #127051 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch , https://github.com/malfet
2024-05-27 03:54:03 +00:00
95e5c994f9
[Submodule] Clear USE_QNNPACK build option ( #126941 )
...
Following the removal of QNNPACK third-party module #126657 , we can clear more build system code. Also third_party/neon2sse was removed because it is not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126941
Approved by: https://github.com/ezyang
2024-05-24 00:12:56 +00:00