pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
atalman	c17ba69ba5	[submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 ) (#141901 ) Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590 Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901 Approved by: https://github.com/andrewor14, https://github.com/malfet	2024-12-03 00:16:35 +00:00
Yu, Guangye	8051ee802c	Add XPU compiler version control in cmake to keep BC (#139258 ) # Motivation This PR aims to maintain backward compatibility when building PyTorch XPU with the old and new compilers. # Additional Context The details are described here. The new compiler (2025.0.0) has some breaking changes compared with the old compiler(2024.1), for examples: 1. On Windows, sycl library is named `sycl7.lib` in the old compiler but is named `sycl.lib` in the new compiler. 2. On Linux, in order to support ABI=0, we have to link `libsycl-preview.so` in the old compiler but we could link `libsycl.so` in the new compiler to have the same ABI compatibility. 3. We added a macro `SYCL_COMPILER_VERSION` to support our new code has good backward compatibility with the old compiler. Now the new feature(Event elapsed_time, memory summary, and device architecture property) introduced by the new compiler will be controlled within the macro `SYCL_COMPILER_VERSION`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139258 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/gujinghui	2024-11-09 13:31:21 +00:00
Matthew Sterrett	7e65060410	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel	2024-11-02 02:14:01 +00:00
Nikita Shulga	a1f854f270	[MPS] Compile kernels into Metallib (#138636 ) PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 ) Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see https://github.com/pytorch/pytorch/pull/125550 ) But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib` This PR does exactly that by: - Moving shader sources into separate .metal files - Adds check on whether full Xcode installed or just DeveloperTools - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt. - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary` Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following ```cpp #ifndef PYTORCH_JIT_COMPILE_SHADERS static auto& lib = MetalShaderLibrary::getBundledLibrary(); #else #include <ATen/native/mps/OpName_metallib.h> #endif ``` Some historical stats: \| PyTorch Version \| Number of shaders in MPS \| Ops added \| \| ------------- \| ------------- \| ---- \| \| 1.12 \| 0 \| \| \| 1.13 \| 2 \| bitwise_ops and index.out \| \| 2.0 \| 4 \| cross repeat and view) \| \| 2.1 \| 9 \| unary_ops, histogram, renorm, binary_ops \| \| 2.2 \| 11 \| gamma and bucketization \| \| 2.3 \| 12 \| naive_matmul (to workaround crash) \| \| 2.4 \| 13 \| quantized_mm \| \| 2.5 \| 14 \| fused_adam \| Pros: - Better code structure/readability - Eventually allows one to use shared headers (and implement something like `TensorIterator`) - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels Cons: - Build process is a bit more complicated that it used to be - Need to maintain two codepath (as our CI builders only has DeveloperTools installed) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636 Approved by: https://github.com/manuelcandales	2024-11-01 21:47:20 +00:00
Xu Han	beb15c80fb	print USE_STATIC_MKL for further debug. (#138902 ) print `USE_STATIC_MKL` for further debug. <img width="257" alt="image" src="https://github.com/user-attachments/assets/cd45bada-c28a-441a-b271-35956cfe1f21"> if we use `MKL`, then show its link method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138902 Approved by: https://github.com/ezyang	2024-10-27 18:08:30 +00:00
Xu Han	96b30dcb25	[Windows][cpu] mkl use mimalloc as allocator on Windows (#138419 ) We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion From the blog conclusion, we found the `ResNet50` is typical case of it. Let's focus on the `ResNet50`, and collect the profiling log: ```cmd (nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 3.91% 682.427ms 100.00% 17.448s 17.448s 1 aten::conv2d 0.18% 30.906ms 64.79% 11.305s 2.133ms 5300 aten::convolution 0.45% 78.031ms 64.62% 11.275s 2.127ms 5300 aten::_convolution 0.30% 51.670ms 64.17% 11.196s 2.113ms 5300 aten::mkldnn_convolution 63.58% 11.093s 63.87% 11.145s 2.103ms 5300 aten::batch_norm 0.13% 23.536ms 20.10% 3.506s 661.580us 5300 aten::_batch_norm_impl_index 0.28% 49.486ms 19.96% 3.483s 657.139us 5300 aten::native_batch_norm 19.26% 3.360s 19.64% 3.427s 646.615us 5300 aten::max_pool2d 0.01% 1.038ms 5.84% 1.018s 10.181ms 100 aten::max_pool2d_with_indices 5.83% 1.017s 5.83% 1.017s 10.171ms 100 aten::add_ 3.38% 588.907ms 3.38% 588.907ms 85.349us 6900 aten::relu_ 0.35% 60.358ms 1.67% 292.155ms 59.624us 4900 aten::clamp_min_ 1.33% 231.797ms 1.33% 231.797ms 47.306us 4900 aten::empty 0.46% 80.195ms 0.46% 80.195ms 1.513us 53000 aten::linear 0.01% 927.300us 0.23% 39.353ms 393.532us 100 aten::addmm 0.20% 35.379ms 0.21% 37.016ms 370.155us 100 aten::empty_like 0.12% 20.455ms 0.17% 29.976ms 5.656us 5300 aten::as_strided_ 0.11% 18.830ms 0.11% 18.830ms 3.553us 5300 aten::adaptive_avg_pool2d 0.00% 419.900us 0.08% 14.265ms 142.647us 100 aten::mean 0.01% 1.737ms 0.08% 13.845ms 138.448us 100 aten::sum 0.05% 8.113ms 0.05% 8.648ms 86.479us 100 aten::resize_ 0.03% 5.182ms 0.03% 5.182ms 0.978us 5300 aten::div_ 0.01% 1.445ms 0.02% 3.460ms 34.600us 100 aten::to 0.00% 337.000us 0.01% 2.015ms 20.154us 100 aten::_to_copy 0.01% 977.500us 0.01% 1.678ms 16.784us 100 aten::copy_ 0.01% 1.474ms 0.01% 1.474ms 7.371us 200 aten::t 0.00% 775.900us 0.01% 1.410ms 14.104us 100 aten::flatten 0.00% 420.900us 0.01% 1.311ms 13.106us 100 aten::view 0.01% 889.700us 0.01% 889.700us 8.897us 100 aten::transpose 0.00% 410.700us 0.00% 634.500us 6.345us 100 aten::expand 0.00% 496.800us 0.00% 566.800us 5.668us 100 aten::fill_ 0.00% 534.800us 0.00% 534.800us 5.348us 100 aten::as_strided 0.00% 293.800us 0.00% 293.800us 1.469us 200 aten::empty_strided 0.00% 241.700us 0.00% 241.700us 2.417us 100 aten::resolve_conj 0.00% 54.800us 0.00% 54.800us 0.274us 200 --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 17.448s Execution time: 20.02380895614624 ``` We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`. Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory. We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory. So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes: 1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`. 2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`. 3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`. For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-10-24 05:29:47 +00:00
PyTorch MergeBot	0e19522122	Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 )" This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86. Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](`239a9ad65e`) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))	2024-09-23 14:52:23 +00:00
Matthew Sterrett	239a9ad65e	Adds support for accelerated sorting with x86-simd-sort (#127936 ) Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads. <details> <summary><b>Contiguous Benchmarks</b></summary> ``` float32, normally distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.150844336 6.886271477 7.132277489 1.038420335 1.002603214 128 9.208030939 8.478154898 7.846915245 1.086089019 1.173458697 1024 37.79037627 23.60707456 16.44122627 1.600807257 2.298513241 10000 714.7355628 203.9921844 105.5683001 3.503739934 6.770361577 100000 8383.074408 721.6333354 465.3709247 11.61680593 18.01374766 1000000 97124.31945 5632.054572 3920.148401 17.24491803 24.77567416 10000000 1161974.907 86070.48988 71533.82301 13.50027063 16.24371323 int32_t, uniformly distributed (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 7.203208685 6.92212224 7.014458179 1.040606975 1.026908779 128 8.972388983 8.195516348 7.592543125 1.094792396 1.18173698 1024 32.77489477 23.6874548 15.36617105 1.383639359 2.132925285 10000 607.8824128 193.3402024 99.25090471 3.144107667 6.124703997 100000 523.9384684 608.1836536 442.3166784 0.861480682 1.184532472 1000000 5211.348627 5271.598405 3518.861883 0.988570871 1.480975611 10000000 133853.6263 81463.05084 67852.97394 1.643120714 1.972700952 ``` </details> Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction. <details> <summary><b>Discontiguous Benchmarks</b></summary> ``` float, normal distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.836543679 4.011214256 3.84376061 0.956454439 0.99812243 128 5.755310194 5.755723127 4.820394962 0.999928257 1.193949923 1024 49.46946019 24.78790785 15.47874362 1.995709379 3.195960952 10000 665.2505291 236.6165959 143.9490662 2.811512551 4.621429974 100000 4328.002203 1329.001212 818.3516414 3.256582586 5.288682743 1000000 47651.5018 16693.72045 11827.39551 2.854456677 4.028909133 10000000 556655.1288 236252.6258 184215.9828 2.356185998 3.021752621 int32_t, uniformly distributed, discontiguous in sorted dimension (in microseconds) size Default AVX2 AVX512 Default/AVX2 Default/AVX512 16 3.817994356 3.878117442 3.770039797 0.984496837 1.012719908 128 5.578731397 5.577152082 4.716770534 1.000283176 1.182743862 1024 43.3412619 23.61275801 14.55446819 1.835501887 2.977866408 10000 634.3997478 224.4322851 133.9518324 2.826686667 4.736028889 100000 4084.358152 1292.363303 781.7867576 3.16037924 5.22438902 1000000 46262.20465 16608.35284 11367.51817 2.785478192 4.06968381 10000000 541231.9104 235185.1861 180249.9294 2.301301028 3.002674742 ``` </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-20 21:19:33 +00:00
Zitong Zhan	90c821814e	SparseCsrCUDA: cuDSS backend for linalg.solve (#129856 ) This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve. Fixes #69538 Minimum example of usage: ``` import torch if __name__ == '__main__': spd = torch.rand(4, 3) A = spd.T @ spd b = torch.rand(3).to(torch.float64).cuda() A = A.to_sparse_csr().to(torch.float64).cuda() x = torch.linalg.solve(A, b) print((A @ x - b).norm()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856 Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn Co-authored-by: Zihang Fang <zhfang1108@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2024-08-22 07:57:30 +00:00
Mikayla Gawarecki	018e48c337	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 ) Reland #130633 USE_CUFILE turned off by default in this version Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489 Approved by: https://github.com/albanD	2024-08-15 17:11:52 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
cyy	85b8503621	[Caffe2] Remove Caffe2 documentation (#130089 ) Due to the removal of Caffe2 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089 Approved by: https://github.com/r-barnes, https://github.com/albanD	2024-07-10 00:52:16 +00:00
Xinya Zhang	d34075e0bd	Add Efficient Attention support on ROCM (#124885 ) This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation Known limitations: - Only supports MI200/MI300X GPUs - Does not support varlen - Does not support `CausalVariant` - Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null - Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM. This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129 `PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885 Approved by: https://github.com/malfet	2024-06-08 22:41:05 +00:00
Ting Lu	1b704a160f	Add linker script optimization flag to CMAKE rule for CUDA ARM wheel (#127514 ) Original PR - https://github.com/pytorch/pytorch/pull/127220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127514 Approved by: https://github.com/Aidyn-A, https://github.com/atalman	2024-06-04 20:51:44 +00:00
eqy	ac568fc007	[CUDNN] Remove defunct cuDNN V8 API build flag (#120006 ) The flag basically does nothing following #95722 Let's see if the quantization tests break CC @malfet @atalmanagement Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006 Approved by: https://github.com/malfet	2024-06-03 22:42:05 +00:00
cyy	d44daebdbc	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-31 01:20:45 +00:00
PyTorch MergeBot	67739d8c6f	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit 699db7988d84d163ebb6919f78885e4630182a7a. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))	2024-05-30 01:16:57 +00:00
Nikita Shulga	0910429d72	[BE][CMake] Use FindPython module (#124613 ) As FindPythonInterp and FindPythonLibs has been deprecated since cmake-3.12 Replace `PYTHON_EXECUTABLE` with `Python_EXECUTABLE` everywhere (CMake variable names are case-sensitive) This makes PyTorch buildable with python3 binary shipped with XCode on MacOS TODO: Get rid of `FindNumpy` as its part of Python package Pull Request resolved: https://github.com/pytorch/pytorch/pull/124613 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-05-29 13:17:35 +00:00
cyy	699db7988d	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-29 11:58:03 +00:00
PyTorch MergeBot	cdbb2c9acc	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit 4fdbaa794f9d5af2f171f772a51cb710c51c925f. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))	2024-05-29 03:02:35 +00:00
cyy	4fdbaa794f	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-27 03:54:03 +00:00
cyy	95e5c994f9	[Submodule] Clear USE_QNNPACK build option (#126941 ) Following the removal of QNNPACK third-party module #126657, we can clear more build system code. Also third_party/neon2sse was removed because it is not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126941 Approved by: https://github.com/ezyang	2024-05-24 00:12:56 +00:00
Richard Barnes	b9e7b35912	Remove caffe2 from more build files (#125898 ) Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125898 Approved by: https://github.com/Skylion007	2024-05-13 18:37:59 +00:00
cyy	6c4f43f826	Decouple most Caffe2 components from the build systems (r-barnes) (#125711 ) Copying #125392 here so I can edit it more easily. Co-authored-by: cyy <cyyever@outlook.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125711 Approved by: https://github.com/malfet	2024-05-09 02:19:59 +00:00
PyTorch MergeBot	1b396d69cb	Revert "[CUDNN] Remove defunct cuDNN V8 API build flag (#120006 )" This reverts commit ee4cafa098ede2d9546016223cbc1a522ea3630a. Reverted https://github.com/pytorch/pytorch/pull/120006 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm jobs in trunk `ee4cafa098` ([comment](https://github.com/pytorch/pytorch/pull/120006#issuecomment-2098849813))	2024-05-07 16:28:04 +00:00
eqy	ee4cafa098	[CUDNN] Remove defunct cuDNN V8 API build flag (#120006 ) The flag basically does nothing following #95722 Let's see if the quantization tests break CC @malfet @atalmanagement Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006 Approved by: https://github.com/malfet	2024-05-06 23:13:58 +00:00
cyy	83845a7c78	[1/2] Remove caffe2 db and distributed from build system (#125092 ) This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 db, distributed and some binaries have been removed. To be noted, this was inspired and is co-dev with @r-barnes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125092 Approved by: https://github.com/malfet	2024-05-04 06:48:46 +00:00
cyy	04c6424fbf	Remove caffe2 image and video (#125045 ) This PR tries to decompose https://github.com/pytorch/pytorch/pull/122527 into a smaller one. Caffe2 image and video folders are removed along with the related CMake code. To be noted, this was inspired and is co-dev with @r-barnes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125045 Approved by: https://github.com/eqy, https://github.com/albanD	2024-04-30 17:31:57 +00:00
Yu, Guangye	50049cfaa0	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-12 07:36:25 +00:00
Bert Maher	521dbbfaff	Remove cpp/tensorexpr benchmarks (#116868 ) Summary: These refer to a deprecated backend of torchscript which is no longer built in releases, and require llvm to be built. Test Plan: ``` python setup.py develop ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/116868 Approved by: https://github.com/hl475, https://github.com/chenyang78, https://github.com/eellison, https://github.com/mikekgfb	2024-01-05 21:23:30 +00:00
PyTorch MergeBot	9ac0e6971a	Revert "[1/4] Intel GPU Runtime Upstreaming for Device (#116019 )" This reverts commit b4cebe2c34242ceee3a1bc285f426662942a29ac. Reverted https://github.com/pytorch/pytorch/pull/116019 on behalf of https://github.com/malfet due to Broke internal and periodic buck builds, see https://github.com/pytorch/pytorch/actions/runs/7414664129/job/20176215868 ([comment](https://github.com/pytorch/pytorch/pull/116019#issuecomment-1879030285))	2024-01-05 17:36:39 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
Yu, Guangye	b4cebe2c34	[1/4] Intel GPU Runtime Upstreaming for Device (#116019 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), The first runtime component we would like to upstream is `Device` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 4 PRs. This is one of the 4 PRs and covers the changes under `c10`. # Design Intel GPU device is a wrapper of sycl device on which kernels can be executed. In our design, we will maintain a sycl device pool containing all the GPU devices of the current machine, and manage the status of the device pool by PyTorch. The thread local safe is considered in this design. The corresponding C++ files related to `Device` will be placed in c10/xpu folder. And we provide the c10 device runtime APIs, like - `c10::xpu::device_count` - `c10::xpu::set_device` - ... # Additional Context In our plan, 4 PRs should be submitted to PyTorch for `Device`: 1. for c10 2. for aten 3. for python frontend 4. for lazy initialization shared with CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/116019 Approved by: https://github.com/gujinghui, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-01-04 17:35:04 +00:00
Jeff Daily	e3aefe2970	Revert "Initial Flash Attention support on ROCM (#114309 )" (#115975 ) This reverts commit 5bddbed399a89bf2875a38bb84cb869f382f1809. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115975 Approved by: https://github.com/atalman, https://github.com/malfet	2023-12-16 03:40:14 +00:00
Xinya Zhang	5bddbed399	Initial Flash Attention support on ROCM (#114309 ) This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - [ ] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - [ ] Only supports power of two sequence lengths. - [ ] No support for varlen APIs. - [ ] Only support head dimension 16,32,64,128. - [ ] Performance is still being optimized. Fixes https://github.com/pytorch/pytorch/issues/112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114309 Approved by: https://github.com/jeffdaily, https://github.com/malfet --------- Co-authored-by: Joseph Groenenboom <joseph.groenenboom@amd.com>	2023-12-14 08:52:57 -08:00
hongxyan	66a76516bf	[ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660 ) Related to #103973 #110532 #108404 #94891 Context: As commented in `6ae0554d11/cmake/Dependencies.cmake (L1198)` Kernel asserts are enabled by default for CUDA and disabled for ROCm. However it is somewhat broken, and Kernel assert was still enabled for ROCm. Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues) Changes: This pull request serves the following purposes: * Refactor and clean up the logic, make it simpler for ROCm to enable and disable Kernel Asserts * Fix the bug that Kernel Asserts for ROCm was not disabled by default. Specifically, - Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons: (1) This variable only applies to ROCm. (2) The new name is more align with #define CUDA_KERNEL_ASSERT function. (3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build). - Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain - Added `#cmakedefine` to carry over the CMake variable to C++ Tests: (1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT is OFF(0), and kernel assert is disabled: ``` python setup.py develop ``` Verify CMakeCache.txt has correct value. ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=0 ``` Tested the following code in ROCm build and CUDA build, and expected the return code differently. ``` subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) ``` This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future) ``` python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async ``` Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing: ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>> r 0 ``` (2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON ``` USE_ROCM_KERNEL_ASSERT=1 python setup.py develop ``` Verify `USE_ROCM_KERNEL_ASSERT` is `1` ``` /xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt USE_ROCM_KERNEL_ASSERT:BOOL=1 ``` Run the assert test, and expected return code not equal to 0. ``` >> import sys >>> import subprocess >>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"]) >>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed. :0:rocdevice.cpp :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016 >>> r -6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd	2023-12-13 15:44:53 +00:00
Ke Wen	f2ca07b680	[ProcessGroupNCCL] Remove jumper to UCC (#114170 ) The "jumper" to UCC lib in ProcessGroupNCCL was a temporary solution a while back. Cleaning it now that UCC has its own "PG" representation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114170 Approved by: https://github.com/wconstab, https://github.com/fduwjj, https://github.com/XilunWu, https://github.com/Aidyn-A	2023-11-22 15:35:06 +00:00
jjsjann123	9d23440c81	Nvfuser code base nuke (#111447 ) removing nvfuser code base. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111447 Approved by: https://github.com/albanD	2023-11-01 20:53:14 +00:00
cyy	ba0362a09e	Remove unused build system checks and definitions (#109711 ) Remove some outdated checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109711 Approved by: https://github.com/ezyang	2023-09-21 16:52:16 +00:00
drisspg	182a9cf366	Add Independent Memory Efficient and Flash Attention Build Flags (#107985 ) # Summary In an effort to simplify https://github.com/pytorch/pytorch/pull/105602, this PR pulls out independent chunks of code that can be landed prior to FlashV2 landing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107985 Approved by: https://github.com/cpuhrsch	2023-08-28 18:39:18 +00:00
Jesse Cai	f81f9093ec	[core][pruning][feature] cuSPARSELt build integration (#103700 ) Summary: This stack of PR's integrates cuSPARSELt into PyTorch. This PR adds support for cuSPARSELt into the build process. It adds in a new flag, USE_CUSPARSELT that defaults to false. When USE_CUSPASRELT=1 is specified, the user can also specify CUSPASRELT_ROOT, which defines the path to the library. Compiling pytorch with cusparselt support can be done as follows: `` USE_CUSPARSELT=1 CUSPARSELT_ROOT=/path/to/cusparselt python setup.py develop ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/103700 Approved by: https://github.com/albanD	2023-08-02 12:48:39 +00:00
Xu Han	6c1ccccf21	Enable mimalloc on pytorch Windows (#102595 ) This PR is implemention of [#102534](https://github.com/pytorch/pytorch/issues/102534), option 2. Major changes: 1. Add mimalloc to the submodule. 2. Add build option "USE_MIMALLOC". 3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance. Additional Test: <img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3"> This PR also build & static link mimalloc on Linux well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102595 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-06-27 08:53:26 +00:00
mikey dagitses	5f5d675587	remove unused CAFFE2_VERSION macros (#97337 ) remove unused CAFFE2_VERSION macros Summary: Nothing reads these and they are completely subsumed by TORCH_VERSION. Getting rid of these will be helpful for build unification, since they are also not used internally. Test Plan: Rely on CI. Reviewers: sahanp Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/97337 Approved by: https://github.com/malfet	2023-03-24 16:02:35 +00:00
Peter Bell	c5f6092591	Use FindCUDAToolkit to find cuda dependencies (#82695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695 Approved by: https://github.com/malfet	2023-03-01 17:26:36 +00:00
PyTorch MergeBot	801b3f8fc7	Revert "Use FindCUDAToolkit to find cuda dependencies (#82695 )" This reverts commit 7289d22d6749465d3bae2cb5a6ce04729318f55b. Reverted https://github.com/pytorch/pytorch/pull/82695 on behalf of https://github.com/peterbell10 due to Breaks torchaudio build	2023-02-28 02:29:09 +00:00
Peter Bell	7289d22d67	Use FindCUDAToolkit to find cuda dependencies (#82695 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695 Approved by: https://github.com/malfet	2023-02-21 22:35:17 +00:00
cyy	5fa7120722	Simplify CMake CUDNN code (#91676 ) 1. Move CUDNN code to seperate module. 2. Merge CUDNN public and private targets into a single private target. There is no need to expose CUDNN dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91676 Approved by: https://github.com/malfet	2023-02-08 01:06:10 +00:00

1 2 3 4

199 Commits