pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Aidyn-A	6926710adf	[ATen][CUDA] CUTLASS matmuls: add sm_103a flag (#162956 ) This PR adds an `sm_103a` flag for GroupMM and RowwiseScaledMM. Contrary to just #161399, this simply adds the flag as the support for `sm_103a` matmuls is going to be added to CUTLASS v4.2 (see https://github.com/pytorch/pytorch/pull/161399#issuecomment-3252892937). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162956 Approved by: https://github.com/eqy, https://github.com/Skylion007	2025-09-16 10:29:55 +00:00
cyy	01f66d08d9	Remove outdated CMAKE_CUDA_COMPILER_VERSION branch (#160075 ) Remove the condition `if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.0)` in cmake/Codegen.cmake, because we are now default to CUDA >=12.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160075 Approved by: https://github.com/Skylion007	2025-08-09 14:23:17 +00:00
AaronWang04	772d590415	[CUTLASS] [CUDA] SM100 GroupMM (#156203 ) Closes https://github.com/pytorch/pytorch/issues/156202 PR adds blackwell support for GroupMM Most of the code that is used for SM90 can be reused, kernel schedule has to be changed in accordance with https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html Did some preliminary benchmarking of H200 vs B200 Script ```py import torch print(torch.__file__) device = torch.device("cuda") dtype = torch.bfloat16 shapes = [ (16, 128000, 7168, 7168), (128, 1, 2048, 7168) ] for batch, M, N, K in shapes: a = torch.randn(batch, M, K, device=device, dtype=dtype) b = torch.randn(batch, N, K, device=device, dtype=dtype) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) for i in range(5): c = torch._grouped_mm(a, b) num_iter = 50 start_event.record() for i in range(num_iter): c = torch._grouped_mm(a, b) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iter print(f"batch: {batch}\tM: {M}\tN: {N}\tK: {K}") print(f"Time per Iteration:\t {avg_time_ms:.4f} ms") ``` On H200 ``` batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 298.6668 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 4.1462 ms ``` B200 ``` batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 190.7458 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 3.0680 ms ``` nsys nvprof ``` root@16930b42ffc6:/workspace/pytorch# nsys nvprof python gemm_test.py WARNING: python and any of its children processes will be profiled. Collecting data... batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 192.6420 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 1.2255 ms Generating '/tmp/nsys-report-6a53.qdstrm' [1/7] [========================100%] report1.nsys-rep [2/7] [========================100%] report1.sqlite [3/7] Executing 'nvtx_sum' stats report SKIPPED: /workspace/pytorch/report1.sqlite does not contain NV Tools Extension (NVTX) data. [4/7] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------ ------------ -------- ----------- ------------ --------------------------------- 98.9 10586895744 2 5293447872.0 5293447872.0 73786464 10513109280 7381715954.2 cudaDeviceSynchronize 1.0 104084608 5 20816921.6 33552480.0 100800 34786208 18048125.3 cudaMalloc 0.1 5694304 4 1423576.0 1416656.0 1258560 1602432 181668.1 cudaGetDeviceProperties_v2_v12000 0.1 5430496 130 41773.0 4560.0 2496 3854368 345761.8 cudaLaunchKernel 0.0 587584 110 5341.7 4992.0 4224 16992 1482.0 cudaLaunchKernelExC_v11060 0.0 119200 660 180.6 128.0 96 4128 206.7 cudaGetDriverEntryPoint_v11030 0.0 68352 660 103.6 64.0 32 4928 224.6 cuTensorMapEncodeTiled 0.0 34976 49 713.8 224.0 160 6720 1343.4 cudaStreamIsCapturing_v10000 0.0 32992 4 8248.0 7456.0 4128 13952 4804.4 cudaEventRecord 0.0 16928 4 4232.0 3600.0 1728 8000 2764.7 cudaEventQuery 0.0 16288 4 4072.0 3568.0 1952 7200 2396.1 cudaEventCreateWithFlags 0.0 13632 4 3408.0 2672.0 544 7744 3408.7 cudaEventDestroy 0.0 1056 1 1056.0 1056.0 1056 1056 0.0 cuModuleGetLoadingMode [5/7] Executing 'cuda_gpu_kern_sum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ----------- ----------- --------- --------- ----------- ---------------------------------------------------------------------------------------------------- 99.0 10549232845 55 191804233.5 192944479.0 165746368 203645313 5353204.3 void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm… 0.6 67327135 55 1224129.7 1330656.0 924320 1364928 182180.4 void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm… 0.3 34854783 20 1742739.1 1597856.0 10080 3899616 818421.2 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat… 0.0 354880 110 3226.2 3296.0 1920 4160 554.4 void at::cuda::detail::prepare_grouped_gemm_data<cutlass::bfloat16_t, cutlass::bfloat16_t, cutlass:… ``` The kernel names are too long to be shown via nvprof, I pasted this from nsight systems ``` small kernel 1SM 100.0% 1.286 ms 1 1.286 ms 1.286 ms 1.286 ms 1.286 ms 0 ns void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)3, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> , cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)128, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params) large kernel 2SM 100.0% 194.178 ms 1 194.178 ms 194.178 ms 194.178 ms 194.178 ms 0 ns void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)5, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)256>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> , cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_2x1SM_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)256, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> , cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156203 Approved by: https://github.com/syed-ahmed, https://github.com/drisspg	2025-06-28 23:02:00 +00:00
Xuehai Pan	ccea6ddac3	[BE] fix typos in cmake/ (#156079 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079 Approved by: https://github.com/Skylion007	2025-06-17 19:25:43 +00:00
Thien Tran	fc177801af	Enable FP8 row-wise scaled-mm for sm12x (#155991 ) ## Update using Cutlass 3.x (2025/06/15) Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS) M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|17.56 64\|4096\|4096\|69.63 256\|4096\|4096\|266.57 1024\|4096\|4096\|339.28 4096\|4096\|4096\|388.91 This uses the same SM100 template. The only difference is - Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature - ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64 Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`. ## Original using SM89 I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends). Simple benchmark script ```python import torch from torch._inductor.utils import do_bench_using_profiling N, K = 4096, 4096 for M in [16, 64, 256, 1024, 4096]: A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn) B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T scale_A = torch.ones(M, 1).cuda() scale_B = torch.ones(1, N).cuda() out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16) out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16() torch.testing.assert_close(out, out_ref) latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)) tflops = (2 * M * N * K) / latency_us / 1e9 print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS") ``` M \| N \| K \| TFLOPS ---\|---\|---\|--- 16 \| 4096 \| 4096 \| 25.73 TFLOPS 64 \| 4096 \| 4096 \| 71.84 TFLOPS 256 \| 4096 \| 4096 \| 86.40 TFLOPS 1024 \| 4096 \| 4096 \| 112.12 TFLOPS 4096 \| 4096 \| 4096 \| 121.24 TFLOPS Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here... However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|27.13 TFLOPS 64\|4096\|4096\|84.84 TFLOPS 256\|4096\|4096\|96.75 TFLOPS 1024\|4096\|4096\|110.21 TFLOPS 4096\|4096\|4096\|122.98 TFLOPS Small M slightly improves, but large M is still bad. If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results M \| N \| K \| TFLOPS ---\|---\|---\|-------- 1024\|4096\|4096\|313.28 4096\|4096\|4096\|376.73 Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs. To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 1024\|4096\|4096\|144.03 4096\|4096\|4096\|156.86 A bit better than current configs, but still very far away from hardware limit. @alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89? Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")` M \| N \| K \| TFLOPS ---\|---\|---\|-------- 16\|4096\|4096\|25.60 64\|4096\|4096\|71.74 256\|4096\|4096\|161.64 1024\|4096\|4096\|185.89 4096\|4096\|4096\|215.53 Better than default configs, but still far away from the config above for compute-bound Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991 Approved by: https://github.com/drisspg, https://github.com/eqy	2025-06-17 18:52:44 +00:00
cyy	2f09e79142	Fix Codegen.cmake warning (#153023 ) Fix ``` CMake Warning (dev) in cmake/Codegen.cmake: A logical block opening on the line /var/lib/jenkins/workspace/cmake/Codegen.cmake:393 (if) closes on the line /var/lib/jenkins/workspace/cmake/Codegen.cmake:401 (endif) with mis-matching arguments. ``` by removing the condition in `endif`. We could instead fix it, however, that is not best practice. For example, cmake_lint warns that, and CMake says ``` The optional <condition> argument is supported for backward compatibility only. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/153023 Approved by: https://github.com/aditew01, https://github.com/Skylion007	2025-05-07 12:45:20 +00:00
Ryo Suzuki	fcbbb03d48	Extend vec backend with BF16 SVE intrinsics (#143666 ) - Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666 Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/nikhil-arm Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>	2025-04-28 18:25:44 +00:00
Natalia Gimelshein	55e62ff74a	bf16 grouped gemm (#150374 ) Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374 Approved by: https://github.com/drisspg	2025-04-06 04:53:24 +00:00
PyTorch MergeBot	a9d08ed0ce	Revert "Parallelize sort (#149505 )" This reverts commit 842d51500be144d53f4d046d31169e8f46c063f6. Reverted https://github.com/pytorch/pytorch/pull/149505 on behalf of https://github.com/ZainRizvi due to Reverting since this is breaking inductor builds on trunk. More details [GH job link](https://github.com/pytorch/pytorch/actions/runs/14000726218/job/39207447863) [HUD commit link](`842d51500b`) ([comment](https://github.com/pytorch/pytorch/pull/149505#issuecomment-2759082390))	2025-03-27 18:43:11 +00:00
Aleksandar Samardžić	43cc954f88	Refactor row-wise scaled MM (#149978 ) 1. Add config selection for SM89. 2. Only build kernels if compiling for given arch. 3. Factor out CMake code to enforce compiling for needed archs for individual files into a function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149978 Approved by: https://github.com/drisspg	2025-03-26 23:49:41 +00:00
PyTorch MergeBot	bada898f5e	Revert "Extend vec backend with BF16 SVE intrinsics (#143666 )" This reverts commit d072254eaea325a507c1498431e4c8294205fe2d. Reverted https://github.com/pytorch/pytorch/pull/143666 on behalf of https://github.com/malfet due to I'm unsure why this PR got merged, as it doesn't have a valid review ([comment](https://github.com/pytorch/pytorch/pull/143666#issuecomment-2749013169))	2025-03-24 18:13:50 +00:00
Annop Wongwathanarat	842d51500b	Parallelize sort (#149505 ) PR #142391 erroneously used `USE_OMP` instead of `USE_OPENMP`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149505 Approved by: https://github.com/fadara01, https://github.com/Skylion007	2025-03-21 20:54:40 +00:00
Ryo Suzuki	d072254eae	Extend vec backend with BF16 SVE intrinsics (#143666 ) - Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666 Approved by: https://github.com/swolchok, https://github.com/aditew01 Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>	2025-03-21 10:55:11 +00:00
maajidkhann	09f7f62cfe	Fix atomic operation compatibility for ARMv8-A (Raspberry Pi 4) by adjusting compilation flags (#148070 ) Issue: * The ldaddal instruction is an AArch64 atomic operation available from ARMv8.1-A onwards. * Raspberry Pi 4 (Cortex-A72) is ARMv8-A, which does not support ldaddal, leading to failures when running PyTorch built with march=armv8.2-a+sve * This led to an issue when running PyTorch on ARMv8-A (Raspberry Pi 4), as unsupported atomic operations were generated. Fix: * Updated the build flags to explicitly use -march=armv8-a+sve, ensuring GCC and clang promotes it correctly and resolves compatibility issues with armv8 and still work correctly for SVE like before. * This ensures that PyTorch builds correctly for ARMv8-A platforms (e.g., Raspberry Pi 4) while still enabling SVE for supported hardware. Test plan: - Allocate `a1.4xlarge` on AWS - Run following script using wheel produced by this PR ```python import torch def f(x): return x.sin() + x.cos() print(torch.__version__) f_c = torch.jit.script(f) ``` - Observe no crash ``` $ python3 foo.py 2.7.0.dev20250313+cpu ``` - Observe crash with 2.6.0 ``` $ python3 foo.py 2.6.0+cpu Illegal instruction (core dumped) ``` Fixes #146792 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148070 Approved by: https://github.com/malfet	2025-03-15 00:02:38 +00:00
Natalia Gimelshein	53a1a022a9	[WIP] Initial implementation of Grouped Gemm API (#148531 ) This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation. I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert. I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself. I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`. Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531 Approved by: https://github.com/drisspg	2025-03-11 21:49:46 +00:00
PyTorch MergeBot	c983e1124c	Revert "[WIP] Initial implementation of Grouped Gemm API (#148531 )" This reverts commit ff29791ed8f815bdbca1a5606de046380baca69d. Reverted https://github.com/pytorch/pytorch/pull/148531 on behalf of https://github.com/janeyx99 due to Sorry but this broke ROCm jobs on trunk ([comment](https://github.com/pytorch/pytorch/pull/148531#issuecomment-2714577498))	2025-03-11 14:40:58 +00:00
Natalia Gimelshein	ff29791ed8	[WIP] Initial implementation of Grouped Gemm API (#148531 ) This PR provides initial cutlass implementation of grouped gemm api as described in this [document](https://docs.google.com/document/d/1985La6wUUVH1AGBkNhaGKUXzx-9ybtbUp567-vYVOM4/edit?tab=t.0#heading=h.g8lzbjnyzzx9). Any combination of 2d and 3d inputs is supported, with 2d input being jagged, and the offsets of the jagged input being given by device tensor `offs`. Only H100 is supported, and only fp8_e4m3 with bf16 output and rowwise scaling. All the dimensions of each individual gemm have to be multiple of 16, that's cutlass limitation. I'll need to add those checks, for dynamic dimensions unfortunately the checks will have to be a device assert. I had to copy-paste cutlass's `Sm90RowBroadcast` and `Sm90ColBroadcast` structs with minor changes to enable scales given as pointer arrays, ideally those should be part of cutlass itself. I copied the schedules from the similar grouped gemm in FBGEMM, but there's a lot of room to improve perf, especially for `fast_accum=False`. Next steps would be perf tuning and increasing coverage to B100, I don't know how cutlass grouped gemm example handles blockwise scaling on B100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148531 Approved by: https://github.com/drisspg	2025-03-11 02:41:09 +00:00
Daniel Vega-Myhre	148eb735ee	Change nvcc arch flags for sm100 (#148774 ) ### Summary - Addressing this comment https://github.com/pytorch/pytorch/pull/148274#discussion_r1984944012 ### Test plan - Verified building from source w/ B200s is successful - Verified B200 tensorcores are still being utilized properly via benchmarking script Pull Request resolved: https://github.com/pytorch/pytorch/pull/148774 Approved by: https://github.com/Skylion007	2025-03-08 19:05:53 +00:00
Daniel Vega-Myhre	ac99fc7e57	Updates to build rowwise scaled mm kernel on SM10.0a (#148274 ) ## Summary Update cmake files and RowwiseScaledMM.cu to build on SM10.0a arch. NOTE: performance optimization will be done in separate follow up PRs ## Steps to verify build 1. Access devgpu/machine with B200 GPUs, verify B200s are visible w/ `nvidia-smi` 2. Install CUDA tookit 12.8 - e.g. see [Nvidia docs](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Rocky&target_version=9&target_type=rpm_local) 3. Verify CUDA toolkit installation - e.g. `nvcc --version` should have `... Cuda compilation tools, release 12.8 ... ` in output 4. Set env var `TORCH_CUDA_ARCH_LIST=10.0a` 4. Build pytorch from source with this PR ([steps](https://github.com/pytorch/pytorch#from-source)) 5. Uninstall `pytorch-triton` with `pip uninstall pytorch-triton` 6. Build and install triton from source: https://github.com/triton-lang/triton?tab=readme-ov-file#install-from-source 7. Run tests shown in test plan below NOTE: performance optimization will be done in a separate PR. The goal of this PR is just to ensure it builds correctly. ## Test plan - `python test/distributed/tensor/test_matrix_ops.py -k scaled_mm`: OK - `python test/test_matmul_cuda.py -k rowwise`: OK - `python test/test_flop_counter.py -k scaled_mm`: OK - `python test/inductor/test_aot_inductor.py -k fp8`: OK - `python test/inductor/test_fp8.py`: OK Pull Request resolved: https://github.com/pytorch/pytorch/pull/148274 Approved by: https://github.com/drisspg	2025-03-04 05:23:41 +00:00
Ryo Suzuki	49082f9dba	parallelize sort (#142391 ) - use __gnu_parallel::sort for gcc compilations - add a parallelized version of std::sort and std::stable_sort for non gcc compilations Using __gnu_parallel::sort: provides ~3.7x speed up for length 50000 sorts with NUM_THREADS=16 and NUM_THREADS=4 on aarch64 The performance is measured using the following script: ```python import torch import torch.autograd.profiler as profiler torch.manual_seed(0) N = 50000 x = torch.randn(N, dtype=torch.float) with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof: for i in range(1000): _, _ = torch.sort(x) print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142391 Approved by: https://github.com/malfet	2025-02-06 18:06:40 +00:00
Aleksandar Samardžić	2b00d211f0	Build RowwiseScaledMM.cu for SM89 (#145676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676 Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy	2025-02-01 11:44:58 +00:00
Nikita Shulga	8f5ce865a4	[Build] Add `COMMIT_SHA` to `caffe2::GetBuildOptions` (#141313 ) Using the same `tools/generate_torch_version.py` script It's already available on Python level, but not on C++ one Please note, that updating commit hash will force recompilation of less than 10 files according to ``` % touch caffe2/core/macros.h; ninja -d explain -j1 -v -n torch_python ninja explain: output caffe2/torch/CMakeFiles/gen_torch_version doesn't exist ninja explain: caffe2/torch/CMakeFiles/gen_torch_version is dirty ninja explain: /Users/malfet/git/pytorch/pytorch/torch/version.py is dirty ninja explain: output third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl of phony edge with no inputs doesn't exist ninja explain: third_party/kineto/libkineto/CMakeFiles/libkineto_defs.bzl is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546390618881 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Version.cpp.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546233600752 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/core/common.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546651089243 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/inline_container.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546224176845 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/serialize/file_adapter.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301546464535054 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/utils/threadpool/ThreadPool.cc.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301550062608920 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/runtime/static/impl.cpp.o is dirty ninja explain: output caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o older than most recent input /Users/malfet/git/pytorch/pytorch/build/caffe2/core/macros.h (1732301547538843492 vs 1732301802196214000) ninja explain: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/mps/MPSFallback.mm.o is dirty ``` Differential Revision: [D66468257](https://our.internmc.facebook.com/intern/diff/D66468257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141313 Approved by: https://github.com/ezyang	2024-11-26 00:09:36 +00:00
xinan.lin	929a647363	[Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM OPs. (#139025 ) [Intel GPU] Support RegisterXPU.cpp codegen and compile for the in-tree XPU structured GEMM ops. Motivation: There are two parts of aten ops for XPU, one is in-tree ops like GEMM related OPs and the other is out-off-tree ops in torch-xpu-ops. For the in-tree part，since Pytorch uses native_functions.yaml registration and is equipped with convenient codegen capabilities, we want to take advantage of these benefits as well. At the same time, since AOT Inductor also uses native_functions.yaml to generate c shim wrappers, we also need to enable this mechanism for XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139025 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2024-11-09 13:09:27 +00:00
maajidkhann	5a6ddbcc3b	Extending the Pytorch vec backend for SVE (ARM) (#119571 ) Motivation: In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes. Reference Link: https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec This PR: * Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE. * More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions) * We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec. * Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571 Approved by: https://github.com/malfet, https://github.com/snadampal Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>	2024-09-18 18:59:10 +00:00
drisspg	cb1c56caba	Set target dependencies to always build for sm90a on rowwise scaling (#129402 ) # Summary Instead of landing global builder changes; https://github.com/pytorch/builder/pull/1878 This PR targets only the Rowwise file and adds the sm90a featurs. Verified locally by setting: ``` TORCH_CUDA_ARCH_LIST=9.0 ``` We can see in the build.ninja file that the proper flags are set: ``` build caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o: CUDA_COMPILER__torch_cuda_unscanned_Release /home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu \|\| cmake_object_order_depends_target_torch_cuda DEFINES = -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS DEP_FILE = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o.d FLAGS = -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-unused-function,-Wno-maybe-uninitialized -Wno-deprecated-copy -gencode arch=compute_90a,code=sm_90a INCLUDES = -I/home/drisspg/meta/pytorch/build/aten/src -I/home/drisspg/meta/pytorch/aten/src -I/home/drisspg/meta/pytorch/build -I/home/drisspg/meta/pytorch -I/home/drisspg/meta/pytorch/third_party/onnx -I/home/drisspg/meta/pytorch/build/third_party/onnx -I/home/drisspg/meta/pytorch/third_party/foxi -I/home/drisspg/meta/pytorch/build/third_party/foxi -I/home/drisspg/meta/pytorch/aten/src/THC -I/home/drisspg/meta/pytorch/aten/src/ATen/cuda -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/drisspg/meta/pytorch/build/caffe2/aten/src -I/home/drisspg/meta/pytorch/aten/src/ATen/.. -I/home/drisspg/meta/pytorch/build/nccl/include -I/home/drisspg/meta/pytorch/c10/cuda/../.. -I/home/drisspg/meta/pytorch/c10/.. -I/home/drisspg/meta/pytorch/third_party/tensorpipe -I/home/drisspg/meta/pytorch/build/third_party/tensorpipe -I/home/drisspg/meta/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/drisspg/meta/pytorch/torch/csrc/api -I/home/drisspg/meta/pytorch/torch/csrc/api/include -isystem /home/drisspg/meta/pytorch/build/third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/drisspg/meta/pytorch/third_party/protobuf/src -isystem /home/drisspg/meta/pytorch/third_party/ittapi/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda-12.3/include -isystem /home/drisspg/meta/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/drisspg/meta/pytorch/third_party/ideep/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/cudnn_frontend/include OBJECT_DIR = caffe2/CMakeFiles/torch_cuda.dir OBJECT_FILE_DIR = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129402 Approved by: https://github.com/malfet	2024-06-25 13:54:51 +00:00
Nikita Shulga	0910429d72	[BE][CMake] Use FindPython module (#124613 ) As FindPythonInterp and FindPythonLibs has been deprecated since cmake-3.12 Replace `PYTHON_EXECUTABLE` with `Python_EXECUTABLE` everywhere (CMake variable names are case-sensitive) This makes PyTorch buildable with python3 binary shipped with XCode on MacOS TODO: Get rid of `FindNumpy` as its part of Python package Pull Request resolved: https://github.com/pytorch/pytorch/pull/124613 Approved by: https://github.com/cyyever, https://github.com/Skylion007	2024-05-29 13:17:35 +00:00
Nikita Shulga	88920b26be	[Cmake] Check that gcc-9.4 or newer is used (#112858 ) As this is the oldest gcc that is fully compatible with C++17 standard. - Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`. - As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report``` Fixes https://github.com/pytorch/pytorch/issues/101839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-11-06 17:19:53 +00:00
PyTorch MergeBot	679ca510b0	Revert "[Cmake] Check that gcc-9.4 or newer is used (#112858 )" This reverts commit ad894cd0728e97c649cd9b33e1f98b18fa12a1da. Reverted https://github.com/pytorch/pytorch/pull/112858 on behalf of https://github.com/PaliC due to breaking internal tests (check diff for test page) ([comment](https://github.com/pytorch/pytorch/pull/112858#issuecomment-1795485009))	2023-11-06 16:56:09 +00:00
Nikita Shulga	ad894cd072	[Cmake] Check that gcc-9.4 or newer is used (#112858 ) As this is the oldest gcc that is fully compatible with C++17 standard. - Replace number of conditional version with simpler `if(CMAKE_COMPILER_IS_GNUCXX)` or `append_cxx_flag_if_supported`. - As `-Wsuggest-override` condition was hidden before incorrect guard, add missing `override` keywords to `torch::autograd::PyFunctionTensorPostAccGradHooks::apply_with_saved` , `caffe2::python::TensorFeeder::Feed` and `cafee2::NetObserverReporterPrint::report``` Fixes https://github.com/pytorch/pytorch/issues/101839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112858 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-11-04 05:40:08 +00:00
PyTorch MergeBot	f68d6e8108	Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881 )" This reverts commit 68a1219f74467a4d2124288f3ab6f8bc471fe4a1. Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/kit1980 due to breaking internal builds, undefined symbol: _ZN3c1022RefcountedMapAllocator6decrefEv ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1761950014))	2023-10-13 17:57:53 +00:00
Peter Bell	68a1219f74	Move at::{Refcounted,}MapAllocator to c10 (#109881 ) `libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`, so it makes sense to move it to c10 along with the other memory allocators. This means `libshm.so` only depends on `c10` and we don't need to relink `libshm.so` for every ATen change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881 Approved by: https://github.com/albanD	2023-10-12 10:51:13 +00:00
PyTorch MergeBot	02a02a23ee	Revert "Move at::{Refcounted,}MapAllocator to c10 (#109881 )" This reverts commit 0341deb1c720d8c908ed40e853eaacfc8ac37181. Reverted https://github.com/pytorch/pytorch/pull/109881 on behalf of https://github.com/albanD due to It does break buck build ([comment](https://github.com/pytorch/pytorch/pull/109881#issuecomment-1756195823))	2023-10-10 20:39:12 +00:00
Anthony Alayo	31611b40b9	cmake: allow to build pytorch as a CMake subproject (#110373 ) This is a re-attempt of fixing https://github.com/pytorch/pytorch/issues/53980, first submitted in https://github.com/pytorch/pytorch/pull/54978. Quoting @SpaceIm ``` Fixes https://github.com/pytorch/pytorch/issues/53980 Maybe it would be nice to find why some files are generated in CMAKE_BINARY_DIR instead of CMAKE_CURRENT_BINARY_DIR or Torch_BINARY_DIR or PROJECT_BINARY_DIR, but there is a lot of indirection in the logic of pytorch build files, so I was not able to find where it comes from. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/110373 Approved by: https://github.com/malfet	2023-10-10 17:47:35 +00:00
Peter Bell	0341deb1c7	Move at::{Refcounted,}MapAllocator to c10 (#109881 ) `libshm.so` depends on the torch library exclusively for `at::RefcountedMapAllocator`, so it makes sense to move it to c10 along with the other memory allocators. This means `libshm.so` only depends on `c10` and we don't need to relink `libshm.so` for every ATen change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109881 Approved by: https://github.com/albanD	2023-10-09 23:53:47 +00:00
mingfeima	ced5c89b6f	add explicit vectorization for Half dtype on CPU (#96076 ) This patch is part of half float performance optimization on CPU: * add specification for dtype `Half` in `Vectorized<>` under both avx256 and avx512. * add specification for dtype `Half` in functional utils, e.g. `vec::map_reduce<>()`, which uses float32 as accumulate type. Also add a helper struct `vec_hold_type<scalar_t>`, since Vectorized<Half>::value_type is pointing to its underlying storage type which is `uint16_t`, leading to error if the kernel uses `Vec::value_type`. Half uses the same logic as BFloat16 in the Vectorized<>, each half vector is mapped to 2x float vectors for computation. Notice that this patch modified the cmake files by adding -mf16c on AVX2 build, from https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html, we can see that all the hardware platforms that support avx2 already have f16c Pull Request resolved: https://github.com/pytorch/pytorch/pull/96076 Approved by: https://github.com/malfet	2023-04-03 10:58:37 +00:00
Larry Liu	e345138591	[retake2][mobile] Fix lightweight dispatch OOM error by introducing selective build (#80791 ) To fix #78540 I committed #78983 which is reverted due to internal CI failure. Then I comitted #79215 which was only fixing the failure but didn't have the full feature of #78983. This PR is another try. This PR adds script to dump all operators from test models and automatically write into `lightweight_dispatch_ops.yaml`. This way we don't have to manually update the yaml file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80791 Approved by: https://github.com/raziel	2022-07-15 18:04:25 +00:00
PyTorch MergeBot	c3e089a047	Revert "[mobile] Fix lightweight dispatch OOM error by introducing selective build" This reverts commit 272bdb1442ee3750861d9f2f10690cc3f1521b92. Reverted https://github.com/pytorch/pytorch/pull/78983 on behalf of https://github.com/osalpekar due to broke internal mobile tests	2022-06-09 05:16:42 +00:00
PyTorch MergeBot	272bdb1442	[mobile] Fix lightweight dispatch OOM error by introducing selective build This PR introduces selective build to lightweight dispatch CI job. By doing so we can't run the `test_lite_intepreter_runtime` test suite anymore because it requires some other operators. From now on, if we are adding a new unit test in `test_codegen_unboxing`, we will have to export the operators for the unit test model and add them into `lightweight_dispatch_ops.yaml`. This can be automated by introducing tracing based selective build, but that's for next PR to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78983 Approved by: https://github.com/kit1980	2022-06-08 04:29:35 +00:00
Michael Andreas Dagitses	501d0729cb	move build_variables.bzl and ufunc_defs.bzl from pytorch-root/tools/ to the root Pull Request resolved: https://github.com/pytorch/pytorch/pull/78542 This makes importing easier in different build systems that have different absolute names for the pytorch-root. Differential Revision: [D36782582](https://our.internmc.facebook.com/intern/diff/D36782582/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36782582/)! Approved by: https://github.com/malfet	2022-06-02 19:39:27 +00:00
Kulin Seth	f348b1b2b5	Add the Runtime components for MPS backend. (#76725 ) The PR adds the runtime components and few basic operations like copy, as_strided for MPS backend. Current list of identified TODOs are: - https://github.com/pytorch/pytorch/issues/77176 - Unify the logic with CUDACachingAllocator and remove redundant code. - https://github.com/pytorch/pytorch/issues/77170 - Look into using C++ smart pointers where possible with ObjC code - Use empty_strided_generic() to implement the `empty_strided_mps` code - https://github.com/pytorch/pytorch/issues/77144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/76725 Approved by: https://github.com/albanD	2022-05-11 17:19:45 +00:00
anjali411	b204ad863f	Revert "Revert "Allow specifying tags for aten operators in native_functions.yaml"" This reverts commit ea44645c9a682a4e212e64b94a86383c3388ed6b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76456 Approved by: https://github.com/osalpekar	2022-04-28 02:04:57 +00:00
jason_w	bc9bba9b43	delete ${GEN_VULKAN_FLAGS} '--vulkan' no longer exists in 'tools/codegen/gen.py', was deleted in #46938. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76080 Approved by: https://github.com/albanD	2022-04-25 17:21:43 +00:00
Edward Yang	36420b5e8c	Rename tools/codegen to torchgen (#76275 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/76275 In preparation for addressing https://github.com/pytorch/pytorch/issues/73212 Diff was generated with: ``` git mv tools/codegen torchgen git grep -l 'tools.codegen' \| xargs sed -i 's/tools.codegen/torchgen/g' sed -i "s/\${TOOLS_PATH}\/codegen/\${TORCH_ROOT}\/torchgen/g" caffe2/CMakeLists.txt ``` and a manual edits to: * tools/test/test_gen_backend_stubs.py * torchgen/build.bzl * torchgen/gen_backend_stubs.py aka this diff: ``` diff --git a/tools/test/test_gen_backend_stubs.py b/tools/test/test_gen_backend_stubs.py index 3dc26c6d2d..104054575e 100644 --- a/tools/test/test_gen_backend_stubs.py +++ b/tools/test/test_gen_backend_stubs.py @@ -9,7 +9,7 @@ from torchgen.gen_backend_stubs import run from torchgen.gen import _GLOBAL_PARSE_NATIVE_YAML_CACHE # noqa: F401 path = os.path.dirname(os.path.realpath(__file__)) -gen_backend_stubs_path = os.path.join(path, '../torchgen/gen_backend_stubs.py') +gen_backend_stubs_path = os.path.join(path, '../../torchgen/gen_backend_stubs.py') # gen_backend_stubs.py is an integration point that is called directly by external backends. # The tests here are to confirm that badly formed inputs result in reasonable error messages. diff --git a/torchgen/build.bzl b/torchgen/build.bzl index ed04e35a43..d00078a3cf 100644 --- a/torchgen/build.bzl +++ b/torchgen/build.bzl @@ -1,6 +1,6 @@ def define_targets(rules): rules.py_library( - name = "codegen", + name = "torchgen", srcs = rules.glob(["*/.py"]), deps = [ rules.requirement("PyYAML"), @@ -11,6 +11,6 @@ def define_targets(rules): rules.py_binary( name = "gen", - srcs = [":codegen"], + srcs = [":torchgen"], visibility = ["//visibility:public"], ) diff --git a/torchgen/gen_backend_stubs.py b/torchgen/gen_backend_stubs.py index c1a672a655..beee7a15e0 100644 --- a/torchgen/gen_backend_stubs.py +++ b/torchgen/gen_backend_stubs.py @@ -474,7 +474,7 @@ def run( ) -> None: # Assumes that this file lives at PYTORCH_ROOT/torchgen/gen_backend_stubs.py - pytorch_root = pathlib.Path(__file__).parent.parent.parent.absolute() + pytorch_root = pathlib.Path(__file__).parent.parent.absolute() template_dir = os.path.join(pytorch_root, "aten/src/ATen/templates") def make_file_manager(install_dir: str) -> FileManager: ``` run_all_fbandroid_tests Test Plan: sandcastle Reviewed By: albanD, ngimel Differential Revision: D35770317 fbshipit-source-id: 153ac4a7fef15b1e750812a90bfafdbc8f1ebcdf (cherry picked from commit c6d485d1d4648fa1c8a4c14c5bf3d8e899b9b4dd)	2022-04-25 01:38:06 +00:00
Priya Ramani	be3ad8c637	[PyTorch][2/4] Support static dispatch with multiple backends (#75605 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/75605 Usecase: Milan models have multiple backends and need to use static dispatch to save on static initialization time and to hit native functions directly from the unboxed APIs. This change passes in List[BackendIndex] and adds ability to generate code for multiple static backends with 1 or 0 kernels ghstack-source-id: 154525738 (Note: this ignores all push blocking failures!) Test Plan: Builds lite_predictor_flatbuffer with multiple backends ``` buck build --config pt.enable_lightweight_dispatch=1 --config pt.static_dispatch_backend=CPU,QuantizedCPU,CompositeExplicitAutograd //xplat/caffe2/fb/lite_predictor:lite_predictor_flatbuffer ``` Reviewed By: larryliu0820 Differential Revision: D35510644 fbshipit-source-id: f985718ad066f8578b006b4759c4a3bd6caac176 (cherry picked from commit a6999729c8cc26c54b8d5684f6585d6c50d8d913)	2022-04-22 18:35:48 +00:00
PyTorch MergeBot	ea44645c9a	Revert "Allow specifying tags for aten operators in native_functions.yaml" This reverts commit 1dab71ab258df5168bb635530a820625f9d4b522. Reverted https://github.com/pytorch/pytorch/pull/72549 on behalf of https://github.com/malfet	2022-03-28 18:04:38 +00:00
anjali411	1dab71ab25	Allow specifying tags for aten operators in native_functions.yaml Pull Request resolved: https://github.com/pytorch/pytorch/pull/72549 Approved by: https://github.com/ezyang	2022-03-25 21:17:52 +00:00
Mengwei Liu	9ce9803abe	[PyTorch] Add codegen unboxing ability (#69881 ) Summary: RFC: https://github.com/pytorch/rfcs/pull/40 This PR (re)introduces python codegen for unboxing wrappers. Given an entry of `native_functions.yaml` the codegen should be able to generate the corresponding C++ code to convert ivalues from the stack to their proper types. To trigger the codegen, run ``` tools/jit/gen_unboxing.py -d cg/torch/share/ATen ``` Merged changes on CI test. In https://github.com/pytorch/pytorch/issues/71782 I added an e2e test for static dispatch + codegen unboxing. The test exports a mobile model of mobilenetv2, load and run it on a new binary for lite interpreter: `test/mobile/custom_build/lite_predictor.cpp`. ## Lite predictor build specifics 1. Codegen: `gen.py` generates `RegisterCPU.cpp` and `RegisterSchema.cpp`. Now with this PR, once `static_dispatch` mode is enabled, `gen.py` will not generate `TORCH_LIBRARY` API calls in those cpp files, hence avoids interaction with the dispatcher. Once `USE_LIGHTWEIGHT_DISPATCH` is turned on, `cmake/Codegen.cmake` calls `gen_unboxing.py` which generates `UnboxingFunctions.h`, `UnboxingFunctions_[0-4].cpp` and `RegisterCodegenUnboxedKernels_[0-4].cpp`. 2. Build: `USE_LIGHTWEIGHT_DISPATCH` adds generated sources into `all_cpu_cpp` in `aten/src/ATen/CMakeLists.txt`. All other files remain unchanged. In reality all the `Operators_[0-4].cpp` are not necessary but we can rely on linker to strip them off. ## Current CI job test coverage update Created a new CI job `linux-xenial-py3-clang5-mobile-lightweight-dispatch-build` that enables the following build options: * `USE_LIGHTWEIGHT_DISPATCH=1` * `BUILD_LITE_INTERPRETER=1` * `STATIC_DISPATCH_BACKEND=CPU` This job triggers `test/mobile/lightweight_dispatch/build.sh` and builds `libtorch`. Then the script runs C++ tests written in `test_lightweight_dispatch.cpp` and `test_codegen_unboxing.cpp`. Recent commits added tests to cover as many C++ argument type as possible: in `build.sh` we installed PyTorch Python API so that we can export test models in `tests_setup.py`. Then we run C++ test binary to run these models on lightweight dispatch enabled runtime. Pull Request resolved: https://github.com/pytorch/pytorch/pull/69881 Reviewed By: iseeyuan Differential Revision: D33692299 Pulled By: larryliu0820 fbshipit-source-id: 211e59f2364100703359b4a3d2ab48ca5155a023 (cherry picked from commit 58e1c9a25e3d1b5b656282cf3ac2f548d98d530b)	2022-03-01 23:28:13 +00:00
Edward Yang	ce7910ba81	ufunc codegen (#65851 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65851 Design doc: https://docs.google.com/document/d/12rtlHnPUpaJ-I52Iob3L0WA3rKRr_OY7fXqeCvn2MVY/edit First read the design doc to understand the user syntax. In this PR, we have converted add to use ufunc codegen; most of the cpp changes are deleting the preexisting implementations of add, and ufunc/add.h are the new implementations in the ufunc format. The bulk of this PR is in the new codegen machinery. Here's the order to read the files: * `tools/codegen/model.py` * Some self-explanatory utility classes: `ScalarType`, `DTYPE_CLASSES` * New classes for representing ufunc entries in `native_functions.yaml`: `UfuncKey` and `UfuncInnerLoop`, as well as parsing logic for these entries. UfuncKey has some unusual entries (e.g., CPUScalar) that don't show up in the documentation, more on these below). * A predicate `is_ufunc_dispatch_key` for testing which dispatch keys should get automatically generated when an operator opts into ufuncs (CPU and CUDA, for now!) * `tools/codegen/api/types.py` * More self-explanatory utility stuff: ScalarTypeToCppMapping mapping ScalarType to CppTypes; Binding.rename for changing the name of a binding (used when we assign constructor variables to member variables inside CUDA functors) * New VectorizedCType, representing `at::vec::Vectorized<T>`. This is used inside vectorized CPU codegen. * New `scalar_t` and `opmath_t` BaseCppTypes, representing template parameters that we work with when doing codegen inside ufunc kernel loops (e.g., where you previously had Tensor, now you have `scalar_t`) * `StructuredImplSignature` represents a `TORCH_IMPL_FUNC` definition, and straightforwardly follows from preexisting `tools.codegen.api.structured` * `tools/codegen/translate.py` - Yes, we use translate a LOT in this PR. I improved some of the documentation, the only substantive changes are adding two new conversions: given a `scalar_t` or a `const Scalar&`, make it convertible to an `opmath_t` * `tools/codegen/api/ufunc.py` * OK, now we're at the meaty stuff. This file represents the calling conventions of three important concepts in ufunc codegen, which we'll describe shortly. All of these APIs are relatively simple, since there aren't any complicated types by the time you get to kernels. * stubs are the DispatchStub trampolines that CPU kernels use to get to their vectorized versions. They drop all Tensor arguments (as they are in TensorIterator) but otherwise match the structured calling convention * ufuncs are the inner loop template functions that you wrote in ufunc/add.h which do the actual computation in question. Here, all the Tensors and Scalars have been converted into the computation type (`opmath_t` in CUDA, `scalar_t` in CPU) * ufunctors are a CUDA-only concept representing functors that take some of their arguments on a host-side constructor, and the rest in the device-side apply. Once again, Tensors and Scalars are converted into the computation type, `opmath_t`, but for clarity all the functions take `scalar_t` as argument (as this is the type that is most salient at the call site). Because the constructor and apply are code generated separately, `ufunctor_arguments` returns a teeny struct `UfunctorBindings` * `tools/codegen/dest/ufunc.py` - the workhorse. This gets its own section below. * `tools/codegen/gen.py` - just calling out to the new dest.ufunc implementation to generate UfuncCPU_add.cpp, UFuncCPUKernel_add.cpp and UfuncCUDA_add.cu files per ufunc operator. Each of these files does what you expect (small file that registers kernel and calls stub; CPU implementation; CUDA implementation). There is a new file manager for UFuncCPUKernel files as these need to get replicated by cmake for vectorization. One little trick to avoid recompilation is we directly replicate code generated forward declarations in these files, to reduce the number of headers we depend on (this is codegen, we're just doing the preprocessors job!) * I'll talk about build system adjustments below. OK, let's talk about tools/codegen/dest/ufunc.py. This file can be roughly understood in two halves: one for CPU code generation, and the other for CUDA code generation. CPU codegen. Here's roughly what we want to generate: ``` // in UfuncCPU_add.cpp using add_fn = void ()(TensorIteratorBase&, const at::Scalar&); DECLARE_DISPATCH(add_fn, add_stub); DEFINE_DISPATCH(add_stub); TORCH_IMPL_FUNC(ufunc_add_CPU) (const at::Tensor& self, const at::Tensor& other, const at::Scalar& alpha, const at::Tensor& out) { add_stub(device_type(), this, alpha); } // in UfuncCPUKernel_add.cpp void add_kernel(TensorIteratorBase& iter, const at::Scalar& alpha) { at::ScalarType st = iter.common_dtype(); RECORD_KERNEL_FUNCTION_DTYPE("add_stub", st); switch (st) { AT_PRIVATE_CASE_TYPE("add_stub", at::ScalarType::Bool, bool, [&]() { auto _s_alpha = alpha.to<scalar_t>(); cpu_kernel(iter, [=](scalar_t self, scalar_t other) { return ufunc::add(self, other, _s_alpha); }); }) AT_PRIVATE_CASE_TYPE( "add_stub", at::ScalarType::ComplexFloat, c10::complex<float>, [&]() { auto _s_alpha = alpha.to<scalar_t>(); auto _v_alpha = at::vec::Vectorized<scalar_t>(_s_alpha); cpu_kernel_vec( iter, [=](scalar_t self, scalar_t other) { return ufunc::add(self, other, _s_alpha); }, [=](at::vec::Vectorized<scalar_t> self, at::vec::Vectorized<scalar_t> other) { return ufunc::add(self, other, _v_alpha); }); }) ... ``` The most interesting change about the generated code is what previously was an `AT_DISPATCH` macro invocation is now an unrolled loop. This makes it easier to vary behavior per-dtype (you can see in this example that the entry for bool and float differ) without having to add extra condtionals on top. Otherwise, to generate this code, we have to hop through several successive API changes: * In TORCH_IMPL_FUNC(ufunc_add_CPU), go from StructuredImplSignature to StubSignature (call the stub). This is normal argument massaging in the classic translate style. * In add_kernel, go from StubSignature to UfuncSignature. This is nontrivial, because we must do various conversions outside of the inner kernel loop. These conversions are done by hand, setting up the context appropriately, and then the final ufunc call is done using translate. (BTW, I introduce a new convention here, call on a Signature, for code generating a C++ call, and I think we should try to use this convention elsewhere) The other piece of nontrivial logic is the reindexing by dtype. This reindexing exists because the native_functions.yaml format is indexed by UfuncKey: ``` Generic: add (AllAndComplex, BFloat16, Half) ScalarOnly: add (Bool) ``` but when we do code generation, we case on dtype first, and then we generate a `cpu_kernel` or `cpu_kernel_vec` call. We also don't care about CUDA code generation (which Generic) hits. Do this, we lower these keys into two low level keys, CPUScalar and CPUVector, which represent the CPU scalar and CPU vectorized ufuncs, respectively (Generic maps to CPUScalar and CPUVector, while ScalarOnly maps to CPUScalar only). Reindexing then gives us: ``` AllAndComplex: CPUScalar: add CPUVector: add Bool: CPUScalar: add ... ``` which is a good format for code generation, but too wordy to force native_functions.yaml authors to write. Note that when reindexing, it is possible for there to be a conflicting definition for the same dtype; we just define a precedence order and have one override the other, so that it is easy to specialize on a particular dtype if necessary. Also note that because CPUScalar/CPUVector are part of UfuncKey, technically you can manually specify them in native_functions.yaml, although I don't expect this functionality to be used. CUDA codegen. CUDA code generation has many of the same ideas as CPU codegen, but it needs to know about functors, and stubs are handled slightly differently. Here is what we want to generate: ``` template <typename scalar_t> struct CUDAFunctorOnSelf_add { using opmath_t = at::opmath_type<scalar_t>; opmath_t other_; opmath_t alpha_; CUDAFunctorOnSelf_add(opmath_t other, opmath_t alpha) : other_(other), alpha_(alpha) {} __device__ scalar_t operator()(scalar_t self) { return ufunc::add(static_cast<opmath_t>(self), other_, alpha_); } }; ... two more functors ... void add_kernel(TensorIteratorBase& iter, const at::Scalar & alpha) { TensorIteratorBase& iter = this; at::ScalarType st = iter.common_dtype(); RECORD_KERNEL_FUNCTION_DTYPE("ufunc_add_CUDA", st); switch (st) { AT_PRIVATE_CASE_TYPE("ufunc_add_CUDA", at::ScalarType::Bool, bool, [&]() { using opmath_t = at::opmath_type<scalar_t>; if (false) { } else if (iter.is_cpu_scalar(1)) { CUDAFunctorOnOther_add<scalar_t> ufunctor( iter.scalar_value<opmath_t>(1), (alpha).to<opmath_t>()); iter.remove_operand(1); gpu_kernel(iter, ufunctor); } else if (iter.is_cpu_scalar(2)) { CUDAFunctorOnSelf_add<scalar_t> ufunctor( iter.scalar_value<opmath_t>(2), (alpha).to<opmath_t>()); iter.remove_operand(2); gpu_kernel(iter, ufunctor); } else { gpu_kernel(iter, CUDAFunctor_add<scalar_t>((alpha).to<opmath_t>())); } }) ... REGISTER_DISPATCH(add_stub, &add_kernel); TORCH_IMPL_FUNC(ufunc_add_CUDA) (const at::Tensor& self, const at::Tensor& other, const at::Scalar& alpha, const at::Tensor& out) { add_kernel(this, alpha); } ``` The functor business is the bulk of the complexity. Like CPU, we decompose CUDA implementation into three low-level keys: CUDAFunctor (normal, all CUDA kernels will have this), and CUDAFunctorOnOther/CUDAFunctorOnScalar (these are to support Tensor-Scalar specializations when the Scalar lives on CPU). Both Generic and ScalarOnly provide ufuncs for CUDAFunctor, but for us to also lift these into Tensor-Scalar specializations, the operator itself must be eligible for Tensor-Scalar specialization. At the moment, this is hardcoded to be all binary operators, but in the future we can use tags in native_functions.yaml to disambiguate (or perhaps expand codegen to handle n-ary operators). The reindexing process not only reassociates ufuncs by dtype, but it also works out if Tensor-Scalar specializations are needed and codegens the ufunctors necessary for the level of specialization here (`compute_ufunc_cuda_functors`). Generating the actual kernel (`compute_ufunc_cuda_dtype_body`) just consists of, for each specialization, constructing the functor and then passing it off to `gpu_kernel`. Most of the hard work is in functor generation, where we take care to make sure `operator()` has the correct input and output types (which `gpu_kernel` uses to arrange for memory accesses to the actual CUDA tensor; if you get these types wrong, your kernel will still work, it will just run very slowly!) There is one big subtlety with CUDA codegen: this won't work: ``` Generic: add (AllAndComplex, BFloat16, Half) ScalarOnly: add_bool (Bool) ``` This is because, even though there are separate Generic/ScalarOnly entries, we only generate a single functor to cover ALL dtypes in this case, and the functor has the ufunc name hardcoded into it. You'll get an error if you try to do this; to fix it, just make sure the ufunc is named the same consistently throughout. In the code, you see this because after testing for the short circuit case (when a user provided the functor themselves), we squash all the generic entries together and assert their ufunc names are the same. Hypothetically, if we generated a separate functor per dtype, we could support differently named ufuncs but... why would you do that to yourself. (One piece of nastiness is that the native_functions.yaml syntax doesn't stop you from shooting yourself in the foot.) A brief word about CUDA stubs: technically, they are not necessary, as there is no CPU/CPUKernel style split for CUDA kernels (so, if you look, structured impl actually calls add_kernel directly). However, there is some code that still makes use of CUDA stubs (in particular, I use the stub to conveniently reimplement sub in terms of add), so we still register it. This might be worth frying some more at a later point in time. Build system changes. If you are at FB, you should review these changes in fbcode, as there are several changes in files that are not exported to ShipIt. The build system changes in this patch are substantively complicated by the fact that I have to implement these changes five times: * OSS cmake build * OSS Bazel build * FB fbcode Buck build * FB xplat Buck build (selective build) * FB ovrsource Buck build Due to technical limitations in the xplat Buck build related to selective build, it is required that you list every ufunc header manually (this is done in tools/build_variables.bzl) The OSS cmake changes are entirely in cmake/Codegen.cmake there is a new set of files cpu_vec_generated (corresponding to UfuncCPUKernel files) which is wired up in the same way as other files. These files are different because they need to get compiled multiple times under different vectorization settings. I adjust the codegen, slightly refactoring the inner loop into its own function so I can use different base path calculation depending on if the file is traditional (in the native/cpu folder) or generated (new stuff from this diff. The Bazel/Buck changes are organized around tools/build_variables.bzl, which contain the canonical list of ufunc headers (aten_ufunc_headers), and tools/ufunc_defs.bzl (added to ShipIt export list in D34465699) which defines a number of functions that compute the generated cpu, cpu kernel and cuda files based on the headers list. For convenience, these functions take a genpattern (a string with a {} for interpolation) which can be used to easily reformat the list of formats in target form, which is commonly needed in the build systems. The split between build_variables.bzl and ufunc_defs.bzl is required because build_variables.bzl is executed by a conventional Python interpreter as part of the OSS cmake, but we require Skylark features to implement the functions in ufunc_defs.bzl (I did some quick Googling but didn't find a lightweight way to run the Skylark interpreter in open source.) With these new file lists, the rest of the build changes are mostly inserting references to these files wherever necessary; in particular, cpu kernel files have to be worked into the multiple vectorization build flow (intern_build_aten_ops in OSS Bazel). Most of the subtlety relates to selective build. Selective build requires operator files to be copied per overall selective build; as dhruvbird explains to me, glob expansion happens during the action graph phase, but the selective build handling of TEMPLATE_SOURCE_LIST is referencing the target graph. In other words, we can't use a glob to generate deps for another rule, because we need to copy files from wherever (included generated files) to a staging folder so the rules can pick them up. It can be somewhat confusing to understand which bzl files are associated with which build. Here are the relevant mappings for files I edited: * Used by everyone - tools/build_tools.bzl, tools/ufunc_defs.bzl * OSS Bazel - aten.bzl, BUILD.bazel * FB fbcode Buck - TARGETS * FB xplat Buck -BUCK, pt_defs.bzl, pt_template_srcs.bzl * FB ovrsource Buck - ovrsource_defs.bzl, pt_defs.bzl Note that pt_defs.bzl is used by both xplat and ovrsource. This leads to the "tiresome" handling for enabled backends, as selective build is CPU only, but ovrsource is CPU and CUDA. BTW, while I was at it, I beefed up fb/build_arvr.sh to also do a CUDA ovrsource build, which was not triggered previously. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D31306586 Pulled By: ezyang fbshipit-source-id: 210258ce83f578f79cf91b77bfaeac34945a00c6 (cherry picked from commit d65157b0b894b6701ee062f05a5f57790a06c91c)	2022-03-01 00:33:40 +00:00
Edward Yang	2321f26fa3	Move vectorized CPU codegen to after ATen codegen (#72869 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72869 The ordering here doesn't really matter, but in a future patch I will make a change where vectorized CPU codegen does have to be here, and moving it ahead of time (with no code changes) will make the latter diff cleaner. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D34282229 Pulled By: ezyang fbshipit-source-id: 3397cb0e062d63cc9853f6248f17c3558013798b (cherry picked from commit 98c616024969f9df90c7fb09741ed9be7b7a20f1)	2022-02-23 20:33:19 +00:00
linuxone	f64906f470	ibm z14/15 SIMD support (#66407 ) Summary: https://github.com/pytorch/pytorch/issues/66406 implemented z arch 14/15 vector SIMD additions. so far besides bfloat all other types have their SIMD implementation. it has 99% coverage and currently passing the local test. it is concise and the main SIMD file is only one header file it's using template metaprogramming, mostly. but still, there are a few macrosses left with the intention not to modify PyTorch much Sleef supports z15 Pull Request resolved: https://github.com/pytorch/pytorch/pull/66407 Reviewed By: mrshenli Differential Revision: D33370163 Pulled By: malfet fbshipit-source-id: 0e5a57f31b22a718cd2a9ac59753fb468cdda140	2022-01-04 09:40:18 -08:00

1 2 3

109 Commits