pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-14 06:07:55 +08:00

Author	SHA1	Message	Date
Nikhil Patel	77b70970f7	[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel (#167182 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/165036, which previously contained a minor bug in the logic that determined whether the kernel should be enabled. As a result, it was incorrectly activated on non-Blackwell GPUs. Test Plan: Inductor test (fbcode): `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1"` Tritonbench (fbcode): `clear; CUDA_VISIBLE_DEVICES=7 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1" -- --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_cute_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Tritonbench(oss): `clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Unit Tests(oss): `clear; python test/inductor/test_cutedsl_grouped_mm.py` Differential Revision: D86376880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167182 Approved by: https://github.com/mlazos, https://github.com/jananisriram	2025-11-06 19:55:38 +00:00
PyTorch MergeBot	5c639466f7	Revert "[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel (#167003 )" This reverts commit 658c5f879c37142b1df51c7eb6c5a5bb06318597. Reverted https://github.com/pytorch/pytorch/pull/167003 on behalf of https://github.com/atalman due to regressed vllm signal: [GH job link](https://github.com/pytorch/pytorch/actions/runs/19093785744/job/54553796743) [HUD commit link](`658c5f879c`) ([comment](https://github.com/pytorch/pytorch/pull/167003#issuecomment-3491527704))	2025-11-05 14:30:15 +00:00
Nikhil Patel	658c5f879c	[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel (#167003 ) Summary: This is a reland of https://github.com/pytorch/pytorch/pull/165036?fbclid=IwY2xjawN3RL1leHRuA2FlbQIxMQBicmlkETExOEcxcnVhNVA1TzRSVmhiAR63GOEpJbZA-JhQ0CSj9ji8H_RHBUhDwYNDtxjOYfDol56OGqmC4r7jPP96Fw_aem_bWvtMfVifLQrnpv1YB_fJA, which previously contained a minor bug in the logic that determined whether the kernel should be enabled. As a result, it was incorrectly activated on non-Blackwell GPUs. Test Plan: Inductor test (fbcode): `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1"` Tritonbench (fbcode): `clear; CUDA_VISIBLE_DEVICES=7 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/opt //pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -m "ovr_config//third-party/pypi/nvidia-cutlass-dsl/constraints:4.2.1" -- --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_cute_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Tritonbench(oss): `clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Unit Tests(oss): `clear; python test/inductor/test_cutedsl_grouped_mm.py` Differential Revision: D86231180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167003 Approved by: https://github.com/jananisriram	2025-11-05 06:51:30 +00:00
PyTorch MergeBot	d77c24caac	Revert "[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel (#165036 )" This reverts commit 0e1a88904f4a5e30634b196678b56e1d6ec074f5. Reverted https://github.com/pytorch/pytorch/pull/165036 on behalf of https://github.com/atalman due to regressed vllm signal: [GH job link](https://github.com/pytorch/pytorch/actions/runs/19059329909/job/54439919668) [HUD commit link](`0e1a88904f`) ([comment](https://github.com/pytorch/pytorch/pull/165036#issuecomment-3487846555))	2025-11-04 20:13:33 +00:00
Nikhil Patel	0e1a88904f	[Inductor][Grouped Gemm] Add Blackwell CuTeDSL Kernel (#165036 ) Make sure you're on cutlass 4.2.0+ Test Plan: Tritonbench(oss): `clear; CUDA_VISIBLE_DEVICES=2 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 python run.py --op grouped_gemm --only aten_grouped_mm,preprocessed_pt2_triton_grouped_mm --precision bf16 --num-inputs 1 --metrics tflops,accuracy` Unit Tests(oss): `clear; python test/inductor/test_cutedsl_grouped_mm.py` Differential Revision: D82010227 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165036 Approved by: https://github.com/alexsamardzic, https://github.com/drisspg, https://github.com/mlazos	2025-11-04 05:58:58 +00:00
xinan.lin	0918bf321c	[xpu][test] Reuse native_mm and mix_order_reduction for Intel GPU. (#166384 ) This PR reused native_mm and mix_order_reduction for Intel GPU and enabled the corresonding test. Fixes #165370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166384 Approved by: https://github.com/jansel	2025-10-30 03:38:35 +00:00
nullplay	ac529df244	Native matmul (#157743 ) ### Implementation of #151705 This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates. To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705: 1. Basic support (this PR) 2. Lazy broadcasting for optimal performance (future PR) ### Summary of This PR This PR implements the basic functionality. It does not include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead. ### Notable Changes 1. Adds a new config flag: `config.triton.enable_native_matmul` 2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled 3. Enforces tililng suitable for matmul when the native matmul flag is enabled 4. Implements code generation for `ops.dot` 5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this. @eellison @jansel @PaulZhang12 @shunting314 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743 Approved by: https://github.com/jansel	2025-10-14 04:22:30 +00:00
henrylhtsang	e9eb2096a5	[cutlass backend] Allow bmm use cases when batch stride is 0 (#160356 ) Differential Revision: [D80035771](https://our.internmc.facebook.com/intern/diff/D80035771/) The motivation and the original change is to reduce the number parameters we pass into the kernel, which was motivated by aesthetic reasons only. But seeing the need to use different batch stride, we should just pass in the batch stride. That would be a good long term fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160356 Approved by: https://github.com/mlazos	2025-08-13 20:52:24 +00:00
Ruben Rodriguez Buchillon	625108ede2	[inductor] consolidate common GEMM triton param retrieval (#159383 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383 Approved by: https://github.com/jansel	2025-08-05 11:42:25 +00:00
PyTorch MergeBot	acad808545	Revert "[inductor] consolidate common GEMM triton param retrieval (#159383 )" This reverts commit e7cc42df58a86bee05944f6e80c535aa1d099443. Reverted https://github.com/pytorch/pytorch/pull/159383 on behalf of https://github.com/jataylo due to sorry but rocm CI is broken due to this PR ([comment](https://github.com/pytorch/pytorch/pull/159383#issuecomment-3145604831))	2025-08-01 19:49:21 +00:00
Ruben Rodriguez Buchillon	e7cc42df58	[inductor] consolidate common GEMM triton param retrieval (#159383 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383 Approved by: https://github.com/jansel	2025-07-31 13:05:04 +00:00
PyTorch MergeBot	a53db90e21	Revert "[inductor] consolidate common GEMM triton param retrieval (#158015 )" This reverts commit 9faef3d17c2e422d5d62f62b266155e2deb52c40. Reverted https://github.com/pytorch/pytorch/pull/158015 on behalf of https://github.com/henrylhtsang due to breaking tests ([comment](https://github.com/pytorch/pytorch/pull/158015#issuecomment-3115384824))	2025-07-25 00:16:50 +00:00
Ruben Rodriguez Buchillon	9faef3d17c	[inductor] consolidate common GEMM triton param retrieval (#158015 ) \# Why - Make loop iteration simpler - Have a common spot where to make modifications that affect all the GEMM Triton templates, avoiding missed spots \# What - pull out commong logic of taking the BaseConfig objects and turning them into kwargs to feed into maybe_append_choice for Triton GEMM templates Differential Revision: [D78081314](https://our.internmc.facebook.com/intern/diff/D78081314) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158015 Approved by: https://github.com/PaulZhang12, https://github.com/jansel	2025-07-24 19:17:48 +00:00
Zeina Migeed	4f5be56612	[Pyrefly][Refactor] Replace dict() calls with literal dict syntax for improved readability (#157735 ) There are 31 places that I spotted which construct literal dictionaries. This PR refactors dictionary construction by replacing` dict(...) `calls with `literal {...}` syntax where applicable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157735 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2025-07-08 18:10:33 +00:00
Laith Sakka	26f7ca3972	Unify dynamic shapes APIs naming 2 (expect_true and check) attempt2 (#156518 ) Summary: The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function. This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically, - it introduces size_vars.expect_true and size_vars.check. - guard_lt becomes check_lt - guard_leq becomes check_leq - guard_equals becomes check_equals I am also seeing a couple of wrong usages !! that i will fix in the next PR Test Plan: OSS and cont Rollback Plan: Differential Revision: D77054177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156518 Approved by: https://github.com/bobrenjc93	2025-06-24 21:01:38 +00:00
Paul Zhang	86996c15dc	[Inductor] Allow exhaustive autotuning across all GEMM options (#156610 ) Differential Revision: D76843916 Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature: Excessive register spillage Using much larger amounts of shared memory than available on the hardware This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive: ``` AUTOTUNE mm(1152x21504, 21504x1024) strides: [21504, 1], [1, 21504] dtypes: torch.bfloat16, torch.bfloat16 mm 0.1167 ms 100.0% triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0 SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610 Approved by: https://github.com/jansel	2025-06-24 01:42:05 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit 3627270bdf17b0fb6f528ca1cb87d6f2ec32680a. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
PyTorch MergeBot	503362d019	Revert "Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776 )" This reverts commit 603a54a9b33e1aabe1407721d7935b881a160968. Reverted https://github.com/pytorch/pytorch/pull/155776 on behalf of https://github.com/atalman due to failing internal build ([comment](https://github.com/pytorch/pytorch/pull/155776#issuecomment-2977041192))	2025-06-16 15:13:53 +00:00
Laith Sakka	603a54a9b3	Unify dynamic shapes APIs naming 2 (expect_true and check) (#155776 ) The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function. This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically, - it introduces size_vars.expect_true and size_vars.check. - guard_lt becomes check_lt - guard_leq becomes check_leq - guard_equals becomes check_equals I am also seeing a couple of wrong usages !! that i will fix in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/155776 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #154774	2025-06-14 17:13:53 +00:00
David Berard	c3ecabf059	[inductor][triton pin] add support for new TMA API for mm.py templates (#155723 ) Triton 3.4 will remove the experimental TMA APIs: https://github.com/triton-lang/triton/pull/6488 For mm.py templates, this PR adds support for using the new APIs when they are available (and otherwise falls back to the experimental APIs). For flex_attention, we'll remove TMA support for Triton 3.2 and 3.3 (versions of triton that don't have the new API). For mm_scaled_grouped.py, https://github.com/pytorch/pytorch/pull/150944 will remove TMA support for Triton 3.2. Note: we attempted this earlier with https://github.com/pytorch/pytorch/pull/154858, but this broke TMA usage in Triton 3.2. Differential Revision: [D76444471](https://our.internmc.facebook.com/intern/diff/D76444471) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155723 Approved by: https://github.com/NikhilAPatel	2025-06-12 06:25:47 +00:00
Aleksandar Samardžić	f8baec8984	Update auto-tuning support for _scaled_grouped_mm (#150944 ) 1. Enable strided inputs 2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs 3. Fix non-TMA load variant 4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor 5. Fix cases when group size along K dimension is not multiple of block size along K 6. Updated meta registration 7. Update synthetic offsets creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944 Approved by: https://github.com/ngimel, https://github.com/davidberard98	2025-06-11 19:12:52 +00:00
PyTorch MergeBot	e12597090c	Revert "Update auto-tuning support for _scaled_grouped_mm (#150944 )" This reverts commit 09328eb02f5412d2211b5fd638ce82d0e03b9c1f. Reverted https://github.com/pytorch/pytorch/pull/150944 on behalf of https://github.com/davidberard98 due to breaks internal usage & complicates triton pin update - more details in https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957246463 ([comment](https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957248841))	2025-06-09 23:12:56 +00:00
Aleksandar Samardžić	09328eb02f	Update auto-tuning support for _scaled_grouped_mm (#150944 ) 1. Enable strided inputs 2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs 3. Fix non-TMA load variant 4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor 5. Fix cases when group size along K dimension is not multiple of block size along K 6. Updated meta registration 7. Update synthetic offsets creation Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944 Approved by: https://github.com/ngimel	2025-06-08 10:18:13 +00:00
Joaquin	cb56df55dc	[Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331 ) Fixes #153298 This PR is the 3rd and final step of #147479 All references to autotune_fallback_to_aten have been removed, and the feature is now deprecated. All calls to should_fallback_to_aten() were also removed, as they were deemed unnecessary. [henrylhtsang](https://github.com/henrylhtsang) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154331 Approved by: https://github.com/henrylhtsang	2025-05-29 20:29:58 +00:00
Henry Tsang	ee2d104c05	[cutlass backend] Add (limited) bmm dynamic shape support (#152393 ) Differential Revision: D73626732 In this PR, we add support for bmm dynamic shape, provided that the batch stride is the biggest in the stride for A, B, and D. For example, for A of size `(B, M, K)`, we support stride `(MK, K, 1)` and `(MK, 1, M)`. With this assumption, we can infer the batch stride from existing arguments. The reason is we don't want to add 2-3 more runtime params. The concerns are complexity and possible perf regression, though we didn't verify the latter. We can revisit this if there is a need for that. We also remove `B = 1` for normal mm and addmm. We tested it and didn't see perf regression. But open to revisiting this as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152393 Approved by: https://github.com/ColinPeppler	2025-04-30 04:36:24 +00:00
Bert Maher	2d187bf7e6	Support tuning of _scaled_grouped_mm (#150421 ) This includes the default aten implementation, as well as a Triton implementation imported from FBGEMM (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421 Approved by: https://github.com/ngimel	2025-04-11 23:03:49 +00:00
PyTorch MergeBot	6a65f2c4fe	Revert "Support tuning of _scaled_grouped_mm (#150421 )" This reverts commit 8efcf21fff327d155350bf26ccba769bab58c077. Reverted https://github.com/pytorch/pytorch/pull/150421 on behalf of https://github.com/malfet due to Looks like it broke lint, see `a0ab243c3a/1` ([comment](https://github.com/pytorch/pytorch/pull/150421#issuecomment-2795218547))	2025-04-10 21:36:41 +00:00
Bert Maher	8efcf21fff	Support tuning of _scaled_grouped_mm (#150421 ) This includes the default aten implementation, as well as a Triton implementation imported from FBGEMM (https://github.com/pytorch/FBGEMM/blob/main/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150421 Approved by: https://github.com/ngimel	2025-04-10 20:34:16 +00:00
Jack Taylor	2299087220	[ROCm] Introduce AMD specific inductor gemm tuning (#147315 ) Replaces https://github.com/pytorch/pytorch/pull/143286 Adds ROCm specific MM configs for max-autotune incorporating ROCm specific triton tuning kernargs such as waves_per_eu, kpack, matrix_instr_nonkdim. This PR also introduces behavior to allow tuning for GROUP_M in triton gemm case. Dynamo huggingface inference benchmarks: `TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS="TRITON" python huggingface.py --performance --inference --bfloat16 --backend=inductor` GEOMEAN speedup (before): \| 1.35x GEOMEAN speedup (after): \| 1.42x name \| Eager - abs latency \| old - abs_latency \| old - speedup \| new - abs_latency \| new - speedup -- \| -- \| -- \| -- \| -- \| -- AlbertForMaskedLM \| 26.22 \| 26.52 \| 98.86% \| 24.58 \| 106.67% AlbertForQuestionAnswering \| 25.96 \| 26.40 \| 98.33% \| 24.10 \| 107.73% AllenaiLongformerBase \| 21.03 \| 10.65 \| 197.50% \| 10.49 \| 200.58% BartForCausalLM \| 7.77 \| 9.76 \| 79.63% \| 8.79 \| 88.46% BartForConditionalGeneration \| 14.44 \| 12.86 \| 112.26% \| 11.96 \| 120.70% BertForMaskedLM \| 8.10 \| 8.82 \| 91.89% \| 8.57 \| 94.53% BertForQuestionAnswering \| 6.82 \| 7.32 \| 93.20% \| 7.10 \| 96.18% BlenderbotForCausalLM \| 10.97 \| 11.39 \| 96.34% \| 10.10 \| 108.65% BlenderbotSmallForCausalLM \| 5.91 \| 5.44 \| 108.72% \| 4.82 \| 122.67% BlenderbotSmallForConditionalGeneration \| 12.64 \| 9.65 \| 130.94% \| 9.11 \| 138.83% CamemBert \| 8.35 \| 9.15 \| 91.24% \| 8.86 \| 94.27% DebertaForMaskedLM \| 10.92 \| 6.09 \| 179.44% \| 5.90 \| 185.05% DebertaForQuestionAnswering \| 14.29 \| 7.70 \| 185.59% \| 7.26 \| 196.75% DebertaV2ForMaskedLM \| 15.47 \| 10.22 \| 151.32% \| 9.34 \| 165.55% DebertaV2ForQuestionAnswering \| 14.98 \| 6.11 \| 245.28% \| 6.28 \| 238.40% DistilBertForMaskedLM \| 8.37 \| 8.70 \| 96.30% \| 8.22 \| 101.92% DistilBertForQuestionAnswering \| 10.21 \| 10.54 \| 96.88% \| 10.39 \| 98.36% DistillGPT2 \| 8.77 \| 6.78 \| 129.40% \| 6.31 \| 138.88% ElectraForCausalLM \| 10.32 \| 4.70 \| 219.45% \| 4.60 \| 224.29% ElectraForQuestionAnswering \| 11.48 \| 5.62 \| 204.20% \| 5.44 \| 210.95% GPT2ForSequenceClassification \| 6.21 \| 5.72 \| 108.50% \| 5.58 \| 111.26% GoogleFnet \| 26.51 \| 20.81 \| 127.37% \| 19.91 \| 133.11% LayoutLMForMaskedLM \| 12.09 \| 7.99 \| 151.28% \| 7.66 \| 157.80% LayoutLMForSequenceClassification \| 10.62 \| 6.49 \| 163.67% \| 6.25 \| 169.95% M2M100ForConditionalGeneration \| 14.98 \| 10.20 \| 146.79% \| 9.89 \| 151.42% MBartForCausalLM \| 7.67 \| 9.78 \| 78.44% \| 8.87 \| 86.55% MBartForConditionalGeneration \| 13.45 \| 12.69 \| 105.99% \| 12.03 \| 111.82% MT5ForConditionalGeneration \| 19.96 \| 5.32 \| 375.37% \| 5.08 \| 393.01% MegatronBertForCausalLM \| 13.22 \| 7.86 \| 168.07% \| 7.18 \| 184.01% MegatronBertForQuestionAnswering \| 15.62 \| 11.81 \| 132.21% \| 11.02 \| 141.68% MobileBertForMaskedLM \| 26.63 \| 10.82 \| 245.99% \| 11.95 \| 222.73% MobileBertForQuestionAnswering \| 23.53 \| 7.55 \| 311.51% \| 9.53 \| 247.03% OPTForCausalLM \| 7.33 \| 7.64 \| 95.93% \| 7.56 \| 96.90% PLBartForCausalLM \| 8.73 \| 7.63 \| 114.40% \| 7.37 \| 118.58% PLBartForConditionalGeneration \| 10.46 \| 8.50 \| 122.98% \| 8.16 \| 128.13% PegasusForCausalLM \| 7.18 \| 7.37 \| 97.42% \| 6.64 \| 108.22% PegasusForConditionalGeneration \| 16.47 \| 16.66 \| 98.87% \| 14.18 \| 116.13% RobertaForCausalLM \| 10.30 \| 9.95 \| 103.52% \| 9.52 \| 108.25% RobertaForQuestionAnswering \| 6.37 \| 7.13 \| 89.28% \| 6.79 \| 93.87% T5ForConditionalGeneration \| 12.40 \| 6.72 \| 184.51% \| 6.48 \| 191.16% T5Small \| 12.02 \| 6.66 \| 180.55% \| 6.32 \| 190.33% TrOCRForCausalLM \| 14.12 \| 13.31 \| 106.11% \| 12.45 \| 113.41% XGLMForCausalLM \| 16.48 \| 6.23 \| 264.52% \| 6.35 \| 259.51% XLNetLMHeadModel \| 74.87 \| 62.23 \| 120.32% \| 57.95 \| 129.19% YituTechConvBert \| 20.21 \| 10.50 \| 192.48% \| 9.97 \| 202.72% We are also seeing improvement ~9% on internal addmm benchmark This PR will also slightly reduce the compilation time on AMD max-autotune as before this change we assess every config with matrix_instr_nonkdim [0, 16] but we remove this and use 16 for all configs with this update. No CI to test the max-autotune perf currently but this will be enabled via https://github.com/pytorch/pytorch/pull/148672 after which we can investigate more tuning updates and config pruning Pull Request resolved: https://github.com/pytorch/pytorch/pull/147315 Approved by: https://github.com/jansel, https://github.com/eellison	2025-04-09 14:34:30 +00:00
PaulZhang12	e62d958f02	[Inductor] Reland Merge Triton ScaledMM as epilogue to MM template #150045 (#150441 ) Merges https://github.com/pytorch/pytorch/pull/150438 and https://github.com/pytorch/pytorch/pull/150045. https://github.com/pytorch/pytorch/pull/150045 was already landed, but did not include a change that makes it unable to land internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150441 Approved by: https://github.com/clee2000	2025-04-02 17:49:32 +00:00
PyTorch MergeBot	f04cf13bdd	Revert "Merge Triton ScaledMM as epilogue to MM template (#150045 )" This reverts commit 981048854da154eae8ff0bd439e72e1256ae00da. Reverted https://github.com/pytorch/pytorch/pull/150045 on behalf of https://github.com/PaulZhang12 due to Need to add PR 150415 fixes for internal merge ([comment](https://github.com/pytorch/pytorch/pull/150045#issuecomment-2770252452))	2025-04-01 17:54:28 +00:00
PaulZhang12	981048854d	Merge Triton ScaledMM as epilogue to MM template (#150045 ) Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor. Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done. TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045 Approved by: https://github.com/drisspg	2025-03-31 23:20:14 +00:00
Jack Taylor	32299e5f9a	Reland "Introduce new template heuristic for triton autotune configs" (#147452 ) This change was reverted in https://github.com/pytorch/pytorch/pull/147388 for regressing an internal workload. I have removed the additional ir.device_type calls in mm_scaled and unpack_mixed_mm.py which could be contributing to the additional compile time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147452 Approved by: https://github.com/jansel	2025-03-26 15:47:06 +00:00
Bert Maher	a439524be6	[inductor] Add the largest matmul tile size to default tuning set (#149790 ) While we probably don't want to expand the set of default matmul tunings too much, this is the largest tile size usable by H100 and A100, and is usually the top performing tile size for large matmuls. E.g. on H100 adding this tile size improves perf of multiplying 8192-square matrices from 600->700 tflops. (cuBLAS 12.6 gets 780, so Triton still isn't SOTA, but closer) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149790 Approved by: https://github.com/jansel	2025-03-24 16:32:53 +00:00
Jason Ansel	b040dc3a53	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential [disconnected] Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-12 15:52:16 +00:00
PyTorch MergeBot	5ada4e6a53	Revert "Reland: [inductor] Simplify grid handling (#148305 )" This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178. Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))	2025-03-12 14:58:43 +00:00
Jason Ansel	8d08b49015	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-11 18:51:06 +00:00
Jack Taylor	8059ead823	[ROCm] Incorporate ROCm triton specific tuning parameters (#148437 ) Splitting https://github.com/pytorch/pytorch/pull/147315 into two PRs. This PR adds general support for kpack and waves_per_eu triton kernel args for AMD backend. More detail in the PR above. A follow up PR will update the configs used by ROCm but this requires https://github.com/pytorch/pytorch/pull/147452 to land first Pull Request resolved: https://github.com/pytorch/pytorch/pull/148437 Approved by: https://github.com/eellison, https://github.com/jansel	2025-03-07 18:09:47 +00:00
henrylhtsang	b020d166f2	stage 1 of depreate silent fallback of tuning gemm (#147798 ) Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/) context: https://github.com/pytorch/pytorch/issues/147479 For the most part, this should not change the behavior. For int_mm, I also removed ``` # TODO: Re-enable eager mode implementation once cuBLAS is fixed if use_cutlass or use_triton_template(layout, enable_int32=True): choices = [] ``` because I think it is unwanted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798 Approved by: https://github.com/eellison	2025-03-05 05:15:59 +00:00
PyTorch MergeBot	608377d341	Revert "[import][inductor] Simplify grid handling (#147583 )" This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e. Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))	2025-03-03 00:49:32 +00:00
Jason Ansel	b59776d857	[import][inductor] Simplify grid handling (#147583 ) Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Note the attached diff contains some minor fbcode-only changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-03-02 07:31:07 +00:00
PyTorch MergeBot	1919e0de9a	Revert "stage 1 of depreate silent fallback of tuning gemm (#147798 )" This reverts commit 297c00264e54cfb192f289e23a41775b81cb9cb8. Reverted https://github.com/pytorch/pytorch/pull/147798 on behalf of https://github.com/wdvr due to failing internal builds, discussed with author ([comment](https://github.com/pytorch/pytorch/pull/147798#issuecomment-2692390551))	2025-03-01 20:04:23 +00:00
henrylhtsang	297c00264e	stage 1 of depreate silent fallback of tuning gemm (#147798 ) Differential Revision: [D70045778](https://our.internmc.facebook.com/intern/diff/D70045778/) context: https://github.com/pytorch/pytorch/issues/147479 For the most part, this should not change the behavior. For int_mm, I also removed ``` # TODO: Re-enable eager mode implementation once cuBLAS is fixed if use_cutlass or use_triton_template(layout, enable_int32=True): choices = [] ``` because I think it is unwanted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147798 Approved by: https://github.com/eellison	2025-02-28 19:51:55 +00:00
Xuehai Pan	1cb4e2df65	[BE][PYFMT] migrate PYFMT for `torch._inductor` to `ruff format` (#144550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144550 Approved by: https://github.com/jansel	2025-02-28 13:33:19 +00:00
eellison	4b7604ec10	Delete Mixed MM Special Casing (#147151 ) Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup: - prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not. - similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](`5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)`) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path. It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication. The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](`bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)`). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete. Future optimizations could include: - cutlass prologue path - making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit. Differential Revision: [D70114858](https://our.internmc.facebook.com/intern/diff/D70114858) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2025-02-25 04:29:54 +00:00
PyTorch MergeBot	3409cbd177	Revert "Delete Mixed MM Special Casing (#147151 )" This reverts commit d6bb1d7f0a9dc3d11d2864da9ab46872377a6e52. Reverted https://github.com/pytorch/pytorch/pull/147151 on behalf of https://github.com/jeanschmidt due to Broke a few internal signals, see comments on D69994157 ([comment](https://github.com/pytorch/pytorch/pull/147151#issuecomment-2676312215))	2025-02-22 17:14:32 +00:00
henrylhtsang	76ce194b8e	For addmm and bmm, check if config.autotune_fallback_to_aten before using aten as a fallback. Also fix bmm cutlass backend (#147148 ) This PR also fixes BMM, which was silently failing for a while. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147148 Approved by: https://github.com/eellison	2025-02-21 18:41:52 +00:00
eellison	d6bb1d7f0a	Delete Mixed MM Special Casing (#147151 ) Now that torchinductor supports prologue fusion we can delete all the mixed mm code. When I benchmarked int8 weight only mm in the new path compared to int8mm in the old path in the [following benchmark](https://gist.github.com/eellison/46e321709572c11c077d0612cb3492b7) I got a 1.244x geomean speedup comparing Huggingface linear shapes with bias. There's a couple reasons for the speedup: - prologue fusion is often unprofitable, even for int8 mm. because the current mixed mm benchmarking only compares triton_int8_mm vs (dtype_conversion + cublas), we miss out on scenarios where the triton template is profitable but the prologue fusion is not. - similarly, we miss out on potential epilogue fusions like bias if we dispatch to the [fallback mixed mm](`5006932cbc/torch/_inductor/kernel/mm.py (L750-L751)`) that mixed_mm will dispatch to instead of the deferred epilogue tuning in current path. It's possible some of the speedups would be smaller on larger models where the epilogue might get fused into a following kernel. Nonetheless, even if this is perf neutral it is worth landing for code deduplication. The one kernel that is a little special and would not fall out of the prologue fusion is the uint4x2_mixed_mm kernel. it's still possible to generate with prologue fusion but not currently exactly as the current [impl](`bd370c138a/torch/_inductor/kernel/unpack_mixed_mm.py (L43-L49)`). But the current impl does not compare to a cublas baseline so I found that it is making things slower (35% slower on a not particularly big 1024, 1024, 1024 mm shape on h100). this should be fine to delete. Future optimizations could include: - cutlass prologue path - making prologue fusion support the persistent tma based mm template. from @drisspg's experience this led to nice wins with fp8 but not as nice wins with bf16 mm. I think similarly, lower memory bandwidth int8 mm would benefit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147151 Approved by: https://github.com/drisspg, https://github.com/cpuhrsch	2025-02-21 16:02:40 +00:00

1 2 3

110 Commits