pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-28 18:54:57 +08:00

Author	SHA1	Message	Date
Maggie Moss	5641de7b6b	Add suppressions for _inductor/codegen (#165659 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165659 Approved by: https://github.com/oulgen	2025-10-16 21:37:37 +00:00
PyTorch MergeBot	84d141e910	Revert "[inductor] Expand use of generic benchmark function (#164938 )" This reverts commit 5c583e2573f29243742e00b9fa36b266c5c78bb3. Reverted https://github.com/pytorch/pytorch/pull/164938 on behalf of https://github.com/clee2000 due to I think this broke test/inductor/test_cuda_repro.py::CudaReproTests::test_epilogue_fusion_with_view? [GH job link](https://github.com/pytorch/pytorch/actions/runs/18529735968/job/52813191763) [HUD commit link](`f58f301313`) on both rocm and the slow grad check for linux. It did run successfully on cuda workflow on trunk, I wonder if this a gpu capability thing? no clue though ([comment](https://github.com/pytorch/pytorch/pull/164938#issuecomment-3407600224))	2025-10-15 17:48:38 +00:00
Mwiza Kunda	5c583e2573	[inductor] Expand use of generic benchmark function (#164938 ) Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164938 Approved by: https://github.com/nmacchioni, https://github.com/eellison	2025-10-15 09:18:24 +00:00
Tianren Gao	a2f29bcd63	[inductor] Remove Repeated Code in Subgraph (#164892 ) Discovered some repeated code blocks in the subgraph.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/164892 Approved by: https://github.com/PaulZhang12	2025-10-09 19:16:02 +00:00
Ruben Rodriguez Buchillon	688acf0b83	[inductor][mm] restructure decompose k (#161026 ) # why - make it easier to integrate into lookup table later # what - current version generates templates on the fly and uses them to generate a single choice - lookup table and performance model work best when there is a stable set of templates (with predictable names) and those are then parametrized - this change makes it so that there is a single DecomposeK template with a stable name, and the k split is the only parametrization we do # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_decompose_k_dynamic_False_bfloat16_sizes1 -v ``` Differential Revision: [D80670913](https://our.internmc.facebook.com/intern/diff/D80670913) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161026 Approved by: https://github.com/PaulZhang12, https://github.com/jansel	2025-08-28 20:14:41 +00:00
Tom Ritchford	e3afbb0362	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-30 15:56:35 +00:00
Paul Zhang	96e4c95cd8	[Inductor] Subgraph as a choice symbolic expression as input (#156185 ) Differential Revision: D76514984 Fix subgraph as a choice for when a symbolic shape is inputted as an expression, i.e. 256 * s0, which typically happens in the backwards pass. The current logic assumes that all symbolic shapes are single inputs, i.e. standalone s0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156185 Approved by: https://github.com/masnesral	2025-06-23 21:29:17 +00:00
PyTorch MergeBot	7e4c097b07	Revert "[inductor] Add typing to _inductor/ir.py (#149958 )" This reverts commit 529e0357c6c4e74f8cd32c29198c5f1c9f6e329d. Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see `b0fbbef136/1` ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))	2025-06-06 15:19:16 +00:00
Tom Ritchford	529e0357c6	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-06 14:15:01 +00:00
Paul Zhang	fdbf314278	[Inductor] Cache subgraph autotuning choices properly (#154067 ) Differential Revision: D75170507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154067 Approved by: https://github.com/eellison	2025-05-29 22:01:44 +00:00
Gabriel Ferns	c7e8e8ee19	Add torch.profile benchmarking function to feedback_fns (#153579 ) Summary: Updates some benchmarking code to have the option to use torch.profile, and passes in a thunk to benchmark_fns to get this information (this will be a different result from `timings`, which are already passed into those functions). Test Plan: Existing unit tests. Differential Revision: D74444990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153579 Approved by: https://github.com/coconutruben, https://github.com/masnesral, https://github.com/nmacchioni	2025-05-29 21:43:45 +00:00
PaulZhang12	a7c01d7f13	[Inductor] Subgraph check output strides (#153755 ) Make sure outputs strides of subgraph consistent with original gm. Without checking strides, it was possible for subgraph to produce nans with a reinterpret tensor on the output of the subgraph output, in which itself was not contiguous. Differential Revision: [D74691119](https://our.internmc.facebook.com/intern/diff/D74691119/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153755 Approved by: https://github.com/eellison ghstack dependencies: #153754	2025-05-20 16:07:18 +00:00
PaulZhang12	63e5d46478	[Inductor] Subgraph support dynamic input expressions (#153754 ) Support subgraph choice taking in inputs that have dynamic dimensions. Testing with decomposeK subgraph decomp Differential Revision: [D74484741](https://our.internmc.facebook.com/intern/diff/D74484741/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153754 Approved by: https://github.com/eellison	2025-05-20 16:07:18 +00:00
PaulZhang12	dccd19c2ef	[Inductor] Construct subgraph with benchmarking args not example_inputs (#153753 ) If the inputs to a subgraph has FlexibleLayout, the subgraph does not currently freeze the layouts here. Therefore, the `example_inputs` generated might not be consistent in layout with the `args` based in for benchmarking Differential Revision: [D74900879](https://our.internmc.facebook.com/intern/diff/D74900879/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153753 Approved by: https://github.com/eellison	2025-05-19 15:58:40 +00:00
PaulZhang12	84aa0985fb	[Inductor] Add decomposeK as an autotuning choice for mm (#150654 ) As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`. Followups: * decompose_k does not currently support epilogue fusion, which will take some work to enable * Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM * Add for addmm * Enable for Inference and AOTI Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously: <img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" /> TorchInductor Benchmark Dashboard: <img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" /> We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over. Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654 Approved by: https://github.com/eellison	2025-05-03 02:23:54 +00:00
PyTorch MergeBot	7c3e679ddd	Revert "[Inductor] Add decomposeK as an autotuning choice for mm (#150654 )" This reverts commit fdcfc6a61a2146c7c961073e029ead633113eb9a. Reverted https://github.com/pytorch/pytorch/pull/150654 on behalf of https://github.com/wdvr due to Failing ROCM tests: inductor/test_subgraph_choice.py::TestSubgraphChoice::test_subgraph_decompose_k [GH job link](https://github.com/pytorch/pytorch/actions/runs/14786111108/job/41515742446) [HUD commit link](`3c54e0c216`) ([comment](https://github.com/pytorch/pytorch/pull/150654#issuecomment-2846470409))	2025-05-02 06:31:38 +00:00
PaulZhang12	fdcfc6a61a	[Inductor] Add decomposeK as an autotuning choice for mm (#150654 ) As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`. Followups: * decompose_k does not currently support epilogue fusion, which will take some work to enable * Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM * Add for addmm * Enable for Inference and AOTI Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously: <img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" /> TorchInductor Benchmark Dashboard: <img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" /> We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over. Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654 Approved by: https://github.com/eellison	2025-05-01 23:01:30 +00:00
PaulZhang12	83ae61fd8e	[Inductor] Add Subgraph as a Autotuning Choice (#150653 ) Add the option for providing a Subgraph as an autotuning choice in Inductor. This is crucial for implementing the split-k optimization for GEMMs by decomposing a mm -> bmm. https://github.com/pytorch/pytorch/pull/150654 uses these changes to add decomposeK as a default autotuning choice for aten.mm in Inductor. Using https://github.com/pytorch/pytorch/pull/150654 and a simple script: ``` import torch def f(a, b): return torch.matmul(a, b) def decompose_func(a_in, b_in): M, K = a_in.shape K, N = b_in.shape # TODO: Ideally we want to autotune over this parameter kPartitions = 256 assert K % kPartitions == 0, "K must be divisible by Kmini" B = K // kPartitions a_reshaped = a_in.reshape(M, B, kPartitions).transpose( 0, 1 ) # Shape: (B, M, kPartitions) b_reshaped = b_in.reshape(B, kPartitions, N) # Shape: (B, kPartitions, N) result = torch.bmm(a_reshaped, b_reshaped) # Shape: (B, M, N) return result.sum(dim=0).to(torch.float16) # Sum over B dimension, Shape: (M, N) for k in [4096, 8192, 12288, 16384, 20480, 24576, 28672, 32768]: a = torch.randn(32, k, dtype=torch.float16, device="cuda", requires_grad=True) b = torch.randn(k, 32, dtype=torch.float16, device="cuda", requires_grad=True) compiled_res = torch.compile(f, dynamic=False)(a, b) decompose_res = decompose_func(a, b) print(f"Compiled mm result close to aten: {torch.allclose(f(a, b), compiled_res, atol=1e-5, rtol=0.5)}") print(f"Compiled mm result close to decompose: {torch.allclose(decompose_res, compiled_res, atol=1e-5, rtol=0.5)}") ``` we are able to autotune the decomposeK optimization to aten and the traditional Triton templates in Inductor. DecomposeK is faster than aten by about ~10% on average and > 4x speedup over the best Triton templates on an H100 machine, e.g.: ``` AUTOTUNE mm(32x28672, 28672x32) decompose_k_mm 0.0126 ms 100.0% mm 0.0144 ms 87.5% triton_mm_69 0.0579 ms 21.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_75 0.0677 ms 18.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4 triton_mm_76 0.0850 ms 14.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_68 0.1444 ms 8.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=5, num_warps=4 triton_mm_72 0.1546 ms 8.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4 triton_mm_74 0.1819 ms 6.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=4, num_warps=4 triton_mm_67 0.1917 ms 6.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=2, num_warps=4 triton_mm_73 0.2766 ms 4.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=32, BLOCK_N=32, EVEN_K=True, GROUP_M=8, num_stages=3, num_warps=4 ``` https://pastebin.com/g3FMaauT is the generated code from Inductor containing the subgraph decomposition for aten.mm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150653 Approved by: https://github.com/eellison	2025-04-11 19:08:43 +00:00

18 Commits