pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-07 18:04:58 +08:00

Author	SHA1	Message	Date
Tom Ritchford	fda975a7b3	Remove unused Python variables in torch/[_-a]* (#133492 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133492 Approved by: https://github.com/albanD	2024-12-10 21:48:44 +00:00
eellison	f2d8d7b7ac	Infer whether prologues can be computed without upcasting to fp32 without changing numerics (#142402 ) For prologues which only do either loads like gathers or dtype conversions, and no actual arithmetic on lower-precision types, we can codegen them without upcasting to fp32 without changing numerics. Prologues that actually do arithmetic will need to use invoke quant. But I would like to to support upcasts/gathers out of the box. We could potentially extend this in the future to avoid upcasting max pooling operations as well, if there were perf benefits to be had (less likely). Pull Request resolved: https://github.com/pytorch/pytorch/pull/142402 Approved by: https://github.com/jansel ghstack dependencies: #134532, #142350, #142400, #142401	2024-12-10 21:26:03 +00:00
eellison	1a0bd40243	Add a pass which analyzes whether a prologue preserves zero mask (#142401 ) We load inputs to prologue fusion with a mask. That mask must still be zero before we run `tl.dot`. Previously, we would always apply the mask: ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tl.where(a_mask, tmp1, 0.0) ``` now we do not need to -> ``` tmp0 = tl.load(in_ptr1 + (tl.broadcast_to(xindex, xindex.shape)), a_mask, eviction_policy='evict_last') tmp1 = tmp0.to(tl.float32) a = tmp1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/142401 Approved by: https://github.com/jansel ghstack dependencies: #134532, #142350, #142400	2024-12-10 21:16:13 +00:00
eellison	59ab3825e7	Prologue Fusion (#134532 ) This PR extends our ability to fuse pointwise nodes onto triton templates with the ability to fuse pointwise nodes into triton templates - prologue fusion. Similar to the store_output api: `{{store_output(("idx_m", "idx_n"), "acc", "mask")}}` And the modification api: ``` {{ modification( subgraph_number=0, output_name="post_mod_scores", score="qk", out="qk" ) \| indent_except_first(1) }} ``` We have: ```{{load_input("B", "b", ("idx_m", "idx_n"), mask=None if EVEN_K else "b_mask", indent_width=8)}}``` Because we are now loading the input with explicit indices and mask, I needed to rewrite the mm kernel to no longer update the [pointers by BLOCK_K](`bb03ef7aca/torch/_inductor/kernel/mm.py (L110-L111)`) on every iteration and instead on each iteration compute indices from the the k_idx of each loop. This did not have any perf difference. There are a couple main use cases for prologue fusion: - Fusing dequants into a matmul. particularly for more bandwidth bound scenarios. - Fusing gather into a matmul. This is useful particularly in MOE. See https://github.com/pytorch/pytorch/issues/134535 for more details. Prologue fusion is generally much less profitable than epilogue fusion, because it must be applied to an element of an input on each loop of the matmul, compared to only once in the epilogue (gather into matmul is a potential exception). Accordingly, we are much less aggressive in attempting to fuse prologue fusion. We only attempt fusion if it does not increase the number of memory bytes read instead the triton template, multipled by a small factor to allow gathers. This restricts reliably unprofitable fusions like fp32->fp16 inside kernel. In future pr we could potentially have api of being more aggressive if we know we are in a bandwidth bound regime. See: https://github.com/pytorch/pytorch/pull/134532/files#diff-d2539c9c8dc6a3d7e457767a880612e96d3c85752a77ead49a9e4e00a3e4c3c7R3060-R3066 Other notes: By default we will upcast to fp32 inside every kernel. This matches eager numerics. This is fine enough for epilogue because it is only done once (although it is probably unnecessary for say a relu) but tanks perf for prologue. I am currently using the `codegen_upcast_to_fp32` option to avoid it, but that will not work for libdevice calls that require fp32. We will need https://github.com/pytorch/pytorch/pull/136778/ and dtype-aware codegen to upcast fp16 ops into libdevice calls. With prologue fusion, we now have essentially separate kernels for each input, and for the output. I had to increase the number of fields that are swapped out in `set_subgraph_body` by a large number :/ I also update the fusion logic because the inputs will have a different group than the outputs. Maybe as part of enabling multiple outputs, this could get cleaned up a bit so.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134532 Approved by: https://github.com/jansel	2024-12-10 16:25:57 +00:00
Blaine Burton Rister	f9af86de01	[Inductor] Represent tiling as a dict (#141751 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions. This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`. Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first. # Test plan The existing CI provides good coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751 Approved by: https://github.com/jansel	2024-12-05 02:28:16 +00:00
PyTorch MergeBot	38d10a1b17	Revert "[Inductor] Represent tiling as a dict (#141751 )" This reverts commit 5deca07c0dcf1482eba99bf93b805cf1cc41ad6c. Reverted https://github.com/pytorch/pytorch/pull/141751 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/141751#issuecomment-2517815899))	2024-12-04 15:43:16 +00:00
Blaine Burton Rister	5deca07c0d	[Inductor] Represent tiling as a dict (#141751 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions. This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`. Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first. # Test plan The existing CI provides good coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751 Approved by: https://github.com/jansel	2024-12-01 09:54:34 +00:00
Blaine Burton Rister	49fde426ba	[Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`. Tested by the existing CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738 Approved by: https://github.com/jansel	2024-11-30 22:38:13 +00:00
Jason Ansel	6eca0aee76	[inductor] Refactor ir.Layout into ir.OutputSpec (#140910 ) This separate the concepts of a Layout (size/stride/etc) and an OutputSpec (which includes multiple outputs). Which should make typing easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140910 Approved by: https://github.com/ezyang ghstack dependencies: #140895	2024-11-21 20:01:57 +00:00
Jason Ansel	318eaa2be7	[inductor] Refactor reduction type choices into V.choices (#139585 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139585 Approved by: https://github.com/shunting314	2024-11-17 16:10:37 +00:00
PyTorch MergeBot	069a71023b	Revert "[inductor] Refactor reduction type choices into V.choices (#139585 )" This reverts commit 6438c8637a7e28b676a1ccfe942dc37375d0cb14. Reverted https://github.com/pytorch/pytorch/pull/139585 on behalf of https://github.com/kit1980 due to breaking internal builds, see D65800124 ([comment](https://github.com/pytorch/pytorch/pull/139585#issuecomment-2471392822))	2024-11-12 19:32:14 +00:00
Jason Ansel	6438c8637a	[inductor] Refactor reduction type choices into V.choices (#139585 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139585 Approved by: https://github.com/shunting314	2024-11-12 00:56:02 +00:00
Jason Ansel	ed30fa74ab	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-04 04:28:40 +00:00
Jason Ansel	66d5e2405d	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-04 04:28:25 +00:00
Jason Ansel	d189f92eb1	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-04 04:28:18 +00:00
PyTorch MergeBot	0863d6a08e	Revert "[inductor] Remove SIMDKernel.last_usage (#139364 )" This reverts commit 286d3ce266ce01ca905afb1cc9ea5d81abf79ff7. Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:11 +00:00
PyTorch MergeBot	9331640e26	Revert "[inductor] Remove Node.last_usage mutation (#139365 )" This reverts commit 1e934b473cabe6bc003f66d9811082e97c958a31. Reverted https://github.com/pytorch/pytorch/pull/139365 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	98e11b0021	Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 )" This reverts commit c53beab3775671b5b7ec6106737c0d8939b8455a. Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
Jason Ansel	c53beab377	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-02 03:04:22 +00:00
Jason Ansel	1e934b473c	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-01 16:28:15 +00:00
Jason Ansel	286d3ce266	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 16:28:15 +00:00
Shunting Zhang	5e4c8b671c	[inductor] loaf-fix (#139376 ) Fix https://github.com/pytorch/pytorch/issues/128063 . Now for this snippet ``` def f(x): y = torch.sum(torch.sum(x, dim=-1)) z = x / 10.0 z_t = z.t().contiguous().t() return y, z, z_t ``` Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused). The PR needs fix 2 subtile bugs regarding LOAF . 1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison. 2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376 Approved by: https://github.com/jansel, https://github.com/eellison	2024-11-01 07:54:32 +00:00
Jason Ansel	f9ef880c0b	[inductor] Refactor kernel args into SIMDKernelFeatures (#139327 ) This is a refactor PR to move stuff around. I'm planning to use the SIMDKernelFeatures class (in a future PR) to host new heuristics for selecting kernel types and block sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139327 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 00:30:14 +00:00
Jason Ansel	a762dc0357	[inductor] Multi-kernel + cooperative reductions (#138893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138893 Approved by: https://github.com/shunting314 ghstack dependencies: #138533	2024-10-29 15:45:17 +00:00
Jason Ansel	2b937e4e6d	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison	2024-10-29 00:45:53 +00:00
PyTorch MergeBot	60d1c7138d	Revert "[inductor] Cooperative reductions (#137756 )" This reverts commit fed37dbfbceefe306af648ff4fe1e0124c4d7844. Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322))	2024-10-28 13:24:33 +00:00
Jason Ansel	fed37dbfbc	[inductor] Cooperative reductions (#137756 ) Example generated code for `(x+y).sum()`: ```py @triton.jit def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr): xnumel = 1 rnumel = 1048576 rsplit_id = tl.program_id(0) num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK rsplit_start = rsplit_chunk * rsplit_id rsplit_end = rsplit_chunk * (rsplit_id + 1) xoffset = tl.program_id(1) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1) rbase = tl.arange(0, RBLOCK)[None, :] _tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32) for roffset in range(rsplit_start, rsplit_end, RBLOCK): rindex = roffset + rbase rmask = rindex < rnumel r0 = rindex tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0) tmp2 = tmp0 + tmp1 tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK]) tmp5 = _tmp4 + tmp3 _tmp4 = tl.where(rmask, tmp5, _tmp4) tmp4 = tl.sum(_tmp4, 1)[:, None] if RSPLIT > 1: tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32)) tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None) if RSPLIT > 1: triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True) if RSPLIT > 1: tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first') tmp4 = tl.sum(tmp4_peers, 1)[:, None] if rsplit_id == (0 % RSPLIT): tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756 Approved by: https://github.com/eellison ghstack dependencies: #138970	2024-10-27 16:31:38 +00:00
Tom Ritchford	932ae131fb	Remove an unused variable in _inductor/codegen/simd.py (#138000 ) * Extracted from https://github.com/pytorch/pytorch/pull/133492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138000 Approved by: https://github.com/Skylion007	2024-10-16 13:54:21 +00:00
Jason Ansel	7480e6938d	[inductor] Add LoopBody.op_counts (#137945 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137945 Approved by: https://github.com/eellison ghstack dependencies: #137946	2024-10-16 06:35:10 +00:00
David Berard	54094c0c26	[inductor][user triton] Check size hints to determine indexing dtype (#137234 ) Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64. This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64. Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234 Approved by: https://github.com/eellison	2024-10-03 22:07:26 +00:00
Blaine Burton Rister	86631eccda	[Inductor] Remove stride-0 dimensions from more complex block pointers (#135557 ) Related issue: #125077 ### Feature Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading. This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`. Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`). ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Before this PR, we would have stride-0 dimensions: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 64 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x2 = xindex x1 = (xindex // 8) tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0]) tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK]) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts. ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 8 xnumel = 8 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1]) tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :] tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` ### Test Plan Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.) This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly. Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557 Approved by: https://github.com/shunting314, https://github.com/jansel Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-27 04:01:40 +00:00
Jason Ansel	02169364e1	[inductor] Split reduction loops when there is no shared reads (#134307 ) Fixes #129102 ![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307 Approved by: https://github.com/shunting314	2024-09-12 09:45:08 +00:00
Rachel Guo	ae3aa8ff73	[AOTI][Tooling][5/n] Refactor the debug printer call to a level lower (#134789 ) Summary: 1. Move the debug printer call a level lower -> at here :https://www.internalfb.com/code/fbsource/[931d7bbb9e7cf2dcb926f42718f56fc940903eec]/fbcode/caffe2/torch/_inductor/codegen/cpp_wrapper_cuda.py?lines=335 2. Add UT for validating debug printer for user defined triton kernel codegen The benefit of having the debug printer call happens at a more centralized place is 1) reduce the duplicate debug printer related logic code scattered everywhere in the codebase 2) it can handle more triton kernel codegen path as long as it invokes this `generate_kernel_call()` for example, it can automatically handle/support user_defined_kernel 's debug printing which is a pretty common use case we encounter in debugging Test Plan: ```AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_user_defined_triton_kernel_abi_compatible_cuda``` Also verified that templateKernel codegen path still works Differential Revision: D61949020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134789 Approved by: https://github.com/ColinPeppler	2024-09-04 02:41:30 +00:00
Jason Ansel	76f975948e	[inductor] Cleanup generate_node_schedule (#134306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134306 Approved by: https://github.com/shunting314	2024-08-29 02:45:14 +00:00
Rachel Guo	89929d9abc	[AOTI][Tooling][4/n] Add `torch.save()` for individual intermediate tensor (#133871 ) Differential Revision: D61415304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133871 Approved by: https://github.com/ColinPeppler	2024-08-28 04:48:00 +00:00
Aaron Orenstein	187d55018a	[BE] Fix MYPY issues (#133872 ) Fix some mypy issues that have crept in to the trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133872 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-08-20 16:12:04 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Rachel Guo	3965f11837	Minor type annotation updates following up D60954888 (#133382 ) Summary: As title. Test Plan: CI Ran lintrunner locally but might have to continue to keep an eye on more oss linting issue if comes up. Differential Revision: D61240900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133382 Approved by: https://github.com/ColinPeppler	2024-08-14 21:36:42 +00:00
Rachel Guo	c17d26c3c1	[AOTI][Tooling] A couple fixes / minor updates for initial debug printer (#133016 ) Summary: Follow up small diff to fix a couple issues: - add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen - other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option Test Plan: ``` AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D60954888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016 Approved by: https://github.com/ColinPeppler	2024-08-13 23:00:29 +00:00
Blaine Burton Rister	3d0de6e1cd	[Inductor] Add config option to force higher-dimensional tiling (#132937 ) Fixes #125077 Feature This PR creates a new Inductor config, `config.triton.prefer_nd_tiling`, which is disabled by default. When enabled, this encourages the Triton code to use as many tiling dimensions as possible. This simplifies indexing expressions for discontiguous tensors, resulting in expressions like `5 * x + 8 * y` as opposed to `5 * (x // 7) + 8 * (y % 9)`. This allows us to find more block pointers than we normally would. We should now see simplified indexing expressions as long as: 1. All discontiguous reads/writes have the same shape. 2. The number of discontiguous dimensions is less than `config.triton.max_tiles`. Here's an example kernel (elementwise add of views) with ND tiling disabled: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 21 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 7 x1 = (xindex // 7) x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (9x1)), xmask) tmp1 = tl.load(in_ptr1 + (x0 + (9x1)), xmask) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[21], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` And here's the version with it enabled: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 3 xnumel = 7 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[7, 3], strides=[1, 7], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tl.broadcast_to(tmp2, [XBLOCK, YBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` With this feature enabled, we get a discontiguous strided block pointer. Previously, this would only have worked for specific shapes, like powers of 2 or multiples of the maximum block size. With this PR, we can support arbitrary shapes so long as we have enough tiles to cover all discontiguous dimensions. Test plan This PR adds some tests for pointwise ops with discontiguous tensors. - Test that we can generate block pointers for views with odd shapes like `(5,7)`, `(9,3,5)`, etc. - Test that we can generate block pointers for a single discontiguous dim in 3D and 4D tensors. - Test that we generate a 2D tiling for a 5D tensor with two discontiguous dims. This case doesn't generate a block pointer, but it checks that the output code is at least correct. This PR also parametrizes some existing tests to run with and without `triton.prefer_nd_tiling`. That way, we ensure this feature doesn't break existing usage. Since this setting isn't enabled on most tests, I also created https://github.com/pytorch/pytorch/pull/132935 to test what happens when `triton.prefer_nd_tiling=True` by default. None of the failures seem related to invalid tiling, so I think this feature is safe to merge. Limitations and follow-ups I can see two main improvements which would expand the usefulness of this feature: 1. This feature currently only works for pointwise kernels, since reductions are never tiled. As a follow-up, we could enable tiled reductions to extend these benefits to reduction kernels. 2. The usefulness of this feature depends on `triton.config.max_tiles`. This is currently restricted to 2 by default, although it can be increased to 3 in certain cases. To support more discontiguous dims, we might consider expanding support for 3D tiling, or even supporting ND tiling, by mapping an ND "virtual" launch grid onto Triton's 3D launch grid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132937 Approved by: https://github.com/jansel, https://github.com/eellison	2024-08-08 22:11:56 +00:00
Rachel Guo	5709375d56	[AOTI][tooling][1/n] Add intermediate value debug printer (#132323 ) Summary: Context: Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866) The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process. Changes: 1. Added a simple initial debug printer helper to print out tensor values 2. Added a filter option to selectively dump tensor values. Usage: Sample cmd : ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda ``` Sample outputs : ``` [ before_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -2.25655 Max value: 2.32996 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -12.0839 Max value: 11.6878 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)] . ---------------------------------------------------------------------- Ran 1 test in 10.867s OK ``` The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below: ``` torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out'] ``` In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further `torch.load()`. Test Plan: Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part) `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda` Differential Revision: D60538496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323 Approved by: https://github.com/ColinPeppler	2024-08-08 01:39:59 +00:00
Feng Shi	55b0c39d82	Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132182 ) Summary: Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)"" The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests. The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause. See D54134695 or #124969 for more details. Test Plan: Originally failed tests f585704630 f585733786 Diff patched: f586664028 f586663820 Differential Revision: D60458597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182 Approved by: https://github.com/Yuzhen11	2024-08-05 06:57:30 +00:00
Oguz Ulgen	09f9c256ad	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-04 18:43:37 +00:00
PyTorch MergeBot	f2ddd5e9e0	Revert "Add basic mypy annotations to inductor (#132416 )" This reverts commit 78927d37f6085a0b30269cceb731d8097302c091. Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
Oguz Ulgen	78927d37f6	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-01 20:14:25 +00:00
eellison	f32ab3b9e3	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-08-01 04:37:15 +00:00
PyTorch MergeBot	784a6ec5a3	Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 )" This reverts commit 13d744464f10e35c0de50feb4e2340d4dae8e05f. Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](`13d744464f`) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))	2024-07-31 16:49:21 +00:00
eellison	13d744464f	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-07-31 16:22:11 +00:00
Yuzhen Huang	5298acb5c7	Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132065 ) Summary: Original commit changeset: 1d8cfdcef69d Original Phabricator Diff: D54134695 back out: D54134695 Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc Reviewed By: zw2326 Differential Revision: D60397377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065 Approved by: https://github.com/zw2326, https://github.com/qchip	2024-07-29 22:48:29 +00:00
Peter Bell	9ae288f4be	[inductor] Simplify multi-kernel codegen by unifying kernel args (#127724 ) Persistent kernels are sometimes able to remove intermediate buffers that would otherwise be needed for the non-persistent reduction kernel. This makes multi kernel's codegen more complicated as it needs to drop these extra arguments at runtime after selecting the correct kernel to run. Instead, this PR updates the persistent kernel's `must_keep_buffers` so these aren't dropped during codegen so both kernels have the same signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724 Approved by: https://github.com/shunting314 ghstack dependencies: #131044	2024-07-26 00:12:43 +00:00

1 2

81 Commits