pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	fdab48a7c1	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 07:36:18 +00:00
PyTorch MergeBot	24520b8386	Revert "Enable all PIE rules on ruff (#165814 )" This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872. Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))	2025-10-18 07:21:08 +00:00
Yuanyuan Chen	c79dfdc655	Enable all PIE rules on ruff (#165814 ) This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are ``` PIE796 Enum contains duplicate value: {value} PIE808 Unnecessary start argument in range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814 Approved by: https://github.com/ezyang	2025-10-18 06:40:12 +00:00
Markus Hoehnerbach	eb3fbf5b08	[inductor] in emulate_precision_casts, disable fma fusion in triton (#163073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163073 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-23 23:59:17 +00:00
Mwiza Kunda	03798b0f91	[inductor] Fix removal of constexpr args from the launcher signature (#161924 ) Fixes the case described below which occurs when: - A user `torch.compile`s a function that uses a triton kernel. - `TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1` . Problem: If the user defined triton kernel is not autotuned: ```python import os os.environ["TORCHINDUCTOR_DUMP_LAUNCH_PARAMS"] = "1" @triton.jit def kernel(..., BLOCK_SIZE: tl.constexpr): ... @torch.compile def fn(..) kernel[..](..., 128) fn(..) ``` Then In `triton_heuristics. _interpret_args_grid`, `filter_signature` function: ```python def filtered_signature() -> list[str]: # constexprs are not passed in as args return [ x for x in self.triton_meta["signature"].keys() if x not in cfg.kwargs.keys() ] ``` because `triton.autotune` is not used on the the `triton.jit` function, `cfg` above will be empty, and so `BLOCK_SIZE` will not be removed from the signature even though it is constexpr, even though it is removed from the arguments that are passed in to `interpret_args_grid`. This results in a mismatch between the number of parameters in the signature and the number of arguments, which leads to the error `NameError: name '_grid_2' is not defined`. Fix: Use the triton jit kernel `constexprs` for args to remove. Not sure if this is a good fix so suggestions are welcome. Test plan: Added a parameter to an existing triton kernel to test for this edge case Pull Request resolved: https://github.com/pytorch/pytorch/pull/161924 Approved by: https://github.com/davidberard98	2025-09-12 13:58:09 +00:00
Oguz Ulgen	81aeefa657	Add torch.compile support for triton.constexpr_function (#162106 ) Fixes #161868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162106 Approved by: https://github.com/jansel, https://github.com/zou3519	2025-09-04 16:46:55 +00:00
PyTorch MergeBot	2c0650a00a	Revert "[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711 )" This reverts commit 8dbe7f99bd707ee28ae12ecb9cab54e1785bf13e. Reverted https://github.com/pytorch/pytorch/pull/160711 on behalf of https://github.com/davidberard98 due to internal failure - T235384144 - I'll revert while I investigate. ([comment](https://github.com/pytorch/pytorch/pull/160711#issuecomment-3215343200))	2025-08-22 19:10:35 +00:00
David Berard	8dbe7f99bd	[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711 ) allow_tf32 is deprecated. Also, this will make it easier to support tf32x3 (i.e. #160359). dashboard results on h100 show no change: [inference](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f), [training](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160711 Approved by: https://github.com/PaulZhang12, https://github.com/njriasan	2025-08-19 05:27:10 +00:00
rzou	10bc36fe84	Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341 ) Short-term fix for https://github.com/pytorch/pytorch/issues/160333 The problem is: 1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation 2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented. 3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition. The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented return logic into the decomposition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341 Approved by: https://github.com/drisspg	2025-08-12 04:09:37 +00:00
ghostspiders	af10f1f86c	Fix requires_cuda to requires_cuda_and_triton (#160222 ) Fixes ##159399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 Approved by: https://github.com/janeyx99	2025-08-10 07:05:52 +00:00
gaoyvfeng	50f23ff6f8	rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883 ) Fixes #159399 "Modified torch.testing._internal.inductor_utils and test/inductor" Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883 Approved by: https://github.com/janeyx99	2025-08-08 15:44:52 +00:00
Dev Sashidhar	8cb91e20bc	Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908 ) This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908 Approved by: https://github.com/janeyx99, https://github.com/guangyey	2025-08-07 11:24:44 +00:00
Xuehai Pan	17687eb792	[BE][4/6] fix typos in test/ (test/inductor/) (#157638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638 Approved by: https://github.com/yewentao256, https://github.com/jansel	2025-07-06 06:34:25 +00:00
David Berard	82eefaedd9	[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 ) Fixes #155006 Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322 Approved by: https://github.com/zou3519, https://github.com/drisspg	2025-07-02 14:02:01 +00:00
PyTorch MergeBot	ab6cb34480	Revert "[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 )" This reverts commit 563fd95563c5edd732ae260b3bd3d0c38822ab57. Reverted https://github.com/pytorch/pytorch/pull/157322 on behalf of https://github.com/davidberard98 due to fails on rocm ([comment](https://github.com/pytorch/pytorch/pull/157322#issuecomment-3025826951))	2025-07-01 23:21:37 +00:00
David Berard	563fd95563	[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322 ) Fixes #155006 Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322 Approved by: https://github.com/zou3519, https://github.com/drisspg	2025-07-01 22:51:11 +00:00
David Berard	b6c00dfe24	[user triton] AOT inductor support for device-side TMA (#155896 ) Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma` Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel. To support this in AOTI, this PR: * records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen * allocates global scratch, if needed (cuda/device_op_overrides.py) * plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device) * updates tests to verify this works for dynamically shaped inputs This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works) Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space). For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896 Approved by: https://github.com/desertfire	2025-06-27 04:28:04 +00:00
Xuehai Pan	f5e6e52f25	[BE][PYFMT] migrate PYFMT for `test/inductor/` to `ruff format` (#148186 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186 Approved by: https://github.com/jansel	2025-06-24 11:12:11 +00:00
David Berard	c83041cac2	[test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827 ) Tests added: ``` python test/inductor/test_triton_kernels.py -k test_on_device_tma python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma ``` These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available). Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827 Approved by: https://github.com/FindHao ghstack dependencies: #155777, #155814	2025-06-15 20:24:19 +00:00
David Berard	b7c95acc6c	[user triton] triton_kernel_wrap support for new host-side TMA API (#155777 ) This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis. Major changes: * TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor * mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir) mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata) Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777 Approved by: https://github.com/aakhundov	2025-06-15 20:24:19 +00:00
David Berard	9328a7fb58	[triton pin][tests] refactor test_triton_kernel.py tests to test new & old API (#155510 ) This splits out the tests so we can independently test both the new and old API. Note: the new API doesn't work yet - we still need to fix those tests. Differential Revision: [D76318840](https://our.internmc.facebook.com/intern/diff/D76318840) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155510 Approved by: https://github.com/oulgen	2025-06-11 13:52:15 +00:00
David Berard	a9a0501ec4	[user triton] mutation analysis for on-device TMA (#155380 ) Previously, the user-defined triton kernel mutation analysis would not detect mutation caused by TMA store, if the TMA descriptor was created via on-device TMA creation. This PR adds partial support for mutation analysis on programs that do stores via on-device TMA. On-device TMA works like this: ``` @triton.jit def kernel(A_ptr, workspace_ptr, ...): tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, A_ptr, ...) tl._experimental_descriptor_store(workspace_ptr, data, ...) ``` The first call (tensormap_create2d) mutates the contents of workspace_ptr to contain a data (including the fact that this TMA descriptor points to A_ptr). The second call (experimental_descriptor_store) writes to the location specified by the data in workspace_ptr: A_ptr, in this case. The approach here is to do a first pass to identify all the experimental_descriptor_stores (and collect the associated descriptor values); and then during mutation analysis, any tma creation on a mutated descriptor value (e.g. on `workspace_ptr` in the above example) will actually register as a mutation to the associated data pointer (e.g. `data` in the above example). Consider this example, which I'll used to describe the pros/cons of this approach. ``` @triton.jit def create_tma(global_ptr, workspace_ptr): tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, global_ptr, ...) @triton.jit def kernel(A, B, workspace_ptr): create_tma(A, workspace_ptr) workspace_B = workspace_ptr + 128 create_tma(B, workspace_B) data = tl._experimental_descriptor_load(workspace_ptr, ...) tl._experimental_descriptor_store(workspace_B, data, ...) ``` An alternative approach could be to simply modify the `tl.extra.cuda.experimental_device_tensormap_create2d` so that it returns a descriptor, and to use that descriptor in subsequent uses (i.e. to "functionalize" the uses of the tma creation API). However, this would (a) require "functionalization" through any function calls (e.g. to `create_tma`), and (b) would lead to both `A` and `B` being marked as mutated (i.e. mutation to `workspace_B` -> mutation to `workspace_ptr` -> mutation to `A`). A downside of the current approach is that it doesn't understand offsets into workspaces. e.g. if one were to recompute workspace_B instead of reusing the variable, the analysis pass would not understand that these values point to the same descriptor. Differential Revision: [D76175117](https://our.internmc.facebook.com/intern/diff/D76175117) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155380 Approved by: https://github.com/oulgen	2025-06-10 00:07:18 +00:00
Gabriel Ferns	71c8231742	fix bug with TORCHINDUCTOR_DUMP_LAUNCH_PARAMS (#153066 ) Summary: https://fb.workplace.com/groups/1028545332188949/posts/9503194033132340/?comment_id=9504669536318123&reply_comment_id=9506405459477864&notif_id=1746154132646897&notif_t=work_group_comment_mention Aligns the arguments for the triton inputs Differential Revision: D74085173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153066 Approved by: https://github.com/jansel	2025-05-12 23:56:49 +00:00
rzou	3d777bae10	Inductor respects exact strides on custom ops by default (#150511 ) If a tag is not specified on a custom operator, then inductor will assume that it needs exact strides. Test Plan: - tests + CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #148104	2025-05-03 00:02:24 +00:00
PyTorch MergeBot	a0e796df03	Revert "Inductor respects exact strides on custom ops by default (#150511 )" This reverts commit a4bb2f106f8cc642539d4698b6d869a87adca92f. Reverted https://github.com/pytorch/pytorch/pull/150511 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](`2e7c9d33e7`) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))	2025-04-09 16:49:48 +00:00
rzou	a4bb2f106f	Inductor respects exact strides on custom ops by default (#150511 ) If a tag is not specified on a custom operator, then inductor will assume that it needs exact strides. Test Plan: - tests + CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511 Approved by: https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #150495, #148104	2025-04-09 16:46:48 +00:00
rzou	4d121d2b02	Implement needs_exact_strides for mutable custom operators (#148091 ) Mutable custom operators get wrapped into an auto_functionalized HOP, so we need to store the arg_kwarg_vals on the auto_functionalized HOP itself. When Inductor does the re-inplacing, it'll use the pattern matcher to decompose the auto_functionalized HOP back into the original op (and 0+ other view or clone operations). The pattern matcher uses the arg_kwarg_vals to trace the subgraph to do the decomposition, so it ultimately sets arg_kwarg_vals on the original op's node correctly. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/148091 Approved by: https://github.com/eellison ghstack dependencies: #148046, #148063	2025-04-02 13:18:04 +00:00
rzou	c69c3c885e	Add needs_exact_strides operator tag for Inductor to force exact strides (#148063 ) Inductor will force exact strides on a custom operator tagged with needs_exact_strides. I'll make this the default in a follow-up PR. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/148063 Approved by: https://github.com/eellison ghstack dependencies: #148046	2025-04-02 13:17:58 +00:00
Jack Taylor	49b7d0d84d	[ROCm] Enable more inductor UTs (#149513 ) Primarily enable inductor fp8 tests, also enable other inductor tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/149513 Approved by: https://github.com/jeffdaily	2025-04-01 00:30:36 +00:00
bobrenjc93	f649ee73ce	Use source hashing to generate consistent symbolic ids (#149665 ) This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how.... Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized. We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665 Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka	2025-03-28 05:36:32 +00:00
Jason Ansel	b040dc3a53	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential [disconnected] Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-12 15:52:16 +00:00
PyTorch MergeBot	5ada4e6a53	Revert "Reland: [inductor] Simplify grid handling (#148305 )" This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178. Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))	2025-03-12 14:58:43 +00:00
Jason Ansel	8d08b49015	Reland: [inductor] Simplify grid handling (#148305 ) Summary: Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583 Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Differential Revision: D70471332 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-03-11 18:51:06 +00:00
David Berard	5fb0f45d3b	[triton 3.3] test_triton_kernel_constants fix (#148626 ) Thanks @FindHao who did the initial version of this PR: https://github.com/pytorch/pytorch/pull/148505 TL;DR is that https://github.com/triton-lang/triton/pull/5961 deprecates `tl.constexpr` annotations - you're supposed to wrap the constexpr value in `tl.constexpr()` instead. This just updates the tests to wrap with `tl.constexpr()` (and leaves the annotations - that way the old triton versions will still pass). Pull Request resolved: https://github.com/pytorch/pytorch/pull/148626 Approved by: https://github.com/FindHao	2025-03-06 14:18:21 +00:00
PyTorch MergeBot	608377d341	Revert "[import][inductor] Simplify grid handling (#147583 )" This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e. Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))	2025-03-03 00:49:32 +00:00
Jason Ansel	b59776d857	[import][inductor] Simplify grid handling (#147583 ) Before this PR, calling a triton kernel would look like: ```py kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0) ``` where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg: ```py kernel.run(a, b, xnumel, stream=stream0) ``` instead now the grid computation is included in the kernel launcher, with something like: ```py def launcher(in_ptr0, out_ptr0, xnumel, stream): grid_0 = ((xnumel + 1023) >> 10) grid_1 = 1 grid_2 = 1 runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel) ``` This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`. It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid. This unification allows this PR to be a net deletion of code. Note the attached diff contains some minor fbcode-only changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-03-02 07:31:07 +00:00
Sijia Chen	4995e058bf	[user-triton] handle inline_asm_case (#148043 ) Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value Test Plan: ../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel ``` test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True) ok ---------------------------------------------------------------------- Ran 2 tests in 0.706s OK ``` Differential Revision: D69878591 Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043 Approved by: https://github.com/zou3519	2025-02-28 20:52:51 +00:00
Alex Baden	9d3636283b	[Inductor] Use generic GPU device in test_preserves_strides (#148006 ) #147861 added a new test tagged for the generic GPU but uses the cuda GPU type for creating the tensors. Update the GPU type to also be generic. This passes with my local Intel Triton install, presumably it will work for the current pin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148006 Approved by: https://github.com/eellison, https://github.com/etaf	2025-02-27 02:52:51 +00:00
eellison	c839fa4dd2	[Resubmit] Record input strides at time of tracing, constrain to them for triton fn (#147861 ) Resubmit of https://github.com/pytorch/pytorch/pull/145448. it lost its changes on rebase. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147861 Approved by: https://github.com/zou3519	2025-02-26 05:05:06 +00:00
Benjamin Glass	46d1422afd	cpp_wrapper: fix inductor triton tests (#146109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146109 Approved by: https://github.com/desertfire	2025-02-25 19:50:37 +00:00
David Berard	94969d0a40	[inductor][user triton] Handle scf.yield more accurately (#147762 ) TL;DR: Previously, the mutation analysis for scf.if/scf.for would bundle all the scf.yield arguments into a single op (the scf.yield), such that a mutation on any returned value from the scf.if/scf.for would register as a mutation to _all_ of the scf.yield args. To fix this, this PR artificially introduces a new scf.yield op for each of the scf.yield args. Context: The relevant kernel is something like this one (added as a test in test_triton_kernels.py) ```python @triton.jit def branch_with_multiple_yield_args( in_ptr0, in_ptr1, out_ptr, conditional_ptr, n_elements, BLOCK_SIZE: "tl.constexpr", ): pid = tl.program_id(axis=0) block_start = pid * BLOCK_SIZE offsets = block_start + tl.arange(0, BLOCK_SIZE) mask = offsets < n_elements conditional = tl.load(conditional_ptr) if conditional: in0 = in_ptr0 + 1 in1 = in_ptr1 + 1 out = out_ptr + 1 else: in0 = in_ptr0 in1 = in_ptr1 out = out_ptr x = tl.load(in0 + offsets, mask=mask) y = tl.load(in1 + offsets, mask=mask) tl.store(out + offsets, x + y, mask=mask) ``` The mutation analysis starts with the `tl.store` - and then does a DFS backwards towards the parameters. When a new op is encountered in the DFS, the analysis pass recurses on the op's arguments. The if branch gets converted to TTIR like this: ```mlir %21:3 = scf.if %20 -> (!tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32>) { ... scf.yield %31, %32, %33 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc10) } else { scf.yield %arg0, %arg1, %arg2 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc11) } loc(#loc7) ``` and so the "source" op of the `out` variable is marked as the `scf.yield` op - and then all of the arguments to `scf.yield` are marked as mutable (including arg0, arg1, and arg2 - only one of which is actually mutated). This PR we duplicate the `scf.yield` to add one `scf.yield` per return value. That way we avoid marking all the returns from the scf.if/scf.for as mutated when only some are. Differential Revision: [D70118202](https://our.internmc.facebook.com/intern/diff/D70118202) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147762 Approved by: https://github.com/oulgen, https://github.com/zou3519	2025-02-25 08:41:00 +00:00
xinan.lin	b11d5cd584	[Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481 ) This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #147347	2025-02-22 02:53:16 +00:00
David Berard	7997ecf809	[BE] reduce log spew from test_triton_kernels.py (#145895 ) One of the tests in this file was setting `self._logging.set_logs(output_code=True)` - which would cause logs to be printed for the rest of the tests in this file. This PR puts the log-setting in a context manager so that the old behavior is restored afterwards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145895 Approved by: https://github.com/nmacchioni	2025-02-04 03:44:23 +00:00
David Berard	8326d27093	[inductor][5/N] triton support post-#5512, fix 1 and None handling (#145515 ) This fixes handling for "1" and "None" args with new Triton versions. TL;DR: triton_meta["constants"] (which is passed to ASTSource) should be a map of {"kwarg_name": constant_value} for values which are tl.constexpr, or have a value of 1 or None (i.e. "specialized" constants). For constant args, triton_meta["signature"][arg_name] should be "constexpr" (even for specialized constants). Note: This adds support for Triton versions after 5512; but not for versions in between 5220 and 5512 (i.e. `TritonAttrsDescriptorVersion.V3_BACKENDS_TUPLE`). There's a completely different format for constants/signature in the commit range in between. To test: I ran `test_torchinductor.py` and `test_triton_kernels.py` with the main branch of triton (~jan 27). The only failing tests are aoti-related tests (which need to be fixed as a follow-up), and test_mutable_custom_op_fixed_layout2_cuda (which is failing with or without the new triton version on my machine); additionally, the split-scan/split-reduction kernels rely on https://github.com/triton-lang/triton/pull/5723. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145515 Approved by: https://github.com/SamGinzburg	2025-02-01 02:11:48 +00:00
PyTorch MergeBot	0d6343347f	Revert "Record inputs at time of tracing, constrain to them for triton fn (#145448 )" This reverts commit a699034eeca8c096c44a690e405a60efa442d4ed. Reverted https://github.com/pytorch/pytorch/pull/145448 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68779678 for details ([comment](https://github.com/pytorch/pytorch/pull/145448#issuecomment-2622470810))	2025-01-29 18:07:12 +00:00
eellison	a699034eec	Record inputs at time of tracing, constrain to them for triton fn (#145448 ) Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context. We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448 Approved by: https://github.com/zou3519	2025-01-28 07:07:14 +00:00
David Berard	b2c89bc115	[inductor][2/N] triton support post-#5512, user-defined triton kernels (#145348 ) Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This PR fixes user-defined triton kernel handling (in most cases) for these new triton commits. What this PR fixes: * in triton_kernel_wrap.py, AST->TTIR parsing was to be updated for the new triton API * ir.py - don't remove None args when using newer triton versions * wrapper.py - update signature & constant handling What this doesn't fix: * correct None handling - I want to do a closer look at constant handling (including None, equal_to_1, and other constants). * cpp wrapper (which needs to be fixed for both user-defined triton kernels and inductor-generated kernels) test/inductor/test_triton_kernels.py passed on triton commit 74de6b46, with the exception of three tests (those shown here: `1374074098`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145348 Approved by: https://github.com/jansel ghstack dependencies: #145051	2025-01-24 00:34:01 +00:00
Aidyn-A	e8e3c03f96	[Test][Inductor] Fix test_tma_graph_breaks (#145271 ) Per title. Before these changes, below tests: ``` test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_False test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_True test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_False test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ``` fail with the following message: ``` __________________________________________________________________ KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ___________________________________________________________________ Traceback (most recent call last): File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/usr/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper method(args, kwargs) File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 557, in instantiated_test test(self, *param_kwargs) File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1760, in test_tma_graph_breaks eager_out = f(a, b) ^^^^^^^ File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1740, in f t.element_size(), ^ UnboundLocalError: cannot access local variable 't' where it is not associated with a value To execute this test, run the following from the base repo dir: python test/inductor/test_triton_kernels.py KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145271 Approved by: https://github.com/jansel	2025-01-22 19:18:59 +00:00
iupaikov-amd	577708e6de	Unskipped multiple inductor tests for ROCm (#143581 ) All of them should be fine to run now after the triton fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-16 20:46:06 +00:00
Sam Ginzburg	074aca3ed2	[user triton] add support for @triton.heuristics after @triton.autotune (#142208 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142208 Approved by: https://github.com/zou3519	2025-01-11 02:18:26 +00:00

1 2 3

148 Commits