148 Commits

Author SHA1 Message Date
fdab48a7c1 Enable all PIE rules on ruff (#165814)
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 07:36:18 +00:00
24520b8386 Revert "Enable all PIE rules on ruff (#165814)"
This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872.

Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))
2025-10-18 07:21:08 +00:00
c79dfdc655 Enable all PIE rules on ruff (#165814)
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 06:40:12 +00:00
eb3fbf5b08 [inductor] in emulate_precision_casts, disable fma fusion in triton (#163073)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163073
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-23 23:59:17 +00:00
03798b0f91 [inductor] Fix removal of constexpr args from the launcher signature (#161924)
Fixes the case described below which occurs when:
- A user `torch.compile`s a function that uses a triton kernel.
- `TORCHINDUCTOR_DUMP_LAUNCH_PARAMS=1` .

Problem:

If the user defined triton kernel is not autotuned:

```python
import os
os.environ["TORCHINDUCTOR_DUMP_LAUNCH_PARAMS"] = "1"
@triton.jit
def kernel(..., BLOCK_SIZE: tl.constexpr):
    ...
@torch.compile
def fn(..)
    kernel[..](..., 128)

fn(..)
```

Then In `triton_heuristics. _interpret_args_grid`, `filter_signature` function:

```python
        def filtered_signature() -> list[str]:
                # constexprs are not passed in as args
                return [
                    x
                    for x in self.triton_meta["signature"].keys()
                    if x not in cfg.kwargs.keys()
                ]
```

because `triton.autotune` is not used on the the `triton.jit` function, `cfg` above will be empty, and so `BLOCK_SIZE` will not be removed from the signature even though it is constexpr, even though it is removed from the arguments that are passed in to `interpret_args_grid`. This results in a mismatch between the number of parameters in the signature and the number of arguments, which leads to the error `NameError: name '_grid_2' is not defined`.

Fix:

Use the triton jit kernel `constexprs` for args to remove.  Not sure if this is a good fix so suggestions are welcome.

Test plan:

Added a parameter to an existing triton kernel to test for this edge case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161924
Approved by: https://github.com/davidberard98
2025-09-12 13:58:09 +00:00
81aeefa657 Add torch.compile support for triton.constexpr_function (#162106)
Fixes #161868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162106
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-09-04 16:46:55 +00:00
2c0650a00a Revert "[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711)"
This reverts commit 8dbe7f99bd707ee28ae12ecb9cab54e1785bf13e.

Reverted https://github.com/pytorch/pytorch/pull/160711 on behalf of https://github.com/davidberard98 due to internal failure - T235384144 - I'll revert while I investigate. ([comment](https://github.com/pytorch/pytorch/pull/160711#issuecomment-3215343200))
2025-08-22 19:10:35 +00:00
8dbe7f99bd [BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711)
allow_tf32 is deprecated. Also, this will make it easier to support tf32x3 (i.e. #160359).

dashboard results on h100 show no change: [inference](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f), [training](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160711
Approved by: https://github.com/PaulZhang12, https://github.com/njriasan
2025-08-19 05:27:10 +00:00
10bc36fe84 Get tensor subclasses and torch.library.triton_op to dispatch correctly (#160341)
Short-term fix for https://github.com/pytorch/pytorch/issues/160333

The problem is:
1) `triton_op` adds a decomposition for FunctionalTensorMode for this operation
2) Tensor Subclasses rely on FunctionalTensorMode's `__torch_dispatch__` returning NotImplemented.
3) `triton_op`'s FunctionalTensorMode decomposition takes precedence over FunctionalTensorMode's decomposition.

The easy fix is to copy-paste the FunctionalTensorMode's NotImplemented
return logic into the decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160341
Approved by: https://github.com/drisspg
2025-08-12 04:09:37 +00:00
af10f1f86c Fix requires_cuda to requires_cuda_and_triton (#160222)
Fixes ##159399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222
Approved by: https://github.com/janeyx99
2025-08-10 07:05:52 +00:00
50f23ff6f8 rename-HAS_CUDA-to-HAS_CUDA_AND_TRITON (#159883)
Fixes #159399
"Modified torch.testing._internal.inductor_utils and test/inductor"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159883
Approved by: https://github.com/janeyx99
2025-08-08 15:44:52 +00:00
8cb91e20bc Renaming HAS_XPU to HAS_XPU_AND_TRITON (#159908)
This PR follows up on the discussion in #159399 where @Akabbaj and @janeyx99 mentioned renaming HAS_XPU to HAS_XPU_AND_TRITON for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159908
Approved by: https://github.com/janeyx99, https://github.com/guangyey
2025-08-07 11:24:44 +00:00
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
82eefaedd9 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-02 14:02:01 +00:00
ab6cb34480 Revert "[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)"
This reverts commit 563fd95563c5edd732ae260b3bd3d0c38822ab57.

Reverted https://github.com/pytorch/pytorch/pull/157322 on behalf of https://github.com/davidberard98 due to fails on rocm ([comment](https://github.com/pytorch/pytorch/pull/157322#issuecomment-3025826951))
2025-07-01 23:21:37 +00:00
563fd95563 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-01 22:51:11 +00:00
b6c00dfe24 [user triton] AOT inductor support for device-side TMA (#155896)
Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`

Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.

To support this in AOTI, this PR:
* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
* allocates global scratch, if needed (cuda/device_op_overrides.py)
* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)
* updates tests to verify this works for dynamically shaped inputs

This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)

Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).

For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896
Approved by: https://github.com/desertfire
2025-06-27 04:28:04 +00:00
f5e6e52f25 [BE][PYFMT] migrate PYFMT for test/inductor/ to ruff format (#148186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186
Approved by: https://github.com/jansel
2025-06-24 11:12:11 +00:00
c83041cac2 [test][triton pin] add device-side TMA tests (AOTI + test_triton_kernels) (#155827)
Tests added:
```
python test/inductor/test_triton_kernels.py -k test_on_device_tma
python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma
python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma
```

These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827
Approved by: https://github.com/FindHao
ghstack dependencies: #155777, #155814
2025-06-15 20:24:19 +00:00
b7c95acc6c [user triton] triton_kernel_wrap support for new host-side TMA API (#155777)
This adds support for user-defined triton kernels using TensorDescriptor.from_tensor into triton_kernel_wrap: i.e. storing metadata about the TMA descriptors and doing mutation analysis.

Major changes:
* TMADescriptorMetadata has changed: previously it was a dict[str, tuple[list[int], list[int], int]]. But now there are two metadata formats: one for experimental API and one for stable API. Now the metadata format is dict[str, tuple[str, tuple[...]]], where tuple[...] is tuple[list[int], list[int], int] for experimental and tuple[list[int],] for stable API. And then most handling of the metadata has to be branched based on whether the metadata represents a stable or experimental TMA descriptor
* mutation analysis: unlike experimental TMA (where the mutation analysis / ttir analysis pretends that the TMA descriptor is actually just a tensor), we need to construct an actual TMA descriptor before getting the Triton frontend to create the TTIR (otherwise assertions fail). A TensorDescriptor (i.e. stable TMA API descriptor) passed into a python triton kernel actually turns into 1 + 2*N parameters in the TTIR (for a rank-N tensor), so the arg list also needs to be patched for this reason (in generate_ttir)
* mutation analysis: now we also need to pass tma_descriptor_metadata into the mutation analysis, in order to create the TMA descriptors that are passed into the frontend code (ie. the previous point). This is why all the mutation tests are modified with an extra return value (the tma_descriptor_metadata)

Inductor is not modified (Inductor just errors out if you use a stable API tma descriptor). This will be the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155777
Approved by: https://github.com/aakhundov
2025-06-15 20:24:19 +00:00
9328a7fb58 [triton pin][tests] refactor test_triton_kernel.py tests to test new & old API (#155510)
This splits out the tests so we can independently test both the new and old API.

Note: the new API doesn't work yet - we still need to fix those tests.

Differential Revision: [D76318840](https://our.internmc.facebook.com/intern/diff/D76318840)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155510
Approved by: https://github.com/oulgen
2025-06-11 13:52:15 +00:00
a9a0501ec4 [user triton] mutation analysis for on-device TMA (#155380)
Previously, the user-defined triton kernel mutation analysis would not detect mutation caused by TMA store, if the TMA descriptor was created via on-device TMA creation. This PR adds partial support for mutation analysis on programs that do stores via on-device TMA.

On-device TMA works like this:

```
@triton.jit
def kernel(A_ptr, workspace_ptr, ...):
    tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, A_ptr, ...)
    tl._experimental_descriptor_store(workspace_ptr, data, ...)
```

The first call (tensormap_create2d) mutates the contents of workspace_ptr to contain a data (including the fact that this TMA descriptor points to A_ptr). The second call (experimental_descriptor_store) writes to the location specified by the data in workspace_ptr: A_ptr, in this case.

The approach here is to do a first pass to identify all the experimental_descriptor_stores (and collect the associated descriptor values); and then during mutation analysis, any tma creation on a mutated descriptor value (e.g. on `workspace_ptr` in the above example) will actually register as a mutation to the associated data pointer (e.g. `data` in the above example).

Consider this example, which I'll used to describe the pros/cons of this approach.

```
@triton.jit
def create_tma(global_ptr, workspace_ptr):
    tl.extra.cuda.experimental_device_tensormap_create2d(workspace_ptr, global_ptr, ...)

@triton.jit
def kernel(A, B, workspace_ptr):
    create_tma(A, workspace_ptr)
    workspace_B = workspace_ptr + 128
    create_tma(B, workspace_B)
    data = tl._experimental_descriptor_load(workspace_ptr, ...)
    tl._experimental_descriptor_store(workspace_B, data, ...)
```

An alternative approach could be to simply modify the `tl.extra.cuda.experimental_device_tensormap_create2d` so that it returns a descriptor, and to use that descriptor in subsequent uses (i.e. to "functionalize" the uses of the tma creation API). However, this would (a) require "functionalization" through any function calls (e.g. to `create_tma`), and (b) would lead to both `A` and `B` being marked as mutated (i.e. mutation to `workspace_B` -> mutation to `workspace_ptr` -> mutation to `A`).

A downside of the current approach is that it doesn't understand offsets into workspaces. e.g. if one were to recompute workspace_B instead of reusing the variable, the analysis pass would not understand that these values point to the same descriptor.

Differential Revision: [D76175117](https://our.internmc.facebook.com/intern/diff/D76175117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155380
Approved by: https://github.com/oulgen
2025-06-10 00:07:18 +00:00
71c8231742 fix bug with TORCHINDUCTOR_DUMP_LAUNCH_PARAMS (#153066)
Summary:
https://fb.workplace.com/groups/1028545332188949/posts/9503194033132340/?comment_id=9504669536318123&reply_comment_id=9506405459477864&notif_id=1746154132646897&notif_t=work_group_comment_mention

Aligns the arguments for the triton inputs

Differential Revision: D74085173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153066
Approved by: https://github.com/jansel
2025-05-12 23:56:49 +00:00
3d777bae10 Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #148104
2025-05-03 00:02:24 +00:00
a0e796df03 Revert "Inductor respects exact strides on custom ops by default (#150511)"
This reverts commit a4bb2f106f8cc642539d4698b6d869a87adca92f.

Reverted https://github.com/pytorch/pytorch/pull/150511 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/14357056427/job/40251630946) [HUD commit link](2e7c9d33e7) ([comment](https://github.com/pytorch/pytorch/pull/148104#issuecomment-2790369493))
2025-04-09 16:49:48 +00:00
a4bb2f106f Inductor respects exact strides on custom ops by default (#150511)
If a tag is not specified on a custom operator, then inductor will
assume that it needs exact strides.

Test Plan:
- tests + CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150511
Approved by: https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #150495, #148104
2025-04-09 16:46:48 +00:00
4d121d2b02 Implement needs_exact_strides for mutable custom operators (#148091)
Mutable custom operators get wrapped into an auto_functionalized HOP, so
we need to store the arg_kwarg_vals on the auto_functionalized HOP
itself.

When Inductor does the re-inplacing, it'll use the pattern matcher to
decompose the auto_functionalized HOP back into the original op (and
0+ other view or clone operations). The pattern matcher uses the
arg_kwarg_vals to trace the subgraph to do the decomposition, so it
ultimately sets arg_kwarg_vals on the original op's node correctly.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148091
Approved by: https://github.com/eellison
ghstack dependencies: #148046, #148063
2025-04-02 13:18:04 +00:00
c69c3c885e Add needs_exact_strides operator tag for Inductor to force exact strides (#148063)
Inductor will force exact strides on a custom operator tagged with
needs_exact_strides. I'll make this the default in a follow-up PR.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148063
Approved by: https://github.com/eellison
ghstack dependencies: #148046
2025-04-02 13:17:58 +00:00
49b7d0d84d [ROCm] Enable more inductor UTs (#149513)
Primarily enable inductor fp8 tests, also enable other inductor tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149513
Approved by: https://github.com/jeffdaily
2025-04-01 00:30:36 +00:00
f649ee73ce Use source hashing to generate consistent symbolic ids (#149665)
This PR was inspired by internal models that were cache missing due to PGO. At a high level the problem looks as follows

Run 1, Invocation 1: We do static compile, save some example values in PGO/automatic dynamic

Run 1, Invocation 2: We detect varying inputs, do dynamic compile, get a dynamic graph and save to PGO. Crucially what we save to PGO is actually a superset of what is actually dynamic. If we notice an input was varying, we mark it as dynamic in PGO even if later on that value gets specialized. When a value gets specialized, we actually remove the symbol from the graph. This results in an interesting conundrum where although we are producing the same isomorphic graph, PGO makes the second run cache miss. Let's see how....

Run 2, Invocation 1: We fetch the PGO, over-mark things as dynamic, get a fx graph, look it up in the cache and... whoops! cache miss! This is because of the aforementioned behavior where the PGO profile will cause us to over-allocate symbols. In practice this means we end up saving a graph in cache with symbols x:s1, y:s3 and on second attempt we cache miss with x:s1, y:s6 where symbols s3,s4,s5 were all optimistically marked dynamic by PGO and subsequently specialized.

We solve this problem by hashing the source names. This ensures somewhat stable assignment. To prevent catastrophic symbol collisions, we use linear probing to ensure no collisions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149665
Approved by: https://github.com/Mingming-Ding, https://github.com/laithsakka
2025-03-28 05:36:32 +00:00
b040dc3a53 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential [disconnected] Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-12 15:52:16 +00:00
5ada4e6a53 Revert "Reland: [inductor] Simplify grid handling (#148305)"
This reverts commit 8d08b4901586f230353a558ee00c16ad57f95178.

Reverted https://github.com/pytorch/pytorch/pull/148305 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/148305#issuecomment-2718177044))
2025-03-12 14:58:43 +00:00
8d08b49015 Reland: [inductor] Simplify grid handling (#148305)
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583

Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Differential Revision: D70471332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
2025-03-11 18:51:06 +00:00
5fb0f45d3b [triton 3.3] test_triton_kernel_constants fix (#148626)
Thanks @FindHao who did the initial version of this PR: https://github.com/pytorch/pytorch/pull/148505

TL;DR is that https://github.com/triton-lang/triton/pull/5961 deprecates `tl.constexpr` annotations - you're supposed to wrap the constexpr value in `tl.constexpr()` instead.

This just updates the tests to wrap with `tl.constexpr()` (and leaves the annotations - that way the old triton versions will still pass).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148626
Approved by: https://github.com/FindHao
2025-03-06 14:18:21 +00:00
608377d341 Revert "[import][inductor] Simplify grid handling (#147583)"
This reverts commit b59776d8572a56e2d2366174eac11015b1776f1e.

Reverted https://github.com/pytorch/pytorch/pull/147583 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147583#issuecomment-2693016036))
2025-03-03 00:49:32 +00:00
b59776d857 [import][inductor] Simplify grid handling (#147583)
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg.  This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
    grid_0 = ((xnumel + 1023) >> 10)
    grid_1 = 1
    grid_2 = 1
    runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```

This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.

It also allows us to unify the handling of grids between the Python and C++ wrapper code.  Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.

This unification allows this PR to be a net deletion of code.

Note the attached diff contains some minor fbcode-only changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-03-02 07:31:07 +00:00
4995e058bf [user-triton] handle inline_asm_case (#148043)
Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value

Test Plan:
../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel

```
test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True)
ok

----------------------------------------------------------------------
Ran 2 tests in 0.706s

OK
```

Differential Revision: D69878591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043
Approved by: https://github.com/zou3519
2025-02-28 20:52:51 +00:00
9d3636283b [Inductor] Use generic GPU device in test_preserves_strides (#148006)
#147861 added a new test tagged for the generic GPU but uses the cuda GPU type for creating the tensors. Update the GPU type to also be generic. This passes with my local Intel Triton install, presumably it will work for the current pin.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148006
Approved by: https://github.com/eellison, https://github.com/etaf
2025-02-27 02:52:51 +00:00
c839fa4dd2 [Resubmit] Record input strides at time of tracing, constrain to them for triton fn (#147861)
Resubmit of https://github.com/pytorch/pytorch/pull/145448. it lost its changes on rebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147861
Approved by: https://github.com/zou3519
2025-02-26 05:05:06 +00:00
46d1422afd cpp_wrapper: fix inductor triton tests (#146109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146109
Approved by: https://github.com/desertfire
2025-02-25 19:50:37 +00:00
94969d0a40 [inductor][user triton] Handle scf.yield more accurately (#147762)
**TL;DR**: Previously, the mutation analysis for scf.if/scf.for would bundle all the scf.yield arguments into a single op (the scf.yield), such that a mutation on any returned value from the scf.if/scf.for would register as a mutation to _all_ of the scf.yield args. To fix this, this PR artificially introduces a new scf.yield op for each of the scf.yield args.

**Context**: The relevant kernel is something like this one (added as a test in test_triton_kernels.py)

```python
        @triton.jit
        def branch_with_multiple_yield_args(
            in_ptr0,
            in_ptr1,
            out_ptr,
            conditional_ptr,
            n_elements,
            BLOCK_SIZE: "tl.constexpr",
        ):
            pid = tl.program_id(axis=0)
            block_start = pid * BLOCK_SIZE
            offsets = block_start + tl.arange(0, BLOCK_SIZE)
            mask = offsets < n_elements
            conditional = tl.load(conditional_ptr)
            if conditional:
                in0 = in_ptr0 + 1
                in1 = in_ptr1 + 1
                out = out_ptr + 1
            else:
                in0 = in_ptr0
                in1 = in_ptr1
                out = out_ptr
            x = tl.load(in0 + offsets, mask=mask)
            y = tl.load(in1 + offsets, mask=mask)
            tl.store(out + offsets, x + y, mask=mask)
```

The mutation analysis starts with the `tl.store` - and then does a DFS backwards towards the parameters.  When a new op is encountered in the DFS, the analysis pass recurses on the op's arguments.

The if branch gets converted to TTIR like this:

```mlir
    %21:3 = scf.if %20 -> (!tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32>) {
      ...
      scf.yield %31, %32, %33 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc10)
    } else {
      scf.yield %arg0, %arg1, %arg2 : !tt.ptr<f32>, !tt.ptr<f32>, !tt.ptr<f32> loc(#loc11)
    } loc(#loc7)
```

and so the "source" op of the `out` variable is marked as the `scf.yield` op - and then all of the arguments to `scf.yield` are marked as mutable (including arg0, arg1, and arg2 - only one of which is actually mutated).

**This PR** we duplicate the `scf.yield` to add one `scf.yield` per return value. That way we avoid marking all the returns from the scf.if/scf.for as mutated when only some are.

Differential Revision: [D70118202](https://our.internmc.facebook.com/intern/diff/D70118202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147762
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-02-25 08:41:00 +00:00
b11d5cd584 [Inductor UT][Windows][XPU] Fix Inductor UT on XPU Windows. (#146481)
This PR fixed all the inductor UT failures for XPU backend on Windows we found in local machine(Due to resource constraints, we have not yet set up a Windows CI pipeline online.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146481
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #147347
2025-02-22 02:53:16 +00:00
7997ecf809 [BE] reduce log spew from test_triton_kernels.py (#145895)
One of the tests in this file was setting `self._logging.set_logs(output_code=True)` - which would cause logs to be printed for the rest of the tests in this file.

This PR puts the log-setting in a context manager so that the old behavior is restored afterwards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145895
Approved by: https://github.com/nmacchioni
2025-02-04 03:44:23 +00:00
8326d27093 [inductor][5/N] triton support post-#5512, fix 1 and None handling (#145515)
This fixes handling for "1" and "None" args with new Triton versions. TL;DR: triton_meta["constants"] (which is passed to ASTSource) should be a map of {"kwarg_name": constant_value} for values which are tl.constexpr, or have a value of 1 or None (i.e. "specialized" constants). For constant args, triton_meta["signature"][arg_name] should be "constexpr" (even for specialized constants).

Note: This adds support for Triton versions after 5512; but not for versions in between 5220 and 5512 (i.e. `TritonAttrsDescriptorVersion.V3_BACKENDS_TUPLE`). There's a completely different format for constants/signature in the commit range in between.

To test: I ran `test_torchinductor.py` and `test_triton_kernels.py` with the main branch of triton (~jan 27). The only failing tests are aoti-related tests (which need to be fixed as a follow-up), and test_mutable_custom_op_fixed_layout2_cuda (which is failing with or without the new triton version on my machine); additionally, the split-scan/split-reduction kernels rely on https://github.com/triton-lang/triton/pull/5723.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145515
Approved by: https://github.com/SamGinzburg
2025-02-01 02:11:48 +00:00
0d6343347f Revert "Record inputs at time of tracing, constrain to them for triton fn (#145448)"
This reverts commit a699034eeca8c096c44a690e405a60efa442d4ed.

Reverted https://github.com/pytorch/pytorch/pull/145448 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D68779678 for details ([comment](https://github.com/pytorch/pytorch/pull/145448#issuecomment-2622470810))
2025-01-29 18:07:12 +00:00
a699034eec Record inputs at time of tracing, constrain to them for triton fn (#145448)
Record input fake tensors at time of tracing and store them in the node meta. Inductor passes have the possibility of changing strides, so it is safer to record the strides of the inputs at tracing. See, https://github.com/pytorch/pytorch/issues/137979 for more context.

We can also extend this to custom ops, and user-visible outputs. If this ends up being compilation time sensitive we can just record strides (and maybe storage offset, per @zou3519) instead of the complete fake tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145448
Approved by: https://github.com/zou3519
2025-01-28 07:07:14 +00:00
b2c89bc115 [inductor][2/N] triton support post-#5512, user-defined triton kernels (#145348)
Triton commit 5220 adds tuple support in Triton (changing the indexing format in AttrsDescriptor) and commit 5512 replaces AttrsDescriptor with raw tuples. This PR fixes user-defined triton kernel handling (in most cases) for these new triton commits.

What this PR fixes:
* in triton_kernel_wrap.py, AST->TTIR parsing was to be updated for the new triton API
* ir.py - don't remove None args when using newer triton versions
* wrapper.py - update signature & constant handling

What this doesn't fix:
* correct None handling - I want to do a closer look at constant handling (including None, equal_to_1, and other constants).
* cpp wrapper (which needs to be fixed for both user-defined triton kernels and inductor-generated kernels)

test/inductor/test_triton_kernels.py passed on triton commit 74de6b46, with the exception of three tests (those shown here: 1374074098)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145348
Approved by: https://github.com/jansel
ghstack dependencies: #145051
2025-01-24 00:34:01 +00:00
e8e3c03f96 [Test][Inductor] Fix test_tma_graph_breaks (#145271)
Per title. Before these changes, below tests:
```
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_False
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_False_after_create_desc_True
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_False
test_triton_kernels.py::KernelTests::test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True
```

fail with the following message:
```
__________________________________________________________________ KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True ___________________________________________________________________
Traceback (most recent call last):
  File "/usr/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/usr/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 3114, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/torch/testing/_internal/common_utils.py", line 557, in instantiated_test
    test(self, **param_kwargs)
  File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1760, in test_tma_graph_breaks
    eager_out = f(a, b)
                ^^^^^^^
  File "~/git/pytorch/test/inductor/test_triton_kernels.py", line 1740, in f
    t.element_size(),
    ^
UnboundLocalError: cannot access local variable 't' where it is not associated with a value

To execute this test, run the following from the base repo dir:
    python test/inductor/test_triton_kernels.py KernelTests.test_tma_graph_breaks_after_data_ptr_True_after_create_desc_True

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145271
Approved by: https://github.com/jansel
2025-01-22 19:18:59 +00:00
577708e6de Unskipped multiple inductor tests for ROCm (#143581)
All of them should be fine to run now after the triton fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143581
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-01-16 20:46:06 +00:00
074aca3ed2 [user triton] add support for @triton.heuristics after @triton.autotune (#142208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142208
Approved by: https://github.com/zou3519
2025-01-11 02:18:26 +00:00