Summary: This is a reland of D80285441, fixed the unit test.
Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR
```
will succeed after this diff.
Rollback Plan:
Differential Revision: D80971224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161521
Approved by: https://github.com/frank-wei
The PR #160222 replaced @skipCUDAIf with @requires_cuda_and_triton in test_torchinductor_opinfo.py, which caused the CI jobs for other devices to skip this large test suite. We attempted to revert #160222 but ran into conflicts. I then opened #160936 to revert the changes from #160222, but that resulted in CPU CI job timeouts. I also filed issue #161132 for assistance, but haven’t received a response yet.
To minimize the impact, this PR re-enables the test suite on XPU first. I will continue to seek help on re-enabling it for CPU afterwards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161477
Approved by: https://github.com/jansel
Summary: AMD specific kwargs need to be removed from the guard, otherwise a keyerror will be raised when executing the kernel.
Test Plan:
```
buck2 run mode/opt-amd-gpu -m rocm641 -c fbcode.split-dwarf=true -c fbcode.use_link_groups=true -c fbcode.enable_gpu_sections=true //hpc/new/models/feed/benchmark:feed_lower_benchmark -- --load=manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/894698382/0/gpu_lowering/new_input8 --skip-eager --skip-flop-estimation --sync-mode=0 --lower-backend=AOT_INDUCTOR
```
can succeed after this change.
Rollback Plan:
Differential Revision: D80285441
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160671
Approved by: https://github.com/muchulee8
Fixes#155006
Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
Fixes#155006
Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
Tests added:
```
python test/inductor/test_triton_kernels.py -k test_on_device_tma
python test/inductor/test_triton_kernels.py -k test_add_kernel_on_device_tma
python test/inductor/test_aot_inductor.py -k test_triton_kernel_on_device_tma
```
These pass on Triton 3.3 but not yet on Triton 3.4 (note: to support tests for both Triton versions, there's two triton kernels - one for old api and one for new api - and a given version of the test will only run if that version of the API is available).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155827
Approved by: https://github.com/FindHao
ghstack dependencies: #155777, #155814
Summary:
Codegen used to generate tmp_arg_{index} as temporary args, and index is the position of the caller.
We changed the logic of codegen such that we can reuse previous generated samples, and only delete after arg is no longer used. In this case, we need to make {index} unique, since different functions could reuse the same "tmp_arg_{index}" name string, but corresponds to different args.
Test Plan: `python test/inductor/test_aot_inductor.py -k test_autotuning_args_reuse`
Differential Revision: D72297084
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150522
Approved by: https://github.com/desertfire, https://github.com/22quinn
Summary:
User defined Triton kernel sometimes rely on real inputs to determine
the path of execution. We need real inputs to invoke the correct
behavior of the user defined triton kernels (see example in test case,
where we have an early return for random inputs)
Test Plan:
Included in the commit.
python test/inductor/test_aot_inductor.py -k triton_autotuning
python test/inductor/test_aot_inductor.py -k triton_mutated_autotuning
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149553
Approved by: https://github.com/davidberard98, https://github.com/eellison
Summary:
When a Triton kernel has arguments with None values followed by arguments with value 1, AOTI attempts to remove the None arguments and update the indices of the equal_to_1 arguments in triton_meta["configs"]. However, if the same kernel is called multiple times, this optimization process is repeated. Prior to this diff, the indices of equal_to_1 arguments from subsequent calls (second and later) were based on the updated indices from the previous call, resulting in incorrect behavior.
This diff aims to localize the updated indices for equal_to_1 arguments within the optimization process of the current call, ensuring accurate and consistent results.
Test Plan:
Unit Test:
```
buck2 run mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_triton_kernel_with_none_inputs_and_equal_to_1_arg
```
Differential Revision: D69998314
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148102
Approved by: https://github.com/davidberard98, https://github.com/chenyang78
Summary: We currently failed the mutation analysis for all inline_asm ops. In this diff, we handle the case when "is_pure" is set to True since it indicates the operation doesn't mutate the input value
Test Plan:
../buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/test/inductor/__triton_kernels__/triton_kernels.par --r test_mutations_inline_asm_kernel
```
test_mutations_inline_asm_kernel_is_pure_true (caffe2.test.inductor.test_triton_kernels.MutationTests) ... W0226 18:10:34.261000 1906801 /data/users/sijiac/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:656] TTIR mutation analysis: Skipping pure tt.elementwise_inline_asm op (is_pure=True)
ok
----------------------------------------------------------------------
Ran 2 tests in 0.706s
OK
```
Differential Revision: D69878591
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148043
Approved by: https://github.com/zou3519
This PR adds caching for user defined triton kernels by putting the transitive closure of source code in node.meta along with constant arguments.
One HUGE hack we do here is a node looks like
```
triton_kernel_wrapper_functional_proxy = torch.ops.higher_order.triton_kernel_wrapper_functional(kernel_idx = 0, constant_args_idx = 1, grid = [(1, 1, 1)], tma_descriptor_
metadata = {}, kwargs = {'in_ptr0': arg0_1, 'in_ptr1': arg1_1, 'out_ptr': arg0_1}, tensors_to_clone = ['out_ptr']);
```
so we use regex to remove `kernel_idx = 0, constant_args_idx = 1` parts as they are not relevant to cache hash. This is horrible and I'd like to eventually not use pickle as a hashing alternative but this is a longer project.
Differential Revision: [D65895744](https://our.internmc.facebook.com/intern/diff/D65895744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140326
Approved by: https://github.com/zou3519
This adds Dynamo tracing support for the host-side Triton TMA API (see `create_2d_tma_descriptor` calls on the host in the [Triton tutorial](https://triton-lang.org/main/getting-started/tutorials/09-persistent-matmul.html#sphx-glr-getting-started-tutorials-09-persistent-matmul-py)). A few notes:
- Here we assume the availability of the host-side TMA API added to upstream Triton in https://github.com/triton-lang/triton/pull/4498. As of time of writing, this is not a part of the PT2 OSS Triton pin (although back-ported internally). OSS Triton pin update should be done in December 2024.
- To capture the chain of calls `t.data_ptr() --> create_{1d,2d}_tma_descriptor(ptr, ...) --> kernel[grid](tma_desc, ...)`, we add three new variable trackers: `DataPtrVariable`, `CreateTMADescriptorVariable` (for the function), `TMADescriptorVariable` (for TMA descriptor object). This is to maintain the path back from the Triton kernel to the Tensor from which the TMA descriptor has been created.
- The newly introduced variables have `reconstruct` methods used in case of graph breaks.
- The `tma_descriptor_metadata` extracted from the captured `create_{1d,2d}_tma_descriptor` calls is propagated through the HOPs in Dynamo and AOTAutograd to be used by the downstream compiler (e.g., Inductor). See the unit tests for how the captured HOP arguments look like.
- In the Dynamo-captured fx graph, we replace the TMA descriptor arguments of the Triton kernel by the underlying Tensors, to be able to track the input/output relationships in terms of Tensors.
- In the Triton kernel mutation analysis pass (in AOTAutograd), we use the `tt.experimental_descriptor_store` TTIR op to detect mutations of the underlying tensors via TMA descriptors. So that downstream AOTAutograd can perform functionalizations as required.
- JIT Inductor and AOT Inductor support will be implemented in follow-up PRs.
Differential Revision: [D64404928](https://our.internmc.facebook.com/intern/diff/D64404928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137677
Approved by: https://github.com/zou3519
## Context
In some user Triton kernels, we have this set-up for whatever reason.
```
@triton.jit
def mykernel(
param0,
param1,
param2,
param3: tl.constexpr, # autotuned
param4, # non-constexpr
):
...
```
This is an edge case because it's a general practice to declare all constexprs params at the end.
And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument...
```
> void* kernel_args_var_0[] = {&var_0, &var_1, &var_2};
---
< CUdeviceptr var_3;
< AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void**>(&var_3)));
< void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
```
## Root-cause
* `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params.
* `raw_args`, that gets passed to wrapper codegen, excludes autotuned args.
* In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature.
79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)
## Fix
We try to fix this, by calculating the right constexprs wrt `raw_args`.
An illustration
```
raw_args: [arg0, arg1, arg2, arg4]
kernel.arg_names: [param0, param1, param2, param3, param4]
kernel.constexprs: [3] # param3 is autotuned; this is correct wrt kernel.arg_names
constexpr_indices: [] # this is correct wrt raw_args
```
Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520
Approved by: https://github.com/oulgen
Summary:
This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally.
We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s
OK
```
Differential Revision: D60701839
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562
Approved by: https://github.com/chenyang78
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
Summary: Triton compiler adds constnat argument 1 to `equal_to_1` [only when it's an int](8c5e33c77e/python/triton/runtime/jit.py (L275)). Here we restrict Inductor's `equal_to_1` in the same way.
Test Plan:
```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_float_arg
...
----------------------------------------------------------------------
Ran 1 test in 6.528s
OK
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_equal_to_1_arg
...
----------------------------------------------------------------------
Ran 2 tests in 10.142s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123886
Approved by: https://github.com/oulgen
ghstack dependencies: #123703
Summary:
Now we can parse statements like
```
%22 = tt.make_tensor_ptr %20, [%21, %c128_i64], [%c2048_i64, %c1_i64], [%1, %c0_i32]
```
Test Plan:
Added new test
```
buck2 test mode/opt //hammer/ops/tests/inductor:ragged_hstu_test
```
now passes again with optimizations
Differential Revision: D53975130
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120263
Approved by: https://github.com/aakhundov, https://github.com/sijiac
This is a lot of files changed! Don't panic! Here's how it works:
* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.
In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.
The codemod was done with this script authored by GPT-4:
```
import glob
exclude_patterns = [
...
]
for pattern in exclude_patterns:
for filepath in glob.glob(pattern, recursive=True):
if filepath.endswith('.py'):
with open(filepath, 'r+') as f:
content = f.read()
f.seek(0, 0)
f.write('# mypy: ignore-errors\n\n' + content)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
Don't require using it as `@requires_cuda()` -> `@requires_cuda` instead No need for the partial function invoked many times
Split out this change from the initial large refactoring in #117741 to hopefully get merged before conflicts arise
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118281
Approved by: https://github.com/ezyang
Summary: Currently, we [`clone`](19207b9183/torch/_inductor/lowering.py (L5273)) every `TensorBox` argument of custom Triton kernels while lowering them to the Inductor IR, during which the stride information of the kernel inputs is lost. This is problematic in the common case when the strides of a `torch.Tensor` argument are passed as scalars to a custom Triton kernel alongside the tensor itself (due to the underlying Triton code interpreting the tensors as raw pointers, so the contained stride semantics of the `torch.Tensor` is lost).
In this PR, we add an extended version of the existing [`clone` lowering](19207b9183/torch/_inductor/lowering.py (L2289))---`clone_preserve_reinterpret_view`---which carries over the `ir.ReinterpretVew` layers (if any) from the source `TensorBox` to the cloned one. The rationale behind adding a new function (and switching to it in the `triton_kernel_wrap` only for now) as opposed to extending the existing `clone` is keeping the semantics of the latter untouched, as it is a lowering of `torch.clone` (albeit incomplete, as the `memory_format` is currently ignored). Changing the existing `clone` would change the semantics which is not necessarily desirable in general. Open to suggestions, though.
Test Plan:
```
$ python test/dynamo/test_functions.py -k test_triton_kernel_strided_input
...
----------------------------------------------------------------------
Ran 1 test in 5.568s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116219
Approved by: https://github.com/jansel
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.
Previously, we would see wrapper like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
Noticed that on many MRS kernels the grid wrapper for autotuning is huge with a bunch of duplicates due to num_warps and num_stages not being needed for grid calculation. Lets deduplicate these entries.
Previously, we would see wrapper like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
now it looks like
```
def grid_wrapper_for_add_kernel_2d_autotuned_0(meta):
if meta['BLOCK_SIZE_X'] == 128 and meta['BLOCK_SIZE_Y'] == 128: return (4, 2, 1)
if meta['BLOCK_SIZE_X'] == 64 and meta['BLOCK_SIZE_Y'] == 64: return (8, 4, 1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115849
Approved by: https://github.com/jansel
Summary: `None` arguments are codegened as `*i8` in the `triton_meta` of the generated or user-defined Triton kernels:
85aa372374/torch/_inductor/codegen/triton_utils.py (L33-L36)
Due to this, in contrary to the conventional Triton, we actually should pass `nullptr` to the Triton kernels in C++ wrapper codegen instead of passing nothing (as normally `None` doesn't make it to the generated PTX parameters, just like `tl.constexpr` args).
This PR adds two things:
1. Proper C++ wrapper codegening (ABI and non-ABI) of `nullptr` and `c10::nullopt`, as the prior way codegening `c10::nullopt` as tensor breaks (also `c10` breaks in the ABI mode).
2. Skipping `tl.constexpr` args when calling the loaded-from-cubin compiled Triton kernel in the C++ wrapper codegen. As a side effect, this also resolves an issue with string arguments: now they are simply omitted in the C++ wrapper codegen.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_triton_kernel_with_none_input
...
----------------------------------------------------------------------
Ran 4 tests in 40.364s
OK (skipped=2)
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114475
Approved by: https://github.com/oulgen