Fixes https://github.com/pytorch/pytorch/issues/136640
Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.
In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.
In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```
I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.
I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:
(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions
(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.
Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
Based on https://github.com/pytorch/pytorch/pull/130956.
Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
- When we pad, it is always aligned to the next multiple of 128 bytes.
- Strides smaller than 1024 are not padded.
- Only intermediate values are padded, not outputs.
The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.
This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
- `config.pad_outputs`: choose whether to pad outputs (default: `False`)
- `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
- `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)
**Test plan**
Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.
These changes should not affect perf, because the defaults are identical to Inductor's current behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314
Co-authored-by: Yueming Hao <yhao@meta.com>
Based on https://github.com/pytorch/pytorch/pull/130956.
Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
- When we pad, it is always aligned to the next multiple of 128 bytes.
- Strides smaller than 1024 are not padded.
- Only intermediate values are padded, not outputs.
The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.
This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
- `config.pad_outputs`: choose whether to pad outputs (default: `False`)
- `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
- `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)
**Test plan**
Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.
These changes should not affect perf, because the defaults are identical to Inductor's current behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314
Co-authored-by: Yueming Hao <yhao@meta.com>
Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same.
Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand).
Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them.
This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015
Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254
Approved by: https://github.com/jansel
Fixes#130394
TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables dense and non-dense outputs' strides follow the strides required by semantics.
The comparison between the original and after this fix for the test is the below.
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 128
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex % 8
x1 = (xindex // 8)
- x2 = xindex
tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
tmp1 = tmp0 + tmp0
- tl.store(out_ptr0 + (x2), tmp1, xmask)
+ tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (16, 8), (16, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
- buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+ buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
stream0 = get_raw_stream(0)
triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
del arg0_1
return (buf1, )
```
The buf1 is created with exact stride required by users, and its values are written in same stride with the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire
## Context
In some user Triton kernels, we have this set-up for whatever reason.
```
@triton.jit
def mykernel(
param0,
param1,
param2,
param3: tl.constexpr, # autotuned
param4, # non-constexpr
):
...
```
This is an edge case because it's a general practice to declare all constexprs params at the end.
And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument...
```
> void* kernel_args_var_0[] = {&var_0, &var_1, &var_2};
---
< CUdeviceptr var_3;
< AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void**>(&var_3)));
< void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
```
## Root-cause
* `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params.
* `raw_args`, that gets passed to wrapper codegen, excludes autotuned args.
* In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature.
79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)
## Fix
We try to fix this, by calculating the right constexprs wrt `raw_args`.
An illustration
```
raw_args: [arg0, arg1, arg2, arg4]
kernel.arg_names: [param0, param1, param2, param3, param4]
kernel.constexprs: [3] # param3 is autotuned; this is correct wrt kernel.arg_names
constexpr_indices: [] # this is correct wrt raw_args
```
Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520
Approved by: https://github.com/oulgen
Fixes#130394
TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables non-dense outputs' strides follow the strides required by semantics.
The comparison between the original and after this fix for the test is the below.
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 128
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex % 8
x1 = (xindex // 8)
- x2 = xindex
tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
tmp1 = tmp0 + tmp0
- tl.store(out_ptr0 + (x2), tmp1, xmask)
+ tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (16, 8), (16, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
- buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+ buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
stream0 = get_raw_stream(0)
triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
del arg0_1
return (buf1, )
```
The buf1 is created with exact stride required by users, and its values are written in same stride with the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister
Fixes#128084
The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
involve a clone; if there was a clone and then a mutation into it
then we copy_ back the result of the mutation.
The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.
There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
Part of #134054.
This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
Fixes#128084
The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
involve a clone; if there was a clone and then a mutation into it
then we copy_ back the result of the mutation.
The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.
There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
Partially fixes#122980
- change cpp type mapping for complex64 to std::complex<float>
- add `aoti_torch_item_complex64` and `aoti_torch_scalar_to_tensor_complex64`.
- add `expensiveCopyToTensor()` to convert `ArrayRefTensor<T>` type to `AtenTensorHandle` type.
- if we want to fully fix#122980, we still need to let ArrayRef and MiniArrayRef to consider underlying storage number of elements. See more details in https://github.com/pytorch/pytorch/pull/132347 (#132347 broke some internal tests, so we need more work before landing it).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132810
Approved by: https://github.com/desertfire
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that:
- When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification).
- When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch.
What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it **returns exactly what compile_fx_inner returns**. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner.
## What's a post compile step?
We define a **post-compile** to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include:
- Setting the tracing context's output strides
- Running cudagraphs if enabled
- Maybe realign inputs if cudagraphs didn't run
To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object.
## Splitting cudagraphs work into pre/post compile
Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages.
Implementation notes:
- We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness.
- ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache **doesn't** have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so *only* inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result.
- Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572
Approved by: https://github.com/eellison
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)
This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").
The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
Closes#129507
This makes two changes to the sort kernel:
1. Use int16 for the indices since we only operate on small dims anyway
2. Instead of passing an explicit mask, we pass the rnumel and imply the
mask from that which saves an additional reduction in the sort
kernel's inner loop.
In my benchmarks, this gives enough of a perf improvement to bump up the
max rblock to 512.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719
Approved by: https://github.com/eellison
This PR enables AutoHeuristic for kernel choice selection, where the feedback can not immediately be provided when AutoHeuristic is called, but only after autotuning has happened. The steps are the following:
When the AutoHeuristic constructor is called, AutoHeuristic registers a function in select_algorithm.py.
After autotuning in select_algorithm.py has happened, and there is an entry in autoheuristic_registry, select_algorithm provides the autotuning results to AutoHeuristic, which stores the results.
I enabled AutoHeuristic for mixed_mm to have an example to test it on. We probably want to add more context, and also add an augment_context function. I will add support for this in another PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131610
Approved by: https://github.com/eellison