Fix https://github.com/pytorch/pytorch/issues/128063 .
Now for this snippet
```
def f(x):
y = torch.sum(torch.sum(x, dim=-1))
z = x / 10.0
z_t = z.t().contiguous().t()
return y, z, z_t
```
Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused).
The PR needs fix 2 subtile bugs regarding LOAF .
1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison.
2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376
Approved by: https://github.com/jansel, https://github.com/eellison
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.
Implements:
https://github.com/pytorch/pytorch/issues/132826
Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
Summary:
I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer.
Implements:
https://github.com/pytorch/pytorch/issues/132826
Test Plan:
New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383
Approved by: https://github.com/eellison
```
NOTE [lowering-time collective optimization]
In collective communication libraries such as NCCL, every rank maintains
communication buffers that are remotely accessible by some peers. Depending
on the underlying transport, remote accessibility may be established via
mechanisms such as ib_reg_mr, CUDA P2P, or CUDA multicast. Typically, these
buffers are private to the communication library by default, and
communication ops copy user data in and out of these buffers.
To prevent these copies, an optimization commonly known as "user buffer
registration" can be employed. This allows direct establishment of remote
accessibility on user buffers, eliminating the need for copying. However,
this optimization introduces stringent usage requirements, which are
typically hard to satisfy without being intrusive to the user code:
- Establishing remote accessibility is expensive and often done ahead of
time. In such implementations, all ranks must agree on the set of allocations
used for every collective op. Failing to meet this requirement can
lead to runtime errors or even silent correctness issues.
- Even if the collective communication library supports gracefully falling
back to "unregistered" implementations, the fallback mechanism would nullify
the optimization.
- Some communication mechanisms impose stricter requirements than others. For
example, CUDA's multicast + multi-mem instructions require all ranks to agree
not only on the allocations used for every collective but also on the offsets
within these allocations.
To support all different mechanisms with optimal results, we aim to satisfy
the strictest requirement for this family of optimizations - we ensures that
every collective op invocation is guaranteed to operate on the same
allocation, at the same offset, in every iteration.
For eligible collective ops, we identify communication buffers at lowering
time and optionally choose to lower the op to a different kernel
(ommunication libraries like NCCL handle both registered and non-registered
buffers transparently within the same op, though some may require different
ops for different cases). Later, the codegen will perform "persistent
allocation" to satisfy the aforementioned constraints, and optionally,
perform buffer planning to optimize overall memory usage.
```
### Changes
- Created `comm_lowering.py` for the lowerings of `_c10d_functional` ops. This is to prevent cluttering `lowering.py` as we add more lowering-time collective optimizations. This PR moved the lowerings for `all_reduce` and `all_reduce_` to the file.
- Added `comm_buffer_type: Dict[str, str]` to `GraphLowering` to track whether a buffer is a comm buffer and the type of the comm buffer.
- Added codegen allocation support for comm buffers of type "symm_mem".
- Added support for auto-lowering `_c10d_functional.all_reduce_` to `symm_mem.one_shot_all_reduce`.
- Added an Inductor config for collective optimizations in general (`config._collective`).
### Limitation
Currently, each persistently allocated comm buffer is dedicated to a single callsite. This is not viable in terms of memory usage. However, this is a neccesary intermediate state before we tackle memory planning for comm buffers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138029
Approved by: https://github.com/Chillee
ghstack dependencies: #138028
Previously, instances of `SchedulerNode` and `FusedSchedulerNode` would explicitly check whether the compilation target is Triton when codegen'ing debug strings. Generating debug triton code is instead implemented as a callback set on scheduler nodes by `TritonScheduling`. This makes the codegen more device-agnostic and allows schedulers to customise the codegen output as opposed to it being closely coupled to the debug string codegen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135015
Approved by: https://github.com/jansel
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.
Discussion for reviewers:
It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation
```
class AllocatedFixedLayout(FixedLayout)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.
Discussion for reviewers:
It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation
```
class AllocatedFixedLayout(FixedLayout)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
Summary:
Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so.
Test Plan:
New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers.
Fixes#120217
Here's a diagram of the issue:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042
Approved by: https://github.com/eellison
**Motivations**:
A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf).
**Solutions**:
1. implement a peak memory estimator via liveness analysis
2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory
**Results**:
On some models we can reduce the peak memory significantly:
| model | batch size | peak_memory baseline | peak_memory new | ratio |
|:-----------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| alexnet | 128 | 1.17 | 0.99 | 1.19 |
| vgg16 | 64 | 4.10 | 3.57 | 1.15 |
| DebertaV2ForQuestionAnswering | 1 | 11.60 | 10.56 | 1.10 |
In the presence of compiler based AC, peak memory can be further reduced:
| model | batch size | peak_memory baseline | peak_memory new | ratio |
|:------------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| AlbertForMaskedLM | 4 | 6.87 | 6.43 | 1.07 |
| AlbertForQuestionAnswering | 4 | 8.69 | 7.76 | 1.12 |
| MobileBertForQuestionAnswering | 128 | 4.67 | 3.90 | 1.20 |
[Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case.
**Other infos:**
* neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_.
* minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second.
* no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874
Approved by: https://github.com/yf225
It's a bit surprised that the code added in Scheduler.fusable_read_and_write would increase compilation time.
Here are some number I get from a H100 on BertForMaskedLM:
- without the fix, cold start compilation time is around 82s
- with the fix, cold start compilation time is around 76s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135071
Approved by: https://github.com/jansel