Graph partition analyzes read_writes to get partition input names. However, weak dep is fake dependency and is not actually read or written. So we should not include weak dep in graph partition input names.
The following test failure is fixed by removing weak dependency from partition_input_names:
`PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py TestTorchDeviceTypeCUDA.test_params_invalidated_with_grads_invalidated_between_unscale_and_step_Adam_cuda_float32`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152863
Approved by: https://github.com/eellison
Summary:
Solves https://github.com/pytorch/pytorch/issues/151925
Currently, AOTI only generate runtime asserts for unbacked symints. We should generate asserts for all `_assert_scalar` calls in the input graph.
Also factored out the run time assertion logic to a separate function.
We need to generate runtime asserts directly in Inductor instead
of just re-using the asserts from input graphs becase we reuse the
same ShapeEnv as before. In particular, on subsequent graph passes,
we would immediately turn all of these assertions into noops,
because when we evaluated their expressions, we would see that
because we had a deferred runtime assert in the ShapeEnv, we
know "oh, of course this expression is True" already.
One example is below:
```
class Model(torch.nn.Module):
def forward(self, a, b, c):
nz = torch.nonzero(a)
ones = a.new_ones([nz.size(0), b.size(0)])
torch._check(ones.size(0) >= 1)
equals = torch.add(ones, c)
return equals
torch._dynamo.mark_dynamic(c, 0)
```
When we re-use the ShapeEnv in Inductor lowering, the check that checks
a and nonzero have the same shape would be evaluted to True after we resolve
unbacked bindings using the ShapeEnv.
See test_unbacked_equals_input_size_runtime_assertion in test_aot_inductor.
In addition to the Inductor generated runtime asserts, we also
need the runtime asserts from the input graph, because some derived
runtime asserts are not generated in Inductor. One example is
below:
```
class Model(torch.nn.Module):
def forward(self, x):
y = x.reshape(100, -1).clone()
y = y + 1
return y
dynamic_shapes = {
"x": {0: torch.export.Dim.DYNAMIC},
}
x.shape[0] needs to be a multiple of 100.
```
See test_aoti_runtime_asserts_backed_symint in test_aot_inductor.
Example:
```
def forward(self):
arg0_1: "f32[s35]";
arg0_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
sym_size_int: "Sym(s35)" = torch.ops.aten.sym_size.int(arg0_1, 0)
#
mod: "Sym(Mod(s35, 100))" = sym_size_int % 100; sym_size_int = None
eq_2: "Sym(Eq(Mod(s35, 100), 0))" = mod == 0; mod = None
_assert_scalar = torch.ops.aten._assert_scalar.default(eq_2, "Runtime assertion failed for expression Eq(Mod(s35, 100), 0) on node 'eq'"); eq_2 = _assert_scalar = None
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:11 in forward, code: y = x.reshape(100, -1).clone()
view: "f32[100, (s35//100)]" = torch.ops.aten.reshape.default(arg0_1, [100, -1]); arg0_1 = None
clone: "f32[100, (s35//100)]" = torch.ops.aten.clone.default(view); view = None
# File: /data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/73a672eb896e7996/scripts/shangdiy/__pt__/pt#link-tree/scripts/shangdiy/pt.py:12 in forward, code: y = y + 1
add_6: "f32[100, 1]" = torch.ops.aten.add.Tensor(clone, 1); clone = None
return (add_6,)
```
Generated cpp code:
```
auto inputs = steal_from_raw_handles_to_raii_handles(input_handles, 1);
auto arg0_1 = std::move(inputs[0]);
auto arg0_1_size = arg0_1.sizes();
int64_t s35 = arg0_1_size[0];
inputs.clear();
auto& kernels = static_cast<AOTInductorModelKernels&>(*this->kernels_.get());
if (!((s35 % 100L) == 0L)) { throw std::runtime_error("Expected Eq(Mod(s35, 100), 0) to be True but received " + std::to_string(s35)); }
```
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_runtime_asserts_backed_symint
```
Differential Revision: D73596786
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152125
Approved by: https://github.com/henrylhtsang, https://github.com/jingsh
This pr adds an optimal reordering for minimizing #partitions.
## Optimal reordering for minimizing #partitions
A bfs could minimize #partitions (ignore peak memory for now):
1. For each node, compute node_to_indegree: dict[node, int].
2. Maintain 2 queues: cudagraphable_nodes, and non_cudagraphable_nodes. Iterate through all nodes and add nodes to one of these 2 queues if node_to_indegree[node] == 0.
3. While non_cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0.
4. While cudagraphable_nodes is not empty: Pop 1 node, schedule it, update the indegree of all its successors, and add its successor nodes to one of the queues if node_to_indegree[successor] == 0.
5. Repeat step 3 & 4 until all nodes have been scheduled.
We call this strategy `reorder_for_minimizing_partition`.
**Q: Why is this optimal?**
Suppose this is not optimal, we have a counter example with 2 non_cudagraphable regions:
```
[non_cudagrable1, cudagraphable2, non_cudagraphable3]
```
where we can reorder to only 1 non_cudagraphable region:
```
[non_cudagrable1, non_cudagraphable3, cudagraphable2]
```
This reorder means non_cudagraphable3 does not depend on cudagraphable2. So after we scheduled non_cudagraphable1, both non_cudagraphable3 and cudagraphable2 have in_degree as 0. If this is true, Step 3 should have already scheduled non_cudagraphable3 before cudagraphable2 such that the counter example cannot exist.
This shows we cannot find such a counter example and the bfs is optimal on minimizing #partitions.
## Minimize peak memory
`reorder_for_peak_memory` currently uses topological_sort_dfs, topological_sort_lpmf, and topological_sort_bfs, where the later 2 are bfs. ILP brings small benefits and it can hardly scale to more than 100 nodes, according to @xuanzhang816. So ILP is not used for peak memory reorder in the inductor.
Heuristics strategy:
- Conduct reorder_for_peak_memory as the default order
- Conduct reorder_for_minimal_partitions and get results as list[tuple[partition, bool]], where partition: list[BaseSchedulerNode] and bool for cudagraphable.
- If the reorder increases peak memory too much, we use the default order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151968
Approved by: https://github.com/eellison
This PR reduces #graph partitions by reordering nodes when the `should_partition` nodes have simple dependencies. Specifically, for `should_partition` nodes:
a. If a node has no dependency or only depends on graph inputs: move to the front. Use case is when we move symints to cuda tensor for PaddedTensorSubclass
b. If the only user of a node is OutputNode: move it to the end.
#### Example
The following example shows a padded tensor subclass use case where we copy symint to a cuda tensor (aka mask) in the middle of function. Reordering still generates 1 cudagraph by moving the mask to the front.
```python
import torch
torch._inductor.config.graph_partition = True
# Two reasons for this:
# 1. We want to reuse the same mask for many masked_fill calls
# 2. Prevent inductor from fusing this op into other ops (e.g. masked_fill)
# so we can still reorder in scheduler
@torch.library.custom_op("mylib::create_mask", mutates_args=(), tags=(torch._C.Tag.cudagraph_unsafe,))
def create_mask(padded_size: int, original_size: int, device: torch.device) -> torch.Tensor:
mask = torch.zeros((padded_size,), dtype=torch.bool, device=device)
mask[original_size:] = True
return mask
@create_mask.register_fake
def _(padded_size, original_size, device):
return torch.empty((padded_size,), dtype=torch.bool, device=device)
def f(padded_tensor, original_tensor, weight):
original_size = original_tensor.size()[0]
padded_size = padded_tensor.size()[0]
# element wise op so we don't care padding value
padded_tensor = padded_tensor + 1
padded_tensor = torch.nn.functional.relu(padded_tensor)
# dot product requires padding with 0
dot_res = padded_tensor.dot(weight)
padded_tensor += dot_res
# min requires padding with inf, so we create mask now
mask = create_mask(padded_size, original_size, padded_tensor.device)
min_res = torch.min(
torch.ops.aten.masked_fill(padded_tensor, mask, float("inf"))
)
# max requires padding with inf. we can reuse previous mask
max_res = torch.max(
torch.ops.aten.masked_fill(padded_tensor, mask, -float("inf"))
)
return min_res+max_res+padded_tensor
compiled_f = torch.compile(f, mode="reduce-overhead")
def run(padded_size, original_size):
padded_tensor = torch.randn(padded_size, device="cuda")
padded_tensor[original_size:] = 0
original_tensor = torch.randn(original_size, device="meta")
weight = torch.randn(padded_size, device="cuda")
eager_out = f(padded_tensor, original_tensor, weight)
compiled_out = compiled_f(padded_tensor, original_tensor, weight)
assert torch.allclose(eager_out[0], compiled_out[0])
assert torch.allclose(eager_out[1], compiled_out[1])
# new cudagraph
run(8, 4)
# new cudagraph due to recompile
run(8, 6)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150814
Approved by: https://github.com/eellison
Summary:
This fixes the error in https://fb.workplace.com/groups/1075192433118967/permalink/1640802133224658/
I tried really hard but I couldn't come up with a test case to repro the issue, but I confirmed with the OP that this issue has been fixed.
```
Traceback (most recent call last):
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 746, in _compile_fx_inner
mb_compiled_graph = fx_codegen_and_compile(
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1343, in fx_codegen_and_compile
return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/compile_fx.py", line 1232, in codegen_and_compile
compiled_module = graph.compile_to_module()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2087, in compile_to_module
return self._compile_to_module()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2095, in _compile_to_module
self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 2002, in codegen
self._update_scheduler()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/graph.py", line 1996, in _update_scheduler
self.scheduler = Scheduler(self.operations)
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1954, in __init__
self._init(nodes)
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 1974, in _init
self.update_zero_dim_cpu_tensor()
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/scheduler.py", line 4433, in update_zero_dim_cpu_tensor
and buffer.get_size() == []
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3903, in get_size
return [*self.get_layout().size]
File "/dev/shm/uid-99/d2b830f6-seed-nspid4026547915_cgpid362302-ns-4026547912/torch/_inductor/ir.py", line 3914, in get_layout
raise NotImplementedError(type(self.layout).__name__)
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NotImplementedError: NoneLayout
```
Test Plan: OP said the issue is fixed
Differential Revision: D72575808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151321
Approved by: https://github.com/BoyuanFeng
Prior to this PR, `rng_state` is in `V.graph.graph_inputs` but not in read_writes of any IRNode. As a result, it is not identified as a partition inputs:
```python
def partition_0(args):
primals_2, primals_1 = args
...
buf0 = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten.rand.default, [4, 4], dtype=torch.float32, device=device(type='cuda', index=1), pin_memory=False, rng_state=fwd_rng_state_0)
# <----- access fwd_rng_state_0 but it's not an input
...
def call(self, args):
primals_1, primals_2, fwd_rng_state_0 = args
...
partition0_args = [primals_2, primals_1]
(buf2, primals_2, primals_1) = self.partitions[0](partition0_args)
# <---- fwd_rng_state_0 is graph_inputs but is not passed to partitions[0]
...
```
This PR fixes this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150958
Approved by: https://github.com/eellison
Originally, I excluded constant_pad_nd from fusing to be conservative on compilation time. But, on benchmarking, you do occasionally get speedups by fusing it. Also includes a fix for making single, contiguous dep for prologues.
For instance, the following benchmark gets a 7% speedup by fusing in the constant_pad_nd.
```
import torch
import torch.nn.functional as F
torch._inductor.config.force_disable_caches = True
padded_N = 2048
n_pad_rows = 100
K, N = 2048, 4096
tensor1 = torch.randn(padded_N - n_pad_rows, 4096, device="cuda").to(torch.bfloat16)
tensor2 = torch.randn(4096, 4096, device="cuda").to(torch.bfloat16)
@torch.compile(mode='max-autotune-no-cudagraphs')
def masked_linear(input, weight, n_pad_input_rows):
"""
Linear layer with input padded by `n_pad_input_rows` rows
"""
# Use constant_pad_nd to pad with zeros for the invalid rows
padded_input = F.pad(tensor1, (0, 0, 0, n_pad_input_rows), "constant", 0)
return F.linear(padded_input, weight)
# Invoke the function
masked_linear(tensor1, tensor2, n_pad_rows)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149947
Approved by: https://github.com/drisspg
This PR supports symbol inputs to graph partition functions. Before this PR, we rely on `node.read_writes` to get partition inputs. However, this does not cover symbol inputs.
In this PR, for each graph partition, we collect all symbol inputs which are required to be in scope to successfully perform codegen, including:
- free symbols used in partition nodes.
- free symbols in partition input/node shapes, strides, and offsets. This is needed for recording cudagraphs for tensors with dynamic shapes.
### Note1: MutationLayout
In this example, node.layout is MutationLayoutSHOULDREMOVE. The symint from index `n` does not appear in the size, offset, stridese of node.layout. This symint appear in node.layout.target. So we need extra handle for it.
```python
x = torch.zeros(7, device="cuda")
def fn(n, a):
a[n] = -1
return a
opt_fn = torch.compile(fn, fullgraph=True)
for n in range(2, x.shape[0]):
opt_fn(n, x)
```
### Note2: Composability with Padded Tensor Subclass
W/o graph partition, Padded Tensor subclass lifts outer shapes to input arguments (i.e., arg0_1 for s0, arg1_1 for s1) but does not lift inner shapes (i.e., s2 and s3). Since cudagraph cache relies on integer inputs, it will cache on outer shapes and ignore inner shapes, which is bad.
```
def call(args):
arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args
args.clear()
s0 = arg0_1
s1 = arg1_1
arg2_1_size = arg2_1.size()
s2 = arg2_1_size[0]
s3 = arg2_1_size[1]
assert_size_stride(arg2_1, (s2, s3), (s3, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
buf0 = empty_strided_cuda((s2, s3), (s3, 1), torch.float32)
# Topologically Sorted Source Nodes: [x1, mul], Original ATen: [aten.add, aten.mul]
triton_poi_fused_add_mul_0_xnumel = s2*s3
stream0 = get_raw_stream(0)
triton_poi_fused_add_mul_0.run(arg2_1, buf0, triton_poi_fused_add_mul_0_xnumel, stream=stream0)
del arg2_1
return (buf0, s0, s1, s1, )
```
w/ graph partition, the partition function only includes tensor and inner shapes as inputs, to make sure the cudagraph caching is correct. Full Comparison: [code](https://www.internalfb.com/intern/diffing/?paste_number=1761674743)
```python
def call(self, args):
arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1 = args
args.clear()
s0 = arg0_1
s1 = arg1_1
arg2_1_size = arg2_1.size()
s2 = arg2_1_size[0]
s3 = arg2_1_size[1]
assert_size_stride(arg2_1, (s2, s3), (s3, 1))
partition0_args = [arg2_1, s2, s3]
del arg2_1
(buf0,) = self.partitions[0](partition0_args)
del partition0_args
return (buf0, s0, s1, s1, )
```
The number of cudagraphs is validated below: (also added to test)
```python
import torch
from padded_tensor import PaddedTensor
# Turning off graph_partition leads to
# torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=6
# at the end, which is wrong.
# torch._inductor.config.graph_partition = False
# Turning on graph_partition leads to
# torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id=4
# at the end, which is correct.
torch._inductor.config.graph_partition = True
def f(x):
x1 = x + 1
return x1 * 2
compiled_f = torch.compile(f, mode="reduce-overhead")
def run(shape):
x = torch.randn(*shape, device="cuda")
pad_x = PaddedTensor.from_tensor(x, multipliers={0:4, 1:4})
assert hasattr(pad_x, "multipliers"), breakpoint()
eager_out = f(pad_x)
for _ in range(3):
compiled_out = compiled_f(pad_x)
compiled_out = compiled_f(pad_x)
assert eager_out.shape == compiled_out.shape
assert eager_out.tensor.shape == compiled_out.tensor.shape
assert torch.allclose(eager_out.tensor, compiled_out.tensor)
# static shape. record a NEW cudagraph. 1 cudagraph in total now.
run((2,3))
# outer shape is dynamic, leading to a new dynamo graph
# this new dynamo graph forces a NEW cudagraph. 2 cudagraphs in total now
run((3,4))
# outer shape changed but inner shape does not change
# so NO new cudagraph is recorded
run((2,2))
# inner shape is dynamic now, leading to a new dynamo graph
# this new dynamo graph forces a NEW cudagraph. 3 cudagraphs in total now
run((5,6))
# does NOT record a new cudagraph
run((7,8))
# record a NEW cudagraph. 4 cudagraphs in total now
run((10,11))
assert torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id().id == 4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149458
Approved by: https://github.com/eellison
Cudagraphs is careful to not allow any memory recorded to escape globally without having a reference to the tensor. This is because we may later reclaim that memory for a cudagraph recording and we need to mark the tensor as erroring on access. Very occasionally, a stray tensor will have been allocated locally but not yet cleaned up. In this case, we enter the slow path and try to gc.collect() to deallocate it. From a hard to repro internal use case, this was fixed by an additional `cuda.synchronize()`.
i also snuck in an outdated comment and a duplicate line removal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149741
Approved by: https://github.com/BoyuanFeng, https://github.com/Skylion007
Summary:
Fix logging error like:
```
in combinable_nodes
log.debug(
Message: 'ComboKernels: %d template nodes are filtered'
Arguments: (OrderedSet([8]),)
--- Logging error ---
Traceback (most recent call last):
File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 1100, in emit
msg = self.format(record)
File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 943, in format
return fmt.format(record)
File "/data/users/guorachel/fbsource/buck-out/v2/gen/fbcode/854b9ed00d28c5c5/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark#link-tree/torch/_logging/_internal.py", line 818, in format
record.message = record.getMessage()
File "/usr/local/fbcode/platform010/lib/python3.10/logging/__init__.py", line 368, in getMessage
msg = msg % self.args
TypeError: %d format: a real number is required, not OrderedSet
```
encountered in running a prod model + enable combo kernel feature
Test Plan: CI
Differential Revision: D71512220
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149575
Approved by: https://github.com/ColinPeppler
This PR implements cudagraph partition, following previous PR on inductor graph partition (#147038). Since there are many ops that cudagraph cannot support, this PR focuses on `cpu ops` and will add more partition rules in the next PR.
## Example
```python
import torch
torch._inductor.config.graph_partition = True
def f(x, y):
x1 = x + 1
y1 = y + 1
y_cpu = y1.cpu() + 1
z = x @ y
return x1 + y1 + z + y_cpu.cuda()
x, y = [torch.ones(2, 2, device="cuda") for _ in range(2)]
x_cloned, y_cloned = [tmp.clone() for tmp in [x,y]]
eager_out = f(x, y)
f_compiled = torch.compile(f, mode="reduce-overhead")
for _ in range(5):
compiled_out = f_compiled(x_cloned, y_cloned)
assert torch.allclose(eager_out, compiled_out)
```
w/o graph partition, we will skip cudagraph:
```
skipping cudagraphs due to skipping cudagraphs due to cpu device (device_put). Found from :
File "/home/boyuan/playground/cudagraph/graph_partition/graph_partition.py", line 9, in f
y_cpu = y1.cpu() + 1 # 3
```
w/ graph partition, we can see two cudagraphify under the same torch-compiled region:

## Design
PR #147038 splits `def call(args)` function into multiple `def partition_id(args)`. In this PR, we use `recursively_apply_fns()` to wrap each `partition_id()` function with `cudagraphify`. One major design point is, `cudagraphify` takes metadata such as static_input_idxs and we need to provide such metadata for each graph partition. However, we previously only have such metadata for the original graph instead of graph partitions.
The [idea](https://github.com/pytorch/pytorch/pull/147038#discussion_r1964124800) is:
- compute a mapping from the partition metadata (e.g., input/output idx) to the graph metadata, stored in `GraphPartitionMap`.
- during post_compile, get the `CudagraphMetadata` for each partition based on the graph-level metadata and `GraphPartitionMap`, via `get_partition_cudagraph_metadata()`.
- finally, in `cudagraph_partition_pos_compile`, we compute the `CudagraphMetadata` and apply cudagraphify for each graph via `recursively_apply_fns`.
#### Q: How does it work with codecache?
While we have multiple graph partitions, we still have 1 file and 1 `call` function for 1 dynamo graph. The major difference is we need to additionally load a `recursively_apply_fns()` for graph partition. We also add `partition_maps: Optional[list[GraphPartitionMap]]` to `CompiledFxGraph` so it will be serialized and could be deserialized later.
## Edge Case 1
PyTorch has an assumption on input/output orders. For example, backward inputs take saved tensors first and then tangents. In graph partition, we respect such orders via `graph_partition_signature_reorder`.
## Edge Case 2
Cudagraphifying `call` function gives 2 cudagraph managed tensors `buf0` and `primals_1`. However, cudagraphifying `partition_0` gives only 1 cudagraph managed tensor `buf0`. This leads to a semantic difference between cudagraph w/ and w/o graph partition. [full code comparison](https://www.internalfb.com/intern/diffing/?paste_number=1747654420)

To achieve the same semantic, we returns an input tensor as output if it is not freed in a graph partition. This allows more cudagraph managed tensors and is important for handling saved tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147648
Approved by: https://github.com/eellison
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Differential [disconnected] Revision: D70471332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
Summary:
Relands D69965761 / https://github.com/pytorch/pytorch/pull/147583
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Differential Revision: D70471332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148305
Approved by: https://github.com/shunting314, https://github.com/eellison
This allows for each device type to check current devices for Triton compatibility and ensure their Triton backend is present.
This PR replaces the `has_triton()` global method which was previously used for this task, and moves the initial check for each Inductor backend on to their associated `BaseScheduler` subclass. This means that other backends, such as Halide, can also implement their own availability checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139171
Approved by: https://github.com/jansel
Before this PR, calling a triton kernel would look like:
```py
kernel.run(a, b, xnumel, grid=grid(xnumel), stream=stream0)
```
where the `grid=` was passed as a callable (function closure) arg. This PR removes the grid arg:
```py
kernel.run(a, b, xnumel, stream=stream0)
```
instead now the grid computation is included in the kernel launcher, with something like:
```py
def launcher(in_ptr0, out_ptr0, xnumel, stream):
grid_0 = ((xnumel + 1023) >> 10)
grid_1 = 1
grid_2 = 1
runner(grid_0, grid_1, grid_2, stream, function, metadata, None, launch_enter_hook, launch_exit_hook, in_ptr0, out_ptr0, xnumel)
```
This should be faster, since we remove multiple function/dict calls and are able to specialize the grid computation for each `triton.Config`.
It also allows us to unify the handling of grids between the Python and C++ wrapper code. Before this, C++ wrapper code didn't actually support dynamic grid sizes and instead burned in a static grid.
This unification allows this PR to be a net deletion of code.
Note the attached diff contains some minor fbcode-only changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147583
Approved by: https://github.com/eellison, https://github.com/shunting314
As title.
Many changes adapted from https://github.com/pytorch/pytorch/pull/129537.
Also this diff is only for *static* method of torchbind *attributes*. Some case that's not supported/tested:
- dynamic torchbind objects
- torchbind objects as an input to the module.
Note that in JIT Inductor, the attributes are lifted as inputs. So even if we just have torchbind objects as attributes, they will show up as inputs in the graph.
Example generated python code in torch.compile with inductor backend for the test case in `inductor/test_torchbind.py` (P1730554370):
```python
async_compile.wait(globals())
del async_compile
def call(args):
arg1_1, arg2_1, arg3_1 = args
args.clear()
assert_size_stride(arg1_1, (2, 3), (3, 1))
assert_size_stride(arg2_1, (2, 3), (3, 1))
buf2 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
cpp_fused_add_0(arg1_1, arg2_1, buf2)
del arg1_1
del arg2_1
# Topologically Sorted Source Nodes: [x, takes_foo_tuple_return], Original ATen: [aten.add]
buf3 = torch.ops._TorchScriptTesting.takes_foo_tuple_return.default(arg3_1, buf2)
buf4 = buf3[0]
assert_size_stride(buf4, (2, 3), (3, 1))
buf5 = buf3[1]
assert_size_stride(buf5, (2, 3), (3, 1))
buf6 = buf4; del buf4 # reuse
cpp_fused_add_1(buf6, buf5)
del buf5
# Topologically Sorted Source Nodes: [y, b], Original ATen: [aten.add]
buf7 = torch.ops._TorchScriptTesting.takes_foo.default(arg3_1, buf6)
del buf3
del buf6
buf8 = buf7
assert_size_stride(buf8, (2, 3), (3, 1))
# Topologically Sorted Source Nodes: [c], Original ATen: []
buf9 = torch.ops.higher_order.call_torchbind(arg3_1, 'add_tensor', buf2)
del arg3_1
del buf7
buf10 = buf9
assert_size_stride(buf10, (2, 3), (3, 1))
del buf9
buf11 = buf2; del buf2 # reuse
cpp_fused_add_2(buf11, buf8, buf10)
return (buf11, )
def benchmark_compiled_module(times=10, repeat=10):
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
arg1_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
arg2_1 = rand_strided((2, 3), (3, 1), device='cpu', dtype=torch.float32)
import pickle
global arg3_1
arg3_1 = pickle.loads(b'\x80\x04\x95[\x00\x00\x00\x00\x00\x00\x00\x8c\x05torch\x94\x8c\x0cScriptObject\x94\x93\x94)\x81\x94]\x94(K\nK\x14e\x8c0__torch__.torch.classes._TorchScriptTesting._Foo\x94\x86\x94b.')
fn = lambda: call([arg1_1, arg2_1, arg3_1])
return print_performance(fn, times=times, repeat=repeat)
if __name__ == "__main__":
from torch._inductor.wrapper_benchmark import compiled_module_main
compiled_module_main('None', benchmark_compiled_module)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146927
Approved by: https://github.com/angelayi
### Big idea
This PR extends https://github.com/pytorch/pytorch/pull/144288 by combining calling triton in worker processes with the future cache: we kick off triton compilation in the worker processes earlier, during inductor codegen. Basically instead of calling async_compile.triton for the first time only after the entire code has been generated, we start compiling as soon as we know we'll need to compile the kernel. Then, when loading the generated inductor code, we can simply read from our in memory future cache, considerably increasing the parallelism.
### Implementation Overview
In total, the diff does the following:
- Converts TritonFuture to LambdaFuture, only calling triton.compile on worker processes
- Now that triton.compile() isn't called on the main process, we call TritonBundler on all compiled kernels when we get them back from workers
- Extend @eellison's future cache to a class, mostly as a refactor
- Finally, call async_compile.triton ahead of time in Scheduler.codegen if workers are warmed up. This causes the subsequent
async_compile.triton call that occurs after codegen to cache hit on cold start.
In the diffs after this, I will add more to CompiledTritonKernels so that TritonBundler, on a warm start, automatically populates the in memory cache on warm start with the existing triton kernels, avoiding calling triton altogether on warm starts.
Because LambdaFutures are much faster to kick off than TritonFutures, due to not needing to load from TritonCodeCache at all, the time spent kicking off these worker jobs is pretty minimal for inductor codegen.
Differential Revision: [D69123174](https://our.internmc.facebook.com/intern/diff/D69123174/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146417
Approved by: https://github.com/jansel
When we attempt prologue or epilogue fusion with a TritonTemplate, we benchmark it at compile time in order to determine profitability. This avoids slowdowns/register spilling, and allows us to pick fusion when a base triton template is slower than cublas but faster when considering an epilogue. However, that fused benchmarking does not do the same async compilation as we do for the base TritonTemplate. The Base TritonTemplate is async compiled during lowering, then later waited on and benchmarked.
This PR extends a similar process to benchmarking fused TritonTemplates in the scheduler. We keep a list of pending fusions which have async compilations. And we resolve any pending fusions a node is in prior to attempting to fuse it with any other node.
Initially, I saw some slowdowns with this because we kick off async compilations of identical fusions in parallel. To address this I added source code caching at the `async_compile` level (we also already cache benchmark runs, but that would not happen in parallel).
Compilation speedups:
<img width="717" alt="image" src="https://github.com/user-attachments/assets/8e8f7d6c-7824-4210-83f9-a2a0f6db5ac9" />
This also should let us be a bit more aggressive with either configs, or benchmarking other fusions which are hard to determine profitability of.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143408
Approved by: https://github.com/jansel, https://github.com/shunting314