[Graph Partition] add graph partition doc (#159450)

This pr adds doc for graph partition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159450
Approved by: https://github.com/eellison
This commit is contained in:
Boyuan Feng
2025-07-30 17:01:07 +00:00
committed by PyTorch MergeBot
parent 6c6e11c206
commit 435edbcb5d

View File

@ -219,6 +219,7 @@ may skip CUDAGraph when necessary. Here, we list common reasons for skipping CUD
[dynamic shapes](https://pytorch.org/docs/stable/torch.compiler_dynamic_shapes.html). [dynamic shapes](https://pytorch.org/docs/stable/torch.compiler_dynamic_shapes.html).
CUDAGraph Trees currently record a CUDAGraph for every unique input tensor shapes. CUDAGraph Trees currently record a CUDAGraph for every unique input tensor shapes.
Please see *Dynamic Shape Support* for more details. Please see *Dynamic Shape Support* for more details.
- **CUDAGraph-unsafe custom ops**: Some custom ops may include cudagraph unsafe ops, which causes cudagraph to be skipped. Please see *CUDAGraph Unsafe Custom Ops* for more details.
- **Incompatible operators**: CUDAGraph Trees skip a function if it contain incompatible - **Incompatible operators**: CUDAGraph Trees skip a function if it contain incompatible
operators. Please replace these operators in a function with supported operators. We operators. Please replace these operators in a function with supported operators. We
show an exhaustive list of incompatible operators: show an exhaustive list of incompatible operators:
@ -249,6 +250,49 @@ aten._local_scalar_dense
aten._assert_scalar aten._assert_scalar
``` ```
### CUDAGraph Unsafe Custom Ops
Custom ops are assumed to be safe for CUDAGraph by default. However, some custom ops may include unsupported ops such as cpu ops. Since custom op are treated as black boxes by the compiler, users must explicitly mark these ops as unsafe for CUDAGraph by setting the `torch._C.Tag.cudagraph_unsafe` tag, as demonstrated in the example below. When a function contains cudagraph-unsafe custom ops, it will be skipped by CUDAGraph unless *CUDAGraph partition* is enabled.
```python
@torch.library.custom_op(
"mylib::modify",
mutates_args=(),
tags=(torch._C.Tag.cudagraph_unsafe,),
)
def modify(pic: torch.Tensor) -> torch.Tensor:
pic1 = pic + 1
pic1_cpu = (pic1.cpu() + 1) * 2
return pic1_cpu.cuda() + pic
@modify.register_fake
def _(pic):
return torch.empty_like(pic)
```
### CUDAGraph Partition
As we discussed earlier, CUDAGraph does not support some ops (e.g., cpu ops) which may limit its adoption. CUDAGraph partition is a compiler solution that automatically splits off these ops, reorders ops to reduce the number of partitions, and applies CUDAGraph to each partition individually. Please set `torch._inductor.config.graph_partition=True` to enable CUDAGraph partition.
Consider the following example where `x` and `y` are gpu inputs but `y_cpu` is a cpu tensor. Without graph partition, this function must be skipped due to cpu ops. With graph partition, the CPU ops are split off, and the remaining GPU ops are cudagraphified, resulting in two separate separate CUDAGraphs.
```python
def f(x, y):
x1 = x + 1
y1 = y + 1
y_cpu = y1.cpu() + 1
z = x @ y
return x1 + y1 + z + y_cpu.cuda()
```
Currently, CUDAGraph partition supports splitting off the following types of ops:
- **Non-GPU Ops**: Popular examples include computation on cpu tensors.
- **Device Copy Ops**: Data transfers between devices, such as the `y1.cpu()` in the example above.
- **Control Flow Ops**: [Control flow ops](https://docs.pytorch.org/docs/stable/cond.html) are split off since they are not yet supported by CUDAGraph.
- **CUDAGraph Unsafe Custom Ops**: Custom ops tagged with `torch._C.Tag.cudagraph_unsafe` are split off. See *CUDAGraph Unsafe Custom Ops* section for details.
- **Unbacked Symints**: Please refer to *Dynamic Shape Support* section for more information.
### Limitations ### Limitations
Because CUDA Graph fixes memory addresses, CUDA Graphs do not have a great way of handling live tensors from a previous invocation. Because CUDA Graph fixes memory addresses, CUDA Graphs do not have a great way of handling live tensors from a previous invocation.
@ -284,4 +328,4 @@ tensors of a prior iteration (outside of torch.compile) before you begin the nex
|---------------|------------------------------------------------------------|------------------------------------------------------------------------| |---------------|------------------------------------------------------------|------------------------------------------------------------------------|
| Memory Can Increase | On each graph compilation (new sizes, etc.) | If you are also running non-cudagraph memory | | Memory Can Increase | On each graph compilation (new sizes, etc.) | If you are also running non-cudagraph memory |
| Recordings | On any new invocation of a graph | Will re-record on any new, unique path you take through your program | | Recordings | On any new invocation of a graph | Will re-record on any new, unique path you take through your program |
| Footguns | Invocation of one graph will overwrite prior invocation | Cannot persist memory between separate runs through your model - one training loop training, or one run of inference | | Footguns | Invocation of one graph will overwrite prior invocation | Cannot persist memory between separate runs through your model - one training loop training, or one run of inference |