convert: rst to myst pr2/2 (#155911)

Fixes #155038
parent [PR](https://github.com/pytorch/pytorch/pull/155375) (made two PRs to pass sanity check)
this PR converts the following three .rst files with the mentioned referenced in each file

- [torch.compiler_faq](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_faq.rst)
  - torch.compiler_troubleshooting
  - nonsupported_numpy_feats
  - torchdynamo_fine_grain_tracing

- [torch.compiler_fine_grain_apis](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_fine_grain_apis.rst)
  - None

- [torch.compiler_get_started](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_get_started.rst)
  - torch.compiler_overview
  - torch.compiler_api
  - torchdynamo_fine_grain_tracing

I made the suggested edits by the maintainers as commented in the parent PR
(used git mv on all files, yet it still appeared as delete-create action)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155911
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
This commit is contained in:
Dhia naouali
2025-06-16 00:44:44 +00:00
committed by PyTorch MergeBot
parent c83041cac2
commit c620d0b5c7
5 changed files with 823 additions and 884 deletions

View File

@ -0,0 +1,630 @@
# Frequently Asked Questions
**Author**: [Mark Saroufim](https://github.com/msaroufim)
## Does `torch.compile` support training?
`torch.compile` supports training, using AOTAutograd to capture backwards:
1. The `.forward()` graph and `optimizer.step()` is captured by
TorchDynamos python `evalframe` frontend.
2. For each segment of `.forward()` that torchdynamo captures, it uses
AOTAutograd to generate a backward graph segment.
3. Each pair of forward and backward graph are (optionally) min-cut
partitioned to save the minimal state between forward and backward.
4. The forward and backward pairs are wrapped in `autograd.function` modules.
5. User code calling `.backward()` still triggers eagers autograd engine,
which runs each *compiled backward* graph as if it were one op, also running
any non-compiled eager ops `.backward()` functions.
## Do you support Distributed code?
`torch.compile` supports `DistributedDataParallel` (DDP).
Support for other distributed training libraries is being considered.
The main reason why Distributed code is challenging with dynamo is
because AOTAutograd unrolls both the forward and backward pass and
provides 2 graphs for backends to optimize. This is a problem for
distributed code because wed like to ideally overlap communication
operations with computations. Eager pytorch accomplishes this in
different ways for DDP/FSDP- using autograd hooks, module hooks, and
modifications/mutations of module states. In a naive application of
dynamo, hooks that should run directly after an operation during
backwards may be delayed until after the entire compiled region of
backwards ops, due to how AOTAutograd compiled functions interact with
dispatcher hooks.
The basic strategy for optimizing DDP with Dynamo is outlined in
[distributed.py](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/backends/distributed.py)
where the main idea will be to graph break on [DDP bucket
boundaries](https://pytorch.org/docs/stable/notes/ddp.html#internal-design).
When each node in DDP needs to synchronize its weights with the other
nodes it organizes its gradients and parameters into buckets which
reduces communication times and allows a node to broadcast a fraction of
its gradients to other waiting nodes.
Graph breaks in distributed code mean you can expect dynamo and its
backends to optimize the compute overhead of a distributed program but
not its communication overhead. Graph-breaks may interfere with
compilation speedups, if the reduced graph-size robs the compiler of
fusion opportunities. However, there are diminishing returns with
increasing graph size since most of the current compute optimizations
are local fusions. So in practice this approach may be sufficient.
## Do I still need to export whole graphs?
For the vast majority of models you probably dont and you can use
`torch.compile()` as is but there are a few situations where
full graphs are necessary and you can can ensure a full graph by simply
running `torch.compile(..., fullgraph=True)`. These situations include:
- Large scale training runs, such as $250K+ that require pipeline parallelism
and other advanced sharding strategies.
- Inference optimizers like [TensorRT](https://github.com/pytorch/TensorRT)
or [AITemplate](https://github.com/facebookincubator/AITemplate) that
rely on fusing much more aggressively than training optimizers.
- Mobile training or inference.
Future work will include tracing communication operations into graphs,
coordinating these operations with compute optimizations, and optimizing
the communication operations.
## Why is my code crashing?
If your code ran just fine without `torch.compile` and started to
crash with it is enabled, then the most important first step is figuring
out which part of the stack your failure occurred. To troubleshoot that,
follow the steps below and only try the next step if the previous one
succeeded.
1. `torch.compile(..., backend="eager")` which only runs TorchDynamo
forward graph capture and then runs the captured graph with PyTorch.
If this fails then theres an issue with TorchDynamo.
2. `torch.compile(..., backend="aot_eager")`
which runs TorchDynamo to capture a forward graph, and then AOTAutograd
to trace the backward graph without any additional backend compiler
steps. PyTorch eager will then be used to run the forward and backward
graphs. If this fails then theres an issue with AOTAutograd.
3. `torch.compile(..., backend="inductor")` which runs TorchDynamo to capture a
forward graph, and then AOTAutograd to trace the backward graph with the
TorchInductor compiler. If this fails then theres an issue with TorchInductor
## Why is compilation slow?
- **Dynamo Compilation** TorchDynamo has a builtin stats function for
collecting and displaying the time spent in each compilation phase.
These stats can be accessed by calling `torch._dynamo.utils.compile_times()`
after executing `torch._dynamo`. By default, this returns a string
representation of the compile times spent in each TorchDynamo function by name.
- **Inductor Compilation** TorchInductor has a builtin stats and trace function
for displaying time spent in each compilation phase, output code, output
graph visualization and IR dump. `env TORCH_COMPILE_DEBUG=1 python repro.py`.
This is a debugging tool designed to make it easier to debug/understand the
internals of TorchInductor with an output that will look something like
[this](https://gist.github.com/jansel/f4af078791ad681a0d4094adeb844396)
Each file in that debug trace can be enabled/disabled via
`torch._inductor.config.trace.*`. The profile and the diagram are both
disabled by default since they are expensive to generate. See the
[example debug directory
output](https://gist.github.com/jansel/f4af078791ad681a0d4094adeb844396)
for more examples.
- **Excessive Recompilation**
When TorchDynamo compiles a function (or part of one), it makes certain
assumptions about locals and globals in order to allow compiler
optimizations, and expresses these assumptions as guards that check
particular values at runtime. If any of these guards fail, Dynamo will
recompile that function (or part) up to
`torch._dynamo.config.recompile_limit` times. If your program is
hitting the cache limit, you will first need to determine which guard is
failing and what part of your program is triggering it. The
Use `TORCH_TRACE/tlparse` or `TORCH_LOGS=recompiles` to trace the root of the issue, check {ref}`torch.compiler_troubleshooting` for more details.
## Why are you recompiling in production?
In some cases, you may not want unexpected compiles after a program has
warmed up. For example, if you are serving production traffic in a
latency critical application. For this, TorchDynamo provides an
alternate mode where prior compiled graphs are used, but no new ones are
generated:
```python
frozen_toy_example = dynamo.run(toy_example)
frozen_toy_example(torch.randn(10), torch.randn(10))
```
## How are you speeding up my code?
There are 3 major ways to accelerate PyTorch code:
1. Kernel fusion via vertical fusions which fuse sequential operations to avoid
excessive read/writes. For example, fuse 2 subsequent cosines means you
can can do 1 read 1 write instead 2 reads 2 writes 2. Horizontal fusion:
the simplest example being batching where a single matrix is multiplied
with a batch of examples but the more general scenario is a grouped GEMM
where a group of matrix multiplications are scheduled together
2. Out of order execution: A general optimization for compilers, by looking ahead
at the exact data dependencies within a graph we can decide on the most
opportune time to execute a node and which buffers can be reused
3. Automatic work placement: Similar of the out of order execution point,
but by matching nodes of a graph to resources like physical hardware or
memory we can design an appropriate schedule
The above are general principles for accelerating PyTorch code but
different backends will each make different tradeoffs on what to
optimize. For example Inductor first takes care of fusing whatever it
can and only then generates [Triton](https://openai.com/blog/triton/)
kernels.
Triton in addition offers speedups because of automatic memory
coalescing, memory management and scheduling within each Streaming
Multiprocessor and has been designed to handle tiled computations.
However, regardless of the backend you use its best to use a benchmark
and see approach so try out the PyTorch profiler, visually inspect the
generated kernels and try to see whats going on for yourself.
(torch.compiler_graph_breaks)=
## Why am I not seeing speedups?
### Graph Breaks
The main reason you wont see the speedups youd like to by using dynamo
is excessive graph breaks. So whats a graph break?
Given a program like:
```python
def some_fun(x):
...
torch.compile(some_fun)(x)
...
```
Torchdynamo will attempt to compile all of the torch/tensor operations
within `some_fun()` into a single FX graph, but it may fail to capture
everything into one graph.
Some graph break reasons are insurmountable to TorchDynamo like calling
into a C extension other than PyTorch is invisible to TorchDynamo, and
could do arbitrary things without TorchDynamo being able to introduce
necessary guards to ensure that the compiled program would be safe to reuse.
> To maximize performance, its important to have as few graph breaks
> as possible.
### Identifying the cause of a graph break
To identify all graph breaks in a program and the associated reasons for
the breaks, `torch._dynamo.explain` can be used. This tool runs
TorchDynamo on the supplied function and aggregates the graph breaks
that are encountered. Here is an example usage:
```python
import torch
import torch._dynamo as dynamo
def toy_example(a, b):
x = a / (torch.abs(a) + 1)
print("woo")
if b.sum() < 0:
b = b * -1
return x * b
explanation = dynamo.explain(toy_example)(torch.randn(10), torch.randn(10))
print(explanation)
"""
Graph Count: 3
Graph Break Count: 2
Op Count: 5
Break Reasons:
Break Reason 1:
Reason: builtin: print [<class 'torch._dynamo.variables.constant.ConstantVariable'>] False
User Stack:
<FrameSummary file foo.py, line 5 in toy_example>
Break Reason 2:
Reason: generic_jump TensorVariable()
User Stack:
<FrameSummary file foo.py, line 6 in torch_dynamo_resume_in_toy_example_at_5>
Ops per Graph:
...
Out Guards:
...
"""
```
To throw an error on the first graph break encountered you can
disable python fallbacks by using `fullgraph=True`, this should be
familiar if youve worked with export based compilers.
```python
def toy_example(a, b):
...
torch.compile(toy_example, fullgraph=True, backend=<compiler>)(a, b)
```
### Why didnt my code recompile when I changed it?
If you enabled dynamic shapes by setting
`env TORCHDYNAMO_DYNAMIC_SHAPES=1 python model.py` then your code
wont recompile on shape changes. Weve added support for dynamic shapes
which avoids recompilations in the case when shapes vary by less than a
factor of 2. This is especially useful in scenarios like varying image
sizes in CV or variable sequence length in NLP. In inference scenarios
its often not possible to know what a batch size will be beforehand
because you take what you can get from different client apps.
In general, TorchDynamo tries very hard not to recompile things
unnecessarily so if for example TorchDynamo finds 3 graphs and your
change only modified one graph then only that graph will recompile. So
another tip to avoid potentially slow compilation times is to warmup a
model by compiling it once after which subsequent compilations will be
much faster. Cold start compile times is still a metric we track
visibly.
## Why am I getting incorrect results?
Accuracy issues can also be minified if you set the environment variable
`TORCHDYNAMO_REPRO_LEVEL=4`, it operates with a similar git bisect
model and a full repro might be something like
`TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4` the reason
we need this is downstream compilers will codegen code whether its
Triton code or the C++ backend, the numerics from those downstream
compilers can be different in subtle ways yet have dramatic impact on
your training stability. So the accuracy debugger is very useful for us
to detect bugs in our codegen or with a backend compiler.
If you'd like to ensure that random number generation is the same across both torch
and triton then you can enable `torch._inductor.config.fallback_random = True`
## Why am I getting OOMs?
Dynamo is still an alpha product so theres a few sources of OOMs and if
youre seeing an OOM try disabling the following configurations in this
order and then open an issue on GitHub so we can solve the root problem
1\. If youre using dynamic shapes try disabling them, weve disabled
them by default: `env TORCHDYNAMO_DYNAMIC_SHAPES=0 python model.py` 2.
CUDA graphs with Triton are enabled by default in inductor but removing
them may alleviate some OOM issues: `torch._inductor.config.triton.cudagraphs = False`.
## Does `torch.func` work with `torch.compile` (for `grad` and `vmap` transforms)?
Applying a `torch.func` transform to a function that uses `torch.compile`
does work:
```python
import torch
@torch.compile
def f(x):
return torch.sin(x)
def g(x):
return torch.grad(f)(x)
x = torch.randn(2, 3)
g(x)
```
### Calling `torch.func` transform inside of a function handled with `torch.compile`
### Compiling `torch.func.grad` with `torch.compile`
```python
import torch
def wrapper_fn(x):
return torch.func.grad(lambda x: x.sin().sum())(x)
x = torch.randn(3, 3, 3)
grad_x = torch.compile(wrapper_fn)(x)
```
### Compiling `torch.vmap` with `torch.compile`
```python
import torch
def my_fn(x):
return torch.vmap(lambda x: x.sum(1))(x)
x = torch.randn(3, 3, 3)
output = torch.compile(my_fn)(x)
```
### Compiling functions besides the ones which are supported (escape hatch)
For other transforms, as a workaround, use `torch._dynamo.allow_in_graph`
`allow_in_graph` is an escape hatch. If your code does not work with
`torch.compile`, which introspects Python bytecode, but you believe it
will work via a symbolic tracing approach (like `jax.jit`), then use
`allow_in_graph`.
By using `allow_in_graph` to annotate a function, you must make sure
your code meets the following requirements:
- All outputs in your function only depend on the inputs and
do not depend on any captured Tensors.
- Your function is functional. That is, it does not mutate any state. This may
be relaxed; we actually support functions that appear to be functional from
the outside: they may have in-place PyTorch operations, but may not mutate
global state or inputs to the function.
- Your function does not raise data-dependent errors.
```python
import torch
@torch.compile
def f(x):
return torch._dynamo.allow_in_graph(torch.vmap(torch.sum))(x)
x = torch.randn(2, 3)
f(x)
```
A common pitfall is using `allow_in_graph` to annotate a function that
invokes an `nn.Module`. This is because the outputs now depend on the
parameters of the `nn.Module`. To get this to work, use
`torch.func.functional_call` to extract the module state.
## Does NumPy work with `torch.compile`?
Starting in 2.1, `torch.compile` understands native NumPy programs that
work on NumPy arrays, and mixed PyTorch-NumPy programs that convert from PyTorch
to NumPy and back via `x.numpy()`, `torch.from_numpy`, and related functions.
(nonsupported-numpy-feats)=
### Which NumPy features does `torch.compile` support?
NumPy within `torch.compile` follows NumPy 2.0 pre-release.
Generally, `torch.compile` is able to trace through most NumPy constructions,
and when it cannot, it falls back to eager and lets NumPy execute that piece of
code. Even then, there are a few features where `torch.compile` semantics
slightly deviate from those of NumPy:
- NumPy scalars: We model them as 0-D arrays. That is, `np.float32(3)` returns
a 0-D array under `torch.compile`. To avoid a graph break, it is best to use this 0-D
array. If this breaks your code, you can workaround this by casting the NumPy scalar
to the relevant Python scalar type `bool/int/float`.
- Negative strides: `np.flip` and slicing with a negative step return a copy.
- Type promotion: NumPy's type promotion will change in NumPy 2.0. The new rules
are described in [NEP 50](https://numpy.org/neps/nep-0050-scalar-promotion.html).
`torch.compile` implements NEP 50 rather than the current soon-to-be deprecated rules.
- `{tril,triu}_indices_from/{tril,triu}_indices` return arrays rather than a tuple of arrays.
There are other features for which we do not support tracing and we gracefully
fallback to NumPy for their execution:
- Non-numeric dtypes like datetimes, strings, chars, void, structured dtypes and recarrays.
- Long dtypes `np.float128/np.complex256` and some unsigned dtypes `np.uint16/np.uint32/np.uint64`.
- `ndarray` subclasses.
- Masked arrays.
- Esoteric ufunc machinery like `axes=[(n,k),(k,m)->(n,m)]` and ufunc methods (e.g., `np.add.reduce`).
- Sorting / ordering `complex64/complex128` arrays.
- NumPy `np.poly1d` and `np.polynomial`.
- Positional `out1, out2` args in functions with 2 or more returns (`out=tuple` does work).
- `__array_function__`, `__array_interface__` and `__array_wrap__`.
- `ndarray.ctypes` attribute.
### Can I compile NumPy code using `torch.compile`?
Of course you do! `torch.compile` understands NumPy code natively, and treats it
as if it were PyTorch code. To do so, simply wrap NumPy code with the `torch.compile`
decorator.
```python
import torch
import numpy as np
@torch.compile
def numpy_fn(X: np.ndarray, Y: np.ndarray) -> np.ndarray:
return np.sum(X[:, :, None] * Y[:, None, :], axis=(-2, -1))
X = np.random.randn(1024, 64)
Y = np.random.randn(1024, 64)
Z = numpy_fn(X, Y)
assert isinstance(Z, np.ndarray)
```
Executing this example with the environment variable `TORCH_LOGS=output_code`, we can see
that `torch.compile` was able to fuse the multiplication and the sum into one C++ kernel.
It was also able to execute them in parallel using OpenMP (native NumPy is single-threaded).
This can easily make your NumPy code `n` times faster, where `n` is the number of cores
in your processor!
Tracing NumPy code this way also supports graph breaks within the compiled code.
### Can I execute NumPy code on CUDA and compute gradients via `torch.compile`?
Yes you can! To do so, you may simply execute your code within a `torch.device("cuda")`
context. Consider the example
```python
import torch
import numpy as np
@torch.compile
def numpy_fn(X: np.ndarray, Y: np.ndarray) -> np.ndarray:
return np.sum(X[:, :, None] * Y[:, None, :], axis=(-2, -1))
X = np.random.randn(1024, 64)
Y = np.random.randn(1024, 64)
with torch.device("cuda"):
Z = numpy_fn(X, Y)
assert isinstance(Z, np.ndarray)
```
In this example, `numpy_fn` will be executed in CUDA. For this to be
possible, `torch.compile` automatically moves `X` and `Y` from CPU
to CUDA, and then it moves the result `Z` from CUDA to CPU. If we are
executing this function several times in the same program run, we may want
to avoid all these rather expensive memory copies. To do so, we just need
to tweak our `numpy_fn` so that it accepts cuda Tensors and returns tensors.
We can do so by using `torch.compiler.wrap_numpy`:
```python
@torch.compile(fullgraph=True)
@torch.compiler.wrap_numpy
def numpy_fn(X, Y):
return np.sum(X[:, :, None] * Y[:, None, :], axis=(-2, -1))
X = torch.randn(1024, 64, device="cuda")
Y = torch.randn(1024, 64, device="cuda")
Z = numpy_fn(X, Y)
assert isinstance(Z, torch.Tensor)
assert Z.device.type == "cuda"
```
Here, we explicitly create the tensors in CUDA memory, and pass them to the
function, which performs all the computations on the CUDA device.
`wrap_numpy` is in charge of marking any `torch.Tensor` input as an input
with `np.ndarray` semantics at a `torch.compile` level. Marking tensors
inside the compiler is a very cheap operation, so no data copy or data movement
happens during runtime.
Using this decorator, we can also differentiate through NumPy code!
```python
@torch.compile(fullgraph=True)
@torch.compiler.wrap_numpy
def numpy_fn(X, Y):
return np.mean(np.sum(X[:, :, None] * Y[:, None, :], axis=(-2, -1)))
X = torch.randn(1024, 64, device="cuda", requires_grad=True)
Y = torch.randn(1024, 64, device="cuda")
Z = numpy_fn(X, Y)
assert isinstance(Z, torch.Tensor)
Z.backward()
# X.grad now holds the gradient of the computation
print(X.grad)
```
We have been using `fullgraph=True` as graph break are problematic in this context.
When a graph break occurs, we need to materialize the NumPy arrays. Since NumPy arrays
do not have a notion of `device` or `requires_grad`, this information is lost during
a graph break.
We cannot propagate gradients through a graph break, as the graph break code may execute
arbitrary code that don't know how to differentiate. On the other hand, in the case of
the CUDA execution, we can work around this problem as we did in the first example, by
using the `torch.device("cuda")` context manager:
```python
@torch.compile
@torch.compiler.wrap_numpy
def numpy_fn(X, Y):
prod = X[:, :, None] * Y[:, None, :]
print("oops, a graph break!")
return np.sum(prod, axis=(-2, -1))
X = torch.randn(1024, 64, device="cuda")
Y = torch.randn(1024, 64, device="cuda")
with torch.device("cuda"):
Z = numpy_fn(X, Y)
assert isinstance(Z, torch.Tensor)
assert Z.device.type == "cuda"
```
During the graph break, the intermediary tensors still need to be moved to CPU, but when the
tracing is resumed after the graph break, the rest of the graph is still traced on CUDA.
Given this CUDA <> CPU and CPU <> CUDA movement, graph breaks are fairly costly in the NumPy
context and should be avoided, but at least they allow tracing through complex pieces of code.
### How do I debug NumPy code under `torch.compile`?
Debugging JIT compiled code is challenging, given the complexity of modern
compilers and the daunting errors that they raise.
{ref}`The torch.compile troubleshooting doc <torch.compiler_troubleshooting>`
contains a few tips and tricks on how to tackle this task.
If the above is not enough to pinpoint the origin of the issue, there are still
a few other NumPy-specific tools we can use. We can discern whether the bug
is entirely in the PyTorch code by disabling tracing through NumPy functions:
```python
from torch._dynamo import config
config.trace_numpy = False
```
If the bug lies in the traced NumPy code, we can execute the NumPy code eagerly (without `torch.compile`)
using PyTorch as a backend by importing `import torch._numpy as np`.
This should just be used for **debugging purposes** and is in no way a
replacement for the PyTorch API, as it is **much less performant** and, as a
private API, **may change without notice**. At any rate, `torch._numpy` is a
Python implementation of NumPy in terms of PyTorch and it is used internally by `torch.compile` to
transform NumPy code into Pytorch code. It is rather easy to read and modify,
so if you find any bug in it feel free to submit a PR fixing it or simply open
an issue.
If the program does work when importing `torch._numpy as np`, chances are
that the bug is in TorchDynamo. If this is the case, please feel free to open an issue
with a {ref}`minimal reproducer <torch.compiler_troubleshooting>`.
### I `torch.compile` some NumPy code and I did not see any speed-up.
The best place to start is the
[tutorial with general advice for how to debug these sort of torch.compile issues](https://pytorch.org/docs/main/torch.compiler_faq.html#why-am-i-not-seeing-speedups).
Some graph breaks may happen because of the use of unsupported features. See
{ref}`nonsupported-numpy-feats`. More generally, it is useful to keep in mind
that some widely used NumPy features do not play well with compilers. For
example, in-place modifications make reasoning difficult within the compiler and
often yield worse performance than their out-of-place counterparts.As such, it is best to avoid
them. Same goes for the use of the `out=` parameter. Instead, prefer
out-of-place ops and let `torch.compile` optimize the memory use. Same goes
for data-dependent ops like masked indexing through boolean masks, or
data-dependent control flow like `if` or `while` constructions.
## Which API to use for fine grain tracing?
In some cases, you might need to exclude small parts of your code from the
torch.compile compilations. This section provides some of the answers and
you can find more information in {ref}`torchdynamo_fine_grain_tracing`.
### How do I graph break on a function?
Graph break on a function is not enough to sufficiently express what you want
PyTorch to do. You need to be more specific about your use case. Some of the
most common use cases you might want to consider:
- If you want to disable compilation on this function frame and the recursively
invoked frames, use `torch._dynamo.disable`.
- If you want a particular operator, such as `fbgemm` to use the eager mode,
use `torch._dynamo.disallow_in_graph`.
Some of the uncommon use cases include:
- If you want to disable TorchDynamo on the function frame but enable it back
on the recursively invoked frames use `torch._dynamo.disable(recursive=False)`.
- If you want to prevent inlining of a function frame use `torch._dynamo.graph_break`
at the beginning of the function you want to prevent inlining.
### What's the difference between `torch._dynamo.disable` and `torch._dynamo.disallow_in_graph`
Disallow-in-graph works at the level of operators, or more specifically,
the operators that you see in the TorchDynamo extracted graphs.
Disable works at the function frame level and decides if TorchDynamo
should look into the function frame or not.
### What's the difference between `torch._dynamo.disable` and `torch._dynamo_skip`
:::{note}
`torch._dynamo_skip` is deprecated.
:::
You most likely need `torch._dynamo.disable`. But in an unlikely scenario, you
might need even finer control. Suppose you want to disable the tracing on just
the `a_fn` function, but want to continue the tracing back in `aa_fn` and
`ab_fn`. The image below demonstrates this use case:
:::{figure} _static/img/fine_grained_apis/call_stack_diagram.png
:alt: diagram of torch.compile + disable(a_fn, recursive=False)
:::
In this case, you can use `torch._dynamo.disable(recursive=False)`.
In previous versions, this functionality was provided by `torch._dynamo.skip`.
This is now supported by the `recursive` flag inside `torch._dynamo.disable`.

View File

@ -1,692 +0,0 @@
Frequently Asked Questions
==========================
**Author**: `Mark Saroufim <https://github.com/msaroufim>`_
Does ``torch.compile`` support training?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``torch.compile`` supports training, using AOTAutograd to capture backwards:
1. The ``.forward()`` graph and ``optimizer.step()`` is captured by
TorchDynamos python ``evalframe`` frontend.
2. For each segment of ``.forward()`` that torchdynamo captures, it uses
AOTAutograd to generate a backward graph segment.
3. Each pair of forward and backward graph are (optionally) min-cut
partitioned to save the minimal state between forward and backward.
4. The forward and backward pairs are wrapped in ``autograd.function`` modules.
5. Usercode calling\ ``.backward()`` still triggers eagers autograd engine,
which runs each *compiled backward* graph as if it were one op, also running
any non-compiled eager ops ``.backward()`` functions.
Do you support Distributed code?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``torch.compile`` supports ``DistributedDataParallel`` (DDP).
Support for other distributed training libraries is being considered.
The main reason why Distributed code is challenging with dynamo is
because AOTAutograd unrolls both the forward and backward pass and
provides 2 graphs for backends to optimize. This is a problem for
distributed code because wed like to ideally overlap communication
operations with computations. Eager pytorch accomplishes this in
different ways for DDP/FSDP- using autograd hooks, module hooks, and
modifications/mutations of module states. In a naive application of
dynamo, hooks that should run directly after an operation during
backwards may be delayed until after the entire compiled region of
backwards ops, due to how AOTAutograd compiled functions interact with
dispatcher hooks.
The basic strategy for optimizing DDP with Dynamo is outlined in
`distributed.py <https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/backends/distributed.py>`__
where the main idea will be to graph break on `DDP bucket
boundaries <https://pytorch.org/docs/stable/notes/ddp.html#internal-design>`__.
When each node in DDP needs to synchronize its weights with the other
nodes it organizes its gradients and parameters into buckets which
reduces communication times and allows a node to broadcast a fraction of
its gradients to other waiting nodes.
Graph breaks in distributed code mean you can expect dynamo and its
backends to optimize the compute overhead of a distributed program but
not its communication overhead. Graph-breaks may interfere with
compilation speedups, if the reduced graph-size robs the compiler of
fusion opportunities. However, there are diminishing returns with
increasing graph size since most of the current compute optimizations
are local fusions. So in practice this approach may be sufficient.
Do I still need to export whole graphs?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For the vast majority of models you probably dont and you can use
``torch.compile()`` as is but there are a few situations where
full graphs are necessary and you can can ensure a full graph by simply
running ``torch.compile(..., fullgraph=True)``. These situations include:
* Large scale training runs, such as $250K+ that require pipeline parallelism
and other advanced sharding strategies.
* Inference optimizers like `TensorRT <https://github.com/pytorch/TensorRT>`__
or `AITemplate <https://github.com/facebookincubator/AITemplate>`__ that
rely on fusing much more aggressively than training optimizers.
* Mobile training or inference.
Future work will include tracing communication operations into graphs,
coordinating these operations with compute optimizations, and optimizing
the communication operations.
Why is my code crashing?
~~~~~~~~~~~~~~~~~~~~~~~~
If your code ran just fine without ``torch.compile`` and started to
crash with it is enabled, then the most important first step is figuring
out which part of the stack your failure occurred. To troubleshoot that,
follow the steps below and only try the next step if the previous one
succeeded.
1. ``torch.compile(..., backend="eager")`` which only runs TorchDynamo
forward graph capture and then runs the captured graph with PyTorch.
If this fails then theres an issue with TorchDynamo.
2. ``torch.compile(..., backend="aot_eager")``
which runs TorchDynamo to capture a forward graph, and then AOTAutograd
to trace the backward graph without any additional backend compiler
steps. PyTorch eager will then be used to run the forward and backward
graphs. If this fails then theres an issue with AOTAutograd.
3. ``torch.compile(..., backend="inductor")`` which runs TorchDynamo to capture a
forward graph, and then AOTAutograd to trace the backward graph with the
TorchInductor compiler. If this fails then theres an issue with TorchInductor
Why is compilation slow?
~~~~~~~~~~~~~~~~~~~~~~~~
* **Dynamo Compilation** TorchDynamo has a builtin stats function for
collecting and displaying the time spent in each compilation phase.
These stats can be accessed by calling ``torch._dynamo.utils.compile_times()``
after executing ``torch._dynamo``. By default, this returns a string
representation of the compile times spent in each TorchDynamo function by name.
* **Inductor Compilation** TorchInductor has a builtin stats and trace function
for displaying time spent in each compilation phase, output code, output
graph visualization and IR dump. ``env TORCH_COMPILE_DEBUG=1 python repro.py``.
This is a debugging tool designed to make it easier to debug/understand the
internals of TorchInductor with an output that will look something like
`this <https://gist.github.com/jansel/f4af078791ad681a0d4094adeb844396>`__
Each file in that debug trace can be enabled/disabled via
``torch._inductor.config.trace.*``. The profile and the diagram are both
disabled by default since they are expensive to generate. See the
`example debug directory
output <https://gist.github.com/jansel/f4af078791ad681a0d4094adeb844396>`__
for more examples.
* **Excessive Recompilation**
When TorchDynamo compiles a function (or part of one), it makes certain
assumptions about locals and globals in order to allow compiler
optimizations, and expresses these assumptions as guards that check
particular values at runtime. If any of these guards fail, Dynamo will
recompile that function (or part) up to
``torch._dynamo.config.recompile_limit`` times. If your program is
hitting the cache limit, you will first need to determine which guard is
failing and what part of your program is triggering it. The
`recompilation profiler <#recompilation-profiler>`__ automates the
process of setting TorchDynamos cache limit to 1 and running your
program under an observation-only compiler that records the causes of
any guard failures. You should be sure to run your program for at least
as long (as many iterations) as you were running when you ran into
trouble, and the profiler will accumulate statistics over this duration.
Why are you recompiling in production?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some cases, you may not want unexpected compiles after a program has
warmed up. For example, if you are serving production traffic in a
latency critical application. For this, TorchDynamo provides an
alternate mode where prior compiled graphs are used, but no new ones are
generated:
.. code-block:: python
frozen_toy_example = dynamo.run(toy_example)
frozen_toy_example(torch.randn(10), torch.randn(10))
How are you speeding up my code?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are 3 major ways to accelerate PyTorch code:
1. Kernel fusion via vertical fusions which fuse sequential operations to avoid
excessive read/writes. For example, fuse 2 subsequent cosines means you
can can do 1 read 1 write instead 2 reads 2 writes 2. Horizontal fusion:
the simplest example being batching where a single matrix is multiplied
with a batch of examples but the more general scenario is a grouped GEMM
where a group of matrix multiplications are scheduled together
2. Out of order execution: A general optimization for compilers, by looking ahead
at the exact data dependencies within a graph we can decide on the most
opportune time to execute a node and which buffers can be reused
3. Automatic work placement: Similar of the out of order execution point,
but by matching nodes of a graph to resources like physical hardware or
memory we can design an appropriate schedule
The above are general principles for accelerating PyTorch code but
different backends will each make different tradeoffs on what to
optimize. For example Inductor first takes care of fusing whatever it
can and only then generates `Triton <https://openai.com/blog/triton/>`__
kernels.
Triton in addition offers speedups because of automatic memory
coalescing, memory management and scheduling within each Streaming
Multiprocessor and has been designed to handle tiled computations.
However, regardless of the backend you use its best to use a benchmark
and see approach so try out the PyTorch profiler, visually inspect the
generated kernels and try to see whats going on for yourself.
Why am I not seeing speedups?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. _torch.compiler_graph_breaks:
Graph Breaks
------------
The main reason you wont see the speedups youd like to by using dynamo
is excessive graph breaks. So whats a graph break?
Given a program like:
.. code-block:: python
def some_fun(x):
...
torch.compile(some_fun)(x)
...
Torchdynamo will attempt to compile all of the torch/tensor operations
within ``some_fun()`` into a single FX graph, but it may fail to capture
everything into one graph.
Some graph break reasons are insurmountable to TorchDynamo like calling
into a C extension other than PyTorch is invisible to TorchDynamo, and
could do arbitrary things without TorchDynamo being able to introduce
necessary guards to ensure that the compiled program would be safe to reuse.
To maximize performance, its important to have as few graph breaks
as possible.
Identifying the cause of a graph break
--------------------------------------
To identify all graph breaks in a program and the associated reasons for
the breaks, ``torch._dynamo.explain`` can be used. This tool runs
TorchDynamo on the supplied function and aggregates the graph breaks
that are encountered. Here is an example usage:
.. code-block:: python
import torch
import torch._dynamo as dynamo
def toy_example(a, b):
x = a / (torch.abs(a) + 1)
print("woo")
if b.sum() < 0:
b = b * -1
return x * b
explanation = dynamo.explain(toy_example)(torch.randn(10), torch.randn(10))
print(explanation)
"""
Graph Count: 3
Graph Break Count: 2
Op Count: 5
Break Reasons:
Break Reason 1:
Reason: builtin: print [<class 'torch._dynamo.variables.constant.ConstantVariable'>] False
User Stack:
<FrameSummary file foo.py, line 5 in toy_example>
Break Reason 2:
Reason: generic_jump TensorVariable()
User Stack:
<FrameSummary file foo.py, line 6 in torch_dynamo_resume_in_toy_example_at_5>
Ops per Graph:
...
Out Guards:
...
"""
To throw an error on the first graph break encountered you can
disable python fallbacks by using ``fullgraph=True``, this should be
familiar if youve worked with export based compilers.
.. code-block:: python
def toy_example(a, b):
...
torch.compile(toy_example, fullgraph=True, backend=<compiler>)(a, b)
Why didnt my code recompile when I changed it?
-----------------------------------------------
If you enabled dynamic shapes by setting
``env TORCHDYNAMO_DYNAMIC_SHAPES=1 python model.py`` then your code
wont recompile on shape changes. Weve added support for dynamic shapes
which avoids recompilations in the case when shapes vary by less than a
factor of 2. This is especially useful in scenarios like varying image
sizes in CV or variable sequence length in NLP. In inference scenarios
its often not possible to know what a batch size will be beforehand
because you take what you can get from different client apps.
In general, TorchDynamo tries very hard not to recompile things
unnecessarily so if for example TorchDynamo finds 3 graphs and your
change only modified one graph then only that graph will recompile. So
another tip to avoid potentially slow compilation times is to warmup a
model by compiling it once after which subsequent compilations will be
much faster. Cold start compile times is still a metric we track
visibly.
Why am I getting incorrect results?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Accuracy issues can also be minified if you set the environment variable
``TORCHDYNAMO_REPRO_LEVEL=4``, it operates with a similar git bisect
model and a full repro might be something like
``TORCHDYNAMO_REPRO_AFTER="aot" TORCHDYNAMO_REPRO_LEVEL=4`` the reason
we need this is downstream compilers will codegen code whether its
Triton code or the C++ backend, the numerics from those downstream
compilers can be different in subtle ways yet have dramatic impact on
your training stability. So the accuracy debugger is very useful for us
to detect bugs in our codegen or with a backend compiler.
If you'd like to ensure that random number generation is the same across both torch
and triton then you can enable ``torch._inductor.config.fallback_random = True``
Why am I getting OOMs?
~~~~~~~~~~~~~~~~~~~~~~
Dynamo is still an alpha product so theres a few sources of OOMs and if
youre seeing an OOM try disabling the following configurations in this
order and then open an issue on GitHub so we can solve the root problem
1. If youre using dynamic shapes try disabling them, weve disabled
them by default: ``env TORCHDYNAMO_DYNAMIC_SHAPES=0 python model.py`` 2.
CUDA graphs with Triton are enabled by default in inductor but removing
them may alleviate some OOM issues: ``torch._inductor.config.triton.cudagraphs = False``.
Does ``torch.func`` work with ``torch.compile`` (for `grad` and `vmap` transforms)?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Applying a ``torch.func`` transform to a function that uses ``torch.compile``
does work:
.. code-block:: python
import torch
@torch.compile
def f(x):
return torch.sin(x)
def g(x):
return torch.grad(f)(x)
x = torch.randn(2, 3)
g(x)
Calling ``torch.func`` transform inside of a function handled with ``torch.compile``
------------------------------------------------------------------------------------
Compiling ``torch.func.grad`` with ``torch.compile``
----------------------------------------------------
.. code-block:: python
import torch
def wrapper_fn(x):
return torch.func.grad(lambda x: x.sin().sum())(x)
x = torch.randn(3, 3, 3)
grad_x = torch.compile(wrapper_fn)(x)
Compiling ``torch.vmap`` with ``torch.compile``
-----------------------------------------------
.. code-block:: python
import torch
def my_fn(x):
return torch.vmap(lambda x: x.sum(1))(x)
x = torch.randn(3, 3, 3)
output = torch.compile(my_fn)(x)
Compiling functions besides the ones which are supported (escape hatch)
-----------------------------------------------------------------------
For other transforms, as a workaround, use ``torch._dynamo.allow_in_graph``
``allow_in_graph`` is an escape hatch. If your code does not work with
``torch.compile``, which introspects Python bytecode, but you believe it
will work via a symbolic tracing approach (like ``jax.jit``), then use
``allow_in_graph``.
By using ``allow_in_graph`` to annotate a function, you must make sure
your code meets the following requirements:
- All outputs in your function only depend on the inputs and
do not depend on any captured Tensors.
- Your function is functional. That is, it does not mutate any state. This may
be relaxed; we actually support functions that appear to be functional from
the outside: they may have in-place PyTorch operations, but may not mutate
global state or inputs to the function.
- Your function does not raise data-dependent errors.
.. code-block:: python
import torch
@torch.compile
def f(x):
return torch._dynamo.allow_in_graph(torch.vmap(torch.sum))(x)
x = torch.randn(2, 3)
f(x)
A common pitfall is using ``allow_in_graph`` to annotate a function that
invokes an ``nn.Module``. This is because the outputs now depend on the
parameters of the ``nn.Module``. To get this to work, use
``torch.func.functional_call`` to extract the module state.
Does NumPy work with ``torch.compile``?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Starting in 2.1, ``torch.compile`` understands native NumPy programs that
work on NumPy arrays, and mixed PyTorch-NumPy programs that convert from PyTorch
to NumPy and back via ``x.numpy()``, ``torch.from_numpy``, and related functions.
.. _nonsupported-numpy-feats:
Which NumPy features does ``torch.compile`` support?
----------------------------------------------------
NumPy within ``torch.compile`` follows NumPy 2.0 pre-release.
Generally, ``torch.compile`` is able to trace through most NumPy constructions,
and when it cannot, it falls back to eager and lets NumPy execute that piece of
code. Even then, there are a few features where ``torch.compile`` semantics
slightly deviate from those of NumPy:
- NumPy scalars: We model them as 0-D arrays. That is, ``np.float32(3)`` returns
a 0-D array under ``torch.compile``. To avoid a graph break, it is best to use this 0-D
array. If this breaks your code, you can workaround this by casting the NumPy scalar
to the relevant Python scalar type ``bool/int/float``.
- Negative strides: ``np.flip`` and slicing with a negative step return a copy.
- Type promotion: NumPy's type promotion will change in NumPy 2.0. The new rules
are described in `NEP 50 <https://numpy.org/neps/nep-0050-scalar-promotion.html)>`__.
``torch.compile`` implements NEP 50 rather than the current soon-to-be deprecated rules.
- ``{tril,triu}_indices_from/{tril,triu}_indices`` return arrays rather than a tuple of arrays.
There are other features for which we do not support tracing and we gracefully
fallback to NumPy for their execution:
- Non-numeric dtypes like datetimes, strings, chars, void, structured dtypes and recarrays.
- Long dtypes ``np.float128/np.complex256`` and some unsigned dtypes ``np.uint16/np.uint32/np.uint64``.
- ``ndarray`` subclasses.
- Masked arrays.
- Esoteric ufunc machinery like ``axes=[(n,k),(k,m)->(n,m)]`` and ufunc methods (e.g., ``np.add.reduce``).
- Sorting / ordering ``complex64/complex128`` arrays.
- NumPy ``np.poly1d`` and ``np.polynomial``.
- Positional ``out1, out2`` args in functions with 2 or more returns (``out=tuple`` does work).
- ``__array_function__``, ``__array_interface__`` and ``__array_wrap__``.
- ``ndarray.ctypes`` attribute.
Can I compile NumPy code using ``torch.compile``?
-------------------------------------------------
Of course you do! ``torch.compile`` understands NumPy code natively, and treats it
as if it were PyTorch code. To do so, simply wrap NumPy code with the ``torch.compile``
decorator.
.. code-block:: python
import torch
import numpy as np
@torch.compile
def numpy_fn(X: np.ndarray, Y: np.ndarray) -> np.ndarray:
return np.sum(X[:, :, None] * Y[:, None, :], axis=(-2, -1))
X = np.random.randn(1024, 64)
Y = np.random.randn(1024, 64)
Z = numpy_fn(X, Y)
assert isinstance(Z, np.ndarray)
Executing this example with the environment variable ``TORCH_LOGS=output_code``, we can see
that ``torch.compile`` was able to fuse the multiplication and the sum into one C++ kernel.
It was also able to execute them in parallel using OpenMP (native NumPy is single-threaded).
This can easily make your NumPy code ``n`` times faster, where ``n`` is the number of cores
in your processor!
Tracing NumPy code this way also supports graph breaks within the compiled code.
Can I execute NumPy code on CUDA and compute gradients via ``torch.compile``?
-----------------------------------------------------------------------------
Yes you can! To do so, you may simply execute your code within a ``torch.device("cuda")``
context. Consider the example
.. code-block:: python
import torch
import numpy as np
@torch.compile
def numpy_fn(X: np.ndarray, Y: np.ndarray) -> np.ndarray:
return np.sum(X[:, :, None] * Y[:, None, :], axis=(-2, -1))
X = np.random.randn(1024, 64)
Y = np.random.randn(1024, 64)
with torch.device("cuda"):
Z = numpy_fn(X, Y)
assert isinstance(Z, np.ndarray)
In this example, ``numpy_fn`` will be executed in CUDA. For this to be
possible, ``torch.compile`` automatically moves ``X`` and ``Y`` from CPU
to CUDA, and then it moves the result ``Z`` from CUDA to CPU. If we are
executing this function several times in the same program run, we may want
to avoid all these rather expensive memory copies. To do so, we just need
to tweak our ``numpy_fn`` so that it accepts cuda Tensors and returns tensors.
We can do so by using ``torch.compiler.wrap_numpy``:
.. code-block:: python
@torch.compile(fullgraph=True)
@torch.compiler.wrap_numpy
def numpy_fn(X, Y):
return np.sum(X[:, :, None] * Y[:, None, :], axis=(-2, -1))
X = torch.randn(1024, 64, device="cuda")
Y = torch.randn(1024, 64, device="cuda")
Z = numpy_fn(X, Y)
assert isinstance(Z, torch.Tensor)
assert Z.device.type == "cuda"
Here, we explicitly create the tensors in CUDA memory, and pass them to the
function, which performs all the computations on the CUDA device.
``wrap_numpy`` is in charge of marking any ``torch.Tensor`` input as an input
with ``np.ndarray`` semantics at a ``torch.compile`` level. Marking tensors
inside the compiler is a very cheap operation, so no data copy or data movement
happens during runtime.
Using this decorator, we can also differentiate through NumPy code!
.. code-block:: python
@torch.compile(fullgraph=True)
@torch.compiler.wrap_numpy
def numpy_fn(X, Y):
return np.mean(np.sum(X[:, :, None] * Y[:, None, :], axis=(-2, -1)))
X = torch.randn(1024, 64, device="cuda", requires_grad=True)
Y = torch.randn(1024, 64, device="cuda")
Z = numpy_fn(X, Y)
assert isinstance(Z, torch.Tensor)
Z.backward()
# X.grad now holds the gradient of the computation
print(X.grad)
We have been using ``fullgraph=True`` as graph break are problematic in this context.
When a graph break occurs, we need to materialize the NumPy arrays. Since NumPy arrays
do not have a notion of ``device`` or ``requires_grad``, this information is lost during
a graph break.
We cannot propagate gradients through a graph break, as the graph break code may execute
arbitrary code that don't know how to differentiate. On the other hand, in the case of
the CUDA execution, we can work around this problem as we did in the first example, by
using the ``torch.device("cuda")`` context manager:
.. code-block:: python
@torch.compile
@torch.compiler.wrap_numpy
def numpy_fn(X, Y):
prod = X[:, :, None] * Y[:, None, :]
print("oops, a graph break!")
return np.sum(prod, axis=(-2, -1))
X = torch.randn(1024, 64, device="cuda")
Y = torch.randn(1024, 64, device="cuda")
with torch.device("cuda"):
Z = numpy_fn(X, Y)
assert isinstance(Z, torch.Tensor)
assert Z.device.type == "cuda"
During the graph break, the intermediary tensors still need to be moved to CPU, but when the
tracing is resumed after the graph break, the rest of the graph is still traced on CUDA.
Given this CUDA <> CPU and CPU <> CUDA movement, graph breaks are fairly costly in the NumPy
context and should be avoided, but at least they allow tracing through complex pieces of code.
How do I debug NumPy code under ``torch.compile``?
--------------------------------------------------
Debugging JIT compiled code is challenging, given the complexity of modern
compilers and the daunting errors that they raise.
:ref:`The torch.compile troubleshooting doc <torch.compiler_troubleshooting>`
contains a few tips and tricks on how to tackle this task.
If the above is not enough to pinpoint the origin of the issue, there are still
a few other NumPy-specific tools we can use. We can discern whether the bug
is entirely in the PyTorch code by disabling tracing through NumPy functions:
.. code-block:: python
from torch._dynamo import config
config.trace_numpy = False
If the bug lies in the traced NumPy code, we can execute the NumPy code eagerly (without ``torch.compile``)
using PyTorch as a backend by importing ``import torch._numpy as np``.
This should just be used for **debugging purposes** and is in no way a
replacement for the PyTorch API, as it is **much less performant** and, as a
private API, **may change without notice**. At any rate, ``torch._numpy`` is a
Python implementation of NumPy in terms of PyTorch and it is used internally by ``torch.compile`` to
transform NumPy code into Pytorch code. It is rather easy to read and modify,
so if you find any bug in it feel free to submit a PR fixing it or simply open
an issue.
If the program does work when importing ``torch._numpy as np``, chances are
that the bug is in TorchDynamo. If this is the case, please feel open an issue
with a :ref:`minimal reproducer <torch.compiler_troubleshooting>`.
I ``torch.compile`` some NumPy code and I did not see any speed-up.
-------------------------------------------------------------------
The best place to start is the
`tutorial with general advice for how to debug these sort of torch.compile issues <https://pytorch.org/docs/main/torch.compiler_faq.html#why-am-i-not-seeing-speedups>`__.
Some graph breaks may happen because of the use of unsupported features. See
:ref:`nonsupported-numpy-feats`. More generally, it is useful to keep in mind
that some widely used NumPy features do not play well with compilers. For
example, in-place modifications make reasoning difficult within the compiler and
often yield worse performance than their out-of-place counterparts.As such, it is best to avoid
them. Same goes for the use of the ``out=`` parameter. Instead, prefer
out-of-place ops and let ``torch.compile`` optimize the memory use. Same goes
for data-dependent ops like masked indexing through boolean masks, or
data-dependent control flow like ``if`` or ``while`` constructions.
Which API to use for fine grain tracing?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some cases, you might need to exclude small parts of your code from the
torch.compile compilations. This section provides some of the answers and
you can find more information in :ref:`torchdynamo_fine_grain_tracing`.
How do I graph break on a function?
-----------------------------------
Graph break on a function is not enough to sufficiently express what you want
PyTorch to do. You need to be more specific about your use case. Some of the
most common use cases you might want to consider:
* If you want to disable compilation on this function frame and the recursively
invoked frames, use ``torch._dynamo.disable``.
* If you want a particular operator, such as ``fbgemm`` to use the eager mode,
use ``torch._dynamo.disallow_in_graph``.
Some of the uncommon use cases include:
* If you want to disable TorchDynamo on the function frame but enable it back
on the recursively invoked frames use ``torch._dynamo.disable(recursive=False)``.
* If you want to prevent inlining of a function frame use ``torch._dynamo.graph_break``
at the beginning of the function you want to prevent inlining.
What's the difference between ``torch._dynamo.disable`` and ``torch._dynamo.disallow_in_graph``
-----------------------------------------------------------------------------------------------
Disallow-in-graph works at the level of operators, or more specifically,
the operators that you see in the TorchDynamo extracted graphs.
Disable works at the function frame level and decides if TorchDynamo
should look into the function frame or not.
What's the difference between ``torch._dynamo.disable`` and ``torch._dynamo_skip``
----------------------------------------------------------------------------------
.. note::
``torch._dynamo_skip`` is deprecated.
You most likely need ``torch._dynamo.disable``. But in an unlikely scenario, you
might need even finer control. Suppose you want to disable the tracing on just
the ``a_fn`` function, but want to continue the tracing back in ``aa_fn`` and
``ab_fn``. The image below demonstrates this use case:
.. figure:: _static/img/fine_grained_apis/call_stack_diagram.png
:alt: diagram of torch.compile + disable(a_fn, recursive=False)
In this case, you can use ``torch._dynamo.disable(recursive=False)``.
In previous versions, this functionality was provided by ``torch._dynamo.skip``.
This is now supported by the ``recursive`` flag inside ``torch._dynamo.disable``.

View File

@ -1,15 +1,15 @@
.. _torchdynamo_fine_grain_tracing:
(torchdynamo_fine_grain_tracing)=
TorchDynamo APIs for fine-grained tracing
=========================================
# TorchDynamo APIs for fine-grained tracing
.. note:: In this document ``torch.compiler.compile`` and
``torch.compile`` are used interchangeably. Both versions
will work in your code.
:::{note}
In this document `torch.compiler.compile` and `torch.compile` are used interchangeably.
Both versions will work in your code.
:::
``torch.compile`` performs TorchDynamo tracing on the whole user model.
`torch.compile` performs TorchDynamo tracing on the whole user model.
However, it is possible that a small part of the model code cannot be
handled by ``torch.compiler``. In this case, you might want to disable
handled by `torch.compiler`. In this case, you might want to disable
the compiler on that particular portion, while running compilation on
the rest of the model. This section describe the existing APIs that
use to define parts of your code in which you want to skip compilation
@ -18,6 +18,7 @@ and the relevant use cases.
The API that you can use to define portions of the code on which you can
disable compilation are listed in the following table:
```{eval-rst}
.. csv-table:: TorchDynamo APIs to control fine-grained tracing
:header: "API", "Description", "When to use?"
:widths: auto
@ -29,24 +30,25 @@ disable compilation are listed in the following table:
"``torch.compiler.is_compiling``", "Indicates whether a graph is executed/traced as part of torch.compile() or torch.export()."
"``torch.compiler.is_dynamo_compiling``", "Indicates whether a graph is traced via TorchDynamo. It's stricter than torch.compiler.is_compiling() flag, as it would only be set to True when TorchDynamo is used."
"``torch.compiler.is_exporting``", "Indicates whether a graph is traced via export. It's stricter than torch.compiler.is_compiling() flag, as it would only be set to True when torch.export is used."
```
``torch.compiler.disable``
~~~~~~~~~~~~~~~~~~~~~~~~~~
## `torch.compiler.disable`
``torch.compiler.disable`` disables compilation on the decorated function frame and all the function frames recursively invoked from the decorated function frame.
`torch.compiler.disable` disables compilation on the decorated function frame and all the function frames recursively invoked from the decorated function frame.
TorchDynamo intercepts the execution of each Python function frame. So, suppose you have a code structure (image below) where the function ``fn`` calls functions ``a_fn`` and ``b_fn``. And ``a_fn`` calls ``aa_fn`` and ``ab_fn``. When you use the PyTorch eager mode rather than ``torch.compile``, these function frames run as is. With ``torch.compile``, TorchDynamo intercepts each of these function frames (indicated by the green color):
TorchDynamo intercepts the execution of each Python function frame. So, suppose you have a code structure (image below) where the function `fn` calls functions `a_fn` and `b_fn`. And `a_fn` calls `aa_fn` and `ab_fn`. When you use the PyTorch eager mode rather than `torch.compile`, these function frames run as is. With `torch.compile`, TorchDynamo intercepts each of these function frames (indicated by the green color):
.. figure:: _static/img/fine_grained_apis/api_diagram.png
:alt: Callstack diagram of different apis.
:::{figure} _static/img/fine_grained_apis/api_diagram.png
:alt: Callstack diagram of different apis.
:::
Let's imagine, that function ``a_fn`` is causing troubles with ``torch.compile``.
And this is a non-critical portion of the model. You can use ``compiler.disable``
on function ``a_fn``. As shown above, TorchDynamo will stop looking at frames
originating from the ``a_fn`` call (white color indicates original Python behavior).
Let's imagine, that function `a_fn` is causing troubles with `torch.compile`.
And this is a non-critical portion of the model. You can use `compiler.disable`
on function `a_fn`. As shown above, TorchDynamo will stop looking at frames
originating from the `a_fn` call (white color indicates original Python behavior).
To skip compilation, you can decorate the offending function with
``@torch.compiler.disable``.
`@torch.compiler.disable`.
You can also use the non-decorator syntax if you dont want to change the source
code
@ -54,54 +56,53 @@ However, we recommend that you avoid this style if possible. Here, you have to
take care that all users of the original function are now using the patched
version.
``torch._dynamo.disallow_in_graph``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## `torch._dynamo.disallow_in_graph`
``torch._dynamo.disallow_in_graph`` disallows an operator but not the function
`torch._dynamo.disallow_in_graph` disallows an operator but not the function
to be present in the TorchDynamo extracted graph. Note that this is suitable
for operators and not general functions as in the case of ``_dynamo.disable``.
for operators and not general functions as in the case of `_dynamo.disable`.
Let's imagine you compile your model with PyTorch. TorchDynamo is able to
extract a graph, but then you see the downstream compiler failing. For example,
the meta kernel is missing, or some Autograd dispatch key is set incorrectly
for a particular operator. Then you can mark that operator as
``disallow_in_graph``, and TorchDynamo will cause a graph break and run that
`disallow_in_graph`, and TorchDynamo will cause a graph break and run that
operator by using the PyTorch eager mode.
The catch is that you will have to find the corresponding Dynamo level operator,
and not the ATen level operator. See more in the Limitations section of the doc.
.. warning::
``torch._dynamo.disallow_in_graph`` is a global flag. If you are comparing
different backend compilers, you might have to call ``allow_in_graph`` for
the disallowed operator when switching to the other compiler.
:::{warning}
`torch._dynamo.disallow_in_graph` is a global flag. If you are comparing
different backend compilers, you might have to call `allow_in_graph` for
the disallowed operator when switching to the other compiler.
:::
``torch.compiler.allow_in_graph``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## `torch.compiler.allow_in_graph`
``torch.compiler.allow_in_graph`` is useful when the relevant function frame
`torch.compiler.allow_in_graph` is useful when the relevant function frame
has some known hard-to-support TorchDynamo feature, such as hooks and
``autograd.Function``, and you are confident that downstream PyTorch components
`autograd.Function`, and you are confident that downstream PyTorch components
such as AOTAutograd can safely trace through the decorated function. When a
function is decorated with ``allow_in_graph``, TorchDynamo treats it as a
function is decorated with `allow_in_graph`, TorchDynamo treats it as a
black-box and puts it as is in the generated graph.
.. warning::
``allow_in_graph`` skips TorchDynamo completely on the decorated function
omitting all TorchDynamo safety checks, including graph breaks, handling
closures, and others. Use `allow_in_graph` with caution. PyTorch downstream
components, such as AOTAutograd rely on TorchDynamo to handle complex Python
features, but ``allow_in_graph`` bypasses TorchDynamo. Using ``allow_in_graph``
could lead to soundness and hard-to-debug issues.
:::{warning}
`allow_in_graph` skips TorchDynamo completely on the decorated function
omitting all TorchDynamo safety checks, including graph breaks, handling
closures, and others. Use `allow_in_graph` with caution. PyTorch downstream
components, such as AOTAutograd rely on TorchDynamo to handle complex Python
features, but `allow_in_graph` bypasses TorchDynamo. Using `allow_in_graph`
could lead to soundness and hard-to-debug issues.
:::
Limitations
~~~~~~~~~~~
## Limitations
All the existing APIs are applied at the TorchDynamo level. Therefore, these
APIs have visibility to only what TorchDynamo sees. This can lead to confusing
scenarios.
For example, ``torch._dynamo.disallow_in_graph`` will not work for ATen operators
For example, `torch._dynamo.disallow_in_graph` will not work for ATen operators
because they are visible to AOT Autograd. For example,
``torch._dynamo.disallow_in_graph(torch.ops.aten.add)`` will not work in the
`torch._dynamo.disallow_in_graph(torch.ops.aten.add)` will not work in the
above example.

View File

@ -0,0 +1,148 @@
(torch_compiler_get_started)=
# Getting Started
Before you read this section, make sure to read the {ref}`torch.compiler_overview`
let's start by looking at a simple `torch.compile` example that demonstrates
how to use `torch.compile` for inference. This example demonstrates the
`torch.cos()` and `torch.sin()` features which are examples of pointwise
operators as they operate element by element on a vector. This example might
not show significant performance gains but should help you form an intuitive
understanding of how you can use `torch.compile` in your own programs.
:::{note}
To run this script, you need to have at least one GPU on your machine.
If you do not have a GPU, you can remove the `.to(device="cuda:0")` code
in the snippet below and it will run on CPU. You can also set device to
`xpu:0` to run on Intel® GPUs.
:::
```python
import torch
def fn(x):
a = torch.cos(x)
b = torch.sin(a)
return b
new_fn = torch.compile(fn, backend="inductor")
input_tensor = torch.randn(10000).to(device="cuda:0")
a = new_fn(input_tensor)
```
A more famous pointwise operator you might want to use would
be something like `torch.relu()`. Pointwise ops in eager mode are
suboptimal because each one would need to read a tensor from the
memory, make some changes, and then write back those changes. The single
most important optimization that inductor performs is fusion. In the
example above we can turn 2 reads (`x`, `a`) and
2 writes (`a`, `b`) into 1 read (`x`) and 1 write (`b`), which
is crucial especially for newer GPUs where the bottleneck is memory
bandwidth (how quickly you can send data to a GPU) rather than compute
(how quickly your GPU can crunch floating point operations).
Another major optimization that inductor provides is automatic
support for CUDA graphs.
CUDA graphs help eliminate the overhead from launching individual
kernels from a Python program which is especially relevant for newer GPUs.
TorchDynamo supports many different backends, but TorchInductor specifically works
by generating [Triton](https://github.com/openai/triton) kernels. Let's save
our example above into a file called `example.py`. We can inspect the code
generated Triton kernels by running `TORCH_COMPILE_DEBUG=1 python example.py`.
As the script executes, you should see `DEBUG` messages printed to the
terminal. Closer to the end of the log, you should see a path to a folder
that contains `torchinductor_<your_username>`. In that folder, you can find
the `output_code.py` file that contains the generated kernel code similar to
the following:
```python
@pointwise(size_hints=[16384], filename=__file__, triton_meta={'signature': {'in_ptr0': '*fp32', 'out_ptr0': '*fp32', 'xnumel': 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 10000
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask, other=0.0)
tmp1 = tl.cos(tmp0)
tmp2 = tl.sin(tmp1)
tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)
```
:::{note}
The above code snippet is an example. Depending on your hardware,
you might see different code generated.
:::
And you can verify that fusing the `cos` and `sin` did actually occur
because the `cos` and `sin` operations occur within a single Triton kernel
and the temporary variables are held in registers with very fast access.
Read more on Triton's performance
[here](https://openai.com/blog/triton/). Because the code is written
in Python, it's fairly easy to understand even if you have not written all that
many CUDA kernels.
Next, let's try a real model like resnet50 from the PyTorch
hub.
```python
import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
opt_model = torch.compile(model, backend="inductor")
opt_model(torch.randn(1,3,64,64))
```
And that is not the only available backend, you can run in a REPL
`torch.compiler.list_backends()` to see all the available backends. Try out the
`cudagraphs` next as inspiration.
## Using a pretrained model
PyTorch users frequently leverage pretrained models from
[transformers](https://github.com/huggingface/transformers) or
[TIMM](https://github.com/rwightman/pytorch-image-models) and one of
the design goals is TorchDynamo and TorchInductor is to work out of the box with
any model that people would like to author.
Let's download a pretrained model directly from the HuggingFace hub and optimize
it:
```python
import torch
from transformers import BertTokenizer, BertModel
# Copy pasted from here https://huggingface.co/bert-base-uncased
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
model = torch.compile(model, backend="inductor") # This is the only line of code that we changed
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
output = model(**encoded_input)
```
If you remove the `to(device="cuda:0")` from the model and
`encoded_input`, then Triton will generate C++ kernels that will be
optimized for running on your CPU. You can inspect both Triton or C++
kernels for BERT. They are more complex than the trigonometry
example we tried above but you can similarly skim through it and see if you
understand how PyTorch works.
Similarly, let's try out a TIMM example:
```python
import timm
import torch
model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
opt_model = torch.compile(model, backend="inductor")
opt_model(torch.randn(64,3,7,7))
```
## Next Steps
In this section, we have reviewed a few inference examples and developed a
basic understanding of how torch.compile works. Here is what you check out next:
- [torch.compile tutorial on training](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html)
- {ref}`torch.compiler_api`
- {ref}`torchdynamo_fine_grain_tracing`

View File

@ -1,148 +0,0 @@
.. _torch.compiler_get_started:
Getting Started
===============
Before you read this section, make sure to read the :ref:`torch.compiler_overview`.
Let's start by looking at a simple ``torch.compile`` example that demonstrates
how to use ``torch.compile`` for inference. This example demonstrates the
``torch.cos()`` and ``torch.sin()`` features which are examples of pointwise
operators as they operate element by element on a vector. This example might
not show significant performance gains but should help you form an intuitive
understanding of how you can use ``torch.compile`` in your own programs.
.. note::
To run this script, you need to have at least one GPU on your machine.
If you do not have a GPU, you can remove the ``.to(device="cuda:0")`` code
in the snippet below and it will run on CPU. You can also set device to
``xpu:0`` to run on Intel® GPUs.
.. code:: python
import torch
def fn(x):
a = torch.cos(x)
b = torch.sin(a)
return b
new_fn = torch.compile(fn, backend="inductor")
input_tensor = torch.randn(10000).to(device="cuda:0")
a = new_fn(input_tensor)
A more famous pointwise operator you might want to use would
be something like ``torch.relu()``. Pointwise ops in eager mode are
suboptimal because each one would need to read a tensor from the
memory, make some changes, and then write back those changes. The single
most important optimization that inductor performs is fusion. In the
example above we can turn 2 reads (``x``, ``a``) and
2 writes (``a``, ``b``) into 1 read (``x``) and 1 write (``b``), which
is crucial especially for newer GPUs where the bottleneck is memory
bandwidth (how quickly you can send data to a GPU) rather than compute
(how quickly your GPU can crunch floating point operations).
Another major optimization that inductor provides is automatic
support for CUDA graphs.
CUDA graphs help eliminate the overhead from launching individual
kernels from a Python program which is especially relevant for newer GPUs.
TorchDynamo supports many different backends, but TorchInductor specifically works
by generating `Triton <https://github.com/openai/triton>`__ kernels. Let's save
our example above into a file called ``example.py``. We can inspect the code
generated Triton kernels by running ``TORCH_COMPILE_DEBUG=1 python example.py``.
As the script executes, you should see ``DEBUG`` messages printed to the
terminal. Closer to the end of the log, you should see a path to a folder
that contains ``torchinductor_<your_username>``. In that folder, you can find
the ``output_code.py`` file that contains the generated kernel code similar to
the following:
.. code-block:: python
@pointwise(size_hints=[16384], filename=__file__, triton_meta={'signature': {'in_ptr0': '*fp32', 'out_ptr0': '*fp32', 'xnumel': 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 10000
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask, other=0.0)
tmp1 = tl.cos(tmp0)
tmp2 = tl.sin(tmp1)
tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)
.. note:: The above code snippet is an example. Depending on your hardware,
you might see different code generated.
And you can verify that fusing the ``cos`` and ``sin`` did actually occur
because the ``cos`` and ``sin`` operations occur within a single Triton kernel
and the temporary variables are held in registers with very fast access.
Read more on Triton's performance
`here <https://openai.com/blog/triton/>`__. Because the code is written
in Python, it's fairly easy to understand even if you have not written all that
many CUDA kernels.
Next, let's try a real model like resnet50 from the PyTorch
hub.
.. code-block:: python
import torch
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
opt_model = torch.compile(model, backend="inductor")
opt_model(torch.randn(1,3,64,64))
And that is not the only available backend, you can run in a REPL
``torch.compiler.list_backends()`` to see all the available backends. Try out the
``cudagraphs`` next as inspiration.
Using a pretrained model
~~~~~~~~~~~~~~~~~~~~~~~~
PyTorch users frequently leverage pretrained models from
`transformers <https://github.com/huggingface/transformers>`__ or
`TIMM <https://github.com/rwightman/pytorch-image-models>`__ and one of
the design goals is TorchDynamo and TorchInductor is to work out of the box with
any model that people would like to author.
Let's download a pretrained model directly from the HuggingFace hub and optimize
it:
.. code-block:: python
import torch
from transformers import BertTokenizer, BertModel
# Copy pasted from here https://huggingface.co/bert-base-uncased
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased").to(device="cuda:0")
model = torch.compile(model, backend="inductor") # This is the only line of code that we changed
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt').to(device="cuda:0")
output = model(**encoded_input)
If you remove the ``to(device="cuda:0")`` from the model and
``encoded_input``, then Triton will generate C++ kernels that will be
optimized for running on your CPU. You can inspect both Triton or C++
kernels for BERT. They are more complex than the trigonometry
example we tried above but you can similarly skim through it and see if you
understand how PyTorch works.
Similarly, let's try out a TIMM example:
.. code-block:: python
import timm
import torch
model = timm.create_model('resnext101_32x8d', pretrained=True, num_classes=2)
opt_model = torch.compile(model, backend="inductor")
opt_model(torch.randn(64,3,7,7))
Next Steps
~~~~~~~~~~
In this section, we have reviewed a few inference examples and developed a
basic understanding of how torch.compile works. Here is what you check out next:
- `torch.compile tutorial on training <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_
- :ref:`torch.compiler_api`
- :ref:`torchdynamo_fine_grain_tracing`