`grad_dtype` is a new attribute on Tensor to control gradient dtype:
- Access/setting is leaf-only.
- grad_dtype is respected when (1) when assigning to .grad, and (2) in the engine after the previous node produces incoming gradients for AccumulateGrad. (See table below for details)
- Not setting grad_dtype preserves the current behavior. Accessing it returns `t.dtype`
- `grad_dtype` cannot be set when there is already a `.grad` present and the dtypes conflict.
| `grad_dtype` setting | Setting `.grad` manually | Incoming gradient from autograd engine |
|-----------------------|--------------------------|-----------------------------------------|
| **Default (tensor’s dtype)** | `.grad` must match tensor’s dtype | Engine casts incoming grad to tensor’s dtype |
| **Set to specific dtype** | `.grad` must match that dtype | Engine casts incoming grad to the specified dtype |
| **Set to `None`** | `.grad` may be any dtype | Engine does not cast; accepts incoming grad dtype as-is |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162815
Approved by: https://github.com/albanD
Today, we always create and record an events in two places:
1) Upon seeing the first producer, we record an event on the producer, and we wait for this event in two places: (1) when the engine goes to run the consumer, the consumer stream waits for this event. (2) prior to doing accumulation, the accumulation stream waits for this event.
2) After doing accumulation, we record an event on the accumulation stream and wait for this event in a single place: when the engine goes to run the consumer.
We do not actually need to record the event in the cases where the 1st producer stream is the same as the consumer and as the accumulation stream, and where the accumulation stream is the same as the consumer stream.
Removing this unnecessary create + record event should save a few us for each instance avoided.
Fixes https://github.com/pytorch/pytorch/issues/157407
----
Manual test plan:
- [x] @eqy to confirm perf is restored
- [x] Running the repro originally reported before/after the patch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157503
Approved by: https://github.com/eqy
ghstack dependencies: #155715
Example:
```python
File "/home/xmfan/core/a/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: TorchDispatchMode not yet implemented for compiled autograd.
You can disable compiled autograd for this operation by:
1. Relocating the unsupported autograd call outside the compiled region.
2. Wrapping the unsupported autograd call within a scope that disables compiled autograd.
3. Configuring the specific compilation unit to disable compiled autograd.
4. Globally disabling compiled autograd at the application's initialization.
```
No duplicate error messages for python side trace-time errors
```python
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xmfan/core/a/pytorch/torch/_dynamo/compiled_autograd.py", line 344, in begin_capture
raise NotImplementedError(
NotImplementedError: Found tensor of type <class 'torch.nn.utils._expanded_weights.expanded_weights_impl.ExpandedWeight'>, which is not supported by FakeTensorMode. You can turn off compiled autograd by either:
1. Moving the unsupported autograd call outside of the torch.compile'd region.
2. Wrapping the unsupported autograd call in the torch._dynamo.compiled_autograd._disable() context manager.
3. Setting torch._dynamo.config.compiled_autograd=False for the torch.compile call containing the unsupported autograd call.
4. Setting torch._dynamo.config.compiled_autograd=False at the start of the program.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156509
Approved by: https://github.com/jansel
ghstack dependencies: #156374
This word appears often in class descriptions and is not consistently spelled. Update comments and some function names to use the correct spelling consistently. Facilitates searching the codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155944
Approved by: https://github.com/Skylion007
Adds create_graph support if you don't compile or compile only with torch.compile(backend="eager").
Using a backend that uses AOTDispatch produces a post-dispatch AOT backward, where its double backward will be silently incorrect if the forward trace involved any ops that are not composite implicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153222
Approved by: https://github.com/jansel
ghstack dependencies: #153193
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.
The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.
The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
## Before
Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.
## After
We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
def unpack(self):
if self.hook:
return self.hook(self.packed_data)
else:
return self.packed_data
# This approach won't directly work when we add support for Forward AD or double-backward.
```
Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.
All tests pass when running the CA graph directly, the remaining issues are in Dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
## Before
Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.
## After
We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
def unpack(self):
if self.hook:
return self.hook(self.packed_data)
else:
return self.packed_data
# This approach won't directly work when we add support for Forward AD or double-backward.
```
Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.
All tests pass when running the CA graph directly, the remaining issues are in Dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
This PR is on the way to getting compiled autograd's initial capture to
stop specializing on Tensor metadata.
This PR changes compiled autograd's initial capture to proxy an opaque
(w.r.t. Dynamo) function into the graph for all built-in codegen'ed
autograd nodes and validate_outputs.
We changed each codegen'ed apply_with_saved (e.g.
MulBackward0::apply_with_saved) to call into Python to proxy a function
(compiled_autograd.ops.MulBackward0) into the graph. Then, we use the
node's InputMetadata to "guess" at the properties of the output Tensors
to create some new FakeTensors.
Some details:
- MulBackward0::apply_with_saved lives in libtorch_cpu, but needs to be
call to Python via libtorch_python. There is an indirection
(PyCompilerInterface) to do this.
- MulBackward0::apply_with_saved passes a C++ function to Python. To make
our lives easier, every codegen'ed apply_with_saved passes a C++
function with the same signature
`(variable_list, ivalue_list) -> variable_list`.
- We define how to pack arbitrary C++ types into IValue via a helper
IValuePacker struct and codegen functional variants of each builtin
C++ autograd node (e.g. MulBackward0_apply_functional_ivalue).
MulBackward0 before this PR:
https://gist.github.com/zou3519/a80381d5fa38e970e413fcd91b0530de
MulBackward0 after this PR:
https://gist.github.com/zou3519/0c2eee8b3d8d96232b51ef430b53c5b0
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143296
Approved by: https://github.com/jansel
Static variables in C++11 is guaranteed to be initialised exactly once, as mentioned [here](https://en.cppreference.com/w/cpp/language/storage_duration)
```
If multiple threads attempt to initialize the same static local variable concurrently,
the initialization occurs exactly once
(similar behavior can be obtained for arbitrary functions with std::call_once.
Usual implementations of this feature use variants
of the double-checked locking pattern,
which reduces runtime overhead for already-initialized local statics
to a single non-atomic boolean comparison.
```
Given that static c10::once_flag is used before, why not just use the associated function to initialised the related static variables? That is the motivation behind this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143255
Approved by: https://github.com/albanD
GraphTask holds metadata needed for a single execution of backward(), it is 1:1 with backward calls, at least for compiled autograd. It is used for certain torch._C global autograd state APIs.
In SAC, we use torch._C._current_graph_task_id() as a dict key to store information during unpack hook execution: a5fb07af27/torch/utils/checkpoint.py (L1128)
If we don't set an active task, it will randomize the key, and will do its logic as if each unpacked tensor was from a different graph task
a5fb07af27/torch/utils/checkpoint.py (L1112-L1115)
The sketchy part of this PR is that in eager autograd, GraphTask is mutated during execution. But inspecting the struct, the mutation seems to only be used to communicate between autograd threads (created when multiple devices are involved) or for deprecated uses. We shouldn't run into the mutation case at all in compiled autograd. Also, only the graph task id is accessible from python hooks.
FIXES https://github.com/pytorch/pytorch/issues/142862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143108
Approved by: https://github.com/jansel, https://github.com/albanD
Thanks @awgu for raising this issue and the small repro
From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633
Approved by: https://github.com/albanD