Commit Graph

23 Commits

Author SHA1 Message Date
0a2da008f8 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-26 16:37:17 +00:00
90e3a3d86d Revert "[ca] trace saved variable unpacking (#147242)"
This reverts commit 68ddca94498fd7961cc5ebcb0dffafb8c2f4baca.

Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))
2025-02-26 00:40:16 +00:00
68ddca9449 [ca] trace saved variable unpacking (#147242)
## Before

Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op.

## After

We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as:
```python
# pseudocode
class SavedVariable:
  def unpack(self):
    if self.hook:
      return self.hook(self.packed_data)
    else:
      return self.packed_data

# This approach won't directly work when we add support for Forward AD or double-backward.
```

Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution.

All tests pass when running the CA graph directly, the remaining issues are in Dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242
Approved by: https://github.com/jansel
2025-02-25 20:38:51 +00:00
82b6480b0a Update SavedTensorHooks TLS stack to use SafePyObject (#131700)
Previously, we must manually manage refcounting when updating the TLS saved variable stack. With this PR, things should be handled automatically by the SafePyObject.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131700
Approved by: https://github.com/albanD
2024-08-02 16:27:16 +00:00
4b96575a09 [dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196)
FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched.

For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196
Approved by: https://github.com/soulitzer
2024-06-14 20:28:08 +00:00
cyy
75b954b715 [4/N] Enable clang-tidy in torch/csrc/autograd (#109455)
The PR enables clang-tidy checks in torch/csrc/autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109455
Approved by: https://github.com/Skylion007
2023-09-17 17:11:50 +00:00
cyy
51d2d825ab [3/N] apply clang-tidy in torch/csrc/autograd (#109368)
This PR applies clang-tidy fixes in torch/csrc/autograd/FunctionsManual.cpp. There are also other fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109368
Approved by: https://github.com/Skylion007
2023-09-17 07:26:59 +00:00
93d7d546ff Fix saved tensor hooks to propogate errors back to python as-is (#94456)
Mitigates the effect of https://github.com/pytorch/pytorch/issues/34172 for saved tensor hooks

BC Breaking message:
- Exceptions raised inside the pack and unpack hooks are no longer erroneously converted to RuntimeErrors. You should update your code to handle the original type of exception raised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94456
Approved by: https://github.com/albanD
2023-02-09 23:52:06 +00:00
7c72bc48d8 Add mechanism to disable the "saved tensors hooks" feature (#85971)
The rationale for this is that functorch doesn't work with saved
variable hooks at the moment or checkpointing and we need some way to
disable it.

Concretely:
- there's a context manager that does the disabling
- this feature is disabled on a thread-local basis
- one can set an error message or use the default error message that
says the feature has been disabled

Since it is thread local I needed to update ATen/ThreadLocalState. To
make things nicer, this PR refactors all the "saved tensors hooks"
related TLS things into a single struct.

Test Plan:
- new test

Differential Revision: [D39970936](https://our.internmc.facebook.com/intern/diff/D39970936)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85971
Approved by: https://github.com/albanD, https://github.com/soulitzer
2022-09-30 20:03:58 +00:00
801818f9e6 Revert "Add mechanism to disable the "saved tensors hooks" feature (#85553)"
This reverts commit 5aa183d2bc7372b4deb4e4b2f31017be9f13264c.

Reverted https://github.com/pytorch/pytorch/pull/85553 on behalf of https://github.com/atalman due to Reverting since failed build-fisp-diff-linux_platform010-opt
2022-09-30 14:31:09 +00:00
5aa183d2bc Add mechanism to disable the "saved tensors hooks" feature (#85553)
The rationale for this is that functorch doesn't work with saved
variable hooks at the moment or checkpointing and we need some way to
disable it.

Concretely:
- there's a context manager that does the disabling
- this feature is disabled on a thread-local basis
- one can set an error message or use the default error message that
says the feature has been disabled

Since it is thread local I needed to update ATen/ThreadLocalState. To
make things nicer, this PR refactors all the "saved tensors hooks"
related TLS things into a single struct.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85553
Approved by: https://github.com/soulitzer
2022-09-28 22:49:28 +00:00
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
a3b7dd7b78 Enable nested default hooks (#70932)
Summary:
When default hooks are set, they are pushed onto a stack.
When nesting context-manager, only the inner-most hooks will
be applied.

There is special care needed to update the TLS code. See also https://github.com/pytorch/pytorch/issues/70940 (i.e. do we need to be storing the enabled flag as well?)

Fixes https://github.com/pytorch/pytorch/issues/70134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70932

Reviewed By: mruberry

Differential Revision: D33530370

Pulled By: albanD

fbshipit-source-id: 3197d585d77563f36c175d3949115a0776b309f4
2022-01-11 15:03:49 -08:00
5abeac3ef7 Make saved tensors default hooks thread local (#62909)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62909

This PR makes saved tensors default hooks thread local.
This allows using default hooks in a multithreaded context.

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30165416

Pulled By: Varal7

fbshipit-source-id: 10a7d580661d3d94bdaf398c4e076b7bea11c16b
2021-08-13 07:49:20 -07:00
3bda4ea842 Avoid unnecessary copying data in Saved Variable (#61927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61927

This is a refactor of `SavedVariable.cpp` to prevent ever defining the `data_` tensor if default hooks are set.

Before the refactor:

```c++
data_ = variable.tensor_data(); // this is wasteful if hooks are defined
register_hooks(Engine::get_default_engine().get_default_saved_variable_hooks());
```

After the refactor:
```c++
if (get_default_hooks_()) {
  save_metadata_(variable);
  register_hooks_(get_default_hooks_(), variable);
  return;
}
save_metadata_(variable);
data_ = variable.tensor_data(); // only needed if hooks are not defined
```

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29848524

Pulled By: Varal7

fbshipit-source-id: abca1eee37a17b47841e28d8a576490913fce1ce
2021-08-03 07:09:47 -07:00
525fa2f0b6 [reland] Catch saved tensors default hooks race condition (#62564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62564

If the user runs code that registers default saved tensor hooks from
multiple threads, it will fail with a nice error message most of the
time. This commit handles the very rare case where a race condition
would have made it fail silently.

Relanding previous PR #61957

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D30045406

Pulled By: Varal7

fbshipit-source-id: d04f74c99affbbf655e53cfc2acd42f7c5b4e6eb
2021-08-02 18:00:37 -07:00
b161ac541d [reland] Add default Saved Variable hooks (#62563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62563

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Relanding previous PR: https://github.com/pytorch/pytorch/pull/61834

Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc

Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98

The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30045405

Pulled By: Varal7

fbshipit-source-id: 7f6c07af3a56fe8835d5edcc815c15ea4fb4e332
2021-08-02 11:30:26 -07:00
5c47038d12 Back out D29792193 "Add default Saved Variable hooks" (#62415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62415

test error

Differential Revision: D29990361

fbshipit-source-id: 99c87dec6c5be6496c9db5c9205c3cb72a953dd9
2021-07-29 16:31:00 -07:00
dcfcefcd0b Back out D29848525 "Catch saved tensors default hooks race condition" (#62414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62414

test error

Differential Revision: D29990348

fbshipit-source-id: 1a7c668153ad7ad9e847dd1a74db678e787b6b0e
2021-07-29 16:29:46 -07:00
200b6ccdc0 Catch saved tensors default hooks race condition (#61957)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61957

If the user runs code that registers default saved tensor hooks from
multiple threads, it will fail with a nice error message most of the
time. This commit handles the very rare case where a race condition
would have made it fail silently.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29848525

Pulled By: Varal7

fbshipit-source-id: eb9bdcfbeed857a988834651246390ea14eedd33
2021-07-26 09:48:47 -07:00
be17d6eadf Add default Saved Variable hooks (#61834)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61834

Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks().
These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed.

Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927.

A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor.

For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.:

```
def pack(x):
    name = os.path.join(tmp_dir, str(uuid.uuid4()))
    torch.save(x, name)
    return name

def unpack(name):
    return torch.load(name)
```

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D29792193

Pulled By: Varal7

fbshipit-source-id: 33e931230ef59faa3ec8b5d11ef7c05539bce77c
2021-07-26 08:14:32 -07:00
ff82394fc0 Apply saved tensor hooks (#60975)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60975

Fixes #58512

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29466227

fbshipit-source-id: c1498d52173aceb29638b5c4f521ac05356a5958
2021-07-18 08:42:51 -07:00
ee5a97de11 Register Saved Tensors hooks (#60663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60663

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29466223

fbshipit-source-id: 65dc3a935c18a0e6b93a37e24543c696e6ae0321
2021-07-15 08:09:55 -07:00