pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Simon Fan	0a2da008f8	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-26 16:37:17 +00:00
PyTorch MergeBot	90e3a3d86d	Revert "[ca] trace saved variable unpacking (#147242 )" This reverts commit 68ddca94498fd7961cc5ebcb0dffafb8c2f4baca. Reverted https://github.com/pytorch/pytorch/pull/147242 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147242#issuecomment-2683604547))	2025-02-26 00:40:16 +00:00
Simon Fan	68ddca9449	[ca] trace saved variable unpacking (#147242 ) ## Before Previously, CA will always unpack all saved variables stored in the autograd graph before executing it. This meant that we can't capture unpack hooks as part of the CA graph, and they would fire out of order wrt to other backward hooks. For memory saving APIs built on top of saved tensor hooks like non-reentrant checkpointing and offloading, we couldn't achieve any savings because all activations would be recomputed/loaded and active at the same time, resulting in no-op. ## After We add unpack hooks into the CA graph so that they can be executed progressively. The python hook and hook input themselves are wrapped by non-traceable code, so CA polyfills the wrapping as: ```python # pseudocode class SavedVariable: def unpack(self): if self.hook: return self.hook(self.packed_data) else: return self.packed_data # This approach won't directly work when we add support for Forward AD or double-backward. ``` Directly executing the CA graph (without torch.compiling it) under checkpointing/offloading, memory profile is expected to stay the same as when using the eager autograd engine. If AOT backward is in the autograd graph, memory profile is expected to be better than the eager autograd engine, since we can now delay saved activations unpacking into the AOT backward's execution. All tests pass when running the CA graph directly, the remaining issues are in Dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147242 Approved by: https://github.com/jansel	2025-02-25 20:38:51 +00:00
soulitzer	82b6480b0a	Update SavedTensorHooks TLS stack to use SafePyObject (#131700 ) Previously, we must manually manage refcounting when updating the TLS saved variable stack. With this PR, things should be handled automatically by the SafePyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131700 Approved by: https://github.com/albanD	2024-08-02 16:27:16 +00:00
Simon Fan	4b96575a09	[dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196 ) FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched. For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196 Approved by: https://github.com/soulitzer	2024-06-14 20:28:08 +00:00
cyy	75b954b715	[4/N] Enable clang-tidy in torch/csrc/autograd (#109455 ) The PR enables clang-tidy checks in torch/csrc/autograd. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109455 Approved by: https://github.com/Skylion007	2023-09-17 17:11:50 +00:00
cyy	51d2d825ab	[3/N] apply clang-tidy in torch/csrc/autograd (#109368 ) This PR applies clang-tidy fixes in torch/csrc/autograd/FunctionsManual.cpp. There are also other fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109368 Approved by: https://github.com/Skylion007	2023-09-17 07:26:59 +00:00
soulitzer	93d7d546ff	Fix saved tensor hooks to propogate errors back to python as-is (#94456 ) Mitigates the effect of https://github.com/pytorch/pytorch/issues/34172 for saved tensor hooks BC Breaking message: - Exceptions raised inside the pack and unpack hooks are no longer erroneously converted to RuntimeErrors. You should update your code to handle the original type of exception raised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94456 Approved by: https://github.com/albanD	2023-02-09 23:52:06 +00:00
Richard Zou	7c72bc48d8	Add mechanism to disable the "saved tensors hooks" feature (#85971 ) The rationale for this is that functorch doesn't work with saved variable hooks at the moment or checkpointing and we need some way to disable it. Concretely: - there's a context manager that does the disabling - this feature is disabled on a thread-local basis - one can set an error message or use the default error message that says the feature has been disabled Since it is thread local I needed to update ATen/ThreadLocalState. To make things nicer, this PR refactors all the "saved tensors hooks" related TLS things into a single struct. Test Plan: - new test Differential Revision: [D39970936](https://our.internmc.facebook.com/intern/diff/D39970936) Pull Request resolved: https://github.com/pytorch/pytorch/pull/85971 Approved by: https://github.com/albanD, https://github.com/soulitzer	2022-09-30 20:03:58 +00:00
PyTorch MergeBot	801818f9e6	Revert "Add mechanism to disable the "saved tensors hooks" feature (#85553 )" This reverts commit 5aa183d2bc7372b4deb4e4b2f31017be9f13264c. Reverted https://github.com/pytorch/pytorch/pull/85553 on behalf of https://github.com/atalman due to Reverting since failed build-fisp-diff-linux_platform010-opt	2022-09-30 14:31:09 +00:00
Richard Zou	5aa183d2bc	Add mechanism to disable the "saved tensors hooks" feature (#85553 ) The rationale for this is that functorch doesn't work with saved variable hooks at the moment or checkpointing and we need some way to disable it. Concretely: - there's a context manager that does the disabling - this feature is disabled on a thread-local basis - one can set an error message or use the default error message that says the feature has been disabled Since it is thread local I needed to update ATen/ThreadLocalState. To make things nicer, this PR refactors all the "saved tensors hooks" related TLS things into a single struct. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/85553 Approved by: https://github.com/soulitzer	2022-09-28 22:49:28 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Victor Quach	a3b7dd7b78	Enable nested default hooks (#70932 ) Summary: When default hooks are set, they are pushed onto a stack. When nesting context-manager, only the inner-most hooks will be applied. There is special care needed to update the TLS code. See also https://github.com/pytorch/pytorch/issues/70940 (i.e. do we need to be storing the enabled flag as well?) Fixes https://github.com/pytorch/pytorch/issues/70134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/70932 Reviewed By: mruberry Differential Revision: D33530370 Pulled By: albanD fbshipit-source-id: 3197d585d77563f36c175d3949115a0776b309f4	2022-01-11 15:03:49 -08:00
Victor Quach	5abeac3ef7	Make saved tensors default hooks thread local (#62909 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62909 This PR makes saved tensors default hooks thread local. This allows using default hooks in a multithreaded context. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D30165416 Pulled By: Varal7 fbshipit-source-id: 10a7d580661d3d94bdaf398c4e076b7bea11c16b	2021-08-13 07:49:20 -07:00
Victor Quach	3bda4ea842	Avoid unnecessary copying data in Saved Variable (#61927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61927 This is a refactor of `SavedVariable.cpp` to prevent ever defining the `data_` tensor if default hooks are set. Before the refactor: ```c++ data_ = variable.tensor_data(); // this is wasteful if hooks are defined register_hooks(Engine::get_default_engine().get_default_saved_variable_hooks()); ``` After the refactor: ```c++ if (get_default_hooks_()) { save_metadata_(variable); register_hooks_(get_default_hooks_(), variable); return; } save_metadata_(variable); data_ = variable.tensor_data(); // only needed if hooks are not defined ``` Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D29848524 Pulled By: Varal7 fbshipit-source-id: abca1eee37a17b47841e28d8a576490913fce1ce	2021-08-03 07:09:47 -07:00
Victor Quach	525fa2f0b6	[reland] Catch saved tensors default hooks race condition (#62564 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62564 If the user runs code that registers default saved tensor hooks from multiple threads, it will fail with a nice error message most of the time. This commit handles the very rare case where a race condition would have made it fail silently. Relanding previous PR #61957 Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D30045406 Pulled By: Varal7 fbshipit-source-id: d04f74c99affbbf655e53cfc2acd42f7c5b4e6eb	2021-08-02 18:00:37 -07:00
Victor Quach	b161ac541d	[reland] Add default Saved Variable hooks (#62563 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62563 Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks(). These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed. Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927. A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor. For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.: ``` def pack(x): name = os.path.join(tmp_dir, str(uuid.uuid4())) torch.save(x, name) return name def unpack(name): return torch.load(name) ``` Relanding previous PR: https://github.com/pytorch/pytorch/pull/61834 Original PR led to timeout error in: https://www.internalfb.com/mast/job/yuguo-release_canary_offline_training-inlinecvrp_a-canary_offline_train_28a7ecfc Now passing: https://www.internalfb.com/mast/job/quach-release_canary_offline_training-inlinecvrp_a-canary_offline_train_9bb57e98 The difference with the new version is we don't need to acquire the GIL when calling `PyDefaultSavedVariableHooks::get_hooks`. Test Plan: Imported from OSS Reviewed By: iramazanli Differential Revision: D30045405 Pulled By: Varal7 fbshipit-source-id: 7f6c07af3a56fe8835d5edcc815c15ea4fb4e332	2021-08-02 11:30:26 -07:00
Yu Guo	5c47038d12	Back out D29792193 "Add default Saved Variable hooks" (#62415 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62415 test error Differential Revision: D29990361 fbshipit-source-id: 99c87dec6c5be6496c9db5c9205c3cb72a953dd9	2021-07-29 16:31:00 -07:00
Yu Guo	dcfcefcd0b	Back out D29848525 "Catch saved tensors default hooks race condition" (#62414 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62414 test error Differential Revision: D29990348 fbshipit-source-id: 1a7c668153ad7ad9e847dd1a74db678e787b6b0e	2021-07-29 16:29:46 -07:00
Victor Quach	200b6ccdc0	Catch saved tensors default hooks race condition (#61957 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61957 If the user runs code that registers default saved tensor hooks from multiple threads, it will fail with a nice error message most of the time. This commit handles the very rare case where a race condition would have made it fail silently. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D29848525 Pulled By: Varal7 fbshipit-source-id: eb9bdcfbeed857a988834651246390ea14eedd33	2021-07-26 09:48:47 -07:00
Victor Quach	be17d6eadf	Add default Saved Variable hooks (#61834 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61834 Expose a pair of functions to Python users: torch.autograd.graph.set_saved_tensors_default_hooks(pack, unpack) and torch.autograd.graph.reset_saved_tensors_default_hooks(). These functions control the hooks applied to saved tensors: all tensors saved in that context will be packed using the pack function, then unpacked accordingly when needed. Currently, this works by simply calling register_hooks (cf #60975) directly at the end of the constructor of a SavedVariable. This could be optimized further by not performing the copy before registering default hooks, but this would require a small refactor. Edit: the refactor is done in #61927. A current limitation is that if users create tensors in this context, they will not be able to register additional hooks on the saved tensor. For instance, to perform something like #28997, one could define a pack function that saves to disk whenever the tensor size is too big and returns a filename, then unpack simply reads the content of the file and outputs a tensor, e.g.: ``` def pack(x): name = os.path.join(tmp_dir, str(uuid.uuid4())) torch.save(x, name) return name def unpack(name): return torch.load(name) ``` Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D29792193 Pulled By: Varal7 fbshipit-source-id: 33e931230ef59faa3ec8b5d11ef7c05539bce77c	2021-07-26 08:14:32 -07:00
Victor Quach	ff82394fc0	Apply saved tensor hooks (#60975 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60975 Fixes #58512 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29466227 fbshipit-source-id: c1498d52173aceb29638b5c4f521ac05356a5958	2021-07-18 08:42:51 -07:00
Victor Quach	ee5a97de11	Register Saved Tensors hooks (#60663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60663 Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D29466223 fbshipit-source-id: 65dc3a935c18a0e6b93a37e24543c696e6ae0321	2021-07-15 08:09:55 -07:00

23 Commits