Marks all params/optimizer state as static addresses and a finalizer which cleans up the graph attributes when the optimizer goes out of scope.
**Note: this does not mark grads as static because this will increase memory usage significantly
There are two cases:
1. The upstream graph is cudagraphed - this case will work fine OOTB
2. The upstream graph is not cudagraphed - in this case, there will be a lot of copies introduced from the upstream (to copy the grads) into cudagraphed-owned memory, unless the user explicitly marks the grads as static. If the user does this, this will also require not deallocating the grads in zero_grad() (either the mod or optimizer version) by setting them to zero vs None. There is a PR (https://github.com/pytorch/pytorch/pull/107853) in flight to throw an error if zero_grad attempts to set static grads to None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107504
Approved by: https://github.com/eellison
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71620
Remove from_functional_optim and make it the default constructor since
that is the only way _OptimizerHookState is now being built. Also, no longer
need to expose create_functional_optim helper function
ghstack-source-id: 147577174
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D33700593
fbshipit-source-id: ba089ce3bf66ccf8f71cffdd0f4d4bddc03e8b14
(cherry picked from commit a50b2caf0e19f9793fbf18b371d30e3dd8c5c0cf)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63462
Now that `torch.distributed.optim` gates DistributedOptimizer on RPC availability, these tests can be run on windows.
ghstack-source-id: 136437635
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D30358923
fbshipit-source-id: 36739bdfe7214789f17de652d30c62c2bc124c73
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63383
Per title
ghstack-source-id: 135966157
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D30358921
fbshipit-source-id: 965e054e525194b1ee55980340df275bab355c9b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63382
Per title
ghstack-source-id: 135966156
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D30255446
fbshipit-source-id: e6ffbf339db0bc5b4702d02b74a462309df07c75
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62611
Enables optimizer overlap with backwards in DDP for Adam. Additional optimizers, especially Adagrad will be done in follow up diffs.
1. Implement `step_param` method based on `step` in _FunctionalAdam (perf permitting we can later dedupe `step` to call `step_param`
2. Modify tests to test all current functional optimizers.
ghstack-source-id: 135207143
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29891783
fbshipit-source-id: 321915982afd5cb0a9c2e43d27550f433bff00d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62078
Ensure that kwarg arguments such as momentum and weight decay maintain
parity between optimizer.step and step_param.
ghstack-source-id: 134330377
Test Plan: CI
Reviewed By: SciPioneer
Differential Revision: D29837942
fbshipit-source-id: 1ae39648fc26aebd8aaef1a7ac0e03b598a8ed60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61756
DDP will support running optimizer as communication hook with
optimizers that support a per-parameter/gradient step function `step_param`.
Add parity tests as we implement more optimizers that support step_param to
ensure parity with regular optimizers.
ghstack-source-id: 134330378
Test Plan: Ci
Reviewed By: SciPioneer
Differential Revision: D29727549
fbshipit-source-id: 18977c896f12b8e478298488b298fd107affcf5f