84 Commits

Author SHA1 Message Date
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
3721fa5612 [BE] Enable ruff's UP rules and autoformat optim/ (#105426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105426
Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi, https://github.com/janeyx99
2023-07-18 21:07:43 +00:00
6878d3a157 [foreach][RAdam] Minimize use of intermediates to decrease peak memory (#104904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104904
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-11 17:08:07 +00:00
231364fd06 [optim] use lerp whenever possible (#104796)
This is a better copy (with fixes) of #104781.

Test plan:
CI will pass once https://github.com/pytorch/pytorch/pull/104784 is landed

Internal CI (and the newly enabled compiled optim tests) will pass after https://github.com/pytorch/pytorch/pull/104866 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104796
Approved by: https://github.com/albanD
2023-07-11 14:32:59 +00:00
e7fe2a797c Revert "[optim] use lerp whenever possible (#104796)"
This reverts commit fbe2a7e50a940ba7a12b003241a2699f7a731afb.

Reverted https://github.com/pytorch/pytorch/pull/104796 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/104796#issuecomment-1628591105))
2023-07-10 09:36:41 +00:00
fbe2a7e50a [optim] use lerp whenever possible (#104796)
This is a better copy (with fixes) of #104781.

Test plan:
CI will pass once https://github.com/pytorch/pytorch/pull/104784 is landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104796
Approved by: https://github.com/albanD
2023-07-08 07:13:38 +00:00
6d2887cc06 Reland "Move tensor grouping to ATen" (#103912)
This is a reland of https://github.com/pytorch/pytorch/pull/100007 with a build fix for Windows debug builds.
`at::native::ParamsHash` only works on structs with standard layout, but `std::string` isn't one in Visual C++ debug builds, which one can easily verified by running something like:
```cpp
#define _DEBUG
#include <type_traits>
#include <string>
static_assert(std::is_standard_layout_v<std::string>, "Oh noes");
```
If above conditon is not met, instead of printing a static_assert output, VC++ raises a very cryptic compilation errors,  see https://github.com/pytorch/pytorch/pull/100007#discussion_r1227116292 for more detail.

Also, using `std::hash` for string should result in a faster hash function.

(cherry picked from commit 74b7a6c75e698378882d30958908073407f97fb3)

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 5914771</samp>

This pull request introduces a new function `_group_tensors_by_device_and_dtype` that can group tensors by their device and dtype, and updates the `foreach` utilities and several optimizers to use this function. The goal is to improve the performance, readability, and compatibility of the code that handles tensors with different properties. The pull request also adds a test case and type annotations for the new function, and some error checks for the `fused` argument in Adam and AdamW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103912
Approved by: https://github.com/janeyx99
2023-06-21 09:26:33 +00:00
0cb5bc3b04 Revert "Move tensor grouping to ATen (#100007)"
This reverts commit 74b7a6c75e698378882d30958908073407f97fb3.

Reverted https://github.com/pytorch/pytorch/pull/100007 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629727 ([comment](https://github.com/pytorch/pytorch/pull/100007#issuecomment-1587861598))
2023-06-12 18:30:33 +00:00
74b7a6c75e Move tensor grouping to ATen (#100007)
rel: #94344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100007
Approved by: https://github.com/janeyx99
2023-06-09 15:44:46 +00:00
4da88447ea Disable grouping by dtype and device if compiling (#102771)
Disable grouping if we are compiling, this happens during lowering
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102771
Approved by: https://github.com/janeyx99
2023-06-02 21:04:49 +00:00
75cb99e549 [optim] Widen the cases for defaulting to foreach (#95820)
Big OOP correction continued. Also added a test this time to verify the defaulting was as expected.

The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95820
Approved by: https://github.com/albanD
2023-03-02 04:15:33 +00:00
097679478e [optim] Set defaults to foreach, NOT fused (#95241)
Rolling back the default change for Adam and rectifying the docs to reflect that AdamW never defaulted to fused.

Since our fused implementations are relatively newer, let's give them a longer bake-in time before flipping the switch for every user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95241
Approved by: https://github.com/ngimel
2023-02-22 04:47:32 +00:00
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
4fc19e1a71 [optim][adam] use fastest impl whenever possible, add util (#93184)
This allows it so that ONLY when the users don't set anything for foreach or fused do we switch the default and cascades adam so that we default to fused, then foreach, then single-tensor.

To clarify:
* if the user puts True in foreach _only_, it will run the foreach implementation.
* if the user puts True in fused _only_, it will run the fused implementation.
* if the user puts True in foreach AND for fused, it will run the fused implementation.

And:
* if the user puts False in foreach _only_, it will run the single tensor implementation.
* if the user puts False in fused _only_, it will still run the single tensor implementation.
* if the user puts False in foreach AND for fused, it will run the single tensor implementation.

I also didn't trust myself that much with the helper function, so I ran some local asserts on _default_to_fused_or_foreach. The only point left to really test is the type(p) -- torch.Tensor but I think the distributed tests will catch that in CI.
```
cuda_only_fp_list = [
    torch.rand((1, 2), device="cuda", dtype=torch.float32),
    torch.rand((1, 2), device="cuda", dtype=torch.float64),
    torch.rand((1, 2), device="cuda", dtype=torch.float16),
    torch.rand((1, 2), device="cuda", dtype=torch.bfloat16),
]

cuda_only_int_list = [
    torch.randint(1024, (1, 2), device="cuda", dtype=torch.int64),
]

cpu_list = [
    torch.rand((1, 2), device="cpu", dtype=torch.float32),
    torch.rand((1, 2), device="cpu", dtype=torch.float64),
    torch.rand((1, 2), device="cpu", dtype=torch.float16),
]

none_list = [None]

# differentiable should always make it return false for both
assert _default_to_fused_or_foreach([cuda_only_fp_list], True, True) == (False, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list], True, False) == (False, False)

# cpu lists should always make it return false for both
assert _default_to_fused_or_foreach([cuda_only_fp_list, cpu_list], False, True) == (False, False)
assert _default_to_fused_or_foreach([cpu_list], False, True) == (False, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cpu_list], False, False) == (False, False)
assert _default_to_fused_or_foreach([cpu_list], False, False) == (False, False)

# has fused triggers correctly
assert _default_to_fused_or_foreach([cuda_only_fp_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list], False, False) == (False, True)

# ints always goes to foreach
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list], False, True) == (False, True)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list], False, False) == (False, True)

# Nones don't error
assert _default_to_fused_or_foreach([cuda_only_fp_list, none_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([cuda_only_fp_list, cuda_only_int_list, none_list], False, True) == (False, True)
assert _default_to_fused_or_foreach([none_list], False, True) == (True, False)
assert _default_to_fused_or_foreach([none_list], False, False) == (False, True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93184
Approved by: https://github.com/albanD
2023-01-30 19:58:55 +00:00
7277247a8c [optim][radam] default to foreach when CUDA + differentiable=False (#92726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92726
Approved by: https://github.com/albanD
2023-01-21 05:31:22 +00:00
de0375e79d [optim][foreach] Do NOT inplace modify gradients (#92706)
SGD and ASGD already had out-of-place grads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92706
Approved by: https://github.com/ngimel, https://github.com/albanD
2023-01-21 00:12:28 +00:00
3ba5eae72a [optim][radam] fix eps discrepancy for foreach (#92551)
Will likely race with https://github.com/pytorch/pytorch/pull/92365

eps was not being used at all in the mta/foreach impl. There was also a discrepancy between the docs vs the implementation: the implementation was doing sqrt(x) + eps and the docs were doing sqrt(x+eps)).

I've fixed the docs + extended the current multi_tensor test case to capture this issue.

![image](https://user-images.githubusercontent.com/31798555/213300617-61cbb763-da2d-48e0-b3b6-0190594dd049.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92551
Approved by: https://github.com/albanD
2023-01-19 14:38:59 +00:00
fbafcecf8d [optim][radam] group tensors in foreach to maximize perf (#92365)
Also noticed that eps is not being used nor tested at all for the mta impl of RAdam.

Will fix in a followup PR before turning foreach to default!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92365
Approved by: https://github.com/albanD
2023-01-18 22:32:27 +00:00
0070c546b5 [BE][optim] abstract out docstrings, add differentiable docs (#92336)
1. abstract out common doc strings --> I'm sure there are more, but let this be a first step.
2. Add differentiable docs to those who are actually differentiable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92336
Approved by: https://github.com/albanD
2023-01-18 15:09:28 +00:00
06326a7721 [optim] skip .item calls in all optimizers when compiling with dynamo (#88173)
@mlazos: skips `item()` calls if compiling with dynamo, by defining a helper function `_get_value` which either returns the result of `.item()` or the scalar cpu tensor if compiling with dynamo. This was done because removing `item()` calls significantly regresses eager perf. Additionally, `_dispatch_sqrt` calls the appropriate sqrt function (math.sqrt, or torch.sqrt).

Fixes https://github.com/pytorch/torchdynamo/issues/1083

This PR will no longer be needed once symint support is default.

This PR closes all remaining graph breaks in the optimizers (!!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88173
Approved by: https://github.com/albanD
2022-12-12 17:32:35 +00:00
c63afb283c Disable dynamo on optimizer lazy initialization (#89902)
Helps with https://github.com/pytorch/torchdynamo/issues/1803

Separate out the group initialization and disable dynamo on it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89902
Approved by: https://github.com/soumith, https://github.com/albanD
2022-12-02 01:15:11 +00:00
3d47c74cfe Update code style for optimizer code (#89862)
Separating out whitespace-only changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89862
Approved by: https://github.com/albanD, https://github.com/soumith
2022-11-30 00:53:05 +00:00
c3e85d879c Mention discrepency between original impl and our impl of RAdam (#89575)
Fixes https://github.com/pytorch/pytorch/issues/88836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89575
Approved by: https://github.com/mruberry
2022-11-24 17:11:42 +00:00
1b43883fd6 Make AdamW, NAdam & RAdam differentiable (#86183)
Blocked by #86096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86183
Approved by: https://github.com/albanD
2022-10-17 04:32:08 +00:00
71d50f4f89 Change docstring type callable to Callable for consistency (#82487)
### Description

Across PyTorch's docstrings, both `callable` and `Callable` for variable types. The Callable should be capitalized as we are referring to the `Callable` type, and not the Python `callable()` function.

### Testing

There shouldn't be any testing required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82487
Approved by: https://github.com/albanD
2022-08-01 17:26:09 +00:00
bda04e9f5e Add __all__ for torch.optim and torch.nn.modules modules (#80237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80237
Approved by: https://github.com/albanD
2022-06-24 21:34:10 +00:00
de7219e8a7 Use generators with all/any in torch/optim (#78142)
Generator comprehensions with any/all are less verbose and potentially help to save memory/CPU : https://eklitzke.org/generator-comprehensions-and-using-any-and-all-in-python

To make JIT work with this change, I added code to convert GeneratorExp to ListComp. So the whole PR is basically NoOp for JIT, but potentially memory and speed improvement for eager mode.

Also I removed a test from test/jit/test_parametrization.py. The test was bad and had a TODO to actually implement and just tested that UnsupportedNodeError is thrown, and with GeneratorExp support a different error would be thrown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78142
Approved by: https://github.com/malfet, https://github.com/albanD
2022-06-24 17:23:45 +00:00
5948522e9c Optim foreach cleanup for RAdam (#70230)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70230

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767874

Pulled By: mikaylagawarecki

fbshipit-source-id: 9379db24266a7bbcc2c23849f87ae0af2e6729c0
(cherry picked from commit ecf7b31fc39ccfeeef36bb763ca8c96960be3577)
2022-02-09 16:52:13 +00:00
7176c92687 [optim] update step in functional and pass state_steps instead of state (#71333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71333

Updated
- Adagrad
- Adamax
- Adam
- AdamW
- RAdam
make multi_tensor functionals take `state_steps: List[Tensor]` instead of taking `states: List[Dict]`
make `state_steps: List[int]s -> state_steps:List[Tensor]` where each is a Singleton tensor so step can be updated within the functional

(NAdam and ASGD) were updated in separate diffs to fold their handling of state into the functionals

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D33767872

Pulled By: mikaylagawarecki

fbshipit-source-id: 9baa7cafb6375eab839917df9287c65a437891f2
(cherry picked from commit 831c02b3d0f585f61165ead368213f94b97a99ee)
2022-02-08 16:51:19 +00:00
263125a962 Fix RAdam docstring on LR default value (#69186)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69186

Reviewed By: albanD

Differential Revision: D32759614

Pulled By: H-Huang

fbshipit-source-id: b11819c50156a538cd6003e9cddde0390c853f67
2021-12-01 14:32:07 -08:00
239366c9c2 To add Rectified Adam Description to Documentation (#63772)
Summary:
It has been discussed before that adding description of Optimization algorithms to PyTorch Core documentation may result in a nice Optimization research tutorial. In the following tracking issue we mentioned about all the necessary algorithms and links to the originally published paper  https://github.com/pytorch/pytorch/issues/63236.

In this PR we are adding description of Rectified Adam Algorithm to the documentation.  For more details, we refer to the paper  https://arxiv.org/abs/1908.03265

<img width="446" alt="RadamAlgo" src="https://user-images.githubusercontent.com/73658284/132587815-4764b642-df53-4e41-975f-72e0f40fdc48.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/63772

Reviewed By: datumbox

Differential Revision: D30839694

Pulled By: iramazanli

fbshipit-source-id: 6f5629ce56e10c66a451433334b587b99eda1610
2021-09-09 07:10:36 -07:00
63219f1f9f To add Rectified Adam Algorithm to Optimizers (#58968)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf  Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

2f03dd1970/radam/radam.py (L156)

f51ee4618d/Sources/TensorFlow/Optimizers/MomentumBased.swift (L638)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58968

Reviewed By: vincentqb

Differential Revision: D29310601

Pulled By: iramazanli

fbshipit-source-id: b7bd487f72f1074f266687fd9c0c6be264a748a9
2021-06-23 18:27:57 -07:00
1abf45e37f Revert D29241736: [pytorch][PR] To add Rectified Adam Algorithm to Optimizers
Test Plan: revert-hammer

Differential Revision:
D29241736 (0d2a936176)

Original commit changeset: 288b9b1f3125

fbshipit-source-id: 56c4ec98647c6f1822b130726741a1c9ca193670
2021-06-22 12:08:31 -07:00
0d2a936176 To add Rectified Adam Algorithm to Optimizers (#58968)
Summary:
Fixes : https://github.com/pytorch/pytorch/issues/24892

In the paper : https://arxiv.org/pdf/1908.03265.pdf  Liyuan Liu et al. suggested a new optimization algorithm with an essence of similar to Adam Algorithm.

It has been discussed in the paper that, without warmup heuristic, in the early stage of adaptive optimization / learning algorithms sometimes we can get undesirable large variance which can slow overall convergence process.

Authors proposed the idea of rectification of variance of adaptive learning rate when it is expected to be high.

Differing from the paper, we selected variance tractability cut-off as 5 instead of 4. This adjustment is common practice, and could be found in the code-repository and also tensorflow swift optim library as well :

2f03dd1970/radam/radam.py (L156)

f51ee4618d/Sources/TensorFlow/Optimizers/MomentumBased.swift (L638)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58968

Reviewed By: gchanan

Differential Revision: D29241736

Pulled By: iramazanli

fbshipit-source-id: 288b9b1f3125fdc6c7a7bb23fde1ea5c201c0448
2021-06-22 10:38:41 -07:00