Commit Graph

48 Commits

Author SHA1 Message Date
6951626735 [Reland] Enable tests disabled for #115607 (#123551)
Relanding https://github.com/pytorch/pytorch/pull/123314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123551
Approved by: https://github.com/anijain2305
ghstack dependencies: #123496, #123497
2024-04-08 21:29:28 +00:00
e94b81b254 Revert "Enable tests disabled for #115607 (#123314)"
This reverts commit 9564e204c1616ce78434abfdea0f3fd428b675f3.

Reverted https://github.com/pytorch/pytorch/pull/123314 on behalf of https://github.com/atalman due to  break TestOptimRenewedCPU::test_foreach_matches_forloop_Adamax_cpu_float64 ([comment](https://github.com/pytorch/pytorch/pull/123314#issuecomment-2040854499))
2024-04-06 01:59:22 +00:00
954d750516 Revert "Enable dynamo'd tests disabled for #115679 (#123315)"
This reverts commit d472ebf94a3f3a3dec31e9d8b2038127b2309727.

Reverted https://github.com/pytorch/pytorch/pull/123315 on behalf of https://github.com/atalman due to break TestOptimRenewedCPU::test_foreach_matches_forloop_Adamax_cpu_float64 ([comment](https://github.com/pytorch/pytorch/pull/123315#issuecomment-2040835229))
2024-04-06 00:57:42 +00:00
d472ebf94a Enable dynamo'd tests disabled for #115679 (#123315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123315
Approved by: https://github.com/janeyx99
ghstack dependencies: #123313, #123314
2024-04-05 23:21:53 +00:00
9564e204c1 Enable tests disabled for #115607 (#123314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123314
Approved by: https://github.com/janeyx99
ghstack dependencies: #123313
2024-04-05 23:21:53 +00:00
d7fe0603a1 Move sparse tests to TestOptimRenewed (#123146)
This is the last of the old TestOptim! With this change, everything will be migrated to use OptimizerInfo. Our sparse support is...well, sparse, and the tests try to best encapsulate which configs actually work. Note that support_sparse is actually just supports sparse grads...we don't test sparse params.

1. This PR fixes a bug in Adagrad multi_tensor with maximize by passing the correct value of maximize (vs False everytime) when sparse values are present.

2. This PR does improve coverage. There used to only be 2 configs each, and now we have the following configs for:

Adagrad:
```
python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Adagrad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'lr': 0.1}    <--- this and above are CPU
.{'foreach': False, 'lr': 0.1}
{'foreach': True, 'lr': 0.1}
{'maximize': True, 'foreach': False, 'lr': 0.1}
{'maximize': True, 'foreach': True, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': False, 'lr': 0.1}
{'initial_accumulator_value': 0.1, 'foreach': True, 'lr': 0.1}
.
----------------------------------------------------------------------
Ran 2 tests in 227.744s

OK
```

SGD
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_SGD
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'dampening': 0.5, 'lr': 0.0048}
.{'foreach': False, 'lr': 0.0048}
{'foreach': True, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': False, 'lr': 0.0048}
{'dampening': 0.5, 'foreach': True, 'lr': 0.0048}
.
----------------------------------------------------------------------
Ran 2 tests in 112.801s

OK
```

SparseAdam
```
(pytorch-3.10) [janeyx@devgpu023.odn1 /data/users/janeyx/pytorch (bff23193)]$ python test/test_optim.py -k test_rosenbrock_sparse_with_lrsched_False_Sparse
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
{'maximize': True, 'lr': 0.04}
.{'maximize': True, 'lr': 0.04}
.
----------------------------------------------------------------------
Ran 2 tests in 35.113s

OK
```

Fixes #103322. A side quest in this migration was to re-enable and track dynamo issues as they trigger on the optim tests, which will be complete from this PR. New tests may add more things to track in dynamo, but there is now an established system for doing so, and dynamo is either enabled or a bug is tracked for every migrated test in TestOptimRenewed.

Next steps:
Remove the hyperparameter constraints in common_optimizer.py defined by metadata_for_sparse (other than LR, which seems handpicked for the tests to actually pass). Doing this requires adding more sparse functionality.

Add more tests!

Maybe add more optimizers!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123146
Approved by: https://github.com/albanD
ghstack dependencies: #123134, #123139
2024-04-02 22:51:02 +00:00
f2838c99a0 Add a tensor lr test for optimizers (#123139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123139
Approved by: https://github.com/albanD
ghstack dependencies: #123134
2024-04-02 22:51:02 +00:00
cb8fc30e4a Move LRScheduler integration tests to OptimizerInfo (#123134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123134
Approved by: https://github.com/albanD
2024-04-02 22:51:02 +00:00
9d9d2af786 [BE] Move tests using functional API to OptimizerInfo (#122822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122822
Approved by: https://github.com/albanD
2024-04-02 01:35:59 +00:00
16771747c2 Add tensor step and capturable support to rprop (#122261)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Rprop step update while compiling

Also adds capturable support + testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122261
Approved by: https://github.com/janeyx99
2024-03-28 23:31:18 +00:00
caa57e4fcd Add tensor step and capturable support to rmsprop (#122264)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes RMSprop step update while compiling

Adds capturable support to RMSprop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122264
Approved by: https://github.com/janeyx99
2024-03-28 03:39:28 +00:00
365e89a591 Add tensor step to adadelta (#122252)
Towards fixing https://github.com/pytorch/pytorch/issues/115679
Fixes Adadelta step update while compiling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122252
Approved by: https://github.com/janeyx99
2024-03-21 07:28:47 +00:00
fb1d7935bb [optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618
Approved by: https://github.com/albanD
2024-03-12 02:33:21 +00:00
f76e541ec7 [BE] NO MORE discrepancy between forloop foreach capturable YAY (#121269)
and I will not let it happen again

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121269
Approved by: https://github.com/albanD
ghstack dependencies: #121260, #121264
2024-03-08 00:00:30 +00:00
9d6c5be781 Add ASGD capturable API for forloop (#121264)
@tfsingh I got to it first--wanted to land this stack and close the gap ASAP.

This PR also fixes a discrepancy between `_init_group` and `__set_state__` because we have the constants live on params' device always.

There are some next steps though:
- ASGD can be made faster by making etas, mus, steps be on CPU when NOT capturable. (I had mistakenly thought foreachifying was faster and so we landed https://github.com/pytorch/pytorch/pull/107857, but it is slower). No one has complained yet though.  ¯\_(ツ)_/¯

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121264
Approved by: https://github.com/albanD
ghstack dependencies: #121260
2024-03-08 00:00:30 +00:00
24821fec26 Add RAdam capturable API for forloop (#121260)
Implementation thanks to @MarouaneMaatouk in https://github.com/pytorch/pytorch/pull/118697, though I've since cleaned it up a lot to save perf on the rect < 5 eager case. It also just looks better now :) Added tests and the cudagraph health check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121260
Approved by: https://github.com/mlazos
2024-03-08 00:00:30 +00:00
83d095c213 [BE] Remove unnecessary requires_cuda in common_optimizers.py (#121249)
@mlazos had already added the needed decorator on the test itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121249
Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/albanD
ghstack dependencies: #121183
2024-03-07 17:57:02 +00:00
53bdae736d Add capturable single tensor Adamax (#121183)
Finishes the work started in https://github.com/pytorch/pytorch/pull/118697. Thanks @MarouaneMaatouk for the attempt, but due to inactivity I have opened this PR for Adamax. Note that the new capturable implementation is much simpler and I've modified the foreach capturable impl--it now calls fewer kernels and is more easily comparable to forloop.

Next steps:
* This PR discovered two bugs: #121178 and #121238.
* Move the now hefty graph optim tests in test_cuda to use OptimInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121183
Approved by: https://github.com/albanD
2024-03-07 17:57:02 +00:00
d621e3e3b8 Add exhaustive module and optimizer tests for torch.load(state_dict, weights_only=True) (#121049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121049
Approved by: https://github.com/janeyx99
2024-03-05 14:27:50 +00:00
f9f602fcb8 Clean up decorators (#119925)
as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119925
Approved by: https://github.com/eellison
2024-02-15 22:51:53 +00:00
9f44274373 Add tests to verify disabled optimizers (#118919)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118919
Approved by: https://github.com/janeyx99
2024-02-14 07:45:16 +00:00
3625ccfbea Move step global hooks test to OptimizerInfo (#119299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119299
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283, #119288
2024-02-07 15:50:31 +00:00
7b3762e6bc Move step pre/post hook tests to OptimizerInfo (#119288)
Note that this increases coverage from 1 config (vanilla SGD) to all the configs (13 optimizers at around 6-7 each). The test time seems fine though!

With the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b6093c03)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 13.680s

OK
```

Excluding the torch cuda synchronization:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_step_pre_hook -k test_step_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
....................................................
----------------------------------------------------------------------
Ran 52 tests in 1.038s

OK
```

The old tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (916f6fe3)]$ python test/test_optim.py -k test_pre_hook -k test_post_hook
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..
----------------------------------------------------------------------
Ran 2 tests in 0.518s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119288
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #119283
2024-02-07 15:50:31 +00:00
b5ba80828f [optim] Rectify capturable testing and fix bugs! (#118326)
This PR fixes several bugs, listed in priority:
1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed.
2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks
3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented  that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos
4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place.
5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected.

The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device.

Details for posterity:
4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct.
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={}, desc=default
params=None, kwargs={'lr': 0.01}, desc=non-default lr
params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad
params=None, kwargs={'capturable': True}, desc=capturable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad
.
----------------------------------------------------------------------
Ran 1 test in 19.229s

OK
```
5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct.
```
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={'differentiable': False}, desc=default
params=None, kwargs={'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable
.params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable
params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused
.
----------------------------------------------------------------------
Ran 2 tests in 11.112s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326
Approved by: https://github.com/mlazos
2024-02-02 19:13:00 +00:00
2964170f3a Revert "[optim] Rectify capturable testing and fix bugs! (#118326)"
This reverts commit d947b9d50011ebd75db2e90d86644a19c4fe6234.

Reverted https://github.com/pytorch/pytorch/pull/118326 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there are some relevant failures in trunk d947b9d500, may be a land race ([comment](https://github.com/pytorch/pytorch/pull/118326#issuecomment-1923125676))
2024-02-02 07:08:14 +00:00
d947b9d500 [optim] Rectify capturable testing and fix bugs! (#118326)
This PR fixes several bugs, listed in priority:
1. `load_state_dict` with a nontensor step was incorrect for capturable and fused implementations since we don't create the tensors on the right device in `__setstate__`. This has been fixed.
2. The most recently added capturable implementations forgot the check that all tensors should be on CUDA for eager. We've now added those checks
3. The most recent change in Adamax only adds capturable for foreach but will silently be incorrect for forloop/single-tensor. I've added erroring and modified testing with many many many skips for that. Honestly my preference after this PR has only been further cemented  that we should just do the single tensor and multi tensor capturable implementations together in the future. @mlazos
4. The conditional for adding cuda-supported configs for the optimizer infos was incorrect! So we hadn't been testing capturable! This also stands rectified and was the trigger for this PR in the first place.
5. In a similar way, the conditional for `_get_optim_inputs_including_global_cliquey_kwargs` was incorrect sometimes as well. This has also been corrected.

The following is not a bug, but is just something to make life simpler by not needing to handle Nones: `optim_input_funcs` must now mandatorily take in a `device`, which could be a string or a torch.device.

Details for posterity:
4. Running the test_foreach_matches_forloop test and printing the configs that get printed yields capturable getting included, which is correct.
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5d50138f)]$ python test/test_optim.py -k test_foreach_matches_forloop_AdamW_cuda
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={}, desc=default
params=None, kwargs={'lr': 0.01}, desc=non-default lr
params=None, kwargs={'weight_decay': 0.1}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'maximize': True}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True}, desc=amsgrad
params=None, kwargs={'capturable': True}, desc=capturable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True}, desc=capturable, amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True}, desc=Tensor lr with capturable and amsgrad
.
----------------------------------------------------------------------
Ran 1 test in 19.229s

OK
```
5. Running the test_optimizer_can_be_printed test (which calls `_get_optim_inputs_including_global_cliquey_kwargs`) and printing what gets run is also now correct.
```
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
params=None, kwargs={'differentiable': False}, desc=default
params=None, kwargs={'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.1, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'differentiable': True}, desc=amsgrad & differentiable
.params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.1, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.1, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable
params=None, kwargs={'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable & foreach
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable & differentiable
params=None, kwargs={'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable & fused
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=capturable, amsgrad & foreach
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=capturable, amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=capturable, amsgrad & fused
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=Tensor lr with capturable and amsgrad & foreach
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=Tensor lr with capturable and amsgrad & differentiable
params=None, kwargs={'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=Tensor lr with capturable and amsgrad & fused
.
----------------------------------------------------------------------
Ran 2 tests in 11.112s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118326
Approved by: https://github.com/mlazos
2024-02-02 02:02:58 +00:00
aca41a3a74 [optim] lbfgs: handle complex params as independent real params (#118184)
Ref: #86340

Fixes #118148

This fixes LBFGS for complex parameters. Complex parameters are handled as R^2.
I also added a test, unfortunately, due to the closure required, I could not use the existing `_test_complex_optimizer` used for all other optimizers.
Lbfgs is special, as it will call the objective function multiple times internally. So I felt making a one-off test for lbfgs might be justifiable.
We will test if each step taken internally by the optimizer is the same for R^2 and complex parameters.

Let me know if the approach is ok, thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118184
Approved by: https://github.com/janeyx99
2024-01-31 19:24:16 +00:00
800e2e823f Add compilable foreach RAdam support (#117912)
Fixes https://github.com/pytorch/pytorch/issues/117807

This brings the number of supported optimizers with `torch.compile` to 11/13 (!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117912
Approved by: https://github.com/janeyx99
2024-01-27 04:32:27 +00:00
9bce208dfb Replace follow_imports = silent with normal (#118414)
This is a lot of files changed! Don't panic! Here's how it works:

* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.

In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.

The codemod was done with this script authored by GPT-4:

```
import glob

exclude_patterns = [
    ...
]

for pattern in exclude_patterns:
    for filepath in glob.glob(pattern, recursive=True):
        if filepath.endswith('.py'):
            with open(filepath, 'r+') as f:
                content = f.read()
                f.seek(0, 0)
                f.write('# mypy: ignore-errors\n\n' + content)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
2024-01-27 02:44:11 +00:00
15608d8cb4 Add guardrails preventing complex params in LBFGS & SparseAdam (#118161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118161
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #118160
2024-01-24 21:22:47 +00:00
17ecd1e9cd Migrate test_complex_optimizer to OptimizerInfo (#118160)
This PR does what it says and more.

1. We increase coverage by a LOT! Previously, complex was not tested for many many configs, including foreach + maximize at the same time. Or the fused impls. Or just random configs people forgot about.
2. I rearranged the maximize conditional and the _view_as_real to preserve list-ness. This is needed for _view_as_real to function properly, I did add a comment in the Files Changed. This new order also just...makes more aesthetic sense.
3. Note that LBFGS and SparseAdam are skipped--they don't support complex and now we know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118160
Approved by: https://github.com/mikaylagawarecki
2024-01-24 21:22:47 +00:00
fc30c4d769 Migrate forloop directional tests to OptimizerInfo (#117410)
This PR is another step towards modernizing our optimizer tests by tackling the simplest foreach tests. The replaced tests are now removed in `test/optim/test_optim.py`.

**Changes in coverage?** Yes!
- This PR _decreases_ coverage (!!!!) by only checking the direction on the forloop implementations vs both the forloop and foreach. Why? I believe it should be sufficient to check the forloop only, as the foreach parity is already checked in the `foreach_matches_forloop` test.
- This PR also _increases_ coverage for SparseAdam with contiguous params on CUDA, which was previously forbidden due to an old old bug that has since been fixed.

What will it take to fully remove `test_basic_cases`?
- We need to flavor the tests with LRSchedulers
- Testing for param groups --> which all just distinguish between lrs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117410
Approved by: https://github.com/albanD
2024-01-24 01:28:40 +00:00
c6be5d55a5 Migrate param_group testing to OptimizerInfo (#117675)
Today, our param_group testing does the equivalent of pitting weight and bias with different optimizer hyperparams and then check that the overall result is going the right direction based on maximize.

This PR introduces two tests to encompass coverage:
1. For every optimizer input (no differentiable), always force bias to have 0 weight_decay, and then check that the direction is expected. This is basically a replica to today's tests, but is more methodical as the test is a real use case.
2. To ensure that the different groups have distinct behavior, I added another test where lr is basically 0 in default group, and ensure that the param in the default group doesn't move while loss does.

Together, these tests do a better job of testing param groups than today's tests, **though we do lose some flavors**. For example, RMSProp also pits centered=True vs False across the param_groups, Adadelta has a variation on rho, and ASGD has a variation for t0. I don't think this is really a loss, as the previous test was just testing for direction and our new tests test stronger guarantees.

The leftover param group configs are used in conjunction with LRSchedulers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117675
Approved by: https://github.com/albanD
2024-01-22 23:48:46 +00:00
aaae2d8bb6 Add compilable and capturable foreach adamax with tests (#117835)
Based off of https://github.com/pytorch/pytorch/pull/110345

Fixes https://github.com/pytorch/pytorch/issues/117812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117835
Approved by: https://github.com/janeyx99
2024-01-20 05:29:05 +00:00
1d14adfa66 [mta] Fused SGD (#116585)
depends on #116583

rel:
- #94791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116585
Approved by: https://github.com/janeyx99
2024-01-16 23:54:38 +00:00
c329eddcb9 Migrate the rest of state_dict testing to OptimizerInfo (#117186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117186
Approved by: https://github.com/albanD
ghstack dependencies: #116509
2024-01-12 22:32:37 +00:00
bcf1f312a0 Migrate nontensor step and CUDA params state_dict tests to OptimizerInfo (#116509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116509
Approved by: https://github.com/albanD
2024-01-12 22:32:37 +00:00
90df7c008a Migrate state_dict bc test to OptimizerInfo, increase coverage (#116500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116500
Approved by: https://github.com/albanD
2024-01-10 08:19:27 +00:00
4af1c27fa8 Migrate repr, deterministic state_dict test to OptimizerInfo (#116496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116496
Approved by: https://github.com/albanD
ghstack dependencies: #116471
2023-12-28 19:49:04 +00:00
f3c4395358 [BE] Add helper in common_optimizers to get all optim inputs (#116471)
This will be a common utility in test_optim.py. Printing out the optimizer inputs when using this helper looks reasonable:

For local test plan, click below.
<details>

```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d186986c)]$ python test/test_optim.py -vv -k test_step_is_noop_when_params_have_no_grad
test_step_is_noop_when_params_have_no_grad_ASGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0
params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adadelta_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adagrad_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_AdamW_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True}, desc=amsgrad & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adamax_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_LBFGS_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok
test_step_is_noop_when_params_have_no_grad_NAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay
params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps
params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RMSprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Rprop_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SGD_cpu_float32 (__main__.TestOptimRenewedCPU) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SparseAdam_cpu_float32 (__main__.TestOptimRenewedCPU) ... ok
test_step_is_noop_when_params_have_no_grad_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.02, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.02, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': False}, desc=t0
params=None, kwargs={'t0': 100, 'foreach': True, 'differentiable': False}, desc=t0 & foreach
params=None, kwargs={'t0': 100, 'foreach': False, 'differentiable': True}, desc=t0 & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=rho
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=rho & foreach
params=None, kwargs={'rho': 0.95, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=rho & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=initial_accumulator_value
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=initial_accumulator_value & foreach
params=None, kwargs={'initial_accumulator_value': 0.1, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=initial_accumulator_value & differentiable
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=lr_decay
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=lr_decay & foreach
params=None, kwargs={'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=lr_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
ok
test_step_is_noop_when_params_have_no_grad_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False, 'fused': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True, 'fused': False}, desc=default & differentiable
params=None, kwargs={'foreach': False, 'differentiable': False, 'fused': True}, desc=default & fused
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': False}, desc=non-default lr
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False, 'fused': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True, 'fused': False}, desc=non-default lr & differentiable
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False, 'fused': True}, desc=non-default lr & fused
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False, 'fused': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True, 'fused': False}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False, 'fused': True}, desc=nonzero weight_decay & fused
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=maximize & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=maximize & fused
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': False}, desc=amsgrad
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': True, 'differentiable': False, 'fused': False}, desc=amsgrad & foreach
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': True, 'fused': False}, desc=amsgrad & differentiable
params=None, kwargs={'weight_decay': 0.9, 'amsgrad': True, 'foreach': False, 'differentiable': False, 'fused': True}, desc=amsgrad & fused
ok
test_step_is_noop_when_params_have_no_grad_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_LBFGS_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_step_is_noop_when_params_have_no_grad_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=non-zero momentum_decay
params=None, kwargs={'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=non-zero momentum_decay & foreach
params=None, kwargs={'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=non-zero momentum_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': False}, desc=weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': True, 'differentiable': False}, desc=weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'foreach': False, 'differentiable': True}, desc=weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'momentum_decay': 0.006, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': False}, desc=non-default eps
params=None, kwargs={'eps': 1e-06, 'foreach': True, 'differentiable': False}, desc=non-default eps & foreach
params=None, kwargs={'eps': 1e-06, 'foreach': False, 'differentiable': True}, desc=non-default eps & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': False}, desc=decoupled_weight_decay
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': True, 'differentiable': False}, desc=decoupled_weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'decoupled_weight_decay': True, 'foreach': False, 'differentiable': True}, desc=decoupled_weight_decay & differentiable
ok
test_step_is_noop_when_params_have_no_grad_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.001, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.001, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nonzero weight_decay
params=None, kwargs={'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nonzero weight_decay & foreach
params=None, kwargs={'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nonzero weight_decay & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': False}, desc=centered
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': True, 'differentiable': False}, desc=centered & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'foreach': False, 'differentiable': True}, desc=centered & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'weight_decay': 0.9, 'centered': True, 'momentum': 0.1, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': False}, desc=non-default lr
params=None, kwargs={'lr': 0.0002, 'foreach': True, 'differentiable': False}, desc=non-default lr & foreach
params=None, kwargs={'lr': 0.0002, 'foreach': False, 'differentiable': True}, desc=non-default lr & differentiable
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': False}, desc=non-default etas
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': True, 'differentiable': False}, desc=non-default etas & foreach
params=None, kwargs={'etas': (0.5, 1.5), 'foreach': False, 'differentiable': True}, desc=non-default etas & differentiable
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': False}, desc=non-default step_sizes
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': True, 'differentiable': False}, desc=non-default step_sizes & foreach
params=None, kwargs={'step_sizes': (2e-06, 100), 'foreach': False, 'differentiable': True}, desc=non-default step_sizes & differentiable
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': False}, desc=default
params=None, kwargs={'lr': 0.01, 'foreach': True, 'differentiable': False}, desc=default & foreach
params=None, kwargs={'lr': 0.01, 'foreach': False, 'differentiable': True}, desc=default & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': False}, desc=momentum
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': True, 'differentiable': False}, desc=momentum & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'foreach': False, 'differentiable': True}, desc=momentum & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': False}, desc=dampening
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': True, 'differentiable': False}, desc=dampening & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'dampening': 0.5, 'foreach': False, 'differentiable': True}, desc=dampening & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=non-zero weight_decay
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=non-zero weight_decay & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=non-zero weight_decay & differentiable
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': False}, desc=nesterov
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': True, 'differentiable': False}, desc=nesterov & foreach
params=None, kwargs={'lr': 0.01, 'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.9, 'foreach': False, 'differentiable': True}, desc=nesterov & differentiable
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': False}, desc=maximize
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': True, 'differentiable': False}, desc=maximize & foreach
params=None, kwargs={'lr': 0.01, 'weight_decay': 0.9, 'maximize': True, 'foreach': False, 'differentiable': True}, desc=maximize & differentiable
ok
test_step_is_noop_when_params_have_no_grad_SparseAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 26 tests in 19.089s

OK
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116471
Approved by: https://github.com/albanD
2023-12-28 19:49:04 +00:00
44b98c09ca [BE] migrate all assertRaises tests to OptimizerInfo test_errors (#116315)
Removes a part of the sparse adam test and the following three tests: `test_fused_optimizer_raises`, `test_duplicate_params_across_param_groups`, `test_duplicate_params_in_one_param_group`

```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_fused_optimizer_raises -k test_duplicate_params_across_param_groups -k test_duplicate_params_in_one_param_group
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...
----------------------------------------------------------------------
Ran 3 tests in 0.023s

OK
```

Increases coverage by testing the duplicate param tests on ALL the optims instead of just one each. Also fixes SparseAdam bug which was accidentally calling torch.unbind through list instead of putting params in a list. This bug was caught by migrating the weird warning stuff to just one easy warning context manager, which checks that nothing else gets raised.

The new test_errors does not run slower than before, overhead is still king:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (d2d129de)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
..........................
----------------------------------------------------------------------
Ran 26 tests in 10.337s

OK
```

Compared to test_errors BEFORE my commit :p
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$ python test/test_optim.py -k test_errors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.............sssssssssssss
----------------------------------------------------------------------
Ran 26 tests in 11.980s

OK (skipped=13)
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (b47aa696)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116315
Approved by: https://github.com/mikaylagawarecki
2023-12-27 00:08:31 +00:00
edf1ea622d Move step is noop tests (#115299)
As stated. I do notice there is perhaps opportunity to abstract, but the tests as written are also super understandable and more abstraction might not be desirable.

This PR _increases coverage_. The original tests each tested 12 default configs (left out Rprop). Now the tests test ~80 configs, and then foreach + fused on top of that! Test time, we basically increase over 10-fold, but this test is tiny so we are not worried:

Old:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 0.028s

OK
```

New (includes the old test):
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (5ca9672c)]$ python test/test_optim.py -k test_step_is_noop_when_params_have_no_grad
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........................
----------------------------------------------------------------------
Ran 27 tests in 0.456s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115299
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023, #115025
2023-12-20 22:49:44 +00:00
8f3a0594e9 Move tests depending on listed configs to OptimizerInfo (#115025)
Removing 4 tests:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py -v -k test_fused_optimizers_with_large_tensors -k test_fused_optimizers_with_varying_tensors -k test_multi_tensor_optimizers_with_large_tensors -k test_multi_tensor_optimizers_with_varying_tensors
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_fused_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_large_tensors (optim.test_optim.TestOptim) ... ok
test_multi_tensor_optimizers_with_varying_tensors (optim.test_optim.TestOptim) ... ok

----------------------------------------------------------------------
Ran 4 tests in 22.731s

OK
```

For the same 4 but more granular:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7539011b)]$ python test/test_optim.py  -v -k test_fused_large_tensor -k test_fused_mixed_device_dtype -k test_foreach_large_tensor -k test_foreach_mixed_device_dtype
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_large_tensor_ASGD_cpu_float16 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
....
test_fused_mixed_device_dtype_Adam_cpu_float32 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
test_foreach_large_tensor_ASGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adadelta_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adagrad_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_NAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RAdam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_RMSprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_Rprop_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_large_tensor_SGD_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_ASGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adadelta_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adagrad_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Adamax_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_NAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RAdam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_RMSprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_Rprop_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_mixed_device_dtype_SGD_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_AdamW_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_large_tensor_Adam_cuda_float16 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_AdamW_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_mixed_device_dtype_Adam_cuda_float32 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 50 tests in 50.785s

OK (skipped=25)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115025
Approved by: https://github.com/albanD
ghstack dependencies: #114802, #115023
2023-12-20 22:49:44 +00:00
05d60931b3 Migrate test_peak_mem_multi_tensor_optimizers to OptimizerInfo (#115023)
Replace the following:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k test_peak_mem_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
.
----------------------------------------------------------------------
Ran 1 test in 38.599s

OK
```

with 11 tests (one for each foreach optim :))
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (1bbf1c6f)]$ python test/test_optim.py -k TestOptimRenewedCUDA.test_foreach_memory
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
...........
----------------------------------------------------------------------
Ran 11 tests in 39.293s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115023
Approved by: https://github.com/albanD
ghstack dependencies: #114802
2023-12-20 22:49:44 +00:00
4fb92b591d [BE] remove redundant _test_derived_optimizers by migrating more to OptimizerInfo (#114802)
New tests look like:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py -v -k TestOptimRenewedCUDA.test_fused
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_fused_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_fused_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 2 tests in 34.591s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (af8fca04)]$ python test/test_optim.py
-v -k test_set_default_dtype_works_with_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_set_default_dtype_works_with_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... skipped 'Only runs on cuda'
...
test_set_default_dtype_works_with_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_set_default_dtype_works_with_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 22 tests in 32.915s

OK (skipped=11)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114802
Approved by: https://github.com/albanD
2023-12-20 22:49:44 +00:00
056a882cb9 add markDynamoStrictTest to TestOptimRenewed, removing flakiness (#115947)
fixes #115406 fixes #115394 fixes #115393 fixes #115392 fixes #115391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115947
Approved by: https://github.com/albanD, https://github.com/zou3519
2023-12-16 01:33:32 +00:00
21cca2494d Move test_multi_tensor_optimizers to use OptimizerInfos (#114797)
This PR aims for parity+ compared to the old testing for the simplest foreach test case.

Test coverage increase: we now test foreach optimizers with CPU as well as on GPU.

Before:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_multi_tensor_optimizers
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_multi_tensor_optimizers (optim.test_optim.TestOptim) ... ok

----------------------------------------------------------------------
Ran 1 test in 7.253s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```

Now, we get granular test cases at the cost of overhead!
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$ python test/test_optim.py -v -k test_foreach
/home/janeyx/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.0
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
test_foreach_ASGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adadelta_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adagrad_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_AdamW_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Adamax_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_NAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RAdam_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_RMSprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_Rprop_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_SGD_cpu_float64 (__main__.TestOptimRenewedCPU) ... ok
test_foreach_ASGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adadelta_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adagrad_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_AdamW_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Adamax_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_NAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RAdam_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_RMSprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_Rprop_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok
test_foreach_SGD_cuda_float64 (__main__.TestOptimRenewedCUDA) ... ok

----------------------------------------------------------------------
Ran 22 tests in 30.954s

OK
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (19136605)]$
```

Why the increase in time?
Two reasons:
1. overhead. Any _CUDA_ *Info test (OpInfo, ModuleInfo, OptimizerInfo) will wrap itself with the `CudaNonDefaultStream` policy, and `CudaNonDefaultStream.__enter__` when called for the first time will go through all visible CUDA devices and synchronize each of them, thus forcing the CUDAContext to be init'd. Doing this for all 8 devices takes ~10-15s. Also, test parametrization costs a little overhead too, but not to the level init'ing CUDA context does.
2. We test more! Now, we have 72 configs (in the foreach optimizer world) whereas we only had 59 before.

Next steps for the future:
- consider adding more Tensor LR configs (like a Tensor LR without capturable in the single tensor case)
- this is likely the next PR or 2: migrate all uses of _test_derived_optimizers in test_optim to TestOptimRenewed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114797
Approved by: https://github.com/albanD
2023-12-07 19:37:56 +00:00
d78fe039eb Introduce OptimizerInfos + add a test_errors (#114178)
Introduce OptimizerInfos + use them to refactor out the error testing.

Why OptimizerInfos?
- cleaner, easier way to test all configs of optimizers
- would plug in well with devicetype to auto-enable tests for devices like MPS, meta
- would allow for more granular testing. currently, lots of functionality is tested in `_test_basic_cases` and some of that should be broken down more.

What did I do for error testing?
- I moved out some error cases from `_test_basic_cases` into a new test_errors parametrized test.
- The new test has to live in TestOptimRenewed (bikeshedding welcome) because the parametrized tests need to take in device and dtype and hook correctly, and not all tests in TestOptim do that.
- TestOptimRenewed also is migrating to the toplevel test/test_optim.py now because importing TestOptimRenewed does not work (because of test instantiation, TestOptimRenewed gets replaced with TestOptimRenewedDevice for CPU, CUDA, and whatever other device).

Is there any change in test coverage?
- INCREASE: The error case where a single Parameter (vs a container of them) are passed in has now expanded to all optims instead of only LBFGS
- DECREASE: Not much. The only thing is we no longer test two error cases for foreach=True AND foreach=False, which I think is redundant. (Highlighted in comments)

Possible but not urgent next step: test ALL possible error cases by going through all the constructors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114178
Approved by: https://github.com/albanD
2023-12-05 22:58:36 +00:00