99 Commits

Author SHA1 Message Date
e925dfcc6b Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang, https://github.com/mlazos
2025-10-17 07:27:11 +00:00
5d7360bb03 Revert "Enable all SIM rules except disabled ones (#164645)"
This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911.

Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))
2025-10-05 19:32:21 +00:00
321e602692 Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang
2025-10-05 07:38:25 +00:00
75de5b65b4 [Dynamo] Don't guard data ptrs by default with mark_static_address (#162208)
Fixes https://github.com/pytorch/pytorch/issues/156377

Since we now re-record cudagraphs, it's not necessary to guard by default anymore and induce a full recompile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162208
Approved by: https://github.com/anijain2305
2025-09-12 07:15:10 +00:00
07f76517e7 [Inductor][WIndows] Fix Windows test case failure. (#161497)
Fixes windows test case failures:
- TritonCodeGenTests.test_inductor_sequence_nr
- TritonCodeGenTests.test_indirect_device_assert
- CompiledOptimizerTests.test_static_address_finalizer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161497
Approved by: https://github.com/jansel
2025-08-28 12:40:42 +00:00
74280d0913 [muon] Introduce Muon optimizer to PyTorch (#160213)
A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy.

This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory.

**Usage**

```python
    model = MyModelForCausalLM
    # filter out your params manually
    muon_params = [...]
    adamw_params = [...]
    muon = Muon(
        params = muon_params
        lr=lr,
        wd=wd,
    )
    adamw = AdamW(
        params = adamw_params
        lr=lr,
        wd=wd,
    )

    # in training loop
    loss = model(input)
    loss.backward()
    muon.step()
    adamw.step()
    muon.zero_grad()
    adamw.zero_grad()
```

~~**Additional usage**~~
~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~

```python
~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~
~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~
```

As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs.

By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW.

**Testing**

~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~

As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test.

2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency.
<img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" />

**Numerics**
We evaluate our implementation with existing implementation to confirm numerical consistency.

As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap.

As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below.

```python
    # dummy model and data
    model0 = Linear(10, 10, bias=False)
    model1 = copy.deepcopy(model0)
    inputs = torch.randn(8, 10)
    targets = torch.randn(8, 10)
    loss = MSELoss()

    lr = 1e-3
    wd = 0.1
    momentum = 0.95

    opt_ref_muon = KellySingleDeviceMuon(
        params=model0.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    opt_exp_muon = Muon(
        params=model1.parameters(),
        lr=lr,
        weight_decay=wd,
        momentum=momentum,
    )

    out_ref = model0(inputs)
    loss_ref = loss(out_ref, targets)
    opt_ref_muon.zero_grad()
    loss_ref.backward()
    opt_ref_muon.step()

    out_exp = model1(inputs)
    loss_exp = loss(out_exp, targets)
    opt_exp_muon.zero_grad()
    loss_exp.backward()
    opt_exp_muon.step()

    for p_ref, p_exp in zip(model0.parameters(), model1.parameters()):
        torch.testing.assert_close(p_ref, p_exp)
```

As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without.
<img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" />

**Performance**
Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP:

- adamw_ddp finishes in 13.12 min
- pytorch_muon_ddp finishes in 13.45 min

Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is *2.5%* slower than AdamW.

AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms
<img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" />

Muon: Optimizer.step() takes ~54 ms, step time ~960 ms
<img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" />

**Note**
We restrict the implementation to accept only 2D parameters.

An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful.

Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound.

**Next Steps**

1. Add `MuP`
2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices.
3. Open-source unsharded Muon co-designed with FSDP2.

****

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213
Approved by: https://github.com/janeyx99
2025-08-24 08:03:04 +00:00
af10f1f86c Fix requires_cuda to requires_cuda_and_triton (#160222)
Fixes ##159399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222
Approved by: https://github.com/janeyx99
2025-08-10 07:05:52 +00:00
cf0749c92f Use expecttest in test_compiled_optimizers.py (#155308)
Fixes #141262

## Test Result

```bash
pytest test/inductor/test_compiled_optimizers.py -vv
```

![image](https://github.com/user-attachments/assets/1886fb71-ff05-46e7-988c-82d36358a834)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155308
Approved by: https://github.com/mlazos, https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>
2025-06-27 06:29:51 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
fea718f062 [BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730)
Our three main users are OK with this, with two of them (foreach_map,
invoke_quant) prefering it like this.

I was originally worried about BC issues (this now means you cannot add
any positional args) but I think that's not a concern -- one can always
add kwonly args.

Test Plan
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730
Approved by: https://github.com/ydwu4, https://github.com/mlazos
2025-02-20 02:30:36 +00:00
deb1da15cc [foreach_map] Add foreach_map Adam impl to compiled optimizer tests (#143454)
Adds a foreach_map backed Adam to compiled optimizer tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143454
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-12-19 03:16:47 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
6bbbb08458 [Dynamo] Replace torch._dynamo.optimize() with torch.compile() [10/N] (#142451)
> This is the last one

related commits:

- #139706
- #140238
- #140247
- #140253
- #140663
- #140688
- #140922
- #140924
- #140933
- #142451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142451
Approved by: https://github.com/bdhirsh
2024-12-17 12:18:29 +00:00
ad37afd590 Revert "Always unspecialize float in OSS (#138922)"
This reverts commit ba5253da9b30ed4d998cee1d865f92b2c27d3086.

Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/yf225 due to perf regression on torchbench ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2499277511))
2024-11-26 00:03:03 +00:00
ba5253da9b Always unspecialize float in OSS (#138922)
Fixes https://github.com/pytorch/pytorch/issues/107277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-11-24 01:58:13 +00:00
a8c90e5140 Revert "Always unspecialize float in OSS (#138922)"
This reverts commit 6d779d05492813da1c19ac0c562d0d5f8473f27e.

Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is some slow tests failing after this land ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2495076878))
2024-11-22 23:18:36 +00:00
6d779d0549 Always unspecialize float in OSS (#138922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-11-22 17:54:42 +00:00
1fd4757fdc Support tensor betas in Adam and AdamW (#134171)
Adds support for beta1 and beta2 to be wrapped in tensor for Adam and AdamW.

Fixes https://github.com/pytorch/pytorch/issues/133898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134171
Approved by: https://github.com/janeyx99
2024-11-15 21:55:55 +00:00
cb71bcc542 Replace clone.detach with detach.clone (#140264)
Fixes #64532

As state in issue, replace `clone.detach` by `detach.clone`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264
Approved by: https://github.com/soulitzer
2024-11-13 07:01:02 +00:00
0efa590d43 [CI] Fix XPU CI failure (#138548)
# Motivation
Fix https://github.com/pytorch/pytorch/issues/138577.

# Solution
1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170
2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`.
3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py`
5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`.

# Additional Context
> Why update torch-xpu-ops commit pin here?

We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364).

> What does the feature of torch-xpu-ops update?

1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc;
2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d`
3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc;
4. fix build failure related to `C10_UNUSED`;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548
Approved by: https://github.com/malfet, https://github.com/EikanWang
2024-10-24 07:56:26 +00:00
0a38c0ec89 [inductor] add a threshold for membw saving during fusion (#136782)
Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782
Approved by: https://github.com/jansel
2024-10-22 00:50:00 +00:00
ac7f52b301 Revert "[inductor] add a threshold for membw saving during fusion (#136782)"
This reverts commit 6647320de2077c10309f5025a007d51c7fb542d8.

Reverted https://github.com/pytorch/pytorch/pull/136782 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_memory starts to fail after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/136782#issuecomment-2423549196))
2024-10-19 03:43:42 +00:00
6647320de2 [inductor] add a threshold for membw saving during fusion (#136782)
Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings .

I think adding a threshold for membw saving in general is not bad.

I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782
Approved by: https://github.com/jansel
2024-10-19 00:22:43 +00:00
0b2c12cb4d Support more foreach ops for tensor beta support (#134170)
Add more foreach ops so we don't have fallbacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134170
Approved by: https://github.com/eellison
2024-10-17 17:51:31 +00:00
8c356ce3da Fix lint errors in fbcode (#135614)
Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps.  After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports.

Test Plan:
```
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS
```
Before:
```
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib.  Some things to try:
```

Differential Revision: D62049222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614
Approved by: https://github.com/oulgen, https://github.com/laithsakka
2024-09-13 02:04:34 +00:00
16f119e62a Update compiled optimizer tests for tensor betas (#134169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134169
Approved by: https://github.com/anijain2305, https://github.com/eellison
ghstack dependencies: #134166, #134167, #134168
2024-08-31 10:24:39 +00:00
4fcd15a667 Fix test_sgd_weight_decay_xpu accuracy error (#134744)
Fixes #134743

This PR adds `test_sgd_weight_decay_xpu` in `KERNEL_COUNT_OVERRIDES` to override.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134744
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-08-29 15:12:40 +00:00
40de63be09 parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749)
Fixes #123451

This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127.
The test failure is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749
Approved by: https://github.com/janeyx99
2024-08-28 16:34:06 +00:00
dcfa415e6e [Inductor UT] Reuse inductor UT for intel GPU test/inductor/test_compiled_optimizers.py (#133083)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_compiled_optimizers.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083
Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos
2024-08-17 01:15:26 +00:00
a6413d2924 Regression test for S429861 (#133376)
Adds repro test to verify that https://www.internalfb.com/sevmanager/view/429861 does not occur again.

I haven't been able to reduce the size of the repro further, if I remove any buffers the error disappears!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133376
Approved by: https://github.com/eellison
2024-08-14 06:55:05 +00:00
fdc4d6fe96 [inductor] Refactor fusion of inplace operations (#130835)
Resubmit of #128979

`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.

Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`

So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.

This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835
Approved by: https://github.com/lezcano
2024-07-25 20:29:01 +00:00
134bc4fc34 [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 07:49:19 +00:00
b732b52f1e Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)"
This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d.

Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))
2024-07-18 06:39:58 +00:00
aecc746fcc [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 05:13:41 +00:00
6dc64026cb Restrict fusions in foreach if there are dependencies on multiple subkernels (#130046)
In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error.

@eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092)

To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning.

Co-authored-by: David Berard <dberard@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046
Approved by: https://github.com/eellison
2024-07-08 18:25:16 +00:00
7b910285db Revert "[inductor] Refactor fusion of inplace operations (#128979)"
This reverts commit 72e3aca227ae1e3dc1b91aee415cf27b0cb22f2b.

Reverted https://github.com/pytorch/pytorch/pull/128979 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128979#issuecomment-2186846940))
2024-06-24 15:29:40 +00:00
72e3aca227 [inductor] Refactor fusion of inplace operations (#128979)
`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.

Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`

So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.

This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128979
Approved by: https://github.com/lezcano
ghstack dependencies: #129082, #129083
2024-06-22 12:38:22 +00:00
cb69c51b6f Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)"
This reverts commit cf35a591b95220aa1bfcc04ff8a943efd1d6d6eb.

Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))
2024-05-20 12:14:22 +00:00
cf35a591b9 Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)
This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure.

Lintrunner passed:
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```
Tests passed:
```
>python test_cuda.py -k test_graph_optims
Ran 19 tests in 7.463s

OK (skipped=9)

>python test_cuda.py -k test_graph_scaling_fused_optimizers
Ran 6 tests in 2.800s

OK (skipped=3)
```
Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers.

I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called:
```
{'foreach': False, 'maximize': True, 'weight_decay': 0}
{ 'foreach': True, 'maximize': True, 'weight_decay': 0}
```
I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned:
```
{'amsgrad': True}
```

Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127
Approved by: https://github.com/janeyx99
2024-05-20 06:20:45 +00:00
7ed67cdbcc Add compile time smoketest for foreach (#126136)
Fixes [T175425693](https://www.internalfb.com/intern/tasks/?t=175425693)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126136
Approved by: https://github.com/yanboliang
2024-05-14 21:00:55 +00:00
812534d27e Skip two LR schedulers with eager memory leaks in compiled optim tests (#126133)
SequentialLR and ChainedLR leak memory, so disable these two schedulers until https://github.com/pytorch/pytorch/issues/126131 is fixed.

Re-enables
https://github.com/pytorch/pytorch/issues/125925
https://github.com/pytorch/pytorch/issues/125925
https://github.com/pytorch/pytorch/issues/125924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126133
Approved by: https://github.com/yanboliang, https://github.com/aorenste
2024-05-14 04:42:34 +00:00
1b1b18a7a4 Add LRScheduler Composability E2E Tests (#125653)
adds tests to verify the LRSchedulers correctly update the compiled optimizers without recompiles.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125653
Approved by: https://github.com/yanboliang
ghstack dependencies: #123751, #123752, #123753, #125383
2024-05-09 00:52:43 +00:00
8c9c169b48 LRScheduler composability kernel tests (#125383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125383
Approved by: https://github.com/eellison
ghstack dependencies: #123751, #123752, #123753
2024-05-09 00:52:43 +00:00
787afc5180 Add LR as tensor tests (#123750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123750
Approved by: https://github.com/janeyx99
2024-05-01 04:46:49 +00:00
d88fcb86d8 Enable dynamo traced test_forloop_goes_right_direction (#123322)
Removed a bunch of skips, I also updated test_forloop_goes_right_direction to *not* use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322
Approved by: https://github.com/janeyx99
ghstack dependencies: #123498
2024-04-18 00:50:10 +00:00
2ac99d539b Only initialize state if needed in SGD (#123757)
Fixes [T184381726](https://www.internalfb.com/intern/tasks/?t=184381726)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123757
Approved by: https://github.com/janeyx99
2024-04-11 08:56:06 +00:00
7c23fed12c Move step to cpu if state is already initialized (#123618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123618
Approved by: https://github.com/anijain2305
ghstack dependencies: #123496, #123497, #123551, #123552
2024-04-09 09:04:18 +00:00
d9d25076fe Reduce guards of optimizer state dict to guard once per param group (#123413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123413
Approved by: https://github.com/anijain2305
2024-04-06 00:12:59 +00:00
512759a3d7 Fix for tensor attribute missing (#123313)
Tensors would sometimes be realized after we already registered attrs on the root nn module. This ensures all stack values are realized before registering attrs on the root nn module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123313
Approved by: https://github.com/anijain2305
2024-04-04 21:11:04 +00:00
3e2b7e6052 [dynamo][guard overhead] Data ptr guard optimizer state tensors (#122858)
Stricter (but faster) guarding on optimizer state tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122858
Approved by: https://github.com/anijain2305
2024-04-03 21:42:06 +00:00