pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	e925dfcc6b	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang, https://github.com/mlazos	2025-10-17 07:27:11 +00:00
PyTorch MergeBot	5d7360bb03	Revert "Enable all SIM rules except disabled ones (#164645 )" This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911. Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))	2025-10-05 19:32:21 +00:00
Yuanyuan Chen	321e602692	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang	2025-10-05 07:38:25 +00:00
Michael Lazos	75de5b65b4	[Dynamo] Don't guard data ptrs by default with mark_static_address (#162208 ) Fixes https://github.com/pytorch/pytorch/issues/156377 Since we now re-record cudagraphs, it's not necessary to guard by default anymore and induce a full recompile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162208 Approved by: https://github.com/anijain2305	2025-09-12 07:15:10 +00:00
xinan.lin	07f76517e7	[Inductor][WIndows] Fix Windows test case failure. (#161497 ) Fixes windows test case failures: - TritonCodeGenTests.test_inductor_sequence_nr - TritonCodeGenTests.test_indirect_device_assert - CompiledOptimizerTests.test_static_address_finalizer Pull Request resolved: https://github.com/pytorch/pytorch/pull/161497 Approved by: https://github.com/jansel	2025-08-28 12:40:42 +00:00
Chuanhao Zhuge	74280d0913	[muon] Introduce Muon optimizer to PyTorch (#160213 ) A single-device version of Muon. Algorithm refers Keller Jordan's [Muon blogpost](https://kellerjordan.github.io/posts/muon/), and optionally incorporates [Moonshot's](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf) learning rate adjustment strategy. This implementation maintains a minimalist API and is consistent with other optimizer conventions. PyTorch team prefers to handle parameter filtering at a higher level, with the Muon optimizer performing only the msign computation for orthogonalization on all parameters it receives. Users are responsible for grouping parameters for different optimizers as needed. An example usage is shown below, and a more detailed example will be added to the [PyTorch examples](https://github.com/pytorch/examples) directory. Usage ```python model = MyModelForCausalLM # filter out your params manually muon_params = [...] adamw_params = [...] muon = Muon( params = muon_params lr=lr, wd=wd, ) adamw = AdamW( params = adamw_params lr=lr, wd=wd, ) # in training loop loss = model(input) loss.backward() muon.step() adamw.step() muon.zero_grad() adamw.zero_grad() ``` ~~Additional usage~~ ~~Users are also able to pass in self-defined `msign` function for orthogonalization, and learning rate adjustment function. Interface defined below:~~ ```python ~~AdjustLrFn: TypeAlias = Callable[[float, torch.Size], float]~~ ~~MsignFn: TypeAlias = Callable[[Tensor, BaseMsignFnConfig], Tensor]~~ ``` As discussed with team and in comment, we prefer to make the interface simpler and cleaner, thus we removed the callback interface, and canonicalize the original NS algorithm for Muon. The only configs available to users are `ns_steps`, `coefficients`, and `eps`, configurable through kwargs. By default, we use 5-step Newton-Schulz, with coefficients proposed by [Keller](https://kellerjordan.github.io/posts/muon/). We use LR adjustment proposed by [Moonshot](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf), which grafts learning rate from AdamW. Testing ~~1. Unit tests: the newly introduced Muon is covered in `test/test_optim.py`. We updated the test cases to pass named parameters to the optimizer under test. Additionally, we introduced a new test case to verify that when the user provides an empty FQN list, Muon correctly falls back to AdamW behavior.~~ As discussed, in order not to complicate the codebase, we prefer not to include reference implementation into PyTorch. We also updated the interface so we don't need to test the FQN based filtering. Muon is covered by the existing `test_optim.py` unit test. 2. End-to-end test: we added a training script that pre-trains a QWEN-like model on `openwebtext-100k` dataset. We trained for one epoch and the resulting loss curve is compared against the Moonshot implementation to confirm behavioral consistency. <img width="1102" height="472" alt="Screenshot 2025-07-29 at 1 04 12 AM" src="https://github.com/user-attachments/assets/ceab0733-497d-4070-8032-02ae7995c64c" /> Numerics We evaluate our implementation with existing implementation to confirm numerical consistency. As discussed, our implementation closely follows the algorithm described in [Keller's post](https://kellerjordan.github.io/posts/muon/), while incorporating the learning rate adjustment from [Moonlight](https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf). This captures a key insight that allows users to reuse hyper-parameters tuned for `adamW`, making Muon a drop-in swap. As expected, the numerics difference mainly comes from `adjust_lr`, a max of ~5% relative diff in an example unit test setup below. ```python # dummy model and data model0 = Linear(10, 10, bias=False) model1 = copy.deepcopy(model0) inputs = torch.randn(8, 10) targets = torch.randn(8, 10) loss = MSELoss() lr = 1e-3 wd = 0.1 momentum = 0.95 opt_ref_muon = KellySingleDeviceMuon( params=model0.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) opt_exp_muon = Muon( params=model1.parameters(), lr=lr, weight_decay=wd, momentum=momentum, ) out_ref = model0(inputs) loss_ref = loss(out_ref, targets) opt_ref_muon.zero_grad() loss_ref.backward() opt_ref_muon.step() out_exp = model1(inputs) loss_exp = loss(out_exp, targets) opt_exp_muon.zero_grad() loss_exp.backward() opt_exp_muon.step() for p_ref, p_exp in zip(model0.parameters(), model1.parameters()): torch.testing.assert_close(p_ref, p_exp) ``` As explained above, including this `adjust_lr` is preferable. This is validated by an e2e training runs on training a qwen-2-like 0.5b model, where the curves show that training with `adjust_lr` converges more effectively than without. <img width="1179" height="464" alt="Screenshot 2025-08-18 at 10 12 33 AM" src="https://github.com/user-attachments/assets/e797d3da-c2f0-4187-b99e-5d48b7437c3c" /> Performance Training for one epoch of openwebtext-100k on eight H100 GPUs with DDP: - adamw_ddp finishes in 13.12 min - pytorch_muon_ddp finishes in 13.45 min Muon runs ~20s slower compared to AdamW. Assuming no other changes, Muon is 2.5% slower than AdamW. AdamW: Optimizer.step() takes ~13.5 ms, step time ~930 ms <img width="726" height="590" alt="Screenshot 2025-07-29 at 1 56 14 AM" src="https://github.com/user-attachments/assets/ebcd7e1c-d129-4b20-9396-39f568edf03d" /> Muon: Optimizer.step() takes ~54 ms, step time ~960 ms <img width="751" height="597" alt="Screenshot 2025-07-29 at 2 02 20 AM" src="https://github.com/user-attachments/assets/72f5b904-ebd5-4502-a6ff-d3e9e5a6da81" /> Note We restrict the implementation to accept only 2D parameters. An alternative approach is to allow parameters with more than two dimensions and apply orthogonalization over the last two dimensions. We opt not to go with this approach as it can be error-prone. For example, with a kernel shaped `[in_channel, height, width, out_channel]`, applying orthogonalization to the last two dimensions is not meaningful. Since Muon is designed to operate orthogonalization on 2D matrices, preserving this assumption keeps the implementation clean and sound. Next Steps 1. Add `MuP` 2. Open-source optimized triton kernel for symmetric matmul. A preliminary benchmark found 1.23x - 1.48x speedup on small - large (n = 256 -> 16384) matrices. 3. Open-source unsharded Muon co-designed with FSDP2. **** Pull Request resolved: https://github.com/pytorch/pytorch/pull/160213 Approved by: https://github.com/janeyx99	2025-08-24 08:03:04 +00:00
ghostspiders	af10f1f86c	Fix requires_cuda to requires_cuda_and_triton (#160222 ) Fixes ##159399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160222 Approved by: https://github.com/janeyx99	2025-08-10 07:05:52 +00:00
zeshengzong	cf0749c92f	Use expecttest in test_compiled_optimizers.py (#155308 ) Fixes #141262 ## Test Result ```bash pytest test/inductor/test_compiled_optimizers.py -vv ``` ![image](https://github.com/user-attachments/assets/1886fb71-ff05-46e7-988c-82d36358a834) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155308 Approved by: https://github.com/mlazos, https://github.com/msaroufim Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>	2025-06-27 06:29:51 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
rzou	fea718f062	[BaseHOP] change hop(subgraph, operands) to hop(subgraph, *operands) (#146730 ) Our three main users are OK with this, with two of them (foreach_map, invoke_quant) prefering it like this. I was originally worried about BC issues (this now means you cannot add any positional args) but I think that's not a concern -- one can always add kwonly args. Test Plan - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146730 Approved by: https://github.com/ydwu4, https://github.com/mlazos	2025-02-20 02:30:36 +00:00
Michael Lazos	deb1da15cc	[foreach_map] Add foreach_map Adam impl to compiled optimizer tests (#143454 ) Adds a foreach_map backed Adam to compiled optimizer tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/143454 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-12-19 03:16:47 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Yuanhao Ji	6bbbb08458	[Dynamo] Replace `torch._dynamo.optimize()` with `torch.compile()` [10/N] (#142451 ) > This is the last one related commits: - #139706 - #140238 - #140247 - #140253 - #140663 - #140688 - #140922 - #140924 - #140933 - #142451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/142451 Approved by: https://github.com/bdhirsh	2024-12-17 12:18:29 +00:00
PyTorch MergeBot	ad37afd590	Revert "Always unspecialize float in OSS (#138922 )" This reverts commit ba5253da9b30ed4d998cee1d865f92b2c27d3086. Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/yf225 due to perf regression on torchbench ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2499277511))	2024-11-26 00:03:03 +00:00
Bob Ren	ba5253da9b	Always unspecialize float in OSS (#138922 ) Fixes https://github.com/pytorch/pytorch/issues/107277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-24 01:58:13 +00:00
PyTorch MergeBot	a8c90e5140	Revert "Always unspecialize float in OSS (#138922 )" This reverts commit 6d779d05492813da1c19ac0c562d0d5f8473f27e. Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is some slow tests failing after this land ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2495076878))	2024-11-22 23:18:36 +00:00
Bob Ren	6d779d0549	Always unspecialize float in OSS (#138922 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-22 17:54:42 +00:00
Michael Lazos	1fd4757fdc	Support tensor betas in Adam and AdamW (#134171 ) Adds support for beta1 and beta2 to be wrapped in tensor for Adam and AdamW. Fixes https://github.com/pytorch/pytorch/issues/133898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134171 Approved by: https://github.com/janeyx99	2024-11-15 21:55:55 +00:00
zeshengzong	cb71bcc542	Replace clone.detach with detach.clone (#140264 ) Fixes #64532 As state in issue, replace `clone.detach` by `detach.clone` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 Approved by: https://github.com/soulitzer	2024-11-13 07:01:02 +00:00
Yu, Guangye	0efa590d43	[CI] Fix XPU CI failure (#138548 ) # Motivation Fix https://github.com/pytorch/pytorch/issues/138577. # Solution 1. All UTs in `test/inductor/test_compiled_optimizers.py` are fixed by https://github.com/pytorch/pytorch/pull/134170 2. UT in `test/inductor/test_pattern_matcher.py` is introduced by https://github.com/pytorch/pytorch/pull/138089, we will skip this UT due to the unsupported feature `max_autotune_gemm_backends:Triton`. 3. We have a new impl related to `histc`, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py` 4. We support `avg_pool3d` for `fp16` data type, so we remove the expected failure from `test/inductor/test_torchinductor_opinfo.py` 5. CUDA-bias code is introduced by https://github.com/pytorch/pytorch/issues/138472, we just generalize it to `GPU_TYPE`. # Additional Context > Why update torch-xpu-ops commit pin here? We have to update commit pin to avoid the build failure raised by the code change [C10_UNUSED](https://github.com/pytorch/pytorch/pull/138364). > What does the feature of torch-xpu-ops update? 1. Add some foreach ops, like `unary ops` and `foreach_clamp_max` etc; 2. Add some maxpool ops forward and backward, like `averge_pool3d` and `max_pool3d` 3. Add some other ops, like `log_normal_`, `index_copy`, and `mode` etc; 4. fix build failure related to `C10_UNUSED`; Pull Request resolved: https://github.com/pytorch/pytorch/pull/138548 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-10-24 07:56:26 +00:00
Shunting Zhang	0a38c0ec89	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-22 00:50:00 +00:00
PyTorch MergeBot	ac7f52b301	Revert "[inductor] add a threshold for membw saving during fusion (#136782 )" This reverts commit 6647320de2077c10309f5025a007d51c7fb542d8. Reverted https://github.com/pytorch/pytorch/pull/136782 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_memory starts to fail after this lands in trunk ([comment](https://github.com/pytorch/pytorch/pull/136782#issuecomment-2423549196))	2024-10-19 03:43:42 +00:00
Shunting Zhang	6647320de2	[inductor] add a threshold for membw saving during fusion (#136782 ) Fix https://github.com/pytorch/pytorch/issues/133242 . In that issue, inductor fuses 2 nodes because they access the same scalar tensor. This saving is very small (4 bytes), and if we ignore that, by default, we can not fuse. But if loop ordering after fusion get kicked in, we can reorder loops and fuse those 2 nodes. We get 33% memory bandwidth savings . I think adding a threshold for membw saving in general is not bad. I'll run a perf test. ( https://github.com/pytorch/pytorch/actions/runs/11375421752 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136782 Approved by: https://github.com/jansel	2024-10-19 00:22:43 +00:00
Michael Lazos	0b2c12cb4d	Support more foreach ops for tensor beta support (#134170 ) Add more foreach ops so we don't have fallbacks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134170 Approved by: https://github.com/eellison	2024-10-17 17:51:31 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Michael Lazos	16f119e62a	Update compiled optimizer tests for tensor betas (#134169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134169 Approved by: https://github.com/anijain2305, https://github.com/eellison ghstack dependencies: #134166, #134167, #134168	2024-08-31 10:24:39 +00:00
Stonepia	4fcd15a667	Fix test_sgd_weight_decay_xpu accuracy error (#134744 ) Fixes #134743 This PR adds `test_sgd_weight_decay_xpu` in `KERNEL_COUNT_OVERRIDES` to override. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134744 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-08-29 15:12:40 +00:00
Chien-Lin Chen	40de63be09	parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749 ) Fixes #123451 This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127. The test failure is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749 Approved by: https://github.com/janeyx99	2024-08-28 16:34:06 +00:00
Li, Xingyuan	dcfa415e6e	[Inductor UT] Reuse inductor UT for intel GPU `test/inductor/test_compiled_optimizers.py` (#133083 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_optimizers.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos	2024-08-17 01:15:26 +00:00
Michael Lazos	a6413d2924	Regression test for S429861 (#133376 ) Adds repro test to verify that https://www.internalfb.com/sevmanager/view/429861 does not occur again. I haven't been able to reduce the size of the repro further, if I remove any buffers the error disappears! Pull Request resolved: https://github.com/pytorch/pytorch/pull/133376 Approved by: https://github.com/eellison	2024-08-14 06:55:05 +00:00
Peter Bell	fdc4d6fe96	[inductor] Refactor fusion of inplace operations (#130835 ) Resubmit of #128979 `WeakDep`s force readers to have completed before a mutation overwrites the buffer, but we want to allow fusions to occur for inplace mutations where the same index is read and written. Currently this is achieved by: 1. Identifying the buffers used by the mutating op in its `dep_closure` 2. Not creating `WeakDep`s for buffers in the `dep_closure` 3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical` So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup. This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to `can_fuse_vertical` which selectively allows inplace operation to fuse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835 Approved by: https://github.com/lezcano	2024-07-25 20:29:01 +00:00
Xuehai Pan	134bc4fc34	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 07:49:19 +00:00
PyTorch MergeBot	b732b52f1e	Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 )" This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d. Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))	2024-07-18 06:39:58 +00:00
Xuehai Pan	aecc746fcc	[BE][Easy][12/19] enforce style for empty lines in import segments in `test/i*/` (#129763 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763 Approved by: https://github.com/jansel	2024-07-18 05:13:41 +00:00
Michael Lazos	6dc64026cb	Restrict fusions in foreach if there are dependencies on multiple subkernels (#130046 ) In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error. @eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092) To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning. Co-authored-by: David Berard <dberard@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046 Approved by: https://github.com/eellison	2024-07-08 18:25:16 +00:00
PyTorch MergeBot	7b910285db	Revert "[inductor] Refactor fusion of inplace operations (#128979 )" This reverts commit 72e3aca227ae1e3dc1b91aee415cf27b0cb22f2b. Reverted https://github.com/pytorch/pytorch/pull/128979 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128979#issuecomment-2186846940))	2024-06-24 15:29:40 +00:00
Peter Bell	72e3aca227	[inductor] Refactor fusion of inplace operations (#128979 ) `WeakDep`s force readers to have completed before a mutation overwrites the buffer, but we want to allow fusions to occur for inplace mutations where the same index is read and written. Currently this is achieved by: 1. Identifying the buffers used by the mutating op in its `dep_closure` 2. Not creating `WeakDep`s for buffers in the `dep_closure` 3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical` So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup. This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to `can_fuse_vertical` which selectively allows inplace operation to fuse. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128979 Approved by: https://github.com/lezcano ghstack dependencies: #129082, #129083	2024-06-22 12:38:22 +00:00
PyTorch MergeBot	cb69c51b6f	Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 )" This reverts commit cf35a591b95220aa1bfcc04ff8a943efd1d6d6eb. Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))	2024-05-20 12:14:22 +00:00
jayanth domalapalli	cf35a591b9	Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127 ) This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure. Lintrunner passed: ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Tests passed: ``` >python test_cuda.py -k test_graph_optims Ran 19 tests in 7.463s OK (skipped=9) >python test_cuda.py -k test_graph_scaling_fused_optimizers Ran 6 tests in 2.800s OK (skipped=3) ``` Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers. I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called: ``` {'foreach': False, 'maximize': True, 'weight_decay': 0} { 'foreach': True, 'maximize': True, 'weight_decay': 0} ``` I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned: ``` {'amsgrad': True} ``` Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127 Approved by: https://github.com/janeyx99	2024-05-20 06:20:45 +00:00
Michael Lazos	7ed67cdbcc	Add compile time smoketest for foreach (#126136 ) Fixes [T175425693](https://www.internalfb.com/intern/tasks/?t=175425693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126136 Approved by: https://github.com/yanboliang	2024-05-14 21:00:55 +00:00
Michael Lazos	812534d27e	Skip two LR schedulers with eager memory leaks in compiled optim tests (#126133 ) SequentialLR and ChainedLR leak memory, so disable these two schedulers until https://github.com/pytorch/pytorch/issues/126131 is fixed. Re-enables https://github.com/pytorch/pytorch/issues/125925 https://github.com/pytorch/pytorch/issues/125925 https://github.com/pytorch/pytorch/issues/125924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126133 Approved by: https://github.com/yanboliang, https://github.com/aorenste	2024-05-14 04:42:34 +00:00
Michael Lazos	1b1b18a7a4	Add LRScheduler Composability E2E Tests (#125653 ) adds tests to verify the LRSchedulers correctly update the compiled optimizers without recompiles. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125653 Approved by: https://github.com/yanboliang ghstack dependencies: #123751, #123752, #123753, #125383	2024-05-09 00:52:43 +00:00
Michael Lazos	8c9c169b48	LRScheduler composability kernel tests (#125383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125383 Approved by: https://github.com/eellison ghstack dependencies: #123751, #123752, #123753	2024-05-09 00:52:43 +00:00
Michael Lazos	787afc5180	Add LR as tensor tests (#123750 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123750 Approved by: https://github.com/janeyx99	2024-05-01 04:46:49 +00:00
Michael Lazos	d88fcb86d8	Enable dynamo traced test_forloop_goes_right_direction (#123322 ) Removed a bunch of skips, I also updated test_forloop_goes_right_direction to not use the closure when dynamo is tracing. The reason for this is that testing the disabled optimizer doesn't actually test anything. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123322 Approved by: https://github.com/janeyx99 ghstack dependencies: #123498	2024-04-18 00:50:10 +00:00
Michael Lazos	2ac99d539b	Only initialize state if needed in SGD (#123757 ) Fixes [T184381726](https://www.internalfb.com/intern/tasks/?t=184381726) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123757 Approved by: https://github.com/janeyx99	2024-04-11 08:56:06 +00:00
Michael Lazos	7c23fed12c	Move step to cpu if state is already initialized (#123618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123618 Approved by: https://github.com/anijain2305 ghstack dependencies: #123496, #123497, #123551, #123552	2024-04-09 09:04:18 +00:00
Michael Lazos	d9d25076fe	Reduce guards of optimizer state dict to guard once per param group (#123413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123413 Approved by: https://github.com/anijain2305	2024-04-06 00:12:59 +00:00
Michael Lazos	512759a3d7	Fix for tensor attribute missing (#123313 ) Tensors would sometimes be realized after we already registered attrs on the root nn module. This ensures all stack values are realized before registering attrs on the root nn module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123313 Approved by: https://github.com/anijain2305	2024-04-04 21:11:04 +00:00
Michael Lazos	3e2b7e6052	[dynamo][guard overhead] Data ptr guard optimizer state tensors (#122858 ) Stricter (but faster) guarding on optimizer state tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/122858 Approved by: https://github.com/anijain2305	2024-04-03 21:42:06 +00:00

1 2

99 Commits