pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Maggie Moss	d795fb225a	[RFC] Add pyrefly to lintrunner (#165179 ) This will add pyrefly to lint runner as a warning only - and allow us to collect feedback about the tool before switching to pyrefly as the main type checker. References the steps outlined here: : https://github.com/pytorch/pytorch/issues/163283: test plan: `lintrunner init` `lintrunner` confirm when pyrefly errors are present results look like: https://gist.github.com/maggiemoss/e6cb2d015dd1ded560ae1329098cf33f Pull Request resolved: https://github.com/pytorch/pytorch/pull/165179 Approved by: https://github.com/ezyang	2025-10-16 20:07:09 +00:00
IvanKobzarev	7d87d7052e	[inductor][bucketing] Fx collectives bucketing of multiple dtypes (#162470 ) Bucketing of multiple dtypes to be processed in one bucketed collective. First target is to bucket bf16 and f32, but already can be used with other dtypes. For now multidtype bucketing is only supported with "custom_ops" mode. Non custom_ops needs additional work on inductor side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162470 Approved by: https://github.com/eellison	2025-10-16 18:31:43 +00:00
IvanKobzarev	9272437cde	Fx collectives bucketing: add bucket all_reduce (#165351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165351 Approved by: https://github.com/eellison	2025-10-16 13:27:33 +00:00
Oguz Ulgen	5d0b22008d	Codemod inductor/fx_passes from Optional to union none (#165606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165606 Approved by: https://github.com/aorenste ghstack dependencies: #165604, #165605	2025-10-16 04:59:47 +00:00
eellison	b3f6d49b69	Overlap scheduler improvements (#165318 ) Bucketing a number of smallish improvements: - Account for bucketing in overlap calculation: if an in-flight collective exists with the same bucket key, reduce new collectives estimated time by its latency time - Update compute domination so we are ordering based on compute idx, as opposed to compute depth, so we never reorder compute. this makes it a bit easier to reason about memory, and pre-fetching, although we can exploring reordering in the future. - When we wait on a collective, force all collectives on the same process group as it that were enqueued prior to the collective to wait as well. Better Memory Handling: - Pre-fetch limiting - when scheduling collectives for overlap, only pre-fetch up to a certain distance, then schedule off-path collectives (which are typically memory reducing). - When we are above peak memory, schedule waits. TODO: - for each compute node, we know its original memory in the graph. we could limit pre-fetching that goes across peak memory - By scheduling off-path collectives for overlap, we reduce memory, but if there weren't enough compute for overlap, we need to proactively schedule them. not an issue yet on examples. - config some hard coded constants, clean up enablement (can do in subsequent pr) On small llama 2d backward : 578 of 618 potentially hideable collectives hidden original mem 14.4GB, rescheduled mem, 15.9GB on forward: 254/256 potentially hideable collectives hidden original mem 5.8 gb, reshceduled mem 5.8GB WIP: adding tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165318 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783, #164944, #164945, #165059	2025-10-15 21:58:47 +00:00
eellison	7d59e37434	Add Comm-Compute Preserving Bucketer (#163960 ) tl;dr performs bucketing while preserving comm-compute overlap. In comm-compute overlap we will have a graph with: ``` def foo(...): ag = all_gather(...) hiding_compute = mm(...) wait(ag) ``` There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap. Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set. We perform bucketing while augmenting the graph with these relationships. This can be done separably from comm-compute overlap, so long as the hiding compute relationships are passed in. TODO: - need to instrument fx graph so inductor respects these relationships. - the compile time of the bucketing search can be sped up significantly by limiting what portion of the graph we traverse through - more memory aware handling Pull Request resolved: https://github.com/pytorch/pytorch/pull/163960 Approved by: https://github.com/ruisizhang123, https://github.com/v0i0, https://github.com/IvanKobzarev ghstack dependencies: #163215, #163754, #163959	2025-09-30 04:53:58 +00:00
eellison	0b2fdc30a2	refactor bucketing (#163754 ) Preparatory refactory Pull Request resolved: https://github.com/pytorch/pytorch/pull/163754 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #163215	2025-09-30 04:53:58 +00:00
PyTorch MergeBot	b28e4f1f87	Revert "refactor bucketing (#163754 )" This reverts commit e1bd5b60cf243d3a026a6c89733488a6d9d4b33d. Reverted https://github.com/pytorch/pytorch/pull/163754 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see `c9b5af9a38 (51526707590-box)` ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))	2025-09-29 21:53:42 +00:00
eellison	e1bd5b60cf	refactor bucketing (#163754 ) Preparatory refactory Pull Request resolved: https://github.com/pytorch/pytorch/pull/163754 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #163215	2025-09-29 18:32:41 +00:00
IvanKobzarev	8ec01f34e9	[bucketing] custom_ops mode to hide inductor copies overhead (#161499 ) Adding "_custom_ops" bucketing to temporary fallback to eager execution of for_each, to workaround too many generated kernels on inductor side. This PR also reverts parts of bucketing changes for cycles detection that resulted in accuracy problems. Differential Revision: [D81152293](https://our.internmc.facebook.com/intern/diff/D81152293) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161499 Approved by: https://github.com/eellison	2025-09-08 20:03:08 +00:00
IvanKobzarev	595987d28d	[bucketing] allow convert_element_type after fsdp reduce_scatter (#161159 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161159 Approved by: https://github.com/eellison	2025-08-22 06:41:50 +00:00
eellison	b708966201	Fix bucketing introducing cycles (#160967 ) We were just looking at direct arguments, but not transitive dependencies. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160967 Approved by: https://github.com/IvanKobzarev	2025-08-20 19:38:46 +00:00
IvanKobzarev	f33ce40bc0	[bucketing] Bucket only adjacent collectives to prevent reordering (#159983 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159983 Approved by: https://github.com/wconstab, https://github.com/eellison	2025-08-12 11:57:00 +00:00
Francisco Massa	9a680e14b7	[bucketing] Reduce CPU overhead for reduce_scatter_merge_fn_to_trace (#159723 ) The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` . This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723 Approved by: https://github.com/IvanKobzarev	2025-08-03 09:16:55 +00:00
Francisco Massa	d2792f51b2	[bucketing] Use max of input/output size for bucketing (#159717 ) The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717 Approved by: https://github.com/wconstab	2025-08-02 22:42:22 +00:00
IvanKobzarev	8aebf01287	[bucketing] Rewrite all_gather, reduce_scatter passes via tracing merge_fn (#158663 ) Rewriting bucketing of all_gather and reduce_scatter with defining of "merge graph" via torch function. `all_gather_merge_fn_to_trace` `reduce_scatter_merge_fn_to_trace` (Instead of creating nodes and doing FakeTensor prop manually) This allows to experiment with merge function. Used foreach_copy_ in merging function for all_gather - added lowering for inductor for `foreach_copy_` Adding topological sort after bucketing passes (comment in post_grad.py): ``` # Fx collectives bucketing passes require topological sort for the cases: # when bucketed collectives have users before the last collective in the bucket # AND when inputs of bucketed collective have ancestors after the first collective in the bucket. # # In this case we can not manually pick the place for bucketed collective insertion. # But we are guaranteed by the bucketing (independent collectives in the bucket), # that it is possible to reorder nodes to satisfy all ordering requirements. # # --- before bucketing --- # in0 = ... # wait_ag0 = ag(in0) # user0(wait_ag0) # ... # pre_in1 = ... # in1 = transform(pre_in1) # wait_ag1 = ag(in1) # user1(wait_ag1) # # --- after bucketing --- # # in0 = ... # user(wait_ag0) <--- wait_ag0 is defined only after bucketed collective. # # pre_in1 = ... # in1 = transform(pre_in1) # ag_bucket(in0+in1) # wait_bucket # wait_ag0 = wait_bucket[0] # wait_ag1 = wait_bucket[1] # user1(wait_ag1) ```` Correctness of the passes verified by loss curve for llama3 8b for simple_fsdp and for autoparallel: <img width="1364" height="495" alt="Screenshot 2025-07-22 at 14 27 28" src="https://github.com/user-attachments/assets/67b2cabb-3206-450b-b529-e23c24292fc6" /> <img width="1355" height="509" alt="Screenshot 2025-07-22 at 14 27 56" src="https://github.com/user-attachments/assets/4d0e6b25-2eb1-47b2-8d68-dcec185239c4" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/158663 Approved by: https://github.com/wconstab	2025-07-25 22:49:51 +00:00
IvanKobzarev	371ffaf415	[bucketing] Support case of several pgs in graph (#158632 ) Main changes: - bucketing collectives only from the same process_group by group_name - Support of groups like [0,2,4,6], [0,1,3,5] using `rank_idx_dict` for in pass operations for slice idxs etc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158632 Approved by: https://github.com/wconstab	2025-07-22 14:50:39 +00:00
IvanKobzarev	8dff457f42	[simple_fsdp] Port fx pass to bucket reduce_scatters (#157780 ) Porting fx passes for reduce_scatters bucketing (similar to all_gather bucketing) for simple_fsdp and autoparallel testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157780 Approved by: https://github.com/wconstab	2025-07-10 14:04:43 +00:00
IvanKobzarev	7b392bac13	all_gather_bucketing fx pass (#157396 ) Porting passes to bucket all_gathers The main logic of the pass is done via 1. Searching for all all_gathers from the buckets Copying tests from @wconstab PR to test compatibility with reordering. Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering. Pass changes: Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops. Inductor changes: Adding foreach_copy_ lowering Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396 Approved by: https://github.com/wconstab	2025-07-03 22:07:42 +00:00

19 Commits