pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Edward Yang	09cb34c1dc	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-22 21:12:18 +00:00
PyTorch MergeBot	f0078941cf	Revert "[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 )" This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0. Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530))	2025-09-22 05:39:07 +00:00
Deng, Daisy	c9485f8ff3	[Reland][2/N]Port several test files under test/distributed to Intel GPU (#159473 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use requires_accelerator_dist_backend to allow both nccl and xccl test - enabled XPU for some test path - Change the hardcoded world_size according to device_count. - Unify some common code under torch/testing/_internal for multiple backend, for example: Added xpu for Backend.backend_capability and dist.Backend.register_backend() Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-17 06:42:27 +00:00
FFFrog	de143bf79b	[C10d] Code clean for torch.distributed.init_process_group (#163038 ) As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163038 Approved by: https://github.com/msaroufim	2025-09-16 08:15:25 +00:00
Edward Yang	6c334885d4	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-12 10:54:42 +00:00
PyTorch MergeBot	6b59a19242	Revert "[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 )" This reverts commit 6e8f17c58029e5fa6bc222b2445ebbc0cbdc17c7. Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3283985880))	2025-09-12 06:52:03 +00:00
Edward Yang	6e8f17c580	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-12 03:56:18 +00:00
PyTorch MergeBot	92f9ed7ac3	Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473 )" This reverts commit fa1d409e83af93425a2672d62e134e8f20c5ccc0. Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break an distributed tests ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3282999084))	2025-09-11 23:51:21 +00:00
suo	fe8cc619b8	[torch][c10d] fix split_group in mixed backend case (#162424 ) Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options. However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends. This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead. Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs. As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424 Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang	2025-09-11 16:29:32 +00:00
Deng, Daisy	fa1d409e83	[2/N]Port several test files under test/distributed to Intel GPU (#159473 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use requires_accelerator_dist_backend to allow both nccl and xccl test - enabled XPU for some test path - Change the hardcoded world_size according to device_count. - Unify some common code under torch/testing/_internal for multiple backend, for example: Added xpu for Backend.backend_capability and dist.Backend.register_backend() Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-11 06:44:26 +00:00
PyTorch MergeBot	d033d11d26	Revert "[torch][c10d] fix split_group in mixed backend case (#162424 )" This reverts commit 2dc26131801a430e030a773c4fbfe874e263259d. Reverted https://github.com/pytorch/pytorch/pull/162424 on behalf of https://github.com/clee2000 due to failure seems related, maybe a hang/timeout distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_model_diff_shape_across_ranks log classifier is pointing at the wrong line ([comment](https://github.com/pytorch/pytorch/pull/162424#issuecomment-3276360494))	2025-09-10 20:13:44 +00:00
suo	2dc2613180	[torch][c10d] fix split_group in mixed backend case (#162424 ) Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options. However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends. This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead. Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs. As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424 Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang	2025-09-10 16:59:18 +00:00
Edward Yang	dda071587f	Revert "Make distributed modules importable even when backend not built (#159889 )" (#162568 ) This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568 Approved by: https://github.com/huydhn	2025-09-10 04:29:42 +00:00
Edward Z. Yang	a0d026688c	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-08 19:10:36 +00:00
PyTorch MergeBot	29e09a6545	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 01edcd4df8bf0c7b4cc2d3ec868bd2059eeea83b. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))	2025-09-08 07:04:36 +00:00
PyTorch MergeBot	ff2de5d522	Revert "[2/N]Port several test files under test/distributed to Intel GPU (#159473 )" This reverts commit 040d00af048967dde7938d358d7f5988cbd18388. Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444) ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3264046983))	2025-09-07 21:06:38 +00:00
Codeboi007	c98ddaca6d	Fixed comment to match logic in distributed_c10d.py (#162158 ) inconsistent with the logic introduced in #162157 and modified in #142216.This update ensures the documentation matches the actual behavior of the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162158 Approved by: https://github.com/wconstab	2025-09-06 05:37:49 +00:00
Edward Z. Yang	01edcd4df8	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-05 20:15:11 +00:00
PyTorch MergeBot	70f865ac9b	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit ef3be6726f7ff4b77c22db10cec5b686f9107ea9. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))	2025-09-05 18:58:47 +00:00
Edward Z. Yang	ef3be6726f	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-04 20:05:50 +00:00
Edward Yang	248355faf5	Don't require FakeStore to be passed into fake backend (#162164 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162164 Approved by: https://github.com/bdhirsh, https://github.com/albanD, https://github.com/wconstab	2025-09-04 16:43:49 +00:00
PyTorch MergeBot	34aa78274d	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 4ae57d448c0a7d37e4cfd5c27d977fad2cef4051. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3253651785))	2025-09-04 13:13:52 +00:00
Deng, Daisy	040d00af04	[2/N]Port several test files under test/distributed to Intel GPU (#159473 ) For https://github.com/pytorch/pytorch/issues/114850, we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles: - instantiate_device_type_tests() - use "torch.accelerator.current_accelerator()" to determine the accelerator backend - use requires_accelerator_dist_backend to allow both nccl and xccl test - enabled XPU for some test path - Change the hardcoded world_size according to device_count. - Unify some common code under torch/testing/_internal for multiple backend, for example: Added xpu for Backend.backend_capability and dist.Backend.register_backend() Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-09-04 12:53:17 +00:00
Edward Z. Yang	4ae57d448c	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-03 07:33:55 +00:00
Ke Wen	9b81fe281d	[c10d] Lessen density of barrier warning (#162015 ) Warnings are great, but too dense when there are many ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162015 Approved by: https://github.com/d4l3k, https://github.com/H-Huang	2025-09-03 02:20:54 +00:00
PyTorch MergeBot	420c52ecf3	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit 626cb7df8161dd4ecb4fe43b60f37ce9076f56b1. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))	2025-09-02 20:24:01 +00:00
Edward Z. Yang	626cb7df81	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-01 23:00:21 +00:00
Howard Huang	82d2d23e85	Add batch option for send/recv_object_list (#160342 ) `send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized. This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group. --- BatchP2P ops, creates (or use existing) communicator keyed by device index Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2” See: `c8205cb354/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3980-L4008)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/160342 Approved by: https://github.com/wconstab	2025-08-30 03:29:09 +00:00
frost-intel	9b4adc4db7	[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568 ) Adds support for FlightRecorder in ProcessGroupXCCL. See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568 Approved by: https://github.com/guangyey, https://github.com/fduwjj	2025-08-22 09:03:35 +00:00
Will Constable	dd22ba09b4	[C10D] Document barrier interaction with device_id (#159389 ) Addresses #159262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159389 Approved by: https://github.com/malfet, https://github.com/H-Huang, https://github.com/kwen2501, https://github.com/fduwjj	2025-08-01 18:12:21 +00:00
Junjie Wang (PyTorch)	4defea1e2c	[c10d] Fix setGroupName and setGroupDesc in `group_split` and `merge_remote_group` (#159429 ) Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI Rollback Plan: Differential Revision: D79201132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429 Approved by: https://github.com/xunnanxu	2025-07-30 19:55:55 +00:00
fduwjj	67e68e0785	[c10d] Cleanup split_group logic using the newly built splitGroup (#158488 ) with https://github.com/pytorch/pytorch/pull/157716 merged we want to further clean up the code on the python side for `split_group` API. We do need to keep some old global book keeping for bc. The rest of logic is now all in cpp. Regarding the change brought in https://github.com/pytorch/pytorch/pull/152175, we did clean up in https://github.com/pytorch/pytorch/pull/158790 (including internal changes) so that we can safely remove it. Differential Revision: [D78777152](https://our.internmc.facebook.com/intern/diff/D78777152) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158488 Approved by: https://github.com/d4l3k	2025-07-29 03:27:11 +00:00
Panagiotis Kourdis	fd47401536	[doc] Updates to distributed.md for XCCL backend (#155834 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834 Approved by: https://github.com/guangyey, https://github.com/AlannaBurke, https://github.com/d4l3k Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-07-22 21:01:43 +00:00
Dmitry Rogozhkin	b146ca74f0	docs: add get_default_backend_for_device to distributed documentation (#156783 ) `torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap. CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783 Approved by: https://github.com/guangyey, https://github.com/malfet	2025-07-10 05:11:30 +00:00
Will Constable	4b4c2a7b1d	Support complex numbers in DTensor redistribute (#157329 ) Add complex number unwrapping in functional collectives used by DTensor. Complex tensors are not directly supported by underlying comm kernels (e.g. nccl) but complex tensors can be viewed as real tensors of a higher rank (added size-2 tensor dim represents real vs im component). Collective output is then viewed as complex to restore the original/expected shape and dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157329 Approved by: https://github.com/XilunWu	2025-07-02 21:37:16 +00:00
Xuehai Pan	4ccc0381de	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-23 02:57:28 +00:00
PyTorch MergeBot	145d4cdc11	Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 )" This reverts commit c2f0292bd5b4b3206f5b295e96f81cd6c178eb18. Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	c2f0292bd5	[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315 Approved by: https://github.com/Skylion007, https://github.com/albanD ghstack dependencies: #156313, #156314	2025-06-22 08:43:26 +00:00
Ke Wen	0f0c010714	[c10d] init_process_group supports index-only device id (#156214 ) Before: ``` acc = torch.accelerator.current_accelerator() if acc: local_idx = ... dist.init_process_group( device_id=torch.device(acc.type, local_idx) ) ``` After: ``` dist.init_process_group(device_id=local_idx) ``` That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-06-21 00:02:37 +00:00
lzhang2	590fe4d2d7	Skip updating the default device distributed backend if already registered (#155320 ) Motivation: PyTorch maintain a `default_device_backend_map` https://github.com/pytorch/pytorch/blob/main/torch/distributed/distributed_c10d.py#L269 , which indicates the default distributed backend if no backend name is specified in user frontend (like `init_process_group`). Currently, `"xpu": XCCL` is also in this `default_device_backend_map`. However, if another process group name is registered as XPU distributed backend, it immediately replaces XCCL in this default map, which is not what we want. Therefore, we would like to skip updating the default distributed backend if one is already registered in the map. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155320 Approved by: https://github.com/guangyey, https://github.com/d4l3k	2025-06-12 21:17:06 +00:00
Tsung-Hsien Lee	a6210fd07b	[c10d] Enhance `get_process_group_ranks()` to accept `group=None` (#154902 ) Summary: This diff enhances the `get_process_group_ranks()` function to accept `group=None` as an optional argument. This allows the function to return all ranks associated with the default process group if no group is specified. Test Plan: contbuild & OSS CI Rollback Plan: Differential Revision: D75817800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154902 Approved by: https://github.com/wz337	2025-06-11 23:41:03 +00:00
fduwjj	ff92b42fc3	[c10d][gloo] Integrate vendor generic FR into gloo (#152614 ) This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152614 Approved by: https://github.com/d4l3k ghstack dependencies: #154929	2025-06-03 16:12:54 +00:00
Tsung-Hsien Lee	cae25ef4e5	[c10d] Enhance Error Logging in `new_subgroups()` for Non-Divisible World Sizes (#154124 ) Summary: The error caused by the world size not being divisible by `group_size` is a common issue encountered by end-users when utilizing applications built on top of `new_subgroups()`. However, these applications may employ different variable names, such as `num_trainers_per_group`, which can make the current error messages less effective despite being correct. To address this, we have improved the error messages to display the actual numbers involved, thereby enhancing their clarity and usefulness. Test Plan: contbuild & OSS CI Differential Revision: D75226925 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154124 Approved by: https://github.com/wz337	2025-05-23 17:12:43 +00:00
Tsung-Hsien Lee	f1f54c197d	[c10d] Simplify `new_subgroups()` by using `new_subgroups_by_enumeration()` (#153843 ) Summary: The code changes in each file of the diff include removing the `subgroups` and `cur_subgroup` variables, and replacing the while loop with a call to `new_subgroups_by_enumeration()`. Test Plan: contbuild & OSS CI Differential Revision: D75007368 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153843 Approved by: https://github.com/Skylion007, https://github.com/wz337	2025-05-20 19:15:20 +00:00
Tsung-Hsien Lee	6487ea30b3	[c10d] Fix `new_subgroups(group=)` bug (#153798 ) Summary: The bug, introduced in https://github.com/pytorch/pytorch/pull/152765, was caused by passing the `group` parameter to the `get_rank()` function, which caused the function to return the rank of the entire group instead of the rank of the current process. The fix involves removing the `group` parameter from the `get_rank()` function call. Test Plan: contbuild & OSS CI Differential Revision: D74964213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153798 Approved by: https://github.com/Skylion007	2025-05-19 17:01:10 +00:00
Deep Shah	2489b6470b	[c10d] Allow split_group to work with non nccl backends (#152175 ) Summary: Currently things are hardcoded to only work with nccl backend. Extend it to allow NCCL + custom plugin backend. The split-specific methods/attributes have not been added to the base Backend and Options as some of them are specific to backend implementations. Instead, explicit checks have been added to the split_group method for the expected methods and attributes. I am open to making them part of base Backend based if folks prefer. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/152175 Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501	2025-05-16 00:15:29 +00:00
Tsung-Hsien Lee	dfcfad2112	[c10d] Fix unused `group` input argument in `new_subgroups()` (#152765 ) Summary: This diff fixes an unused input argument [`group`](`8faa225695/torch/distributed/distributed_c10d.py (L5341)`) in the `new_subgroups()` function. Test Plan: contbuild & OSS CI, see Differential Revision: D74132537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152765 Approved by: https://github.com/wz337	2025-05-07 02:37:51 +00:00
Ke Wen	a8f727c439	[c10d] Fix extra CUDA context created by barrier (#149144 ) Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144 Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever	2025-05-06 15:27:30 +00:00
PyTorch MergeBot	cc954848d4	Revert "[c10d] Fix extra CUDA context created by barrier (#149144 )" This reverts commit 457fa820ad538c7aeadb68f0ec418d63972ba1ee. Reverted https://github.com/pytorch/pytorch/pull/149144 on behalf of https://github.com/huydhn due to Internal failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/149144#issuecomment-2852564660))	2025-05-05 22:56:50 +00:00
Ke Wen	457fa820ad	[c10d] Fix extra CUDA context created by barrier (#149144 ) Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144 Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever	2025-05-03 03:13:34 +00:00

1 2 3 4 5 ...

483 Commits