3255e7872b
Enable all flake8-logging-format rules ( #164655 )
...
These rules are enabled by removing existing suppressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655
Approved by: https://github.com/janeyx99 , https://github.com/mlazos
2025-10-19 00:59:28 +00:00
2e22b1a61e
[pytorch] Composite backend potential fix for is_backend_available ( #165061 )
...
Summary: `is_backend_available` takes in a string and expects it to only be backend, if its given a composite (device:backend) string, it fails.
Reviewed By: prashrock
Differential Revision: D81886736
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165061
Approved by: https://github.com/H-Huang
2025-10-17 22:06:36 +00:00
fae74cd52f
Revert "shrink_group implementation to expose ncclCommShrink API ( #164518 )"
...
This reverts commit a032510db38e8331afa08f7635d146f9cefdd0ab.
Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3416718767 ))
2025-10-17 18:55:53 +00:00
a032510db3
shrink_group implementation to expose ncclCommShrink API ( #164518 )
...
Closes #164529
To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink ) API to PyTorch.
This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.
For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518
Approved by: https://github.com/Skylion007 , https://github.com/syed-ahmed , https://github.com/kwen2501
2025-10-17 17:55:03 +00:00
d2494cbb2b
Revert "[distributed] Replace assert statements with AssertionError exceptions ( #165216 )"
...
This reverts commit 74db92b21868b7e9e77cc966e5d57a8246723cbd.
Reverted https://github.com/pytorch/pytorch/pull/165216 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_pg_wrapper.py::ProcessGroupNCCLWrapperTest::test_debug_level_detail_no_gloo [GH job link](https://github.com/pytorch/pytorch/actions/runs/18492765290/job/52693842750 ) [HUD commit link](74db92b218
), note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/165216#issuecomment-3402838765 ))
2025-10-14 17:05:16 +00:00
fbe0d20a17
[2/N] More ruff SIM fixes ( #165031 )
...
This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031
Approved by: https://github.com/mlazos
2025-10-14 14:22:54 +00:00
74db92b218
[distributed] Replace assert statements with AssertionError exceptions ( #165216 )
...
Replaces 71 assert statements across 11 files in `torch.distributed` with explicit if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag.
Fixes #164878
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165216
Approved by: https://github.com/albanD
2025-10-14 09:58:59 +00:00
b8be796a57
Revert "[2/N] More ruff SIM fixes ( #165031 )"
...
This reverts commit 38095fbd1323ee4a9541fbcbb9b28bd20f2cd956.
Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870 ))
2025-10-10 13:42:14 +00:00
70925bdf82
[1/N] Use "is" in python type comparison ( #165037 )
...
It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165037
Approved by: https://github.com/mlazos
2025-10-10 12:36:50 +00:00
38095fbd13
[2/N] More ruff SIM fixes ( #165031 )
...
This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031
Approved by: https://github.com/mlazos
2025-10-10 05:37:46 +00:00
9944cac6e6
Add suppressions to torch/_inductor ( #165062 )
...
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283
Split this directory into two PRs to keep them from being too large.
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199
after:
INFO 0 errors (6,884 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165062
Approved by: https://github.com/oulgen , https://github.com/mlazos
2025-10-09 20:34:20 +00:00
7457d139c5
Add pyrefly suppressions to torch/distributed (7/n) ( #165002 )
...
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283
One more PR after this one.
Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check
step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199
after:
INFO 0 errors (6,884 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002
Approved by: https://github.com/oulgen
2025-10-09 04:08:25 +00:00
5d7360bb03
Revert "Enable all SIM rules except disabled ones ( #164645 )"
...
This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911.
Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351 ))
2025-10-05 19:32:21 +00:00
321e602692
Enable all SIM rules except disabled ones ( #164645 )
...
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang
2025-10-05 07:38:25 +00:00
da003d7b95
[3/N] Import Callable from collections.abc in torch/distributed ( #164104 )
...
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
This PR is the follow-up of #164054 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104
Approved by: https://github.com/Skylion007
2025-09-30 00:28:53 +00:00
00059db034
Revert "[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )"
...
This reverts commit 09cb34c1dce8fe1b880bbf3115d8ddad3401d871.
Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/malfet due to reverted internally and now can be safely reverted in OSS ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3334176367 ))
2025-09-25 13:47:46 +00:00
09cb34c1dc
[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )
...
Summary:
Original: D81957844 and D81957923
Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well
#buildall
Test Plan:
sandcastle and oss ci
Rollback Plan:
Reviewed By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang , https://github.com/dcci
2025-09-22 21:12:18 +00:00
f0078941cf
Revert "[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )"
...
This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0.
Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530 ))
2025-09-22 05:39:07 +00:00
c9485f8ff3
[Reland][2/N]Port several test files under test/distributed to Intel GPU ( #159473 )
...
For https://github.com/pytorch/pytorch/issues/114850 , we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:
- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
Added xpu for Backend.backend_capability and dist.Backend.register_backend()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey , https://github.com/d4l3k
2025-09-17 06:42:27 +00:00
de143bf79b
[C10d] Code clean for torch.distributed.init_process_group ( #163038 )
...
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163038
Approved by: https://github.com/msaroufim
2025-09-16 08:15:25 +00:00
6c334885d4
[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )
...
Summary:
Original: D81957844 and D81957923
Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well
#buildall
Test Plan:
sandcastle and oss ci
Rollback Plan:
Reviewed By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang , https://github.com/dcci
2025-09-12 10:54:42 +00:00
6b59a19242
Revert "[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )"
...
This reverts commit 6e8f17c58029e5fa6bc222b2445ebbc0cbdc17c7.
Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3283985880 ))
2025-09-12 06:52:03 +00:00
6e8f17c580
[RELAND] Always build USE_DISTRIBUTED ( #160449 ) and Make distributed modules importable even when backend not built ( #159889 ) ( #162594 )
...
Summary:
Original: D81957844 and D81957923
Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well
#buildall
Test Plan:
sandcastle and oss ci
Rollback Plan:
Reviewed By: H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang , https://github.com/dcci
2025-09-12 03:56:18 +00:00
92f9ed7ac3
Revert "[2/N]Port several test files under test/distributed to Intel GPU ( #159473 )"
...
This reverts commit fa1d409e83af93425a2672d62e134e8f20c5ccc0.
Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break an distributed tests ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3282999084 ))
2025-09-11 23:51:21 +00:00
fe8cc619b8
[torch][c10d] fix split_group in mixed backend case ( #162424 )
...
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k , https://github.com/fduwjj , https://github.com/H-Huang
2025-09-11 16:29:32 +00:00
fa1d409e83
[2/N]Port several test files under test/distributed to Intel GPU ( #159473 )
...
For https://github.com/pytorch/pytorch/issues/114850 , we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:
- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
Added xpu for Backend.backend_capability and dist.Backend.register_backend()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey , https://github.com/d4l3k
2025-09-11 06:44:26 +00:00
d033d11d26
Revert "[torch][c10d] fix split_group in mixed backend case ( #162424 )"
...
This reverts commit 2dc26131801a430e030a773c4fbfe874e263259d.
Reverted https://github.com/pytorch/pytorch/pull/162424 on behalf of https://github.com/clee2000 due to failure seems related, maybe a hang/timeout distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_model_diff_shape_across_ranks log classifier is pointing at the wrong line ([comment](https://github.com/pytorch/pytorch/pull/162424#issuecomment-3276360494 ))
2025-09-10 20:13:44 +00:00
2dc2613180
[torch][c10d] fix split_group in mixed backend case ( #162424 )
...
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k , https://github.com/fduwjj , https://github.com/H-Huang
2025-09-10 16:59:18 +00:00
dda071587f
Revert "Make distributed modules importable even when backend not built ( #159889 )" ( #162568 )
...
This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9.
Revert "Always build USE_DISTRIBUTED. (#160449 )"
This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568
Approved by: https://github.com/huydhn
2025-09-10 04:29:42 +00:00
a0d026688c
Make distributed modules importable even when backend not built ( #159889 )
...
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-08 19:10:36 +00:00
29e09a6545
Revert "Make distributed modules importable even when backend not built ( #159889 )"
...
This reverts commit 01edcd4df8bf0c7b4cc2d3ec868bd2059eeea83b.
Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053 ) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002 ))
2025-09-08 07:04:36 +00:00
ff2de5d522
Revert "[2/N]Port several test files under test/distributed to Intel GPU ( #159473 )"
...
This reverts commit 040d00af048967dde7938d358d7f5988cbd18388.
Reverted https://github.com/pytorch/pytorch/pull/159473 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal signals, @d4l3k please help the author to have this change landed. [D81718444](https://www.internalfb.com/diff/D81718444 ) ([comment](https://github.com/pytorch/pytorch/pull/159473#issuecomment-3264046983 ))
2025-09-07 21:06:38 +00:00
c98ddaca6d
Fixed comment to match logic in distributed_c10d.py ( #162158 )
...
inconsistent with the logic introduced in #162157 and modified in #142216.This update ensures the documentation matches the actual behavior of the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162158
Approved by: https://github.com/wconstab
2025-09-06 05:37:49 +00:00
01edcd4df8
Make distributed modules importable even when backend not built ( #159889 )
...
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-05 20:15:11 +00:00
70f865ac9b
Revert "Make distributed modules importable even when backend not built ( #159889 )"
...
This reverts commit ef3be6726f7ff4b77c22db10cec5b686f9107ea9.
Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011 ))
2025-09-05 18:58:47 +00:00
ef3be6726f
Make distributed modules importable even when backend not built ( #159889 )
...
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-04 20:05:50 +00:00
248355faf5
Don't require FakeStore to be passed into fake backend ( #162164 )
...
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162164
Approved by: https://github.com/bdhirsh , https://github.com/albanD , https://github.com/wconstab
2025-09-04 16:43:49 +00:00
34aa78274d
Revert "Make distributed modules importable even when backend not built ( #159889 )"
...
This reverts commit 4ae57d448c0a7d37e4cfd5c27d977fad2cef4051.
Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3253651785 ))
2025-09-04 13:13:52 +00:00
040d00af04
[2/N]Port several test files under test/distributed to Intel GPU ( #159473 )
...
For https://github.com/pytorch/pytorch/issues/114850 , we will port distributed tests to Intel GPU. This PR will work on some test files under test/distributed. We could enable Intel GPU with following methods and try the best to keep the original code styles:
- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- use requires_accelerator_dist_backend to allow both nccl and xccl test
- enabled XPU for some test path
- Change the hardcoded world_size according to device_count.
- Unify some common code under torch/testing/_internal for multiple backend, for example:
Added xpu for Backend.backend_capability and dist.Backend.register_backend()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159473
Approved by: https://github.com/guangyey , https://github.com/d4l3k
2025-09-04 12:53:17 +00:00
4ae57d448c
Make distributed modules importable even when backend not built ( #159889 )
...
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-03 07:33:55 +00:00
9b81fe281d
[c10d] Lessen density of barrier warning ( #162015 )
...
Warnings are great, but too dense when there are many ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162015
Approved by: https://github.com/d4l3k , https://github.com/H-Huang
2025-09-03 02:20:54 +00:00
420c52ecf3
Revert "Make distributed modules importable even when backend not built ( #159889 )"
...
This reverts commit 626cb7df8161dd4ecb4fe43b60f37ce9076f56b1.
Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982 ))
2025-09-02 20:24:01 +00:00
626cb7df81
Make distributed modules importable even when backend not built ( #159889 )
...
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.
Signed-off-by: Edward Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-01 23:00:21 +00:00
82d2d23e85
Add batch option for send/recv_object_list ( #160342 )
...
`send_object_list` and `recv_object_list` use regular `send`/`recv` P2P ops which means that they will create 2-rank NCCL communicators between ranks if the communicators have not been initialized.
This adds an option `use_batch` which will call the send/recv with `batch_isend_irecv` which will re-use the communicators already initialized for collectives in the group.
---
BatchP2P ops, creates (or use existing) communicator keyed by device index
Regular P2P Ops, creates (or use existing) dedicated 2-rank communicators keyed by “rank1:rank2”
See:
c8205cb354/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3980-L4008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160342
Approved by: https://github.com/wconstab
2025-08-30 03:29:09 +00:00
9b4adc4db7
[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL ( #158568 )
...
Adds support for FlightRecorder in ProcessGroupXCCL.
See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568
Approved by: https://github.com/guangyey , https://github.com/fduwjj
2025-08-22 09:03:35 +00:00
dd22ba09b4
[C10D] Document barrier interaction with device_id ( #159389 )
...
Addresses #159262
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159389
Approved by: https://github.com/malfet , https://github.com/H-Huang , https://github.com/kwen2501 , https://github.com/fduwjj
2025-08-01 18:12:21 +00:00
4defea1e2c
[c10d] Fix setGroupName and setGroupDesc in group_split
and merge_remote_group
( #159429 )
...
Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.
We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh
Also ncclx needs to be aware of that its Option is a subclass of BackendOption
Test Plan:
CI
Rollback Plan:
Differential Revision: D79201132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429
Approved by: https://github.com/xunnanxu
2025-07-30 19:55:55 +00:00
67e68e0785
[c10d] Cleanup split_group logic using the newly built splitGroup ( #158488 )
...
with https://github.com/pytorch/pytorch/pull/157716 merged we want to further clean up the code on the python side for `split_group` API. We do need to keep some old global book keeping for bc. The rest of logic is now all in cpp. Regarding the change brought in https://github.com/pytorch/pytorch/pull/152175 , we did clean up in https://github.com/pytorch/pytorch/pull/158790 (including internal changes) so that we can safely remove it.
Differential Revision: [D78777152](https://our.internmc.facebook.com/intern/diff/D78777152 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158488
Approved by: https://github.com/d4l3k
2025-07-29 03:27:11 +00:00
fd47401536
[doc] Updates to distributed.md for XCCL backend ( #155834 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834
Approved by: https://github.com/guangyey , https://github.com/AlannaBurke , https://github.com/d4l3k
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com >
2025-07-22 21:01:43 +00:00
b146ca74f0
docs: add get_default_backend_for_device to distributed documentation ( #156783 )
...
`torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap.
CC: @guangyey, @EikanWang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783
Approved by: https://github.com/guangyey , https://github.com/malfet
2025-07-10 05:11:30 +00:00