[c10d] Fix setGroupName and setGroupDesc in group_split and merge_remote_group (#159429)

Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.

We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh

Also ncclx needs to be aware of that its Option is a subclass of BackendOption

Test Plan:
CI

Rollback Plan:

Differential Revision: D79201132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429
Approved by: https://github.com/xunnanxu
This commit is contained in:
Junjie Wang (PyTorch)
2025-07-30 19:55:55 +00:00
committed by PyTorch MergeBot
parent 53d68b95de
commit 4defea1e2c
4 changed files with 32 additions and 25 deletions

View File

@ -5105,6 +5105,9 @@ def split_group(
split_pg.bound_device_id = device_id # type: ignore[union-attr]
split_backend_class = split_pg._get_backend(torch.device("cuda"))
split_backend_class._set_sequence_number_for_group()
assert split_pg.group_name == group_name, (
f"group name should be set to {group_name} but got {split_pg.group_name}"
)
# update global state
_world.pg_map[split_pg] = (backend, split_pg.get_group_store())