mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
[c10d] Fix setGroupName and setGroupDesc in group_split
and merge_remote_group
(#159429)
Summary: We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it. We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh Also ncclx needs to be aware of that its Option is a subclass of BackendOption Test Plan: CI Rollback Plan: Differential Revision: D79201132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429 Approved by: https://github.com/xunnanxu
This commit is contained in:
committed by
PyTorch MergeBot
parent
53d68b95de
commit
4defea1e2c
@ -5105,6 +5105,9 @@ def split_group(
|
||||
split_pg.bound_device_id = device_id # type: ignore[union-attr]
|
||||
split_backend_class = split_pg._get_backend(torch.device("cuda"))
|
||||
split_backend_class._set_sequence_number_for_group()
|
||||
assert split_pg.group_name == group_name, (
|
||||
f"group name should be set to {group_name} but got {split_pg.group_name}"
|
||||
)
|
||||
|
||||
# update global state
|
||||
_world.pg_map[split_pg] = (backend, split_pg.get_group_store())
|
||||
|
Reference in New Issue
Block a user