mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
[torch][c10d] fix split_group in mixed backend case (#162424)
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options. However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends. This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead. Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs. As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424 Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
This commit is contained in:
@ -5160,7 +5160,11 @@ def split_group(
|
||||
my_group = split_group
|
||||
break
|
||||
|
||||
group_name = _process_group_name(my_group, use_hashed_name=False)
|
||||
# use_hashed_name is True to ensure that subgroups have unique names.
|
||||
# This is needed as some backends (e.g. Gloo) use the group name as a
|
||||
# PrefixStore prefix for initialization of splits. Thus, names have to be
|
||||
# unique to avoid key collisions.
|
||||
group_name = _process_group_name(my_group, use_hashed_name=True)
|
||||
split_pg = parent_pg.split_group(
|
||||
my_group,
|
||||
timeout=timeout,
|
||||
|
Reference in New Issue
Block a user