Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang