### Changes
- Move ProcessGroup::Work into its own class and update all the references to it / header includes.
#### Motivation
In the future PRs we will repurpose ProcessGroup to instead contain a list of Backends (ProcessGroupNCCL/Gloo/UCC) and perform dispatching to them based on tensor type. This change is prevent a circular dependency with ProcessGroup depending on Backend and Backend depending on ProcessGroup::Work.
Differential Revision: [D38839212](https://our.internmc.facebook.com/intern/diff/D38839212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83680
Approved by: https://github.com/kwen2501
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181
`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.
To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.
Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.
#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112
Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.
Reviewed By: mrshenli
Differential Revision: D23858025
fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40944
This stack adds Work-level timeout for blocking wait.
This PR just changes the API to accept a default wait arg for the wait function in each ProcessGroup backend. The ProcessGroup superclass correctly waits for the given timeout by changing the CV wait to wait_for.
Closes: https://github.com/pytorch/pytorch/issues/37571
ghstack-source-id: 107835735
Test Plan: Tests in 4th PR in this stack
Reviewed By: jiayisuse
Differential Revision: D22107135
fbshipit-source-id: b38c07cb5e79e6c86c205e580336e7918ed96501
Summary:
The original behavior of pytorch c10d only supports built-in c10d backends, such as
nccl/gloo/mpi. This patch is used to extend the c10d capability to support dynamically
loading 3rd party communication libraries which are derived from ProcessGroup base class.
related RFC is in: https://github.com/pytorch/pytorch/issues/27955
Through this way, user just need specify a 3rd party c10d backend name when invoking
torch.distributed.init_process_group(). The proposed logic will try to load corresponding
c10d backend cpp extension automatically. as for how to develop a new 3rd party c10d backend
through cpp extension, pls refer to test/cpp_extensions/cpp_c10d_extension.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28068
Differential Revision: D19174838
Pulled By: agolynski
fbshipit-source-id: 3409a504a43ce7260e6f9d1207c00e87471fac62