c10d: add Collectives abstraction (#125978)

This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives.

Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit

The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR.

Test plan:

```
python test/distributed/test_collectives.py -v
```

This tests both functionality using multiple threads as well as timeout behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125978
Approved by: https://github.com/shuqiangzhang
This commit is contained in:
Tristan Rice
2024-05-17 05:09:06 +00:00
committed by PyTorch MergeBot
parent a8c41e0678
commit 4b2ae2ac33
11 changed files with 837 additions and 53 deletions

View File

@ -54,6 +54,8 @@ if is_available():
set_debug_level,
set_debug_level_from_env,
_make_nccl_premul_sum,
_ControlCollectives,
_StoreCollectives,
)
class _DistributedPdb(pdb.Pdb):