mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
c10d: add Collectives abstraction (#125978)
This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives. Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR. Test plan: ``` python test/distributed/test_collectives.py -v ``` This tests both functionality using multiple threads as well as timeout behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125978 Approved by: https://github.com/shuqiangzhang
This commit is contained in:
committed by
PyTorch MergeBot
parent
a8c41e0678
commit
4b2ae2ac33
@ -54,6 +54,8 @@ if is_available():
|
||||
set_debug_level,
|
||||
set_debug_level_from_env,
|
||||
_make_nccl_premul_sum,
|
||||
_ControlCollectives,
|
||||
_StoreCollectives,
|
||||
)
|
||||
|
||||
class _DistributedPdb(pdb.Pdb):
|
||||
|
Reference in New Issue
Block a user