Fixes#127519
Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.
#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:
```
plugin_root
|_ pyproject.toml
|_ src
|_ redis
|_ __init__.py
|_ redis_store.py
|_ redis_backend.py
```
The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:
```
[project]
name = "redis"
version = "0.0.1"
[project.entry-points.'torchrun.plugins']
redis = 'redis'
```
The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:
```
def getPluginHandler():
def _create_redis_handler(params: RendezvousParameters):
from redis_rendezvous_backend import create_backend
backend, store = create_backend(params)
return create_handler(store, backend, params)
return _create_redis_handler
```
The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.
#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.
Once installed, the new backend can be used in torchrun as follows:
```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/fduwjj
Fixes#127519
Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.
#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:
```
plugin_root
|_ pyproject.toml
|_ src
|_ redis
|_ __init__.py
|_ redis_store.py
|_ redis_backend.py
```
The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:
```
[project]
name = "redis"
version = "0.0.1"
[project.entry-points.'torchrun.plugins']
redis = 'redis'
```
The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:
```
def getPluginHandler():
def _create_redis_handler(params: RendezvousParameters):
from redis_rendezvous_backend import create_backend
backend, store = create_backend(params)
return create_handler(store, backend, params)
return _create_redis_handler
```
The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.
#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.
Once installed, the new backend can be used in torchrun as follows:
```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/wconstab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58057
This PR refactors the store initialization logic and moves it to the `create_backend` function for both C10d and etcd backends.
ghstack-source-id: 128671579
Test Plan: Run the existing and revised tests.
Reviewed By: tierex
Differential Revision: D28356587
fbshipit-source-id: caf9416ab811eefe4834268d8a11a48f2236ed5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57764
As discussed offline this PR renames etcd-experimental backend to etcd-v2 and c10d-experimental backend to c10d.
ghstack-source-id: 128342523
Test Plan: Run the existing unit tests.
Reviewed By: kiukchung
Differential Revision: D28263739
fbshipit-source-id: c3409037ecea5a8ff6daadeeb1f2fb4205cc3852
Summary:
Pull Request resolved: https://github.com/pytorch/elastic/pull/148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56811
Moves docs sphinx `*.rst` files from the torchelastic repository to torch. Note: only moves the rst files the next step is to link it to the main pytorch `index.rst` and write new `examples.rst`
Reviewed By: H-Huang
Differential Revision: D27974751
fbshipit-source-id: 8ff9f242aa32e0326c37da3916ea0633aa068fc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386
The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher
Added test_launcher to the ci/cd tests
Reviewed By: cbalioglu
Differential Revision: D27858897
fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687
The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env
The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0
The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.
The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27643206
fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55637
This diff introduces the `EtcdRendezvousBackend` type that will serve as an experimental alternative to the existing `EtcdRendezvousHandler`.
The major advantage of `EtcdRendezvousBackend` is that it delegates the bulk of the rendezvous handling logic to `DynamicRendezvousHandler` which is shared with `C10dRendezvousBackend` (see D27654492) and any other potential future rendezvous backend (e.g. Amazon S3).
ghstack-source-id: 126312209
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654498
fbshipit-source-id: f3259adfc9068b7e323b947a7d8d52fcd0b8ada1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55636
This diff introduces:
- The `C10dRendezvousBackend` type to support C10d stores as rendezvous backends.
- A fix to the `TCPStore.compare_set()` function to support non-existent keys.
- A placeholder `c10d-experimental` registry to instantiate C10d-baked rendezvous backends via `get_rendezvous_handler()`.
ghstack-source-id: 126312162
Test Plan: Run the existing and newly-introduced unit/integration tests.
Reviewed By: tierex
Differential Revision: D27654492
fbshipit-source-id: 09f498138b35186de4b0e174adb33fb5b5aa4b52
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55466
Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.
### Note
See the original diff (D27442325 (df299dbd7d)) that had to be reverted due to an unexpected Python version incompatibility between the internal and external PyTorch CI tests.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D27623215
fbshipit-source-id: 51538d0f154f64e04f685a95d40d805b478c93f9
Summary: Improve the implementation and the unit test coverage of `RendezvousHandlerRegistry`.
Test Plan: Run the existing and newly-introduced unit tests.
Reviewed By: tierex
Differential Revision: D27442325
fbshipit-source-id: 8519a2caacbe2e3ce5d9a02e87a910503dea27d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53172
Pull Request resolved: https://github.com/pytorch/elastic/pull/141
Upstreams two modules to torch:
1. `torchelastic.rendezvous`
2. `torchelastic.utils`
These modules were chosen as `[1/n]` since they are the leaf modules in torchelastic.
==== NOTES: ====
1. I'm disabling etcd_rendezvous and etcd_server tests in CIRCLECI for the moment since I need to edit the test dockers to contain the etcd server binary (there's 4-5 test dockers - one for each platform so this is going to take some time for me to set up the environments and test) - T85992919.
2. I've fixed all lint errors on python files but there are ones on the cpp files on the ZeusRendezvous. I took a look at them, and I don't want to fix the linter errors right now for 2 major reasons:
1. Some of them are more than formatting changes (e.g. std::move vs pass by value) and I don't want to introduce bundled changes with the move
1. The old rendezvous code (the one we forked from in caffe2/fb) has the same problems and I think its better for us to deal with this when we deprecate caffe2/fb/rendezvous in favor of the one in torchelastic -T86012579.
Test Plan:
```
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/utils/data/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/test/...
buck test mode/dev-nosan //caffe2/torch/distributed/elastic/rendezvous/fb/...
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
```
\+ Sandcastle
Reviewed By: H-Huang
Differential Revision: D26718746
fbshipit-source-id: 67cc0350c3d847221cb3c3038f98f47915362f51