At present, I refer to the existing rpc backend tensorpipe backend, and implement our own rpc communication backend in our extension package. We found that these functions are not exposed during development, and direct use will cause our extension package to appear undefined symbol problem.
Add the TORCH_API macro to the functions required to implement the custom tensorpipe agent in the rpc module to expose them to developers,at the same time, we think this risk is very controllable and hope it can be merged into the version 2.1.
cc
@albanD, @kumpera
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108553
Approved by: https://github.com/kumpera, https://github.com/albanD
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.
See D39835774 for more details about Meta internal complication.
**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.
See D39835774 for more details about Meta internal complication.
**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73373
This PR allows for newly joined ranks in Dynamic RPC to communicate with ranks that have already joined the group. That is, rank N will be able to run RPC against all ranks <= N.
Previously:
Process 1 (init):
```python
init_rpc("worker0", rank=0)
```
Process2 (command against a rank that already joined, would fail):
```python
init_rpc("worker1", rank=1)
rpc.sync("worker0", torch.add, (torch.tensor(1), torch.tensor(1)))
```
Now:
Above scenario succeeds
Test:
`pytest test/distributed/rpc/test_tensorpipe_agent.py -vsk test_init_rpc_without_world_size`
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D35052544
Pulled By: H-Huang
fbshipit-source-id: dba48b216731c27730e7d46aefd9e7191c792170
(cherry picked from commit f3c42d8482c933fd746d4da8e64fa40cdf92a221)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65946
Add new function in agent_utils to perform a synchronization of active call counts using store. This is intended to replace the barrier and all_reduce used by the process group in RPC shutdown.
`test_ddp_comparison` and `test_ddp_comparison_uneven_inputs` test fail with these changes. It seems like the RPC agents are not accessing the same store, so the total count of processes never reaches the world size to exit the barrier, still ened to investigate why it is like this only for these test cases. Setting clean_shutdown to false ignores this code path which allows the test to pass.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D31762736
Pulled By: H-Huang
fbshipit-source-id: cb5d0efe196f72726c63393c4293e97ec4f18548