13 Commits

Author SHA1 Message Date
d55dc00f84 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-23 02:57:50 +00:00
4b55871e06 Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)"
This reverts commit c95f7fa874a3116f1067f9092456ee7281003614.

Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))
2025-06-22 12:27:36 +00:00
c95f7fa874 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-22 08:43:49 +00:00
cyy
6aa6bd4ca5 [Distributed] [12/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136528)
Follows #136439. A dangling reference to qualifiedName was found and fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136528
Approved by: https://github.com/kwen2501
2024-09-25 20:12:08 +00:00
1dabfb68e7 Add TORCH_API to expose RPC module functions for RPC module device extension (#108553)
At present, I refer to the existing rpc backend tensorpipe backend, and implement our own rpc communication backend in our extension package. We found that these functions are not exposed during development, and direct use will cause our extension package to appear undefined symbol problem.

Add the TORCH_API macro to the functions required to implement the custom tensorpipe agent in the rpc module to expose them to developers,at the same time, we think this risk is very controllable and hope it can be merged into the version 2.1.

cc
@albanD, @kumpera
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108553
Approved by: https://github.com/kumpera, https://github.com/albanD
2023-09-06 17:24:46 +00:00
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718aefce95ecc1b1c312ea6f127b2c662e.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
811ccde41a [Dynamic RPC] Add graceful shutdown for dynamic RPC members
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74561

Approved by: https://github.com/mrshenli
2022-04-26 13:12:55 +00:00
9270bccaf6 [Dynamic RPC] Allow newly joined ranks to communicate with existing ranks (#73373)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73373

This PR allows for newly joined ranks in Dynamic RPC to communicate with ranks that have already joined the group. That is, rank N will be able to run RPC against all ranks <= N.

Previously:

Process 1 (init):
```python
init_rpc("worker0", rank=0)
```
Process2 (command against a rank that already joined, would fail):
```python
init_rpc("worker1", rank=1)
rpc.sync("worker0", torch.add, (torch.tensor(1), torch.tensor(1)))
```

Now:
Above scenario succeeds

Test:
`pytest test/distributed/rpc/test_tensorpipe_agent.py -vsk test_init_rpc_without_world_size`

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D35052544

Pulled By: H-Huang

fbshipit-source-id: dba48b216731c27730e7d46aefd9e7191c792170
(cherry picked from commit f3c42d8482c933fd746d4da8e64fa40cdf92a221)
2022-03-24 16:19:28 +00:00
938afa37a3 Remove process group barrier and all_reduce function calls from tensorpipe agent (#65946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65946

Add new function in agent_utils to perform a synchronization of active call counts using store. This is intended to replace the barrier and all_reduce used by the process group in RPC shutdown.

`test_ddp_comparison` and `test_ddp_comparison_uneven_inputs` test fail with these changes. It seems like the RPC agents are not accessing the same store, so the total count of processes never reaches the world size to exit the barrier, still ened to investigate why it is like this only for these test cases. Setting clean_shutdown to false ignores this code path which allows the test to pass.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D31762736

Pulled By: H-Huang

fbshipit-source-id: cb5d0efe196f72726c63393c4293e97ec4f18548
2021-10-28 10:15:56 -07:00
c7b1979b6b Use Store collect and verify names in all RPC agents (#53209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53209

closes #40048

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26791524

Pulled By: mrshenli

fbshipit-source-id: fc75589f9707014334fcfae6f05af3c04217783b
2021-03-07 16:51:46 -08:00
affdcce833 Extract TensorPipeAgent's collectNames to be a standalone utility function (#53202)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53202

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26791525

Pulled By: mrshenli

fbshipit-source-id: 8234c4d0350a5cd61926dce4ecc9e918960d30d2
2021-03-07 16:48:46 -08:00