39 Commits

Author SHA1 Message Date
99c8640b5d [1/N] Change C-style casts to static_cast or reinterpret_cast (#165750)
This series of changes try to cover C style casts into C++ alternatives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750
Approved by: https://github.com/Skylion007
2025-10-20 23:27:13 +00:00
ab82456c16 Revert "[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750)"
This reverts commit e1e8491b316df810388d9fa24f135cdba27ab40e.

Reverted https://github.com/pytorch/pytorch/pull/165750 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165750#issuecomment-3422413890))
2025-10-20 14:51:58 +00:00
e1e8491b31 [1/N] Change C-style casts to static_cast or reinterpret_cast (#165750)
This series of changes try to cover C style casts into C++ alternatives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750
Approved by: https://github.com/Skylion007
2025-10-20 04:36:19 +00:00
755cf90672 Redirect all use of filesystem to c10/utils/FileSystem.h (#162914)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162914
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/cyyever
2025-09-15 04:30:41 +00:00
8e23a1227b [ROCm/Windows] Fix build failures and support some BLAS calls (#161981)
* Support getrsBatched/geqrfBatched/gelsBatched on Windows ROCm (fixes https://github.com/ROCm/TheRock/issues/1367)
* Fix windows pytorch build with USE_DISTRIBUTED=ON by default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161981
Approved by: https://github.com/ScottTodd, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-03 20:26:14 +00:00
d55dc00f84 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-23 02:57:50 +00:00
4b55871e06 Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)"
This reverts commit c95f7fa874a3116f1067f9092456ee7281003614.

Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))
2025-06-22 12:27:36 +00:00
c95f7fa874 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-22 08:43:49 +00:00
cyy
256c96332c [1/N] Use std::filesystem (#152288)
Maybe it is time to use std::filesystem because CXX11 ABI is now the default. The changes are for jit and distributed code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152288
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-04-30 17:54:16 +00:00
8b5e717601 c10d/Store: add clone feature (#150966) (#150966) (#151045)
Summary:
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: https://github.com/pytorch/pytorch/issues/150943

Approved by: https://github.com/fduwjj

Test Plan:
contbuild & OSS CI, see 205881ea4a

Test plan from GitHub:
```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Differential Revision: D72789690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151045
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2025-04-11 04:00:23 +00:00
abe41c5c9c Revert "c10d/Store: add clone feature (#150966)"
This reverts commit 205881ea4a451574c3a3de87c42484043a955d6e.

Reverted https://github.com/pytorch/pytorch/pull/150966 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150966#issuecomment-2795063574))
2025-04-10 20:17:53 +00:00
205881ea4a c10d/Store: add clone feature (#150966)
This adds a new `clone()` method to Store which will return a new Store instance that can be used from a different thread.

This is intended to better support multiple threads with stores such as when ProcessGroupNCCL needs a store to do error propagation.

Related issue: https://github.com/pytorch/pytorch/issues/150943

Test plan:

```
pytest test/distributed/test_store.py -k PythonStore
pytest test/distributed/test_store.py -k clone
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150966
Approved by: https://github.com/fduwjj
2025-04-10 01:41:50 +00:00
cyy
6a35d9aaa4 Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-24 12:22:13 +00:00
6a2b4db0a1 Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)"
This reverts commit 42f4fda2ebb27693411f7acca1665778d539bf79.

Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))
2025-01-24 00:17:34 +00:00
cyy
42f4fda2eb Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-23 22:47:18 +00:00
cyy
b0be30dd79 [19/N] Fix extra warnings brought by clang-tidy-17 (#144448)
Apply more clang-tidy fixes. There was a bug introduced by #144014 due to incorrect namespace concatenation which is reverted here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144448
Approved by: https://github.com/albanD
2025-01-09 15:58:05 +00:00
cyy
dca443835e Enable more readability-redundant checks (#143963)
They are helpful to simplifying code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963
Approved by: https://github.com/albanD
2024-12-30 14:49:33 +00:00
cyy
00b3b61076 Add and use thread-safe strerror (#140472)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140472
Approved by: https://github.com/ezyang
2024-11-19 04:24:17 +00:00
cyy
29a1f62f23 Replace c10::invoke_result with std::invoke_result (#124169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124169
Approved by: https://github.com/swolchok
2024-05-25 02:42:13 +00:00
cyy
c2596fd3e0 [Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032)
This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032
Approved by: https://github.com/Skylion007
2024-04-16 00:42:18 +00:00
cyy
6d8bb0e984 [Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884)
This PR fixes some clang-tidy warnings in distributed code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122884
Approved by: https://github.com/kwen2501
2024-03-31 09:06:35 +00:00
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b3cb987e1f3230876654f156bd11a09.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
88ebb2e321 Windows FileStore skip timeout if the file path is invalid (#103247)
On Windows in FileStore if the path to the file to be created is not valid then it will get stuck there trying to create the file until the timeout is reached.
This PR checks if the path is invalid and if it is then it will leave the loop instantly.

Fixes #48475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103247
Approved by: https://github.com/fduwjj, https://github.com/kit1980
2023-06-13 03:21:35 +00:00
cyy
f172feae0d More tidy fixes (#93069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069
Approved by: https://github.com/Skylion007
2023-01-27 06:40:50 +00:00
cyy
e292ddff4e More clang-tidy fixes (#92944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92944
Approved by: https://github.com/Skylion007
2023-01-25 19:11:51 +00:00
97db9fde69 Fix header-filter for clang-tidy c10 and apply some fixes to c10 and … (#91178)
…c10d

Fixes a broken header filters from #90699 and applies a few more clang-tidy fixes that are relevant from c10 and c10d. The header filter pattern was actually broken and the clang-tidy include pattern was redundant. Also fixed a few bugs in torch/distributed/c10d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91178
Approved by: https://github.com/ezyang
2022-12-27 07:34:12 +00:00
b43ae1c411 Add reference counter in FileStore (#85601)
Fixes #67566.

This diff added a reference counter in the FileStore object. The underlying file would be removed only if the reference counter became 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85601
Approved by: https://github.com/H-Huang
2022-10-07 17:59:29 +00:00
1ad0048b64 Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
2022-09-30 05:13:50 +00:00
a50d8864fc Revert "Refactor distribuetd to use absolute header path (#85780)"
This reverts commit 668082718aefce95ecc1b1c312ea6f127b2c662e.

Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>
2022-09-30 02:04:29 +00:00
668082718a Refactor distribuetd to use absolute header path (#85780)
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.

See D39835774 for more details about Meta internal complication.

**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
2022-09-30 00:27:24 +00:00
5af48581b5 In order to make pytorch headers consumable from cpp20 code bases, … (#79985)
… all instances of std::result_of and std:result_of_t are conditionally replaced by std::invoke_result and std::invoke_result_t if __cpp_lib_is_invocable >= 201703L.  std::invoke_result was only introduced in c++17, so it should probably not be required yet.

Fixes #71657  and a small part of #69290

Tested on Centos 7 / gcc11 + a private project that requires cpp20.

I think the main questions to check by a maintainer are,
- whether my choices of preprocessor blocks are appropriate
- whether there are any very subtle differences between std::result_of and std::invoke_result that I have missed
- whether in any of the replacements  the 'new' side can/should be simplified further

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79985
Approved by: https://github.com/ezyang
2022-07-04 20:14:36 +00:00
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
6e16c9bb1d Add support for deleteKey for FileStore (#69953)
Summary:
torch_ucc uses `deleteKey`, and trying to run PyTorch tests with torch_ucc leads to failure about `deleteKey not implemented for FileStore`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69953

Reviewed By: ngimel

Differential Revision: D33458457

Pulled By: H-Huang

fbshipit-source-id: f46afd59f950722ae594d9aafb8843f14019e930
2022-01-07 06:20:59 -08:00
fced51eaf7 [torch][distributed] Check for file existence before invoking cleanup logic in FileStore destructor (#68603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68603

FileStore is frequently used from the python lang, which has GC. This means, that users of FileStore from python do not have control over when FileStore destructor is invoked. If the directory for file store is created by some external logic, that has cleanup procedure, this procedure may have a race condition with the logic in the FileStore destructor.

The diff adds check for file access in destructor before actually invoking the cleanup. In long term, it makes sense to move out the cleanup logic out of the destructor to a separate method.

Test Plan:
CI

Stress tests: `buck test mode/dev-nosan //torchrec/examples/dlrm/tests:test_dlrm_main -- --exact 'torchrec/examples/dlrm/tests:test_dlrm_main - torchrec.examples.dlrm.tests.test_dlrm_main.MainTest: test_main_function' --run-disabled --jobs 18 --stress-runs 20 --record-results`

Reviewed By: colin2328

Differential Revision: D32535470

fbshipit-source-id: 6f421f2e7b0d9ac9c884a1db2f7e5a94fc59fc0e
2021-12-01 16:43:42 -08:00
999e52a795 [FileStore] log timeout in err msg (#69167)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69167

Per title
ghstack-source-id: 144378083

Test Plan: Ci

Reviewed By: H-Huang

Differential Revision: D32736119

fbshipit-source-id: f37fd3e4ac393c07eb8bd1f9202841d33d0a8aad
2021-11-30 23:29:09 -08:00
5deeaab36a minor fixes in c10d for Windows (#62953)
Summary:
Found out by triggering builds against clang on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62953

Reviewed By: gchanan

Differential Revision: D30191300

Pulled By: ezyang

fbshipit-source-id: d929119768298084c41d70dbc3a78aacd64fb715
2021-08-09 09:05:09 -07:00
a016150163 Move torch/lib/c10d to torch/csrc/distributed/c10d (#60543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60543

Since now c10d is part of libtorch, it would also be nice if the sources lived all in one place.
ghstack-source-id: 132306292

Test Plan: It builds

Reviewed By: cbalioglu

Differential Revision: D29062002

fbshipit-source-id: d9e1301e9d73e1643fa0f0119cd2d618f1ad52e6
2021-06-24 12:38:51 -07:00