31 Commits

Author SHA1 Message Date
00059db034 Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)"
This reverts commit 09cb34c1dce8fe1b880bbf3115d8ddad3401d871.

Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/malfet due to reverted internally and now can be safely reverted in OSS ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3334176367))
2025-09-25 13:47:46 +00:00
09cb34c1dc [RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)
Summary:
Original: D81957844 and D81957923

Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
2025-09-22 21:12:18 +00:00
f0078941cf Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)"
This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0.

Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530))
2025-09-22 05:39:07 +00:00
6c334885d4 [RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)
Summary:
Original: D81957844 and D81957923

Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
2025-09-12 10:54:42 +00:00
6b59a19242 Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)"
This reverts commit 6e8f17c58029e5fa6bc222b2445ebbc0cbdc17c7.

Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3283985880))
2025-09-12 06:52:03 +00:00
6e8f17c580 [RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)
Summary:
Original: D81957844 and D81957923

Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
2025-09-12 03:56:18 +00:00
dda071587f Revert "Make distributed modules importable even when backend not built (#159889)" (#162568)
This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9.

Revert "Always build USE_DISTRIBUTED. (#160449)"

This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568
Approved by: https://github.com/huydhn
2025-09-10 04:29:42 +00:00
a0d026688c Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-08 19:10:36 +00:00
29e09a6545 Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 01edcd4df8bf0c7b4cc2d3ec868bd2059eeea83b.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))
2025-09-08 07:04:36 +00:00
01edcd4df8 Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-05 20:15:11 +00:00
70f865ac9b Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit ef3be6726f7ff4b77c22db10cec5b686f9107ea9.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))
2025-09-05 18:58:47 +00:00
ef3be6726f Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-04 20:05:50 +00:00
34aa78274d Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 4ae57d448c0a7d37e4cfd5c27d977fad2cef4051.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3253651785))
2025-09-04 13:13:52 +00:00
4ae57d448c Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-03 07:33:55 +00:00
420c52ecf3 Revert "Make distributed modules importable even when backend not built (#159889)"
This reverts commit 626cb7df8161dd4ecb4fe43b60f37ce9076f56b1.

Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))
2025-09-02 20:24:01 +00:00
626cb7df81 Make distributed modules importable even when backend not built (#159889)
This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889
Approved by: https://github.com/wconstab
ghstack dependencies: #160449
2025-09-01 23:00:21 +00:00
4ccc0381de [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-23 02:57:28 +00:00
145d4cdc11 Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)"
This reverts commit c2f0292bd5b4b3206f5b295e96f81cd6c178eb18.

Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
c2f0292bd5 [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-22 08:43:26 +00:00
94dc3253a0 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-06-22 18:53:28 +00:00
9c929f6ce9 Revert "[BE][Easy] enable UFMT for torch/distributed/ (#128870)"
This reverts commit a0e1e20c4157bb3e537fc784a51d7aef1e754157.

Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))
2024-06-21 00:38:28 +00:00
a0e1e20c41 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin
ghstack dependencies: #128868, #128869
2024-06-18 21:49:08 +00:00
9cc040fef6 Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880)
Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
2023-12-01 20:08:23 +00:00
ff51f94e32 [Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094
Approved by: https://github.com/fduwjj
2023-11-07 05:34:26 +00:00
75adb9f371 Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)"
This reverts commit f9d47e13813bbefc9f19a6c0430b7122f9d09b91.

Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor f9d47e1381 https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))
2023-11-06 22:49:53 +00:00
f9d47e1381 Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893)
Previous PRs changed the c++ default timeout for PGNccl, but this path
was only hit in some cases, and the python defaults took over in other
cases.

This PR ensures that NCCL pg always default to the changed NCCL-specific
timeout value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893
Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu
ghstack dependencies: #112611, #112803
2023-11-06 20:48:39 +00:00
43ad172c54 make ProcessGroupDefaultTimeout the same as python (#56549)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56549

This make the `kProcessGroupDefaultTimeout` be the same as the python
side, and python side directly use the pybind value instead

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27899190

Pulled By: wanchaol

fbshipit-source-id: 388a7f42358b0abed75cf4934fb7b311fd33fee6
2021-04-21 17:56:05 -07:00
5e2f17d77a Add NCCL_ASYNC_ERROR_HANDLING to docs (#46856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46856

Add reference to NCCL_ASYNC_ERROR_HANDLING in the pytorch docs,
similar to how NCCL_BLOCKING_WAIT is curently described.
ghstack-source-id: 115186877

Test Plan: CI, verifying docs change

Reviewed By: jiayisuse

Differential Revision: D24541822

fbshipit-source-id: a0b3e843bc6392d2787a4bb270118f2dfda5f4ec
2020-10-26 14:41:32 -07:00
6cb9e6b015 Back out "Revert D19871946: [distributed] pass in timeout to TCP store when initializing" (#33434)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434

Reland of https://github.com/pytorch/pytorch/pull/33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377

Test Plan: Added UT test_tcp_store_timeout_set

Differential Revision: D19935390

fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a
2020-02-19 17:17:17 -08:00
d4e4beddc4 Revert D19871946: [distributed] pass in timeout to TCP store when initializing
Test Plan: revert-hammer

Differential Revision:
D19871946

Original commit changeset: dd002180c4c8

fbshipit-source-id: 40b0676c51e43366c0700e81d16cc7927ee8efc2
2020-02-16 19:37:44 -08:00
df47a3abe0 [distributed] pass in timeout to TCP store when initializing (#33325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33325

Closes https://github.com/pytorch/pytorch/issues/32924. There was a bug where for TCPStore, we would not respect the timeout passed into `init_process_group` while constructing the TCPStore. Instead, we'd set the timeout after the rendezvous created the store, meaning that we used the default timeout of 300s while connecting to the server. This diff passes the timeout passed into `init_process_group` to rendezvous so that it can be passed into the constructor for TCPStore, so that we can use the right timeout at construction time.

Question: Should we make this change for FileStore as well? Currently the FileStore constructor does not take in a timeout at all.
ghstack-source-id: 98401875

Test Plan: Added a UT

Differential Revision: D19871946

fbshipit-source-id: dd002180c4c883216645b8a97cc472c6116ac117
2020-02-16 17:59:44 -08:00