Summary:
This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values
This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead
Test Plan:
```
TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo"
```
```
I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500.
I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running
I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500.
I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500).
I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500
I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646
I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646
I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646
I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646
I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646
I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646
I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646
I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646
```
Differential Revision: D62924454
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320
Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu
We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times.
This PR does a few things:
* Retry that connect and validate up to the specified timeout.
* Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep.
* Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141
Test plan:
```
python test/distributed/test_store.py -v
./build/bin/BackoffTest
```
Will do internal testing with some large scale jobs to ensure TCPStore works correctly.
At 4k scale: 4x improvement
```
tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (pytorch-3.10)
started 0
init 0
set 0
joined all
________________________________________________________
Executed in 1.98 secs fish external
usr time 0.93 secs 91.00 micros 0.93 secs
sys time 1.98 secs 954.00 micros 1.97 secs
tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10 (pytorch-3.10)
tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (torchdrive-3.10)
started 0
init 0
set 0
joined all
________________________________________________________
Executed in 8.20 secs fish external
usr time 2.15 secs 0.00 micros 2.15 secs
sys time 2.76 secs 843.00 micros 2.76 secs
```
```py
import time
import os
import threading
from multiprocessing import Pool
WORLD_SIZE = 10000
import torch.distributed as dist
def run(rank):
should_log = rank % (WORLD_SIZE // 10) == 0
if should_log:
print(f"started {rank}")
store = dist.TCPStore(
host_name="devvm4382.nao0.facebook.com",
port=29500,
world_size=WORLD_SIZE,
is_master=rank == 0,
use_libuv=True,
)
if should_log:
print(f"init {rank}")
store.set(f"key{rank}", "1234")
if should_log:
print(f"set {rank}")
del store
def noop(rank):
pass
print("starting pool")
with Pool(WORLD_SIZE) as pool:
pool.map(noop, range(WORLD_SIZE), 1)
print("pool hot")
start = time.time()
pool.map(run, range(WORLD_SIZE), 1)
print("run finished", time.time()-start)
```
```
tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py (pytorch-3.10)
starting pool
pool hot
started 0
[W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
started 1000
init 1000
set 1000
started 2000
init 2000
set 2000
started 3000
init 3000
set 3000
started 4000
init 4000
set 4000
started 5000
init 5000
set 5000
started 6000
init 6000
set 6000
started 7000
init 7000
set 7000
started 8000
init 8000
set 8000
started 9000
init 9000
set 9000
init 0
set 0
run finished 0.705092191696167
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261
Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o
This adds better logging of errors to the socket and TCPStore classes.
All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged.
It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky.
Test plan:
```
python test/distributed/test_store.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673
Approved by: https://github.com/c-p-i-o
Otherwise following error is thrown when attempted to compile with WERROR enabled:
```
In file included from /home/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:24: warning: redundant redeclaration of ‘constexpr’ static data member ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’ [-Wdeprecated]
340 | constexpr const size_t codecvt_result<CodeUnit>::max_size;
| ^~~~~~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:335:33: note: previous declaration of ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’
335 | static constexpr const size_t max_size = 32;
| ^~~~~~~~
```
or following if using clang as host compiler
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:50: warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Wdeprecated]
constexpr const size_t codecvt_result<CodeUnit>::max_size;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111002
Approved by: https://github.com/drisspg
Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks.
Make the minimum wait time in _store_based_barrier to be adaptative based on
the number of ranks.
Longer timeouts give more room for the store to do productive work when swamped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218
Approved by: https://github.com/XilunWu
ghstack dependencies: #109217
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.
This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
...
if "The client socket has timed out after" in exception_str:
...
if "Broken pipe" in exception_str:
...
if "Connection reset by peer" in exception_str:
...
```
To address this issue, in this PR I've ensured added these error types:
1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.
This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
...
if "The client socket has timed out after" in exception_str:
...
if "Broken pipe" in exception_str:
...
if "Connection reset by peer" in exception_str:
...
```
To address this issue, in this PR I've ensured added these error types:
1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash.
split the original diff for OSS vs. internal.
Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit.
Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107878
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
Current TCPStore wait logic leaves the client socket in a bad state if waiting timesout.
This happens because all recv functions raise an exception on timeout and that's it.
The problem is that on timeout we need to unregister the wait.
We implement this with client side cancelation by adding a new CANCEL_WAIT instruction.
So, if no data arrives before the deadline, the client sends a CANCEL_WAIT command.
The server sends a WAIT_CANCELED response to that command, always.
This gets us down to the last issue, which is that there's a race between timeout'ing,
canceling the wait and the wait completing. The client needs to handle the server sending
a STOP_WAITING followed by a WAIT_CANCELED answer.
This ensures client and server state are synchronized regardless of whether the wait
timeouts or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100594
Approved by: https://github.com/H-Huang
Summary:
This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port.
Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved).
Test Plan:
Added new unit test:
buck2 test caffe2/test/distributed:store
Differential Revision: D46622317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478
Approved by: https://github.com/H-Huang
Summary:
Staging an update to the latest fmt version triggered lots of build errors due to non-`const` methods on custom formatters. This fixes the `format()` methods to be `const` as they don't mutate any state anyway, as well as `parse()` methods that don't need to mutate internal state. This mitigates many future build errors.
Updates were identified and executed by using regular expression search/replacements such as:
`(constexpr auto parse\(ParseContext& [^)]*\)) \{` -> `$1 const {`
`(constexpr auto parse\(ParseContext& [^)]*\)) ->` -> `$1 const ->`
`(auto format\(.*, FormatContext& [^)]*\)) \{` -> `$1 const {`
`(auto format\(.*, FormatContext& [^)]*\)) ->` -> `$1 const ->`
Any changes to third-party code was then reverted. Some small changes detected from subsequent build errors were then applied.
Test Plan: CI
Differential Revision: D45463620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100616
Approved by: https://github.com/davidberard98
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.
See D39835774 for more details about Meta internal complication.
**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera, https://github.com/huydhn
Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute.
See D39835774 for more details about Meta internal complication.
**How to test**: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780
Approved by: https://github.com/kumpera
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74641
This makes dns hostname lookup failures retryable since in some environments such as Kubernetes they're not guaranteed to be resolvable until the job starts. Retrying this eliminates the race condition.
This also fixes `sandcastle_skip_if` when used on the class instead of the method. Previously they wouldn't inherit from TestCase so just wouldn't run under buck at all.
Fixes https://github.com/pytorch/pytorch/issues/73682
Test Plan:
Added a unit test
```
buck test //caffe2/test/distributed:test_store
```
Reviewed By: aivanou
Differential Revision: D35092284
fbshipit-source-id: d40bf187e52c41f551e4fe41c536b2b0015588ee
(cherry picked from commit f8908309d8ee64c25ee466a6b4922f34f2b7618e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73149
This PR improves the handling of the "server socket not yet listening" warning log in c10d `socket`. Instead of outputting it after every failed attempt (meaning every second), it is now written every 20 seconds. Note though that if the log level is set to `INFO`, we keep writing a detailed message every second as before with additional `errno` information.
With log level set to `WARN` the output looks like:
```
[W socket.cpp:598] [c10d] No socket on (127.0.0.1, 29501) is listening yet, will retry.
[W socket.cpp:598] [c10d] No socket on (127.0.0.1, 29501) is listening yet, will retry.
...
[E socket.cpp:726] [c10d] The client socket has timed out after 300s while trying to connect to (127.0.0.1, 29501).
```
With log level set to `INFO` (a.k.a. verbose or debug level) the output looks like:
```
[I socket.cpp:515] [c10d] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29501).
[I socket.cpp:582] [c10d] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:643] [c10d] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry.
[W socket.cpp:598] [c10d] No socket on (127.0.0.1, 29501) is listening yet, will retry.
[I socket.cpp:582] [c10d] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:643] [c10d] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry.
[I socket.cpp:582] [c10d] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:643] [c10d] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry.
[I socket.cpp:582] [c10d] The client socket is attempting to connect to [localhost]:29501.
[I socket.cpp:643] [c10d] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry.
...
[W socket.cpp:598] [c10d] No socket on (127.0.0.1, 29501) is listening yet, will retry.
...
[E socket.cpp:726] [c10d] The client socket has timed out after 300s while trying to connect to (127.0.0.1, 29501).
```
ghstack-source-id: 149778565
Test Plan: Run manual tests to verify the correctness of the log message.
Reviewed By: rohan-varma
Differential Revision: D34365217
fbshipit-source-id: 296d01fa8b1ba803432903c10686d8a75145e539
(cherry picked from commit 8ae5aff0c5ffcc3e87d27d2deba6fedf8cef45cd)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73167
This PR adds `C10D_DEBUG` and `C10D_TRACE` macros to enable fine grained logging in c10d. It also updates some log statements of `socket` to make its output less noisy.
ghstack-source-id: 149778567
Test Plan: Manual testing with different socket conditions.
Reviewed By: rohan-varma
Differential Revision: D34371426
fbshipit-source-id: a852b05ec353b18b0540ce5f803666c3da21ddd7
(cherry picked from commit 4519b06ac57f177dfc086bc10e8e1a746ba0870d)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71966
Fix a few issues that block migration to fmt 8.x:
1. Format strings must be known at compile time by default
2. `formatter` specialization must be visible when formatting an object
Test Plan: sandcastleit
Reviewed By: cbalioglu
Differential Revision: D33835157
fbshipit-source-id: 642d36ae7cd4a3894aff1a6ecc096f72348df864
(cherry picked from commit 970ad5bc010e48d8c3e8f5818e9ab05a3785968e)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71864
A very minor change in one of the warning messages of `socket` to make it clear that it is a transient issue and not an error.
```
[W socket.cpp:634] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused).
```
becomes
```
[W socket.cpp:634] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry.
```
ghstack-source-id: 147716736
Test Plan: No behavioral change. Run the existing unit and integration tests.
Reviewed By: H-Huang
Differential Revision: D33792888
fbshipit-source-id: 79b287325945d0353c4568d84d1b52c820783cfc
(cherry picked from commit 9e5b627551fdf3bd6d06eb669883f9423d0999f1)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226
**Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review.**
This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer:
- significantly better error handling and much more verbose logging (see the example output below)
- explicit support for IPv6 and dual-stack sockets
- correct handling of signal interrupts
- better Windows support
A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets.
## Example Output
```
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying...
[I logging.h:21] The server socket will attempt to listen on an IPv6 address.
[I logging.h:21] The server socket is attempting to listen on [::]:29501.
[I logging.h:21] The server socket has started to listen on [::]:29501.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726.
```
ghstack-source-id: 143501987
Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10.
Reviewed By: Babar, wilson100hong, mrshenli
Differential Revision: D32372333
fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a