pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
cyy	a2396b2dd8	[2/N] Fix extra warnings brought by clang-tidy-17 (#137459 ) Follows #137407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459 Approved by: https://github.com/Skylion007	2024-10-08 19:05:02 +00:00
Tristan Rice	bebf5302ba	TCPStoreLibUvBackend: trace operations (#136320 ) Summary: This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead Test Plan: ``` TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo" ``` ``` I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500. I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500). I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500. I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500). I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500 I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646 I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646 I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646 I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646 I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646 I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646 I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646 I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646 ``` Differential Revision: D62924454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu	2024-09-20 00:53:21 +00:00
Tristan Rice	9027db1ab8	TCPStore: fix remote address (#131773 ) (#131913 ) Summary: This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo. This relands it since it got reverted due to a fmt::format issue internally. Original Pull Request: https://github.com/pytorch/pytorch/pull/131773 Approved by: https://github.com/kurman Test Plan: Enable debug logs and verify addresses are correct ``` TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v buck2 test @//mode/dev-nosan //caffe2/test/distributed:store ``` Differential Revision: D60296583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131913 Approved by: https://github.com/kurman, https://github.com/rsdcastro, https://github.com/Skylion007	2024-07-30 17:27:33 +00:00
PyTorch MergeBot	696e83a1da	Revert "TCPStore: fix remote address (#131773 )" This reverts commit 9039131a89a5fdb8746bd86b0a4dd91559821e36. Reverted https://github.com/pytorch/pytorch/pull/131773 on behalf of https://github.com/clee2000 due to broke internal builds D60265883, something about formatter ([comment](https://github.com/pytorch/pytorch/pull/131773#issuecomment-2253123800))	2024-07-26 16:47:57 +00:00
Tristan Rice	9039131a89	TCPStore: fix remote address (#131773 ) This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo. Test plan: Enable debug logs and verify addresses are correct ``` TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131773 Approved by: https://github.com/kurman	2024-07-25 21:55:25 +00:00
Aaron Gokaslan	83eedf66b9	Update libfmt submodule to 11.0.1 (#130628 ) Update libfmt to 11.0.1 reopen of https://github.com/pytorch/pytorch/pull/129962. Requires a kineto update and moves fmt::join into a separate include so added it where necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130628 Approved by: https://github.com/aaronenyeshi	2024-07-16 06:12:11 +00:00
cyy	29861779ce	[2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236 ) Follows #128301. The changes were made by grep and sed Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236 Approved by: https://github.com/ezyang	2024-07-09 03:17:24 +00:00
Tristan Rice	0298560ca2	TCPStore: improve connect and retry logic (#129261 ) We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times. This PR does a few things: * Retry that connect and validate up to the specified timeout. * Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep. * Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141 Test plan: ``` python test/distributed/test_store.py -v ./build/bin/BackoffTest ``` Will do internal testing with some large scale jobs to ensure TCPStore works correctly. At 4k scale: 4x improvement ``` tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (pytorch-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 1.98 secs fish external usr time 0.93 secs 91.00 micros 0.93 secs sys time 1.98 secs 954.00 micros 1.97 secs tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10 (pytorch-3.10) tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (torchdrive-3.10) started 0 init 0 set 0 joined all ________________________________________________________ Executed in 8.20 secs fish external usr time 2.15 secs 0.00 micros 2.15 secs sys time 2.76 secs 843.00 micros 2.76 secs ``` ```py import time import os import threading from multiprocessing import Pool WORLD_SIZE = 10000 import torch.distributed as dist def run(rank): should_log = rank % (WORLD_SIZE // 10) == 0 if should_log: print(f"started {rank}") store = dist.TCPStore( host_name="devvm4382.nao0.facebook.com", port=29500, world_size=WORLD_SIZE, is_master=rank == 0, use_libuv=True, ) if should_log: print(f"init {rank}") store.set(f"key{rank}", "1234") if should_log: print(f"set {rank}") del store def noop(rank): pass print("starting pool") with Pool(WORLD_SIZE) as pool: pool.map(noop, range(WORLD_SIZE), 1) print("pool hot") start = time.time() pool.map(run, range(WORLD_SIZE), 1) print("run finished", time.time()-start) ``` ``` tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py (pytorch-3.10) starting pool pool hot started 0 [W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it. started 1000 init 1000 set 1000 started 2000 init 2000 set 2000 started 3000 init 3000 set 3000 started 4000 init 4000 set 4000 started 5000 init 5000 set 5000 started 6000 init 6000 set 6000 started 7000 init 7000 set 7000 started 8000 init 8000 set 8000 started 9000 init 9000 set 9000 init 0 set 0 run finished 0.705092191696167 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261 Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o	2024-06-25 19:24:22 +00:00
Tristan Rice	52d4442a00	[c10d] Socket, TCPStore: add better logging (#128673 ) This adds better logging of errors to the socket and TCPStore classes. All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged. It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky. Test plan: ``` python test/distributed/test_store.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673 Approved by: https://github.com/c-p-i-o	2024-06-14 23:08:29 +00:00
cyy	be7be9fa16	[Distributed] [8/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#125102 ) This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following https://github.com/pytorch/pytorch/pull/124987. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125102 Approved by: https://github.com/ezyang	2024-05-30 16:19:53 +00:00
cyy	6d8bb0e984	[Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884 ) This PR fixes some clang-tidy warnings in distributed code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122884 Approved by: https://github.com/kwen2501	2024-03-31 09:06:35 +00:00
Nikita Shulga	e49ea87162	Fix socket.cpp compilation using gcc-9.4 (#111002 ) Otherwise following error is thrown when attempted to compile with WERROR enabled: ``` In file included from /home/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30: /home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:24: warning: redundant redeclaration of ‘constexpr’ static data member ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’ [-Wdeprecated] 340 \| constexpr const size_t codecvt_result<CodeUnit>::max_size; \| ^~~~~~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:335:33: note: previous declaration of ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’ 335 \| static constexpr const size_t max_size = 32; \| ^~~~~~~~ ``` or following if using clang as host compiler ``` In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30: /Users/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:50: warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Wdeprecated] constexpr const size_t codecvt_result<CodeUnit>::max_size; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/111002 Approved by: https://github.com/drisspg	2023-10-11 05:16:00 +00:00
Rodrigo Kumpera	c26270c733	[C10D] Even more store scalability work. (#109218 ) Fix a bug socket.cpp in timeout detection that only shows up with 10k ranks. Make the minimum wait time in _store_based_barrier to be adaptative based on the number of ranks. Longer timeouts give more room for the store to do productive work when swamped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109218 Approved by: https://github.com/XilunWu ghstack dependencies: #109217	2023-09-22 21:27:09 +00:00
Rodrigo Kumpera	a6dab86259	[C10d] Fix TCPSTore::wait to be robust to interruptions. (#108425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108425 Approved by: https://github.com/daulet-askarov, https://github.com/fegin	2023-09-08 00:12:20 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit 0e2317479b3cb987e1f3230876654f156bd11a09. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
Xilun Wu	49fbaa29e6	[c10d] Increase socket buffer size to allow ProcessGroup init up to 12k ranks (#107878 ) The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash. split the original diff for OSS vs. internal. Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit. Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107878 Approved by: https://github.com/H-Huang, https://github.com/fduwjj	2023-08-25 00:06:30 +00:00
Rodrigo Kumpera	7c3c3dd7ca	[C10D] Reimplement TCPStore wait timeout logic. (#100594 ) Current TCPStore wait logic leaves the client socket in a bad state if waiting timesout. This happens because all recv functions raise an exception on timeout and that's it. The problem is that on timeout we need to unregister the wait. We implement this with client side cancelation by adding a new CANCEL_WAIT instruction. So, if no data arrives before the deadline, the client sends a CANCEL_WAIT command. The server sends a WAIT_CANCELED response to that command, always. This gets us down to the last issue, which is that there's a race between timeout'ing, canceling the wait and the wait completing. The client needs to handle the server sending a STOP_WAITING followed by a WAIT_CANCELED answer. This ensures client and server state are synchronized regardless of whether the wait timeouts or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100594 Approved by: https://github.com/H-Huang	2023-07-11 00:36:41 +00:00
Jon Maltiel Swenson	0da38409a0	[gloo] Make it possible for gloo TCPStore to take over an existing socket fd (#103478 ) Summary: This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port. Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved). Test Plan: Added new unit test: buck2 test caffe2/test/distributed:store Differential Revision: D46622317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478 Approved by: https://github.com/H-Huang	2023-06-16 17:15:56 +00:00
Yuxin Wu	5aefa61d2f	Fix calls to unqualified format_to to not clash with C++20's std::format_to (#103130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103130 Approved by: https://github.com/Skylion007	2023-06-09 18:19:07 +00:00
Scott Ramsby	19d8d31c94	[fbcode/caffe2] Make fmt formatter methods const (#100616 ) Summary: Staging an update to the latest fmt version triggered lots of build errors due to non-`const` methods on custom formatters. This fixes the `format()` methods to be `const` as they don't mutate any state anyway, as well as `parse()` methods that don't need to mutate internal state. This mitigates many future build errors. Updates were identified and executed by using regular expression search/replacements such as: `(constexpr auto parse$ParseContext& [^)]$) \{` -> `$1 const {` `(constexpr auto parse$ParseContext& [^)]$) ->` -> `$1 const ->` `(auto format$., FormatContext& [^)]$) \{` -> `$1 const {` `(auto format$., FormatContext& [^)]$) ->` -> `$1 const ->` Any changes to third-party code was then reverted. Some small changes detected from subsequent build errors were then applied. Test Plan: CI Differential Revision: D45463620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100616 Approved by: https://github.com/davidberard98	2023-05-06 01:38:25 +00:00
Min Si	1ad0048b64	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera, https://github.com/huydhn	2022-09-30 05:13:50 +00:00
PyTorch MergeBot	a50d8864fc	Revert "Refactor distribuetd to use absolute header path (#85780 )" This reverts commit 668082718aefce95ecc1b1c312ea6f127b2c662e. Reverted https://github.com/pytorch/pytorch/pull/85780 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it breaks build due to a missing file <c10d/Store.hpp>	2022-09-30 02:04:29 +00:00
Min Si	668082718a	Refactor distribuetd to use absolute header path (#85780 ) Headers under torch/csrc/distributed may be referened with relative path, e.g., "<c10d/...>". However, relative path cannot be gracefully handled by Meta internal build when the NCCL PG is hipified to support AMD/RCCL because the "hipified" header files are generated in other directories. Moreover, using absolute path for header inclusion is the state-of-the-art in most components in Pytorch. Thus, this patch refactors all header paths in torch/csrc/distributed to be absolute. See D39835774 for more details about Meta internal complication. How to test: commit 9e5d199 removes -I./torch/csrc/distributed in compile options. Thus use it to verify we don't miss any relative path use of torch/csrc/distributed headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85780 Approved by: https://github.com/kumpera	2022-09-30 00:27:24 +00:00
jjsjann123	9e86796fe3	simple c10 implementation for std::call_once (#78051 ) A long standing bug on std::call_once: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146 It could hang during re-entry after an exception handling. Added a c10 implementation yielding a bulky mutex. Not the most efficient thing but at least it shouldn't hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78051 Approved by: https://github.com/albanD	2022-06-28 15:47:03 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Tristan Rice	5b915e844c	c10d: retry dns lookup failures (#74641 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74641 This makes dns hostname lookup failures retryable since in some environments such as Kubernetes they're not guaranteed to be resolvable until the job starts. Retrying this eliminates the race condition. This also fixes `sandcastle_skip_if` when used on the class instead of the method. Previously they wouldn't inherit from TestCase so just wouldn't run under buck at all. Fixes https://github.com/pytorch/pytorch/issues/73682 Test Plan: Added a unit test ``` buck test //caffe2/test/distributed:test_store ``` Reviewed By: aivanou Differential Revision: D35092284 fbshipit-source-id: d40bf187e52c41f551e4fe41c536b2b0015588ee (cherry picked from commit f8908309d8ee64c25ee466a6b4922f34f2b7618e)	2022-03-24 19:51:09 +00:00
Can Balioglu	fc3c7fb756	Make "server socket not listening" warning logs less noisy (#73149 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73149 This PR improves the handling of the "server socket not yet listening" warning log in c10d `socket`. Instead of outputting it after every failed attempt (meaning every second), it is now written every 20 seconds. Note though that if the log level is set to `INFO`, we keep writing a detailed message every second as before with additional `errno` information. With log level set to `WARN` the output looks like: ``` [W socket.cpp:598] [c10d] No socket on (127.0.0.1, 29501) is listening yet, will retry. [W socket.cpp:598] [c10d] No socket on (127.0.0.1, 29501) is listening yet, will retry. ... [E socket.cpp:726] [c10d] The client socket has timed out after 300s while trying to connect to (127.0.0.1, 29501). ``` With log level set to `INFO` (a.k.a. verbose or debug level) the output looks like: ``` [I socket.cpp:515] [c10d] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29501). [I socket.cpp:582] [c10d] The client socket is attempting to connect to [localhost]:29501. [I socket.cpp:643] [c10d] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry. [W socket.cpp:598] [c10d] No socket on (127.0.0.1, 29501) is listening yet, will retry. [I socket.cpp:582] [c10d] The client socket is attempting to connect to [localhost]:29501. [I socket.cpp:643] [c10d] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry. [I socket.cpp:582] [c10d] The client socket is attempting to connect to [localhost]:29501. [I socket.cpp:643] [c10d] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry. [I socket.cpp:582] [c10d] The client socket is attempting to connect to [localhost]:29501. [I socket.cpp:643] [c10d] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry. ... [W socket.cpp:598] [c10d] No socket on (127.0.0.1, 29501) is listening yet, will retry. ... [E socket.cpp:726] [c10d] The client socket has timed out after 300s while trying to connect to (127.0.0.1, 29501). ``` ghstack-source-id: 149778565 Test Plan: Run manual tests to verify the correctness of the log message. Reviewed By: rohan-varma Differential Revision: D34365217 fbshipit-source-id: 296d01fa8b1ba803432903c10686d8a75145e539 (cherry picked from commit 8ae5aff0c5ffcc3e87d27d2deba6fedf8cef45cd)	2022-02-24 02:33:05 +00:00
Can Balioglu	e143f98010	Introduce debug and trace log levels in c10d (#73167 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73167 This PR adds `C10D_DEBUG` and `C10D_TRACE` macros to enable fine grained logging in c10d. It also updates some log statements of `socket` to make its output less noisy. ghstack-source-id: 149778567 Test Plan: Manual testing with different socket conditions. Reviewed By: rohan-varma Differential Revision: D34371426 fbshipit-source-id: a852b05ec353b18b0540ce5f803666c3da21ddd7 (cherry picked from commit 4519b06ac57f177dfc086bc10e8e1a746ba0870d)	2022-02-24 02:33:05 +00:00
Victor Zverovich	2821574eea	[caffe2] Fix compilation with fmt 8.x (#71966 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71966 Fix a few issues that block migration to fmt 8.x: 1. Format strings must be known at compile time by default 2. `formatter` specialization must be visible when formatting an object Test Plan: sandcastleit Reviewed By: cbalioglu Differential Revision: D33835157 fbshipit-source-id: 642d36ae7cd4a3894aff1a6ecc096f72348df864 (cherry picked from commit 970ad5bc010e48d8c3e8f5818e9ab05a3785968e)	2022-01-28 21:41:13 +00:00
Can Balioglu	76a2c22341	[c10d] Improve the "not yet listening" warning message of `socket` (#71864 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71864 A very minor change in one of the warning messages of `socket` to make it clear that it is a transient issue and not an error. ``` [W socket.cpp:634] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused). ``` becomes ``` [W socket.cpp:634] The server socket on [localhost]:29501 is not yet listening (errno: 111 - Connection refused), will retry. ``` ghstack-source-id: 147716736 Test Plan: No behavioral change. Run the existing unit and integration tests. Reviewed By: H-Huang Differential Revision: D33792888 fbshipit-source-id: 79b287325945d0353c4568d84d1b52c820783cfc (cherry picked from commit 9e5b627551fdf3bd6d06eb669883f9423d0999f1)	2022-01-27 02:28:19 +00:00
Can Balioglu	6e640a0acf	Revise the socket implementation of c10d (#68226 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226 Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review. This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer: - significantly better error handling and much more verbose logging (see the example output below) - explicit support for IPv6 and dual-stack sockets - correct handling of signal interrupts - better Windows support A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets. ## Example Output ``` [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying... [I logging.h:21] The server socket will attempt to listen on an IPv6 address. [I logging.h:21] The server socket is attempting to listen on [::]:29501. [I logging.h:21] The server socket has started to listen on [::]:29501. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724. [I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501). [I logging.h:21] The client socket is attempting to connect to [localhost]:29501. [I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726. [I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726. ``` ghstack-source-id: 143501987 Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10. Reviewed By: Babar, wilson100hong, mrshenli Differential Revision: D32372333 fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a	2021-11-16 20:49:25 -08:00

33 Commits