[PyTorch Distributed] Add debug hint for NCCL async system error (#73897)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73897

add a debug hint that async system error can be caused by unexpected exit of
a remote process if not an actual network issue. For example, the exit of the remote process
can cause a closed network connection error at a local process. The hint helps to direct
the debug focus to the remote process.

Test Plan: unit tests

Reviewed By: pritamdamania87, rohan-varma

Differential Revision: D34702348

fbshipit-source-id: d19f9116e9efe5f6d76c0158a7a447616437ca69
(cherry picked from commit 005e74b7b6764ecd832b3410063285bff2411b56)
This commit is contained in:
Ke Wen
2022-03-10 10:25:08 -08:00
committed by PyTorch MergeBot
parent 420385eb60
commit 122f8648ab

View File

@ -25,7 +25,8 @@ const inline char* getNcclErrorDetailStr(ncclResult_t error, c10::optional<std::
case ncclUnhandledCudaError:
return "ncclUnhandledCudaError: Call to CUDA function failed.";
case ncclSystemError:
return "ncclSystemError: System call (socket, malloc, munmap, etc) failed.";
return "ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. "
"It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.";
case ncclInternalError:
return "ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption";
case ncclInvalidArgument: