mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
[PyTorch Distributed] Add debug hint for NCCL async system error (#73897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73897 add a debug hint that async system error can be caused by unexpected exit of a remote process if not an actual network issue. For example, the exit of the remote process can cause a closed network connection error at a local process. The hint helps to direct the debug focus to the remote process. Test Plan: unit tests Reviewed By: pritamdamania87, rohan-varma Differential Revision: D34702348 fbshipit-source-id: d19f9116e9efe5f6d76c0158a7a447616437ca69 (cherry picked from commit 005e74b7b6764ecd832b3410063285bff2411b56)
This commit is contained in:
@ -25,7 +25,8 @@ const inline char* getNcclErrorDetailStr(ncclResult_t error, c10::optional<std::
|
||||
case ncclUnhandledCudaError:
|
||||
return "ncclUnhandledCudaError: Call to CUDA function failed.";
|
||||
case ncclSystemError:
|
||||
return "ncclSystemError: System call (socket, malloc, munmap, etc) failed.";
|
||||
return "ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. "
|
||||
"It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer.";
|
||||
case ncclInternalError:
|
||||
return "ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruption";
|
||||
case ncclInvalidArgument:
|
||||
|
Reference in New Issue
Block a user