mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
Fix logging for aborted communicators in ProcessGroupNCCL. (#33147)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33147 The log mentioned that it is aborting communicators even if `blockingWait_` was false. This was incorrect, and I updated the logging to reflect the appropriate behavior. ghstack-source-id: 98025017 Test Plan: waitforbuildbot Differential Revision: D19817967 fbshipit-source-id: fb3415af2cc99eb20981ceaa5203c0a1880fd6f3
This commit is contained in:
committed by
Facebook Github Bot
parent
1a589f50bd
commit
c90b393c00
@ -328,10 +328,10 @@ void ProcessGroupNCCL::ncclCommWatchdogInternal() {
|
||||
}
|
||||
|
||||
if (checkForNCCLErrors(ncclComms)) {
|
||||
LOG(INFO) << "Received NCCL errors for communicators in the cache, "
|
||||
"aborting the communicators.";
|
||||
LOG(INFO) << "Received NCCL errors for communicators in the cache";
|
||||
|
||||
if (blockingWait_) {
|
||||
LOG(INFO) << "Aborting communicators that received errors";
|
||||
// We should not abort the communicators if we are performing a
|
||||
// non-blocking wait(). The reason for this is that if we abort the
|
||||
// nccl communicator, wait() might not throw exceptions and
|
||||
|
Reference in New Issue
Block a user