Fix logging for aborted communicators in ProcessGroupNCCL. (#33147)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33147

The log mentioned that it is aborting communicators even if
`blockingWait_` was false. This was incorrect, and I updated the logging to
reflect the appropriate behavior.
ghstack-source-id: 98025017

Test Plan: waitforbuildbot

Differential Revision: D19817967

fbshipit-source-id: fb3415af2cc99eb20981ceaa5203c0a1880fd6f3
This commit is contained in:
Pritam Damania
2020-02-17 14:41:05 -08:00
committed by Facebook Github Bot
parent 1a589f50bd
commit c90b393c00

View File

@ -328,10 +328,10 @@ void ProcessGroupNCCL::ncclCommWatchdogInternal() {
}
if (checkForNCCLErrors(ncclComms)) {
LOG(INFO) << "Received NCCL errors for communicators in the cache, "
"aborting the communicators.";
LOG(INFO) << "Received NCCL errors for communicators in the cache";
if (blockingWait_) {
LOG(INFO) << "Aborting communicators that received errors";
// We should not abort the communicators if we are performing a
// non-blocking wait(). The reason for this is that if we abort the
// nccl communicator, wait() might not throw exceptions and