[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055)

Work around issues like #153960, #152623

NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154055
Approved by: https://github.com/atalman
This commit is contained in:
Ke Wen
2025-05-21 13:28:19 -07:00
committed by PyTorch MergeBot
parent fae6f6c9ca
commit 87fc5af1f6

View File

@ -1102,9 +1102,12 @@ bool ProcessGroupNCCL::useNonblocking() {
useNonblocking_ = nbEnv;
}
// 3rd priority: automatically use nonblocking if we are in eager init mode
else if (getBoundDeviceId()) {
useNonblocking_ = true;
}
// Note: this automatic selection is disabled in torch 2.7.1 to work around a
// hang in NCCL 2.26 in non-blocking mode. We can revisit if NCCL fixes the
// bug. See https://github.com/pytorch/pytorch/issues/153960
// else if (getBoundDeviceId()) {
// useNonblocking_ = true;
// }
// 4th priority: otherwise, nonblocking = false to preserve old behavior
else {
useNonblocking_ = false;