mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055)
Work around issues like #153960, #152623 NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154055 Approved by: https://github.com/atalman
This commit is contained in:
@ -1102,9 +1102,12 @@ bool ProcessGroupNCCL::useNonblocking() {
|
||||
useNonblocking_ = nbEnv;
|
||||
}
|
||||
// 3rd priority: automatically use nonblocking if we are in eager init mode
|
||||
else if (getBoundDeviceId()) {
|
||||
useNonblocking_ = true;
|
||||
}
|
||||
// Note: this automatic selection is disabled in torch 2.7.1 to work around a
|
||||
// hang in NCCL 2.26 in non-blocking mode. We can revisit if NCCL fixes the
|
||||
// bug. See https://github.com/pytorch/pytorch/issues/153960
|
||||
// else if (getBoundDeviceId()) {
|
||||
// useNonblocking_ = true;
|
||||
// }
|
||||
// 4th priority: otherwise, nonblocking = false to preserve old behavior
|
||||
else {
|
||||
useNonblocking_ = false;
|
||||
|
Reference in New Issue
Block a user