Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25012
Resubmitting https://github.com/pytorch/pytorch/pull/22907 with build fix.
This change adds the following functionality:
1) WorkNCCL isCompleted, isSuccess methods check for NCCL errors and set the
appropriate exception.
2) Added a watchdog thread to ProcessGroupNCCL which checks for errors in the
cached communicators and removes them from the cache.
3) Use ncclCommAbort in NCCLComm destructor since ncclCommDestroy can block
forever waiting for work.
4) Added a simulate_nccl_errors.py script to simulate NCCL errors.
https://github.com/pytorch/pytorch/issues/17882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22907
Test Plan: 1) Run the simulate_nccl_errors.py to verify NCCL errors are caught.
Differential Revision: D16958078
fbshipit-source-id: 662b0b8b8ee250e2b6d15bdfc9306d71c4f66219
Summary:
This change adds the following functionality:
1) WorkNCCL isCompleted, isSuccess methods check for NCCL errors and set the
appropriate exception.
2) Added a watchdog thread to ProcessGroupNCCL which checks for errors in the
cached communicators and removes them from the cache.
3) Use ncclCommAbort in NCCLComm destructor since ncclCommDestroy can block
forever waiting for work.
4) Added a simulate_nccl_errors.py script to simulate NCCL errors.
https://github.com/pytorch/pytorch/issues/17882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22907
Test Plan: 1) Run the simulate_nccl_errors.py to verify NCCL errors are caught.
Differential Revision: D16220638
fbshipit-source-id: fbc8881ea0c38a4d09a77045691e36557b7b0b25