[NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110208431

Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues.

Reviewed By: pritamdamania87

Differential Revision: D23055807

fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
This commit is contained in:
Sinan Nasir
2020-08-19 19:37:53 -07:00
committed by Facebook GitHub Bot
parent 97d62bcd19
commit 6e1127ea3f
9 changed files with 189 additions and 72 deletions

View File

@ -972,6 +972,10 @@ void initJITBindings(PyObject* module) {
"done",
// Intentionally not releasing GIL
&PythonFutureWrapper::done)
.def(
"value",
&PythonFutureWrapper::value,
py::call_guard<py::gil_scoped_release>())
.def(
"wait",
&PythonFutureWrapper::wait,