[NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869

We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback.

The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation.

In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream.

ghstack-source-id: 110208431

Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues.

Reviewed By: pritamdamania87

Differential Revision: D23055807

fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5

This commit is contained in:

Sinan Nasir

2020-08-19 19:37:53 -07:00

committed by

Facebook GitHub Bot

parent 97d62bcd19

commit 6e1127ea3f

9 changed files with 189 additions and 72 deletions

									
										4

torch/csrc/jit/python/init.cpp
									
												View File
												
				@ -972,6 +972,10 @@ void initJITBindings(PyObject* module) {

				          "done",

				          // Intentionally not releasing GIL

				          &PythonFutureWrapper::done)

				      .def(

				          "value",

				          &PythonFutureWrapper::value,

				          py::call_guard<py::gil_scoped_release>())

				      .def(

				          "wait",

				          &PythonFutureWrapper::wait,

[NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)

4 torch/csrc/jit/python/init.cpp Unescape Escape View File

4

torch/csrc/jit/python/init.cpp

View File