mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-21 13:44:15 +08:00
[NCCL] Changed FutureNCCL's then callback logic for better efficiency. (#42869)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/42869 We realized that when we invoke a simple callback that divides the tensors by `world_size` after `allreduce`, the performance was almost 50% lower in terms of QPS compared to the case where a simple `allreduce` hook is used with no `then` callback. The main problem was as we call `work.wait()` before invoking `then` callback, we were synchronizing `work`'s stream with the default PyTorch stream inside [`runHook`](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp#L609) and stalling the backward computation. In that PR, we ensure that FutureNCCL's `then` callback is not stalling the backward computation. Assuming single-process single-device, `FutureNCCL` gets a new stream from device's pool using `at::cuda::getStreamFromPool` to run `callback` and before invoking the `callback` inline it synchronizes `WorkNCCL`'s stream by callback's stream not the default stream. ghstack-source-id: 110208431 Test Plan: Run performance benchmark tests to validate performance issue is resolved. Also, `python test/distributed/test_c10d.py` to avoid any odd issues. Reviewed By: pritamdamania87 Differential Revision: D23055807 fbshipit-source-id: 60e50993f1ed97497514eac5cb1018579ed2a4c5
This commit is contained in:
committed by
Facebook GitHub Bot
parent
97d62bcd19
commit
6e1127ea3f
@ -972,6 +972,10 @@ void initJITBindings(PyObject* module) {
|
||||
"done",
|
||||
// Intentionally not releasing GIL
|
||||
&PythonFutureWrapper::done)
|
||||
.def(
|
||||
"value",
|
||||
&PythonFutureWrapper::value,
|
||||
py::call_guard<py::gil_scoped_release>())
|
||||
.def(
|
||||
"wait",
|
||||
&PythonFutureWrapper::wait,
|
||||
|
Reference in New Issue
Block a user