This reverts commit e525f433e15de1f16966901604a8c4c662828a8a.
Original PR: #85849
Fixes #ISSUE_NUMBER
In addition to reverting the revert, this PR:
- defines the virtual destructor of FunctionPreHook in the header. Why? Presumably the internal build imports the header from somewhere, but does not have function_hooks.cpp (where the virtual destructor was previously defined) in the same compilation unit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92559
Approved by: https://github.com/albanD
Addresses: https://github.com/pytorch/pytorch/issues/35802
Design doc: https://docs.google.com/document/d/19xSib7FFknRQ5f3ptGFUmiOt3BrgXSUlTQH2xMcZJYg/edit#
### Changes in this PR
#### Implementation
- We have now have 3 fields: pre_hooks, retains_grad_hooks, and tensor_pre_hooks so that we can more precisely define their ordering and when they are executed.
- Since retains grad uses an entirely new field, we cannot reuse the old retains grad, logic. We refactor retains grad to call directly into the variable.cpp logic. Other logic in variable.cpp that handle cpp hooks must also be updated.
#### Hooks ordering and execution:
- Defines pre-hooks registered on tensor to run before pre-hooks registered on grad_fn
- Updates pre-hooks registered on tensor to always run, even if they are the inputs= to .grad()
- Post hooks (and pre hooks) can now observe the modifications to gradient by the tensor pre hook
#### Retains grad hooks
- retains grad hooks always execute last, even if there are other tensor pre-hooks registered
#### Unchanged:
- pre_hooks registered to grad_fn aren't expected to execute if they are the inputs= to .grad()
Follow ups:
- simplify retains_grad field to not be a vector, since it always holds a single hook
- potentially merge capture hooks with tensor pre hooks, this would involve some additional refactoring since
- python hooks registered to tensor behavior on in-place is still wrong
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85849
Approved by: https://github.com/albanD
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.
I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop
# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
-j \
-s \
-k \
-v \
--paths torch/csrc/ \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649
Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.
Reviewed By: walterddr, janeyx99
Differential Revision: D29504258
Pulled By: 1ntEgr8
fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57792
There are two problems when using CUDA RPC with distributed autograd
and distributed optimizer:
1) In local autograd engine, all autograd functions/nodes, including
AccumualteGrad will use the forward stream for backward computation.
But distributed autograd skips AccumulateGrad autograd function/node
and directly calls into `AccumulateGrad::accumulateGrad`. As the
result, it will use the default stream to accumulate gradients
instead of the forward stream. This commit changes that and uses the
forward stream to accumulate gradients, matching forward behavior.
2) Distributed optimizer and distributed autograd backward are
separate RPC calls, and CUDA streams are not synchronized across
different RPC calls. As a result, distributed optimizer might
consume gradients before they are ready. This commit uses CUDA
events to record the completion of gradient computation, and use
those events to block current streams when getGradients() are called.
Test Plan: Imported from OSS
Reviewed By: pritamdamania87
Differential Revision: D28274876
Pulled By: mrshenli
fbshipit-source-id: 22e607152324ae918084066cde8c5dbb418bba7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636
The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge.
At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC.
This PR was produced mainly by find-and-replace, plus a couple of manual fixes.
ghstack-source-id: 128296581
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D28187972
fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635
Note: this PR looks massive, but it's just one simple change, codemodded many times.
In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580
Test Plan: CI
Reviewed By: wanchaol
Differential Revision: D28178783
fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56807
If I understand correctly, there's no reason to create your own instance of these global singleton types.
ghstack-source-id: 127312270
Test Plan: CI
Reviewed By: SplitInfinity
Differential Revision: D27973447
fbshipit-source-id: f12df69d185f1baaa45f2ac6eac70570a7a65912
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684
This PR attempts to address #42560 by capturing the appropriate
exception_ptr in the autograd engine and passing it over to the Future.
As part of this change, there is a significant change the Future API where we
now only accept an exception_ptr as part of setError.
For the example in #42560, the exception trace would now look like:
```
> Traceback (most recent call last):
> File "test_autograd.py", line 6914, in test_preserve_backtrace
> Foo.apply(t).sum().backward()
> File "torch/tensor.py", line 214, in backward
> torch.autograd.backward(self, gradient, retain_graph, create_graph)
> File "torch/autograd/__init__.py", line 127, in backward
> allow_unreachable=True) # allow_unreachable flag
> File "torch/autograd/function.py", line 87, in apply
> return self._forward_cls.backward(self, *args)
> File "test_autograd.py", line 6910, in backward
> raise ValueError("something")
> ValueError: something
```
ghstack-source-id: 111109637
Test Plan: waitforbuildbot
Reviewed By: albanD
Differential Revision: D23365408
fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5
Summary:
## Why doesn’t DDP work under dist_autograd?
DDP follows the steps below
1. [DDP Python constructor](8d6a8d2b3f/torch/nn/parallel/distributed.py (L389-L393)) (on a module) creates a [C++ Reducer](https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/reducer.cpp), which holds references to all parameters (or variables in C++ code).
2. The reducer installs a post hook on each model parameter.
3. The backward run starts and triggers the post hooks installed above.
4. The post hook of a parameter simply marks the parameter ready for all-reduce.
5. Once all parameters in a bucket are ready, an all-reduce process starts by reading variable `.grad` and writes to variable `.grad`.
But under dist_autograd, `.grad` of a variable is not populated at all. Instead, grads are in a global map in distributed context from variables to their grads.
## Solution of this PR
The distributed engine to set a thread_local variable in a backward run indicating we're running in distributed mode. DDP reducer can then appropriately use `.grad` or the distributed context based on the thread local. More precisely, the thread local is set before calling the post hooks installed by the DDP reducer so that DDP post hooks can retrieve this thread local.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37998
Test Plan:
```
python test/distributed/test_ddp_under_dist_autograd.py
```
FB repo
```
buck test caffe2/test/distributed/...
```
DDP accuracy benchmark workflow run
```
flow-cli canary pytorch.benchmark.accuracy_comparison.workflow --parameters-json '{"node_world_size": 4, "dist_backend": "nccl"}' --run-as-secure-group fblearner_flow --entitlement gpu_prod
```
f196173157
Reviewed By: pritamdamania87
Differential Revision: D21513795
Pulled By: hczhu
fbshipit-source-id: fe21e68ecdc9274182db4d4bb5a1e2d68ef927a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36640
We had the following race when two threads entered
'mark_graph_task_completed'.
1) Thread 1 grabs the graph_task mutex first and moves captured_vars_ to its
local 'vars'.
2) Thread 1 releases the lock.
3) Thread 2 grabs the mutex and moves an empty captured_vars_ to its local
'vars'.
4) Thread 2 now proceeds to call 'markCompleted' with empty grads.
5) Thread 1 which actually has the right grads never gets to set the grads on
the future since future_completed_ is set to True by Thread 2.
Discovered this while running our RNN example:
https://github.com/pytorch/examples/tree/master/distributed/rpc/rnn and
verified this PR fixes the race.
ghstack-source-id: 102237850
Test Plan: waitforbuildbot
Differential Revision: D21035196
fbshipit-source-id: 1963826194d466b93f19e8016b38e4f9cad47720
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502
We're sometimes deleting futures without completing them (discovered by logging),
and we've recently noticed a slow memory leak.
This change migrates the future lambda cases where there was self-capture.
- In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback.
This avoids the reference cycle. We use this primarily in the case where the
value ends up being moved in the callback (something we want to be careful about)
- We also add a convenience api to Future where the completed Future is returned as an arg.
This allows us to avoid self-capture, though it assumes that the markCompleted()
caller is persisting the future for the markCompleted() duration (this has been the case)
ghstack-source-id: 102130672
Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...
Differential Revision: D20998905
fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220
The torch::utils::Future change from yesterday may have introduced a reference cycle,
leading to OOM on PS. This change reverts the lambda capture changes with
torch::utils::Future until we can analyze further.
ghstack-source-id: 101756106
Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset
Differential Revision: D20918904
fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849
This change harmonizes some aspects of the api.
- torch::utils::Future callback should have no args, like ivalue::future.
Many of the lines of this change are related to fixing that up downstream.
No args makes the api simpler to use, particularly since many/most of the
downstream use cases ignore the passed-in args. It's simple enough to
appropriately capture the future in the lambda if necessary.
- Add error/hasError methods to ivalue::Future.
- Use c10::optional underneath for error to ivalue::Future.
- Change markCompleted(error) to setError(error) to ivalue::Future.
- Add setValue(FutureError) version to torch::utils::Future
ghstack-source-id: 101684435
Test Plan: buck test mode/dev-nosan caffe2/test/...
Differential Revision: D20803251
fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33157
This PR enables graph level thread parallelism on CPU for the Autograd
Engine. It replace https://github.com/pytorch/pytorch/pull/29574 for the
reason of task level parallelism drawbacks with the existing autograd
system.
Fixes https://github.com/pytorch/pytorch/issues/18333
The graph level parallelism on CPU design:
1. Remove the single CPU thread that init in the Engine itself and allow
the owning thread (which calls Engine::execute) to drive the Engine
execution so that we could let outer threading to enable thread
parallelism.
2. Maintain a separate ReadyQueue per CPU thread, and stash the
ReadyQueue for different devices/threads into the thread local
shared_ptr, the Engine itself will memorize the shared_ptr of the
ReadyQueue to different devices (other than CPU)
3. The CPU thread local ReadyQueue is initialized per CPU thread
Engine::execute call (or `backward()`, `grad()` call), and memorized
the shared_ptr into the GraphTask since every `backward()` call have
its own GraphTask
4. Cross device NodeTask push is accomplished by 2 and 3. we can refer
to device's ReadyQueue from Engine, and CPU's ReadyQueue from
GraphTask, which means if we can push to a different ReadyQueue
according to the device
5. Termination of the CPU thread: if we mark the graph_task as
completed, we will exit the while loop and terminate the current
backward execution, because it's guranteed that all other NodeTasks
is finished before we mark a GraphTask as complete
6. re-entrant thread logic keeps the same, reentrant thread detection is
similar as before, we set the worker_device to NO_DEVICE initially
and set to CPU afterward to detect if this is a reentrant call or not.
7. we still have the reentrant thread pool that create new threads if it's
a deep reentrant case, and reuse the ReadyQueue with the parent thread
for performance.
Since we introduce the thread parallelism on CPU, we have to ensure the
thread safety of the GraphTask. This is not a problem if we execute all
forward in different threads since we will build separate GraphTask in
different threads, and each GraphTask is a separate instance that share
nothing, i.e. Hogwild training on CPU should be fine on this case.
But there might be case that user would like to do some part of the task in
a single thread, and do the rest of work in several threads
concurrently, so thread safety is crucial in those cases. The thread
safety strategy for the multithread autograd is as follows:
1. Add a mutex to protect thread safety in Autograd Node/Function, and
hold the lock for different data racing cases
2. Lock the mutex during Node::apply(), this is to ensure Node that
writing to the shared variable are not racing across threads (i.e.
AccumulateGrad and custom C++ Autograd Node if writing to shared
variables )
3. Lock the mutex during Node::release_variables(), this serve the
purpose that when we release saved_variables from one thread, no
other threads can call the Node::apply(), this ensures the variable
references from other threads aren't dangling.
4. If we don't release any variables and no shared data read/write in
the Node i.e. purely functional, we don't lock the mutex
This way we could protect the thread safety on Autograd Node, but we
could still not protect the thread safety on Node pre/post C++ hooks
(python hooks are automatically thread safe), we rely on the user to
write thread safe C++ hooks if they want the hook to be correctly
applied in multithreading environment.
**User visiable changes**:
There're not too much user visiable changes, since we use the owning
thread to drive the autograd execution, user could write their own
threading code and does not block on the Autograd engine, some behaviors
that user should be aware of:
**Non-determinism**:
if we are calling backward() on multiple thread concurrently but with
shared inputs (i.e. Hogwild CPU training). Since parameters are automatically shared across threads, gradient accumulation might become non-deterministic on backward calls across threads, because two backward calls might access and try to accumulate the same .grad attribute. This is technically not safe, and it might result in racing condition and the result might be invalid to use.
But this is expected pattern if user are using the multithreading
approach to drive the whole training process but using shared
parameters, user who use multithreading should have the threading model
in mind and should expect this to happen. User should use the functional
interface `torch.autograd.grad()` to calculate the gradients instead of
`backward()` on loss.
**Graph retaining**:
If part of the autograd graph is shared between threads, i.e. run first
part of forward single thread, then run second part in multiple threads,
then the first part of graph is shared. In this case different threads execute grad() or backward() on the same graph might
have issue of destroying the graph on the fly of one thread, and the
other thread will crash in this case. We will error out to the user
similar to what call `backward()` twice with out `retain_graph=True`, and let the user know they should use `retain_graph=True`.
**TODOs**:
[ ] benchmark the PR with example models and datasets to demonstrate
the performance gain in CPU training
[ ] ensure that we don't regress the single thread autograd performance
**Follow ups**:
[ ] a correct and tight integration with distributed autograd
[ ] try to unify the thread pool between JIT and Autograd, and see if
there's unifying pattern that we could apply universally
Test Plan: Imported from OSS
Differential Revision: D20236771
Pulled By: wanchaol
fbshipit-source-id: 1e0bd4eec14ffebeffdb60b763b8d6f0e427eb64
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34638
Fixes: https://github.com/pytorch/pytorch/issues/27643
This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly.
(Note: this ignores all push blocking failures!)
Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass.
Differential Revision: D20164420
fbshipit-source-id: 3d4ed74230969ac70bb763f1b5b1c16d979f66a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33214
Distributed autograd had some custom logic in terms of how we
accumulated gradients. This was mostly done early on to enable basic
functionality. Although, in the long term we should merge this logic with what
we have in the local autograd engine. A lot of work has gone into ensuring we
accumulate grads correctly and efficiently and we should reuse that as a
starting point.
We can investigate if we need further custom logic for distributed autograd
later on if we need additional optimizations.
In this PR I've merged the gradient accumulation logic and also the gradient
hooks. As a result, now gradient hooks are called in distributed autograd as
well.
ghstack-source-id: 99838019
Test Plan: waitforbuildbot
Differential Revision: D19843284
fbshipit-source-id: 7923d7e871fb6afd3e98dba7de96606264dcb5f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32506
In this PR, we've introduced a `retain_graph` parameter to distributed
autograd similar to `torch.autograd.backward`.
In terms of design, this parameter is sent over RPC to all nodes and is used to
create the GraphTask on the local nodes. This enables us to run
`dist_autograd.backward()` multiple times in the same context.
The use case currently for this is to benchmark only the backward pass for
distributed autograd. We'd like to measure the QPS for the backward pass and as
a result, running a single forward pass and multiple backward passes in a loop
is one way to benchmark backward pass performance.
ghstack-source-id: 97868900
Test Plan: waitforbuildbot
Differential Revision: D19521288
fbshipit-source-id: 7ad8521059fd400d7b5a6ab77ce56e1927ced90a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32952
When the Async() version of clearAndWaitForOutstandingRpcs() was written,
we didn't yet have the generic Future<T> class, and hadn't worked out our
error model fully.
This change fixes that method to properly propagate the first encountered error
to the future, using a bool+CAS.
ghstack-source-id: 97665749
Test Plan: existing test coverage, buck test mode/dev-nosan caffe2/test/...
Differential Revision: D19710337
fbshipit-source-id: 66ce5593a94a16ea624930dbb9409917ef5cfd5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31230
A major issue with distributed autograd currently is that we block an
RPC thread when we call Engine::execute_with_graph_task.
To resolve this issue, I've made modifications to the local autograd engine
such that `execute_with_graph_task` returns a Future instead. The `execute()`
methods for Engine::execute() and DistEngine::execute() still wait() on this
Future which ensures there is no change in behavior yet.
In follow up PRs we can modify the distributed autograd engine to take
advantage of this Future.
Closes#26359
ghstack-source-id: 96298057
Test Plan: waitforbuildbot
Differential Revision: D18999709
fbshipit-source-id: 388f54467fd2415a0acb7df17bd063aedc105229
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29579
Per #28923, this diff is to move Future<Message> to torch::utils and extend it to be Future<T>, most of implementations are copied from FutureMessage and ivalue::Future. merge ivalue::Future with Future<T> will be done separately.
The main difference between Future<T> and FutureMessage is the error handling, instead of checking message type inside Future to handle error, this future<T> owns has_error_ and error_ states.
also this future passes value_, has_error_ and error_ states to callbacks for easily read future states.
In next diff, a torch script rpc async API will be created, before the API returns, it will create an ivalue::Future and passes it to Future<T>'s call back where state of ivalue::Future will be set. In this way, the torch script rpc async API can still return a ivalue::Future and call wait() to get its state appropriately afterwards.
ghstack-source-id: 95479525
Test Plan: unit tests
Differential Revision: D18263023
fbshipit-source-id: 48a65712656a72c2feb0bb3ec8b308c0528986a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30637
RequestCallback api currently forces work to be always synchronous, which,
as we scale, means we're going to need to throw large number of (mostly
blocked) threads at the rpc problem. For some activities like dependent
autograd rpcs, there's not a necessary reason to block in these threads.
In this change, the RequestCallback api is updated to return a
shared_ptr<FutureMessage> rather than a Message:
std::shared_ptr<FutureMessage> operator()(Message& request) const;
With a futures-style api, RPC ops that wish to be async can then be async,
while short-lived blocking functions (or Python UDFs) can just block.
In this change, we keep all of the current ops as synchronous (i.e. we block
and then return a completed FutureMessage). We also update the rpc_agents in
a manner compatible with this sort of parallelism.
Here, we only want to incur overhead when we use the async behavior.
Some modest extra cost seems unavoidable here (e.g. the allocation for the
std::make_shared<>), but we can trivially detect the synchronous/completed
case in the rpc_agent and avoid the extra thread-switches/etc. in that case.
ghstack-source-id: 95287026
Test Plan:
- Basic: buck test mode/dev-nosan caffe2/test/...
- Additional testcase in ThriftRpcAgentTest for deferred work.
Differential Revision: D18774322
fbshipit-source-id: cf49922a71707cfb1726de16f93af23b160385d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29696
The paths distributed/autograd/context/dist_autograd_context.h and
distributed/autograd/context/dist_autograd_container.h were repetitive.
Therefore renaming these to distributed/autograd/context/context.h and
distributed/autograd/context/container.h
ghstack-source-id: 93850266
Test Plan: waitforbuildbot
Differential Revision: D18467624
fbshipit-source-id: bbf3905396f553006851af296c880c1bd106ec47