Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345
FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165
Test Plan:
CI
perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.
Reviewed By: hlu1
Differential Revision: D31027361
fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.
I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop
# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
-j \
-s \
-k \
-v \
--paths torch/csrc/ \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649
Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.
Reviewed By: walterddr, janeyx99
Differential Revision: D29504258
Pulled By: 1ntEgr8
fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59683
Replaces usages of throw std::runtime_error("foo") with the better
torch_check(false, "foo") which allows C++ stacktraces to show up when
TORCH_SHOW_CPP_STACKTRACES=1. This will hopefully provide much better debugging
information when debugging crashes/flaky tests.
ghstack-source-id: 131167210
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D28981327
fbshipit-source-id: 677f569e28600263cab18759eb1b282e0391aa7b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59211
Reland of https://github.com/pytorch/pytorch/pull/58674
I found this missing parameter while debugging failures in the next PR.
I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad.
ghstack-source-id: 130202843
Test Plan: Should fix the next diff.
Reviewed By: mrshenli
Differential Revision: D28623886
fbshipit-source-id: 6c82ed7c785ac3bf32fff7eec67cdd73b96aff28
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58674
I found this missing parameter while debugging failures in the next PR.
I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad.
ghstack-source-id: 129567042
Test Plan: Should fix the next diff.
Reviewed By: mrshenli
Differential Revision: D28574083
fbshipit-source-id: 5c89902cdc5cc12f1ebeea860b90cd9c3d7c7da1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57854
Because OwnerRRefs used to be created before their value was computed, we had to figure out their type ahead of time. After the previous diff, we inverted the order of operations, and we can now first compute the result and then create the OwnerRRef. Which means we can just inspect the value to get its type. Much simpler, and much less likely to get it wrong.
ghstack-source-id: 129567060
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28253843
fbshipit-source-id: f13c9b294f477ae66fcbdbc85c642fdc69b2740f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635
Note: this PR looks massive, but it's just one simple change, codemodded many times.
In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580
Test Plan: CI
Reviewed By: wanchaol
Differential Revision: D28178783
fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57443
Based on the comments in https://github.com/pytorch/pytorch/pull/57355, I started looking at the callsites of `getOrCreateOwnerRRef` and `createOwnerRRef`, and noticed that many of them didn't specify the `devices` argument, which was optional and thus defaulted to `{}`, which created a CPU-only Future inside the OwnerRRef. (Such callsites were, for example, in `processPythonRemoteCall` and `processBaseScriptRemoteCall`, or `PyRRef::unpickle`, ...).
Some (or all?) of these callsites might still have worked thanks to the RRef's own handling of CUDA streams and events, however we intend to remove that in https://github.com/pytorch/pytorch/pull/57355. I think it would be a safer and more generic solution to always create OwnerRRefs with the full set of devices supported by the RPC agent, and this is in fact easy to do since the RRefContext has access to the RPC agent. This means that all OwnerRRefs, no matter how they're created, will support CUDA if the agent does. This also allows us to stop requiring to specify devices when creating a OwnerRRef by hand in Python.
ghstack-source-id: 128184665
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28144365
fbshipit-source-id: 1f2d446873f31ee297415c46b94126b6502b12d3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57442
We did this for the RPC agents and for ivalue::Future, the last one (I think) is RRef.
ghstack-source-id: 128184664
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28144368
fbshipit-source-id: eeacab6006f72118cbec542a02322f2e391c67a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085
PR #54932 fixed the CUDA RPC for RRef when RRef is created through
RPC. But besides that use case, RRef can also be created locally
by directly passing in a value, which would bypass the CUDA stream
synchronization in #54932.
This commit covers the above gap by adding a `devices` argument
to RRef constructor. The RRef will then use this argument to
choose between `CUDAFutre` and `ivalue::Future` to hold the value.
When `devices` is specified and non-empty, `CUDAFuture` will be
used, and the `devices` will be passed to that `CUDAFuture`.
Test Plan: Imported from OSS
Reviewed By: lw
Differential Revision: D28050001
Pulled By: mrshenli
fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51697
Refactors the rest of rref_context, specifically pendingOwners map and `getOwnerRRef` to use JitFuture.
ghstack-source-id: 122037611
Test Plan: CI
Reviewed By: wanchaol
Differential Revision: D26243268
fbshipit-source-id: ab8874c8253274e8fe50dcd7291e0655a8f3f1df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51696
Modify this API to use JitFuture.
ghstack-source-id: 121695707
Test Plan: Ci
Reviewed By: mrshenli
Differential Revision: D26239132
fbshipit-source-id: 15c0c349a79e660fe4862e1d99176989f8159bf4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51695
As part of the plan to completely eliminate torch/csrc/utils/future.h,
we are converting this to JitFuture (c10::ivalue::Future).
ghstack-source-id: 121695708
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D26238873
fbshipit-source-id: 92bad1a349964ce8a9a80e2d1cf68f293cbe411c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217
Fixes this flaky test. Due to the possibility of callback
finishCreatingOwnerRRef running after request_callback has processed and
created the owner RRef, we could actually end up with 0 owners on the node,
since the callback removes from the owners_ map. In this case, shutdown is fine
since there are no owners. On the other hand, if the callback runs first, there
will be 1 owner which we will delete in shutdown when we detect it has no
forks. So either way, shutdown works fine and we don't need to enforce there to
be 1 owner.
ghstack-source-id: 107883497
Test Plan: Ran the test 500 times with TSAN.
Reviewed By: ezyang
Differential Revision: D22469806
fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38748
This diff contains the message scaffolding and profiler changes in order to be able to remotely run the profiler across different nodes and aggregate the results on a single node.
As discussed, we have implemented this by creating new message types, that similar to autograd messages, wrap the profiling information with the original message, and send this new message over the wire. On the receiving end, this wrapped message is detected, we fetch the original message from it, and process the original message with the profiler enabled. When sending a response with profiling information, we serialize the profiled `Events` and send them back over RPC. When such a message is received, the events profiled on the remote node are stored (added back to the local profiler).
Changes in this PR:
- New message types (run_with_profiling_req, run_with_profiling_resp) to send profiling info over the wire. Message parsing logic is added to handle these wrapped types.
- Handling of sending profiler data over the wire, in particular, the attributes of the `ProfilerConfig` and the serialized profiled `Event`s
- The logic for wrapping RPC messages is deduped with that in `rpc_with_autograd`, and the common payload wrapping/unwrapping logic is moved to helper functions in `rpc/utils.cpp`
- Changes in `autograd/utils.cpp` to detect if we have enabled the profiler and are sending an RPC, if so, uses the above new message types
- Changes in request_callback to parse and turn on the profiler in a thread-local fashion
- Serialization and deserialization of profiling `Events`, and support to add the remote events to the thread-local profiler
- Introduction of the concept of `node_id`, which as discussed with ilia-cher , will be used along with the `Event`s handle attribute to distinguish between events. When there are events from different nodes, this node information is rendered in the profile output (e.g. when printing tables), otherwise, it is not, since it is irrelevant.
- Some changes to profiler.cpp to add useful helper methods/guards
- toHere() is now profiled for RRefs
- Unittests
ghstack-source-id: 106134626
Test Plan: Added unittests, existing profiler unittests.
Differential Revision: D19510010
fbshipit-source-id: 044347af992f19a9e3b357c9567f6fc73e988157
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40141
This rref timeout test could be flaky because we could end up processing `RRefUserDelete` messages on the owner node before processing the to_here message. This would result in a hang in `ProcessGroupAgent::sync()` that eventually results in a timeout.
The rough sequence of what happens is:
0) Node 0 creates RRef on node 1 with rpc.remote() call
1) rref.to_here() is called with a timeout. Because of delay injection, the processing of this message can be delayed (this is also technically possible in applications without delay injection)
2) At some point, callbacks corresponding to rpc.remote() runs and confirms the rref, adding it as a confirmed user
3) RPC shutdown starts, as part of which we send out RRef user deletes. In this case, 0 sends an RRef user delete to 1, and node 1 removes the owner from the `owners_` field.
4) The `to_here()` message is finally processed by node 1. But since we have deleted the `owner_`, while processing this message we create a future that will be complete when the owner exists (this is to account for the case of to_here() arriving here rpc.remote). But this future will never complete, since the owner is already deleted, so we hang indefnitely
As a workaround for now, we can force `to_here()` to run before RPC shutdown by adding a blocking `to_here()` call with no timeout.
A more robust, longer-term fix would be to detect if an owner has been previously deleted (such as by an RRefUserDelete). Then, we know that the future corresponding to owner creation on the remote end will never completee, and then we error out when processing a `to_here()`.
ghstack-source-id: 106036796
Differential Revision: D22084735
fbshipit-source-id: fe7265a4fe201c4d6d2f480f64fe085cd59dbfb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39932
This PR make RRef fork to use the jit type annotation_str recently introduce in
https://github.com/pytorch/pytorch/pull/39544 to allow consistent
serialization type str format, and fix the case when dict->str() format
not match the type resolver.
Test Plan: Imported from OSS
Differential Revision: D22015427
Pulled By: wanchaol
fbshipit-source-id: f64d7e3acde5312813816c8f3c7d8fa9379704e8
Summary:
Clearly expressing a type is inferred by PyTorch instead of explicitly annotated by user makes many error messages more user-friendly
Currently Type has two string conversion methods. str() for IR printing and python_str() for serialization and error message generation. If we want to include more information in type printing while maintaining serialization/deserialization correctness, we need to split python_str() into annotation_str() and repr_str().
annotation_str is solely responsible for serialization, it strictly matches format of python type annotation. repr_str() is responsible for generating a human-readable error message that includes information like "this type is inferred, not explicitly annotated"
Closes https://github.com/pytorch/pytorch/issues/39449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39544
Differential Revision: D21978759
Pulled By: gmagogsfm
fbshipit-source-id: 733566f5a62e748b5ca4bb3c5943ebb6d5b664d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590
This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works:
- Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call).
- Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown)
- A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for.
- `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user.
- Tests are added, primarily via delay injection.
ghstack-source-id: 105232837
Test Plan: CI
Differential Revision: D21588165
fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39012
The `test_rref_context_debug_info` test was flaky with the TensorPipe agent, and I think the issue is the test itself.
What was happening is that on line 1826 the test was clearing a global variable on the remote side which was holding a rref. Even though the RPC call that unset the global variable was synchronous, the messages that the rref context needs to send around to delete that rref are asynchronous. Therefore, sometimes, when we reached line 1845 we saw the following check fail:
```
self.assertEqual(2, int(info["num_owner_rrefs"]))
```
because `num_owner_rrefs` was still 3, as the deletion hadn't yet been processed.
The only way I found to fix it is to add a synchronization step where we wait for all the futures from the rref context to complete. Since we must wait for this to happen on all workers, we synchronize with a barrier.
ghstack-source-id: 104810738
Test Plan: The test isn't flaky anymore.
Differential Revision: D21716070
fbshipit-source-id: e5a97e520c5b10b67c335abf2dc7187ee6227643
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36785
Currently, RRef unpickle (both Python and TorchScript) will block
until the OwnerRRef has been created by the original `rpc.remote`
call, if it is an OwnerRRef. This is not ideal as the correctness
would then depends on the number of threads configuration. This
commit changed that behavior. Both `rpc.remote` and the unpickle
can create OwnerRRefs. More specifically, whichever one arrives
first will create the OwnerRRef and the subsequent ones will
retrieve the same OwnerRRef, so that no one is blocking.
Test Plan: Imported from OSS
Differential Revision: D21083089
Pulled By: mrshenli
fbshipit-source-id: 34ef063d50549b01c968b47815c4fe9fac179d3d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502
We're sometimes deleting futures without completing them (discovered by logging),
and we've recently noticed a slow memory leak.
This change migrates the future lambda cases where there was self-capture.
- In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback.
This avoids the reference cycle. We use this primarily in the case where the
value ends up being moved in the callback (something we want to be careful about)
- We also add a convenience api to Future where the completed Future is returned as an arg.
This allows us to avoid self-capture, though it assumes that the markCompleted()
caller is persisting the future for the markCompleted() duration (this has been the case)
ghstack-source-id: 102130672
Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...
Differential Revision: D20998905
fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36419
Since we call waitForThreadLocalPendingRRefs per-RPC, construct it
already-satisfied in the common empty case, to avoid extra mutex/cv work.
Also, naming consistency for recording_.
ghstack-source-id: 101975739
Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...
Differential Revision: D20977879
fbshipit-source-id: e321a33127e4b5797e44e039839c579057e778e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36254
These future use changes were all landed yesterday as part of the future
refactoring, quickly reverted due to an observed OOM, but now being relanded, since
they've since been tested to be benign.
ghstack-source-id: 101776613
Test Plan:
buck test mode/dev-nosan caffe2/test/...
not ooming: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod
Differential Revision: D20924010
fbshipit-source-id: 28872e488df34c7a886bcd659fa7e9914639d306
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220
The torch::utils::Future change from yesterday may have introduced a reference cycle,
leading to OOM on PS. This change reverts the lambda capture changes with
torch::utils::Future until we can analyze further.
ghstack-source-id: 101756106
Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset
Differential Revision: D20918904
fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849
This change harmonizes some aspects of the api.
- torch::utils::Future callback should have no args, like ivalue::future.
Many of the lines of this change are related to fixing that up downstream.
No args makes the api simpler to use, particularly since many/most of the
downstream use cases ignore the passed-in args. It's simple enough to
appropriately capture the future in the lambda if necessary.
- Add error/hasError methods to ivalue::Future.
- Use c10::optional underneath for error to ivalue::Future.
- Change markCompleted(error) to setError(error) to ivalue::Future.
- Add setValue(FutureError) version to torch::utils::Future
ghstack-source-id: 101684435
Test Plan: buck test mode/dev-nosan caffe2/test/...
Differential Revision: D20803251
fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35331
When the function called by remote() throws, it seems sensible to
surface that exeption when rref.to_here() is called.
Doing this only involves simple modifications:
- we need the OwnerRRef to keep around an optional<string>
for the error
- add an OwnerRRef setError() method that's parallel to setValue(),
and plumb through the logic
We add rpc_tests to verify that the exception is propagated properly.
ghstack-source-id: 101136900
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn
buck test mode/dev-nosan caffe2/test/distributed/rpc/jit:rpc_spawn
Differential Revision: D20634078
fbshipit-source-id: b5b13fdb85cdf6a43f42347d82eabae1635368ec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636
Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072
Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.
In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).
This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.
Differential Revision: D20019236
fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34988
In https://github.com/pytorch/pytorch/pull/31893, we introduced a confirmedUsers_ map in RRefContext.
For the case that the fork is shared from the owner, there is no `pendingUsers_` intermediate phase for this fork, we should put this fork into `confirmedUsers_` immediately.
Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork
```
Differential Revision: D7735909
fbshipit-source-id: 14c36a16486f0cc9618dcfb111fe5223781b647d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497
Use a thread_local table to intercept UserRRefs created during user
function args deserialization, and then wait for confirmations of
those UserRRefs before launching the given user function.
Differential Revision: D20347464
Test Plan: Imported from OSS
Pulled By: mrshenli
fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31893
In order to resolve the issue summarized in https://github.com/pytorch/pytorch/issues/31325.
The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.
As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 100023142
Test Plan:
V22 is the version that make User to wait on delete UseerRRef message.
# Unit tests
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_nested_rref_stress --stress-runs 100
buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_nested_rref_stress
buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par - r test_rref_forward_chain
buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_non_garbage_collected_user_rref_due_to_local_circular_dependency
```
Reviewed By: mrshenli
Differential Revision: D19292254
fbshipit-source-id: 92c3e8d0b00f183c5e22f163bdca482cc25a1ce9