Commit Graph

74 Commits

Author SHA1 Message Date
d55dc00f84 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-23 02:57:50 +00:00
4b55871e06 Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)"
This reverts commit c95f7fa874a3116f1067f9092456ee7281003614.

Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))
2025-06-22 12:27:36 +00:00
c95f7fa874 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-22 08:43:49 +00:00
cyy
bbff667e32 [Distributed] [13/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136713)
Follows #136528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136713
Approved by: https://github.com/kwen2501
2024-09-27 10:11:53 +00:00
cyy
95dbbf713e [Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109)
Follows #125102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109
Approved by: https://github.com/ezyang
2024-07-16 04:23:42 +00:00
cyy
30875953a4 [1/N] Remove inclusion of c10/util/string_utils.h (#128300)
As a first step to remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128300
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-06-10 23:40:47 +00:00
cyy
4d51c8532c Some simple fixes (#93221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93221
Approved by: https://github.com/Skylion007
2023-01-30 05:14:03 +00:00
0247ed27cc Apply Clang-Tidy readability-container-size-empty (#93236)
Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
2023-01-29 23:28:19 +00:00
cyy
f172feae0d More tidy fixes (#93069)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069
Approved by: https://github.com/Skylion007
2023-01-27 06:40:50 +00:00
c5cf8f6b28 fix [rpc] Wrong usage of RRefContext::handleException #71458 (#83166)
Fixes #71458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83166
Approved by: https://github.com/kumpera
2022-09-08 18:22:51 +00:00
2d885ab73d [jit] Reduce refcounting of Types (#65345)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345

FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership.
ghstack-source-id: 140044165

Test Plan:
CI

perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial.

Reviewed By: hlu1

Differential Revision: D31027361

fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8
2021-10-08 09:03:04 -07:00
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
d433a55c94 Replace throw std::runtime_error with torch_check in torch/csrc/distributed (#59683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59683

Replaces usages of throw std::runtime_error("foo") with the better
torch_check(false, "foo") which allows C++ stacktraces to show up when
TORCH_SHOW_CPP_STACKTRACES=1. This will hopefully provide much better debugging
information when debugging crashes/flaky tests.
ghstack-source-id: 131167210

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28981327

fbshipit-source-id: 677f569e28600263cab18759eb1b282e0391aa7b
2021-06-11 11:15:49 -07:00
258a991027 [reland] Set and propagate devices in RRef completion future (#59211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59211

Reland of https://github.com/pytorch/pytorch/pull/58674

I found this missing parameter while debugging failures in the next PR.
I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad.
ghstack-source-id: 130202843

Test Plan: Should fix the next diff.

Reviewed By: mrshenli

Differential Revision: D28623886

fbshipit-source-id: 6c82ed7c785ac3bf32fff7eec67cdd73b96aff28
2021-06-02 05:46:04 -07:00
89c81c5bba Revert D28574083: Set and propagate devices in RRef completion future
Test Plan: revert-hammer

Differential Revision:
D28574083 (23df70359a)

Original commit changeset: 5c89902cdc5c

fbshipit-source-id: e48043b6c4fb8a6f383f78e1aa88f7614f9fa13a
2021-05-21 19:24:12 -07:00
23df70359a Set and propagate devices in RRef completion future (#58674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58674

I found this missing parameter while debugging failures in the next PR.

I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad.
ghstack-source-id: 129567042

Test Plan: Should fix the next diff.

Reviewed By: mrshenli

Differential Revision: D28574083

fbshipit-source-id: 5c89902cdc5cc12f1ebeea860b90cd9c3d7c7da1
2021-05-21 13:15:34 -07:00
20d02cb7dd Remove getScriptRemoteCallType (#57854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57854

Because OwnerRRefs used to be created before their value was computed, we had to figure out their type ahead of time. After the previous diff, we inverted the order of operations, and we can now first compute the result and then create the OwnerRRef. Which means we can just inspect the value to get its type. Much simpler, and much less likely to get it wrong.
ghstack-source-id: 129567060

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253843

fbshipit-source-id: f13c9b294f477ae66fcbdbc85c642fdc69b2740f
2021-05-21 13:15:07 -07:00
36e47af58b Pass reference to parent future in callbacks (#57635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635

Note: this PR looks massive, but it's just one simple change, codemodded many times.

In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D28178783

fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
2021-05-07 03:59:18 -07:00
7d4121d1d2 Make RRefContext get devices from RPC agent when creating OwnerRRef (#57443)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57443

Based on the comments in https://github.com/pytorch/pytorch/pull/57355, I started looking at the callsites of `getOrCreateOwnerRRef` and `createOwnerRRef`, and noticed that many of them didn't specify the `devices` argument, which was optional and thus defaulted to `{}`, which created a CPU-only Future inside the OwnerRRef. (Such callsites were, for example, in `processPythonRemoteCall` and `processBaseScriptRemoteCall`, or `PyRRef::unpickle`, ...).

Some (or all?) of these callsites might still have worked thanks to the RRef's own handling of CUDA streams and events, however we intend to remove that in https://github.com/pytorch/pytorch/pull/57355. I think it would be a safer and more generic solution to always create OwnerRRefs with the full set of devices supported by the RPC agent, and this is in fact easy to do since the RRefContext has access to the RPC agent. This means that all OwnerRRefs, no matter how they're created, will support CUDA if the agent does. This also allows us to stop requiring to specify devices when creating a OwnerRRef by hand in Python.
ghstack-source-id: 128184665

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28144365

fbshipit-source-id: 1f2d446873f31ee297415c46b94126b6502b12d3
2021-05-06 01:12:56 -07:00
7ffadf6e46 Replace DeviceIndexes with Devices in RRefs (#57442)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57442

We did this for the RPC agents and for ivalue::Future, the last one (I think) is RRef.
ghstack-source-id: 128184664

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28144368

fbshipit-source-id: eeacab6006f72118cbec542a02322f2e391c67a3
2021-05-06 01:12:54 -07:00
1ee54cc7b4 Add devices argument to RRef constructor (#57085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085

PR #54932 fixed the CUDA RPC for RRef when RRef is created through
RPC. But besides that use case, RRef can also be created locally
by directly passing in a value, which would bypass the CUDA stream
synchronization in #54932.

This commit covers the above gap by adding a `devices` argument
to RRef constructor. The RRef will then use this argument to
choose between `CUDAFutre` and `ivalue::Future` to hold the value.
When `devices` is specified and non-empty, `CUDAFuture` will be
used, and the `devices` will be passed to that `CUDAFuture`.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D28050001

Pulled By: mrshenli

fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e
2021-04-28 19:11:10 -07:00
5b62b0d9bc [RPC] Fix typo in rref_context.cpp (#53978)
Summary:
untill -> until

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53978

Reviewed By: jbschlosser

Differential Revision: D27039043

Pulled By: rohan-varma

fbshipit-source-id: c9178e79fe8b2a3dc61665148fe55dba5adb0abf
2021-03-15 11:08:59 -07:00
3b11822825 [RPC] Refactor rref_context to not use utils::Future (#51697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51697

Refactors the rest of rref_context, specifically pendingOwners map and `getOwnerRRef` to use JitFuture.
ghstack-source-id: 122037611

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D26243268

fbshipit-source-id: ab8874c8253274e8fe50dcd7291e0655a8f3f1df
2021-02-19 00:59:38 -08:00
8ff5a46c32 [RPC] waitForThreadLocalRRefs returns jitFuture (#51696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51696

Modify this API to use JitFuture.
ghstack-source-id: 121695707

Test Plan: Ci

Reviewed By: mrshenli

Differential Revision: D26239132

fbshipit-source-id: 15c0c349a79e660fe4862e1d99176989f8159bf4
2021-02-13 17:43:16 -08:00
87c0b6bffc [RPC] Move confirmation future in rref context to jit future (#51695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51695

As part of the plan to completely eliminate torch/csrc/utils/future.h,
we are converting this to JitFuture (c10::ivalue::Future).
ghstack-source-id: 121695708

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D26238873

fbshipit-source-id: 92bad1a349964ce8a9a80e2d1cf68f293cbe411c
2021-02-13 17:40:55 -08:00
d730c7e261 Replace FutureMessage with ivalue::Future in RpcAgent retry logic (#49995)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49995

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25745301

Pulled By: mrshenli

fbshipit-source-id: b5e3a7e0b377496924847d8d70d61de32e2d87f4
2021-01-07 19:50:23 -08:00
008206decc Replace FutureMessage with ivalue::Future in RRefContext (#49960)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49960

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25730530

Pulled By: mrshenli

fbshipit-source-id: 5d54572c653592d79c40aed616266c87307a1ad8
2021-01-07 19:50:19 -08:00
b5e32528d0 Fix flaky test_udf_remote_message_delay_timeout_to_self (#41217)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217

Fixes this flaky test. Due to the possibility of callback
finishCreatingOwnerRRef running after request_callback has processed and
created the owner RRef, we could actually end up with 0 owners on the node,
since the callback removes from the owners_ map. In this case, shutdown is fine
since there are no owners. On the other hand, if the callback runs first, there
will be 1 owner which we will delete in shutdown when we detect it has no
forks. So either way, shutdown works fine and we don't need to enforce there to
be 1 owner.
ghstack-source-id: 107883497

Test Plan: Ran the test 500 times with TSAN.

Reviewed By: ezyang

Differential Revision: D22469806

fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46
2020-07-16 11:20:56 -07:00
7e82382ad5 Allow profiler to be enabled remotely with RPC (#38748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38748

This diff contains the message scaffolding and profiler changes in order to be able to remotely run the profiler across different nodes and aggregate the results on a single node.

As discussed, we have implemented this by creating new message types, that similar to autograd messages, wrap the profiling information with the original message, and send this new message over the wire. On the receiving end, this wrapped message is detected, we fetch the original message from it, and process the original message with the profiler enabled. When sending a response with profiling information, we serialize the profiled `Events` and send them back over RPC. When such a message is received, the events profiled on the remote node are stored (added back to the local profiler).

Changes in this PR:
- New message types (run_with_profiling_req, run_with_profiling_resp) to send profiling info over the wire. Message parsing logic is added to handle these wrapped types.
- Handling of sending profiler data over the wire, in particular, the attributes of the `ProfilerConfig` and the serialized profiled `Event`s
- The logic for wrapping RPC messages is deduped with that in `rpc_with_autograd`, and the common payload wrapping/unwrapping logic is moved to helper functions in `rpc/utils.cpp`
- Changes in `autograd/utils.cpp` to detect if we have enabled the profiler and are sending an RPC, if so, uses the above new message types
- Changes in request_callback to parse and turn on the profiler in a thread-local fashion
- Serialization and deserialization of profiling `Events`, and support to add the remote events to the thread-local profiler
- Introduction of the concept of `node_id`, which as discussed with ilia-cher , will be used along with the `Event`s handle attribute to distinguish between events. When there are events from different nodes, this node information is rendered in the profile output (e.g. when printing tables), otherwise, it is not, since it is irrelevant.
- Some changes to profiler.cpp to add useful helper methods/guards
- toHere() is now profiled for RRefs
- Unittests
ghstack-source-id: 106134626

Test Plan: Added unittests, existing profiler unittests.

Differential Revision: D19510010

fbshipit-source-id: 044347af992f19a9e3b357c9567f6fc73e988157
2020-06-18 17:01:57 -07:00
f4ffe99da5 Fix flaky rref timeout test (#40141)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40141

This rref timeout test could be flaky because we could end up processing `RRefUserDelete` messages on the owner node before processing the to_here message. This would result in a hang in `ProcessGroupAgent::sync()` that eventually results in a timeout.

The rough sequence of what happens is:
0) Node 0 creates RRef on node 1 with rpc.remote() call
1) rref.to_here() is called with a timeout. Because of delay injection, the processing of this message can be delayed (this is also technically possible in applications without delay injection)
2) At some point, callbacks corresponding to rpc.remote() runs and confirms the rref, adding it as a confirmed user
3) RPC shutdown starts, as part of which we send out RRef user deletes. In this case, 0 sends an RRef user delete to 1, and node 1 removes the owner from the `owners_` field.
4) The `to_here()` message is finally processed by node 1. But since we have deleted the `owner_`, while processing this message we create a future that will be complete when the owner exists (this is to account for the case of to_here() arriving here rpc.remote). But this future will never complete, since the owner is already deleted, so we hang indefnitely

As a workaround for now, we can force `to_here()` to run before RPC shutdown by adding a blocking `to_here()` call with no timeout.

A more robust, longer-term fix would be to detect if an owner has been previously deleted (such as by an RRefUserDelete). Then, we know that the future corresponding to owner creation on the remote end will never completee, and then we error out when processing a `to_here()`.
ghstack-source-id: 106036796

Differential Revision: D22084735

fbshipit-source-id: fe7265a4fe201c4d6d2f480f64fe085cd59dbfb2
2020-06-17 15:48:38 -07:00
356d564886 [rpc] use annotation_str for RRef type serialization (#39932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39932

This PR make RRef fork to use the jit type annotation_str recently introduce in
https://github.com/pytorch/pytorch/pull/39544 to allow consistent
serialization type str format, and fix the case when dict->str() format
not match the type resolver.

Test Plan: Imported from OSS

Differential Revision: D22015427

Pulled By: wanchaol

fbshipit-source-id: f64d7e3acde5312813816c8f3c7d8fa9379704e8
2020-06-12 17:15:57 -07:00
c22bbb2124 [JIT] Add Type::repr_str to return human-readable str (#39544)
Summary:
Clearly expressing a type is inferred by PyTorch instead of explicitly annotated by user makes many error messages more user-friendly

Currently Type has two string conversion methods. str() for IR printing and python_str() for serialization and error message generation. If we want to include more information in type printing while maintaining serialization/deserialization correctness, we need to split python_str() into annotation_str() and repr_str().

annotation_str is solely responsible for serialization, it strictly matches format of python type annotation. repr_str() is responsible for generating a human-readable error message that includes information like "this type is inferred, not explicitly annotated"

Closes https://github.com/pytorch/pytorch/issues/39449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39544

Differential Revision: D21978759

Pulled By: gmagogsfm

fbshipit-source-id: 733566f5a62e748b5ca4bb3c5943ebb6d5b664d0
2020-06-10 12:01:24 -07:00
8b2bb02e09 Implement timeout support for RRefs (#38590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works:

- Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call).
- Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown)
- A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for.
- `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user.
- Tests are added, primarily via delay injection.
ghstack-source-id: 105232837

Test Plan: CI

Differential Revision: D21588165

fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3
2020-06-04 02:14:42 -07:00
f58cc4b444 [RPC] Fix flaky test by waiting for async rref calls (#39012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39012

The `test_rref_context_debug_info` test was flaky with the TensorPipe agent, and I think the issue is the test itself.

What was happening is that on line 1826 the test was clearing a global variable on the remote side which was holding a rref. Even though the RPC call that unset the global variable was synchronous, the messages that the rref context needs to send around to delete that rref are asynchronous. Therefore, sometimes, when we reached line 1845 we saw the following check fail:
```
        self.assertEqual(2, int(info["num_owner_rrefs"]))
```
because `num_owner_rrefs` was still 3, as the deletion hadn't yet been processed.

The only way I found to fix it is to add a synchronization step where we wait for all the futures from the rref context to complete. Since we must wait for this to happen on all workers, we synchronize with a barrier.
ghstack-source-id: 104810738

Test Plan: The test isn't flaky anymore.

Differential Revision: D21716070

fbshipit-source-id: e5a97e520c5b10b67c335abf2dc7187ee6227643
2020-05-28 10:48:34 -07:00
3d934c3d36 Add using torch::utils::Future to simplify code in RRefContext (#36811)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36811

Test Plan: Imported from OSS

Differential Revision: D21093846

Pulled By: mrshenli

fbshipit-source-id: 61a6b1483ef1533803a18bec216ebe82aa187458
2020-04-25 09:37:33 -07:00
269ec9a139 Prevent RRef.to_here() to block an RPC thread on the callee using Future callbacks (#36805)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36805

Test Plan: Imported from OSS

Differential Revision: D21093847

Pulled By: mrshenli

fbshipit-source-id: 81b0934874af36e03329fe6176628e3aca12811f
2020-04-25 09:37:28 -07:00
6e1e55c134 Prevent RRef unpickle to block waiting for OwnerRRef creation (#36785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36785

Currently, RRef unpickle (both Python and TorchScript) will block
until the OwnerRRef has been created by the original `rpc.remote`
call, if it is an OwnerRRef. This is not ideal as the correctness
would then depends on the number of threads configuration. This
commit changed that behavior. Both `rpc.remote` and the unpickle
can create OwnerRRefs. More specifically, whichever one arrives
first will create the OwnerRRef and the subsequent ones will
retrieve the same OwnerRRef, so that no one is blocking.

Test Plan: Imported from OSS

Differential Revision: D21083089

Pulled By: mrshenli

fbshipit-source-id: 34ef063d50549b01c968b47815c4fe9fac179d3d
2020-04-25 09:36:02 -07:00
37aab14d14 [future] Avoid some future callback self-captures. (#36502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502

We're sometimes deleting futures without completing them (discovered by logging),
and we've recently noticed a slow memory leak.

This change migrates the future lambda cases where there was self-capture.
 - In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback.
   This avoids the reference cycle. We use this primarily in the case where the
   value ends up being moved in the callback (something we want to be careful about)

 - We also add a convenience api to Future where the completed Future is returned as an arg.
   This allows us to avoid self-capture, though it assumes that the markCompleted()
   caller is persisting the future for the markCompleted() duration (this has been the case)

ghstack-source-id: 102130672

Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20998905

fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16
2020-04-14 17:52:44 -07:00
e3b6dd1708 [rref] Minor tweaks in rref_context (#36419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36419

Since we call waitForThreadLocalPendingRRefs per-RPC, construct it
already-satisfied in the common empty case, to avoid extra mutex/cv work.

Also, naming consistency for recording_.
ghstack-source-id: 101975739

Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20977879

fbshipit-source-id: e321a33127e4b5797e44e039839c579057e778e5
2020-04-13 10:19:02 -07:00
291c910e85 [future] Re-land some safe portions of the future change. (#36254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36254

These future use changes were all landed yesterday as part of the future
refactoring, quickly reverted due to an observed OOM, but now being relanded, since
they've since been tested to be benign.
ghstack-source-id: 101776613

Test Plan:
buck test mode/dev-nosan caffe2/test/...
   not ooming: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod

Differential Revision: D20924010

fbshipit-source-id: 28872e488df34c7a886bcd659fa7e9914639d306
2020-04-08 20:05:33 -07:00
a91535930f [future] Undo some recent torch::utils::Future api changes (#36220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220

The torch::utils::Future change from yesterday may have introduced a reference cycle,
leading to OOM on PS. This change reverts the lambda  capture changes with
torch::utils::Future until we can analyze further.
ghstack-source-id: 101756106

Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset

Differential Revision: D20918904

fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624
2020-04-08 11:28:22 -07:00
72b55fea6b [jit] Make torch::utils::Future and ivalue::future apis closer (#35849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849

This change harmonizes some aspects of the api.

- torch::utils::Future callback should have no args, like ivalue::future.
   Many of the lines of this change are related to fixing that up downstream.

   No args makes the api simpler to use, particularly since many/most of the
   downstream use cases ignore the passed-in args. It's simple enough to
   appropriately capture the future in the lambda if necessary.

 - Add error/hasError methods to ivalue::Future.
 - Use c10::optional underneath for error to ivalue::Future.
 - Change markCompleted(error) to setError(error) to ivalue::Future.
 - Add setValue(FutureError) version to torch::utils::Future

ghstack-source-id: 101684435

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20803251

fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269
2020-04-07 17:05:35 -07:00
f182b43760 [rref] Handle exceptions returned via remote() calls (#35331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35331

When the function called by remote() throws, it seems sensible to
surface that exeption when rref.to_here() is called.

Doing this only involves simple modifications:
 - we need the OwnerRRef to keep around an optional<string>
   for the error
 - add an OwnerRRef setError() method that's parallel to setValue(),
   and plumb through the logic

We add rpc_tests to verify that the exception is propagated properly.
ghstack-source-id: 101136900

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn
  buck test mode/dev-nosan caffe2/test/distributed/rpc/jit:rpc_spawn

Differential Revision: D20634078

fbshipit-source-id: b5b13fdb85cdf6a43f42347d82eabae1635368ec
2020-03-31 10:06:15 -07:00
4025729e88 [1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636

Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116,
https://github.com/pytorch/pytorch/issues/33072

Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages.

In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an  RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish).

This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected.

Differential Revision: D20019236

fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48
2020-03-20 20:07:47 -07:00
e5ee95e448 [RPC] Add to confirmed users immediately if the fork is shared from owner, instead of adding nothing to pending users (#34988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34988

In https://github.com/pytorch/pytorch/pull/31893, we introduced a confirmedUsers_ map in RRefContext.

For the case that the fork is shared from the owner,  there is no `pendingUsers_` intermediate phase for this fork, we should put this fork into `confirmedUsers_` immediately.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork
```

Differential Revision: D7735909

fbshipit-source-id: 14c36a16486f0cc9618dcfb111fe5223781b647d
2020-03-18 18:17:41 -07:00
422e348619 Don't run user function until all UserRRefs in the args are confirmed (#34497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497

Use a thread_local table to intercept UserRRefs created during user
function args deserialization, and then wait for confirmations of
those UserRRefs before launching the given user function.

Differential Revision: D20347464

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07
2020-03-16 18:30:06 -07:00
4e07c35679 Delete all user forks tracked in RRefContext before graceful shutting down (#31893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31893

In order to resolve the issue summarized in https://github.com/pytorch/pytorch/issues/31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 100023142

Test Plan:
V22 is the version that make User to wait on delete UseerRRef message.

# Unit tests

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_nested_rref_stress --stress-runs 100

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_nested_rref_stress

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par - r test_rref_forward_chain

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_non_garbage_collected_user_rref_due_to_local_circular_dependency
```

Reviewed By: mrshenli

Differential Revision: D19292254

fbshipit-source-id: 92c3e8d0b00f183c5e22f163bdca482cc25a1ce9
2020-03-12 10:23:08 -07:00
7da24b36b1 Apply clang-format to RPC files (#34139)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34139

Test Plan: Imported from OSS

Differential Revision: D20227342

Pulled By: mrshenli

fbshipit-source-id: 01b478bde1f6a51f69eb5277fa90ba6ac2d4b5dc
2020-03-03 16:44:35 -08:00
4dad00b64b [rpc] special case tensor type check when getting RRef (#33582)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33582

Test Plan: Imported from OSS

Differential Revision: D20009837

Pulled By: wanchaol

fbshipit-source-id: 7e9ab87d4dddb822c7575891a2b620eff83bfa00
2020-02-26 18:44:40 -08:00