pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Xuehai Pan	d55dc00f84	[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321 Approved by: https://github.com/jingsh ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319	2025-06-23 02:57:50 +00:00
PyTorch MergeBot	4b55871e06	Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 )" This reverts commit c95f7fa874a3116f1067f9092456ee7281003614. Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))	2025-06-22 12:27:36 +00:00
Xuehai Pan	c95f7fa874	[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321 Approved by: https://github.com/jingsh ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319	2025-06-22 08:43:49 +00:00
cyy	bbff667e32	[Distributed] [13/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136713 ) Follows #136528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136713 Approved by: https://github.com/kwen2501	2024-09-27 10:11:53 +00:00
cyy	95dbbf713e	[Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109 ) Follows #125102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109 Approved by: https://github.com/ezyang	2024-07-16 04:23:42 +00:00
cyy	30875953a4	[1/N] Remove inclusion of c10/util/string_utils.h (#128300 ) As a first step to remove it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128300 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-06-10 23:40:47 +00:00
cyy	4d51c8532c	Some simple fixes (#93221 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93221 Approved by: https://github.com/Skylion007	2023-01-30 05:14:03 +00:00
Aaron Gokaslan	0247ed27cc	Apply Clang-Tidy readability-container-size-empty (#93236 ) Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is. Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236 Approved by: https://github.com/malfet	2023-01-29 23:28:19 +00:00
cyy	f172feae0d	More tidy fixes (#93069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069 Approved by: https://github.com/Skylion007	2023-01-27 06:40:50 +00:00
xiaxiaohua1	c5cf8f6b28	fix [rpc] Wrong usage of RRefContext::handleException #71458 (#83166 ) Fixes #71458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/83166 Approved by: https://github.com/kumpera	2022-09-08 18:22:51 +00:00
Scott Wolchok	2d885ab73d	[jit] Reduce refcounting of Types (#65345 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345 FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership. ghstack-source-id: 140044165 Test Plan: CI perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial. Reviewed By: hlu1 Differential Revision: D31027361 fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8	2021-10-08 09:03:04 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Mike Guo	6ecc1a4c4f	Make pytorch clang-tidy clean (#60649 ) Summary: This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master. I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver): ```bash python3 setup.py develop # Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options python3 tools/clang_tidy.py \ -j \ -s \ -k \ -v \ --paths torch/csrc/ \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/export.cpp" \ -g"-torch/csrc/jit/serialization/import.cpp" \ -g"-torch/csrc/jit/serialization/import_legacy.cpp" \ -g"-torch/csrc/onnx/init.cpp" \ -g"-torch/csrc/cuda/nccl." \ -g"-torch/csrc/cuda/python_nccl.cpp" \ -g"-torch/csrc/autograd/FunctionsManual.cpp" \ -g"-torch/csrc/generic/.cpp" \ -g"-torch/csrc/jit/codegen/cuda/runtime/*" \ -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649 Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors. Reviewed By: walterddr, janeyx99 Differential Revision: D29504258 Pulled By: 1ntEgr8 fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e	2021-07-01 12:21:07 -07:00
Rohan Varma	d433a55c94	Replace throw std::runtime_error with torch_check in torch/csrc/distributed (#59683 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59683 Replaces usages of throw std::runtime_error("foo") with the better torch_check(false, "foo") which allows C++ stacktraces to show up when TORCH_SHOW_CPP_STACKTRACES=1. This will hopefully provide much better debugging information when debugging crashes/flaky tests. ghstack-source-id: 131167210 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D28981327 fbshipit-source-id: 677f569e28600263cab18759eb1b282e0391aa7b	2021-06-11 11:15:49 -07:00
Luca Wehrstedt	258a991027	[reland] Set and propagate devices in RRef completion future (#59211 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59211 Reland of https://github.com/pytorch/pytorch/pull/58674 I found this missing parameter while debugging failures in the next PR. I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad. ghstack-source-id: 130202843 Test Plan: Should fix the next diff. Reviewed By: mrshenli Differential Revision: D28623886 fbshipit-source-id: 6c82ed7c785ac3bf32fff7eec67cdd73b96aff28	2021-06-02 05:46:04 -07:00
Xiaodong Wang	89c81c5bba	Revert D28574083: Set and propagate devices in RRef completion future Test Plan: revert-hammer Differential Revision: D28574083 (`23df70359a`) Original commit changeset: 5c89902cdc5c fbshipit-source-id: e48043b6c4fb8a6f383f78e1aa88f7614f9fa13a	2021-05-21 19:24:12 -07:00
Luca Wehrstedt	23df70359a	Set and propagate devices in RRef completion future (#58674 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58674 I found this missing parameter while debugging failures in the next PR. I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad. ghstack-source-id: 129567042 Test Plan: Should fix the next diff. Reviewed By: mrshenli Differential Revision: D28574083 fbshipit-source-id: 5c89902cdc5cc12f1ebeea860b90cd9c3d7c7da1	2021-05-21 13:15:34 -07:00
Luca Wehrstedt	20d02cb7dd	Remove getScriptRemoteCallType (#57854 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57854 Because OwnerRRefs used to be created before their value was computed, we had to figure out their type ahead of time. After the previous diff, we inverted the order of operations, and we can now first compute the result and then create the OwnerRRef. Which means we can just inspect the value to get its type. Much simpler, and much less likely to get it wrong. ghstack-source-id: 129567060 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28253843 fbshipit-source-id: f13c9b294f477ae66fcbdbc85c642fdc69b2740f	2021-05-21 13:15:07 -07:00
Luca Wehrstedt	36e47af58b	Pass reference to parent future in callbacks (#57635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635 Note: this PR looks massive, but it's just one simple change, codemodded many times. In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't). ghstack-source-id: 128296580 Test Plan: CI Reviewed By: wanchaol Differential Revision: D28178783 fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081	2021-05-07 03:59:18 -07:00
Luca Wehrstedt	7d4121d1d2	Make RRefContext get devices from RPC agent when creating OwnerRRef (#57443 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57443 Based on the comments in https://github.com/pytorch/pytorch/pull/57355, I started looking at the callsites of `getOrCreateOwnerRRef` and `createOwnerRRef`, and noticed that many of them didn't specify the `devices` argument, which was optional and thus defaulted to `{}`, which created a CPU-only Future inside the OwnerRRef. (Such callsites were, for example, in `processPythonRemoteCall` and `processBaseScriptRemoteCall`, or `PyRRef::unpickle`, ...). Some (or all?) of these callsites might still have worked thanks to the RRef's own handling of CUDA streams and events, however we intend to remove that in https://github.com/pytorch/pytorch/pull/57355. I think it would be a safer and more generic solution to always create OwnerRRefs with the full set of devices supported by the RPC agent, and this is in fact easy to do since the RRefContext has access to the RPC agent. This means that all OwnerRRefs, no matter how they're created, will support CUDA if the agent does. This also allows us to stop requiring to specify devices when creating a OwnerRRef by hand in Python. ghstack-source-id: 128184665 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28144365 fbshipit-source-id: 1f2d446873f31ee297415c46b94126b6502b12d3	2021-05-06 01:12:56 -07:00
Luca Wehrstedt	7ffadf6e46	Replace DeviceIndexes with Devices in RRefs (#57442 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57442 We did this for the RPC agents and for ivalue::Future, the last one (I think) is RRef. ghstack-source-id: 128184664 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28144368 fbshipit-source-id: eeacab6006f72118cbec542a02322f2e391c67a3	2021-05-06 01:12:54 -07:00
Shen Li	1ee54cc7b4	Add devices argument to RRef constructor (#57085 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085 PR #54932 fixed the CUDA RPC for RRef when RRef is created through RPC. But besides that use case, RRef can also be created locally by directly passing in a value, which would bypass the CUDA stream synchronization in #54932. This commit covers the above gap by adding a `devices` argument to RRef constructor. The RRef will then use this argument to choose between `CUDAFutre` and `ivalue::Future` to hold the value. When `devices` is specified and non-empty, `CUDAFuture` will be used, and the `devices` will be passed to that `CUDAFuture`. Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D28050001 Pulled By: mrshenli fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e	2021-04-28 19:11:10 -07:00
Ikko Ashimine	5b62b0d9bc	[RPC] Fix typo in rref_context.cpp (#53978 ) Summary: untill -> until Pull Request resolved: https://github.com/pytorch/pytorch/pull/53978 Reviewed By: jbschlosser Differential Revision: D27039043 Pulled By: rohan-varma fbshipit-source-id: c9178e79fe8b2a3dc61665148fe55dba5adb0abf	2021-03-15 11:08:59 -07:00
Rohan Varma	3b11822825	[RPC] Refactor rref_context to not use utils::Future (#51697 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51697 Refactors the rest of rref_context, specifically pendingOwners map and `getOwnerRRef` to use JitFuture. ghstack-source-id: 122037611 Test Plan: CI Reviewed By: wanchaol Differential Revision: D26243268 fbshipit-source-id: ab8874c8253274e8fe50dcd7291e0655a8f3f1df	2021-02-19 00:59:38 -08:00
Rohan Varma	8ff5a46c32	[RPC] waitForThreadLocalRRefs returns jitFuture (#51696 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51696 Modify this API to use JitFuture. ghstack-source-id: 121695707 Test Plan: Ci Reviewed By: mrshenli Differential Revision: D26239132 fbshipit-source-id: 15c0c349a79e660fe4862e1d99176989f8159bf4	2021-02-13 17:43:16 -08:00
Rohan Varma	87c0b6bffc	[RPC] Move confirmation future in rref context to jit future (#51695 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51695 As part of the plan to completely eliminate torch/csrc/utils/future.h, we are converting this to JitFuture (c10::ivalue::Future). ghstack-source-id: 121695708 Test Plan: CI Reviewed By: mrshenli Differential Revision: D26238873 fbshipit-source-id: 92bad1a349964ce8a9a80e2d1cf68f293cbe411c	2021-02-13 17:40:55 -08:00
Shen Li	d730c7e261	Replace FutureMessage with ivalue::Future in RpcAgent retry logic (#49995 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49995 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25745301 Pulled By: mrshenli fbshipit-source-id: b5e3a7e0b377496924847d8d70d61de32e2d87f4	2021-01-07 19:50:23 -08:00
Shen Li	008206decc	Replace FutureMessage with ivalue::Future in RRefContext (#49960 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49960 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25730530 Pulled By: mrshenli fbshipit-source-id: 5d54572c653592d79c40aed616266c87307a1ad8	2021-01-07 19:50:19 -08:00
Rohan Varma	b5e32528d0	Fix flaky test_udf_remote_message_delay_timeout_to_self (#41217 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217 Fixes this flaky test. Due to the possibility of callback finishCreatingOwnerRRef running after request_callback has processed and created the owner RRef, we could actually end up with 0 owners on the node, since the callback removes from the owners_ map. In this case, shutdown is fine since there are no owners. On the other hand, if the callback runs first, there will be 1 owner which we will delete in shutdown when we detect it has no forks. So either way, shutdown works fine and we don't need to enforce there to be 1 owner. ghstack-source-id: 107883497 Test Plan: Ran the test 500 times with TSAN. Reviewed By: ezyang Differential Revision: D22469806 fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46	2020-07-16 11:20:56 -07:00
Rohan Varma	7e82382ad5	Allow profiler to be enabled remotely with RPC (#38748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38748 This diff contains the message scaffolding and profiler changes in order to be able to remotely run the profiler across different nodes and aggregate the results on a single node. As discussed, we have implemented this by creating new message types, that similar to autograd messages, wrap the profiling information with the original message, and send this new message over the wire. On the receiving end, this wrapped message is detected, we fetch the original message from it, and process the original message with the profiler enabled. When sending a response with profiling information, we serialize the profiled `Events` and send them back over RPC. When such a message is received, the events profiled on the remote node are stored (added back to the local profiler). Changes in this PR: - New message types (run_with_profiling_req, run_with_profiling_resp) to send profiling info over the wire. Message parsing logic is added to handle these wrapped types. - Handling of sending profiler data over the wire, in particular, the attributes of the `ProfilerConfig` and the serialized profiled `Event`s - The logic for wrapping RPC messages is deduped with that in `rpc_with_autograd`, and the common payload wrapping/unwrapping logic is moved to helper functions in `rpc/utils.cpp` - Changes in `autograd/utils.cpp` to detect if we have enabled the profiler and are sending an RPC, if so, uses the above new message types - Changes in request_callback to parse and turn on the profiler in a thread-local fashion - Serialization and deserialization of profiling `Events`, and support to add the remote events to the thread-local profiler - Introduction of the concept of `node_id`, which as discussed with ilia-cher , will be used along with the `Event`s handle attribute to distinguish between events. When there are events from different nodes, this node information is rendered in the profile output (e.g. when printing tables), otherwise, it is not, since it is irrelevant. - Some changes to profiler.cpp to add useful helper methods/guards - toHere() is now profiled for RRefs - Unittests ghstack-source-id: 106134626 Test Plan: Added unittests, existing profiler unittests. Differential Revision: D19510010 fbshipit-source-id: 044347af992f19a9e3b357c9567f6fc73e988157	2020-06-18 17:01:57 -07:00
Rohan Varma	f4ffe99da5	Fix flaky rref timeout test (#40141 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40141 This rref timeout test could be flaky because we could end up processing `RRefUserDelete` messages on the owner node before processing the to_here message. This would result in a hang in `ProcessGroupAgent::sync()` that eventually results in a timeout. The rough sequence of what happens is: 0) Node 0 creates RRef on node 1 with rpc.remote() call 1) rref.to_here() is called with a timeout. Because of delay injection, the processing of this message can be delayed (this is also technically possible in applications without delay injection) 2) At some point, callbacks corresponding to rpc.remote() runs and confirms the rref, adding it as a confirmed user 3) RPC shutdown starts, as part of which we send out RRef user deletes. In this case, 0 sends an RRef user delete to 1, and node 1 removes the owner from the `owners_` field. 4) The `to_here()` message is finally processed by node 1. But since we have deleted the `owner_`, while processing this message we create a future that will be complete when the owner exists (this is to account for the case of to_here() arriving here rpc.remote). But this future will never complete, since the owner is already deleted, so we hang indefnitely As a workaround for now, we can force `to_here()` to run before RPC shutdown by adding a blocking `to_here()` call with no timeout. A more robust, longer-term fix would be to detect if an owner has been previously deleted (such as by an RRefUserDelete). Then, we know that the future corresponding to owner creation on the remote end will never completee, and then we error out when processing a `to_here()`. ghstack-source-id: 106036796 Differential Revision: D22084735 fbshipit-source-id: fe7265a4fe201c4d6d2f480f64fe085cd59dbfb2	2020-06-17 15:48:38 -07:00
Wanchao Liang	356d564886	[rpc] use annotation_str for RRef type serialization (#39932 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39932 This PR make RRef fork to use the jit type annotation_str recently introduce in https://github.com/pytorch/pytorch/pull/39544 to allow consistent serialization type str format, and fix the case when dict->str() format not match the type resolver. Test Plan: Imported from OSS Differential Revision: D22015427 Pulled By: wanchaol fbshipit-source-id: f64d7e3acde5312813816c8f3c7d8fa9379704e8	2020-06-12 17:15:57 -07:00
Yanan Cao	c22bbb2124	[JIT] Add Type::repr_str to return human-readable str (#39544 ) Summary: Clearly expressing a type is inferred by PyTorch instead of explicitly annotated by user makes many error messages more user-friendly Currently Type has two string conversion methods. str() for IR printing and python_str() for serialization and error message generation. If we want to include more information in type printing while maintaining serialization/deserialization correctness, we need to split python_str() into annotation_str() and repr_str(). annotation_str is solely responsible for serialization, it strictly matches format of python type annotation. repr_str() is responsible for generating a human-readable error message that includes information like "this type is inferred, not explicitly annotated" Closes https://github.com/pytorch/pytorch/issues/39449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39544 Differential Revision: D21978759 Pulled By: gmagogsfm fbshipit-source-id: 733566f5a62e748b5ca4bb3c5943ebb6d5b664d0	2020-06-10 12:01:24 -07:00
Rohan Varma	8b2bb02e09	Implement timeout support for RRefs (#38590 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 105232837 Test Plan: CI Differential Revision: D21588165 fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3	2020-06-04 02:14:42 -07:00
Luca Wehrstedt	f58cc4b444	[RPC] Fix flaky test by waiting for async rref calls (#39012 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39012 The `test_rref_context_debug_info` test was flaky with the TensorPipe agent, and I think the issue is the test itself. What was happening is that on line 1826 the test was clearing a global variable on the remote side which was holding a rref. Even though the RPC call that unset the global variable was synchronous, the messages that the rref context needs to send around to delete that rref are asynchronous. Therefore, sometimes, when we reached line 1845 we saw the following check fail: ``` self.assertEqual(2, int(info["num_owner_rrefs"])) ``` because `num_owner_rrefs` was still 3, as the deletion hadn't yet been processed. The only way I found to fix it is to add a synchronization step where we wait for all the futures from the rref context to complete. Since we must wait for this to happen on all workers, we synchronize with a barrier. ghstack-source-id: 104810738 Test Plan: The test isn't flaky anymore. Differential Revision: D21716070 fbshipit-source-id: e5a97e520c5b10b67c335abf2dc7187ee6227643	2020-05-28 10:48:34 -07:00
Shen Li	3d934c3d36	Add using torch::utils::Future to simplify code in RRefContext (#36811 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36811 Test Plan: Imported from OSS Differential Revision: D21093846 Pulled By: mrshenli fbshipit-source-id: 61a6b1483ef1533803a18bec216ebe82aa187458	2020-04-25 09:37:33 -07:00
Shen Li	269ec9a139	Prevent RRef.to_here() to block an RPC thread on the callee using Future callbacks (#36805 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36805 Test Plan: Imported from OSS Differential Revision: D21093847 Pulled By: mrshenli fbshipit-source-id: 81b0934874af36e03329fe6176628e3aca12811f	2020-04-25 09:37:28 -07:00
Shen Li	6e1e55c134	Prevent RRef unpickle to block waiting for OwnerRRef creation (#36785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36785 Currently, RRef unpickle (both Python and TorchScript) will block until the OwnerRRef has been created by the original `rpc.remote` call, if it is an OwnerRRef. This is not ideal as the correctness would then depends on the number of threads configuration. This commit changed that behavior. Both `rpc.remote` and the unpickle can create OwnerRRefs. More specifically, whichever one arrives first will create the OwnerRRef and the subsequent ones will retrieve the same OwnerRRef, so that no one is blocking. Test Plan: Imported from OSS Differential Revision: D21083089 Pulled By: mrshenli fbshipit-source-id: 34ef063d50549b01c968b47815c4fe9fac179d3d	2020-04-25 09:36:02 -07:00
Jeremy Lilley	37aab14d14	[future] Avoid some future callback self-captures. (#36502 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502 We're sometimes deleting futures without completing them (discovered by logging), and we've recently noticed a slow memory leak. This change migrates the future lambda cases where there was self-capture. - In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback. This avoids the reference cycle. We use this primarily in the case where the value ends up being moved in the callback (something we want to be careful about) - We also add a convenience api to Future where the completed Future is returned as an arg. This allows us to avoid self-capture, though it assumes that the markCompleted() caller is persisting the future for the markCompleted() duration (this has been the case) ghstack-source-id: 102130672 Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/... Differential Revision: D20998905 fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16	2020-04-14 17:52:44 -07:00
Jeremy Lilley	e3b6dd1708	[rref] Minor tweaks in rref_context (#36419 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36419 Since we call waitForThreadLocalPendingRRefs per-RPC, construct it already-satisfied in the common empty case, to avoid extra mutex/cv work. Also, naming consistency for recording_. ghstack-source-id: 101975739 Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/... Differential Revision: D20977879 fbshipit-source-id: e321a33127e4b5797e44e039839c579057e778e5	2020-04-13 10:19:02 -07:00
Jeremy Lilley	291c910e85	[future] Re-land some safe portions of the future change. (#36254 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36254 These future use changes were all landed yesterday as part of the future refactoring, quickly reverted due to an observed OOM, but now being relanded, since they've since been tested to be benign. ghstack-source-id: 101776613 Test Plan: buck test mode/dev-nosan caffe2/test/... not ooming: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod Differential Revision: D20924010 fbshipit-source-id: 28872e488df34c7a886bcd659fa7e9914639d306	2020-04-08 20:05:33 -07:00
Jeremy Lilley	a91535930f	[future] Undo some recent torch::utils::Future api changes (#36220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220 The torch::utils::Future change from yesterday may have introduced a reference cycle, leading to OOM on PS. This change reverts the lambda capture changes with torch::utils::Future until we can analyze further. ghstack-source-id: 101756106 Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset Differential Revision: D20918904 fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624	2020-04-08 11:28:22 -07:00
Jeremy Lilley	72b55fea6b	[jit] Make torch::utils::Future and ivalue::future apis closer (#35849 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849 This change harmonizes some aspects of the api. - torch::utils::Future callback should have no args, like ivalue::future. Many of the lines of this change are related to fixing that up downstream. No args makes the api simpler to use, particularly since many/most of the downstream use cases ignore the passed-in args. It's simple enough to appropriately capture the future in the lambda if necessary. - Add error/hasError methods to ivalue::Future. - Use c10::optional underneath for error to ivalue::Future. - Change markCompleted(error) to setError(error) to ivalue::Future. - Add setValue(FutureError) version to torch::utils::Future ghstack-source-id: 101684435 Test Plan: buck test mode/dev-nosan caffe2/test/... Differential Revision: D20803251 fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269	2020-04-07 17:05:35 -07:00
Jeremy Lilley	f182b43760	[rref] Handle exceptions returned via remote() calls (#35331 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35331 When the function called by remote() throws, it seems sensible to surface that exeption when rref.to_here() is called. Doing this only involves simple modifications: - we need the OwnerRRef to keep around an optional<string> for the error - add an OwnerRRef setError() method that's parallel to setValue(), and plumb through the logic We add rpc_tests to verify that the exception is propagated properly. ghstack-source-id: 101136900 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn buck test mode/dev-nosan caffe2/test/distributed/rpc/jit:rpc_spawn Differential Revision: D20634078 fbshipit-source-id: b5b13fdb85cdf6a43f42347d82eabae1635368ec	2020-03-31 10:06:15 -07:00
Omkar Salpekar	4025729e88	[1.5 Release][RPC Reliability] RRef Idempotency and RPC Retry enablement (#33636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33636 Fixes https://github.com/pytorch/pytorch/issues/32119, https://github.com/pytorch/pytorch/issues/26116, https://github.com/pytorch/pytorch/issues/33072 Makes RRef control messages idempotent and enables sending with retries for distributed autograd cleanup and RRef internal messages. In order to effectively test whether these RRef and distributed autograd cleanup work with network failures/retries, I implemented an RPC Agent with a faulty send function, and enabled running tests using this as a third backend (in addition to Thrift and PGA). The tests using this backend are in a separate class (the test cases are similar but with minor changes to ensure short-running tests wait for retried RPCs to finish). This faulty RPC Agent is pretty configurable. The tests can configure which messages types to fail, and how many messages to fail, but going forward, other RPC functionality can be overriden with faulty methods to test with failures injected. Differential Revision: D20019236 fbshipit-source-id: 540a977e96b2e29aa0393ff12621fa293fe92b48	2020-03-20 20:07:47 -07:00
Shihao Xu	e5ee95e448	[RPC] Add to confirmed users immediately if the fork is shared from owner, instead of adding nothing to pending users (#34988 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34988 In https://github.com/pytorch/pytorch/pull/31893, we introduced a confirmedUsers_ map in RRefContext. For the case that the fork is shared from the owner, there is no `pendingUsers_` intermediate phase for this fork, we should put this fork into `confirmedUsers_` immediately. Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork ``` ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork ``` Differential Revision: D7735909 fbshipit-source-id: 14c36a16486f0cc9618dcfb111fe5223781b647d	2020-03-18 18:17:41 -07:00
Shen Li	422e348619	Don't run user function until all UserRRefs in the args are confirmed (#34497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497 Use a thread_local table to intercept UserRRefs created during user function args deserialization, and then wait for confirmations of those UserRRefs before launching the given user function. Differential Revision: D20347464 Test Plan: Imported from OSS Pulled By: mrshenli fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07	2020-03-16 18:30:06 -07:00
Shihao Xu	4e07c35679	Delete all user forks tracked in RRefContext before graceful shutting down (#31893 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31893 In order to resolve the issue summarized in https://github.com/pytorch/pytorch/issues/31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 100023142 Test Plan: V22 is the version that make User to wait on delete UseerRRef message. # Unit tests ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_nested_rref_stress --stress-runs 100 buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_nested_rref_stress buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par - r test_rref_forward_chain buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_non_garbage_collected_user_rref_due_to_local_circular_dependency ``` Reviewed By: mrshenli Differential Revision: D19292254 fbshipit-source-id: 92c3e8d0b00f183c5e22f163bdca482cc25a1ce9	2020-03-12 10:23:08 -07:00
Shen Li	7da24b36b1	Apply clang-format to RPC files (#34139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34139 Test Plan: Imported from OSS Differential Revision: D20227342 Pulled By: mrshenli fbshipit-source-id: 01b478bde1f6a51f69eb5277fa90ba6ac2d4b5dc	2020-03-03 16:44:35 -08:00
Wanchao Liang	4dad00b64b	[rpc] special case tensor type check when getting RRef (#33582 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33582 Test Plan: Imported from OSS Differential Revision: D20009837 Pulled By: wanchaol fbshipit-source-id: 7e9ab87d4dddb822c7575891a2b620eff83bfa00	2020-02-26 18:44:40 -08:00

1 2

74 Commits