pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
cyy	f048569c24	[Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439 ) Follows #131671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439 Approved by: https://github.com/kwen2501	2024-09-24 13:05:15 +00:00
cyy	95dbbf713e	[Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109 ) Follows #125102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109 Approved by: https://github.com/ezyang	2024-07-16 04:23:42 +00:00
cyy	f4dcf2ae93	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-07-08 07:03:53 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
Richard Barnes	ed327876f5	[codemod] `c10:optional` -> `std::optional` (#126135 ) Generated by running the following from PyTorch root: ``` find . -regex ".*\.$cpp\\|h\\|cu\\|hpp\\|cc\\|cxx$$" \| grep -v "build/" \| xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/' ``` `c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi	2024-05-14 19:35:51 +00:00
Scott Wolchok	fff1948b02	[PyTorch] intrusive_ptr: don't guarantee release_resources will be called Pull Request resolved: https://github.com/pytorch/pytorch/pull/76767 We're spending a virtual function call in the common case where there are no weak references just to save a small amount of care in intrusive_ptr_target subclasses that override release_resources, of which there aren't very many. Differential Revision: [D36109757](https://our.internmc.facebook.com/intern/diff/D36109757/) NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36109757/)! Approved by: https://github.com/ezyang	2022-06-10 19:30:35 +00:00
Richard Barnes	ee44d73e59	Modernize override (#61744 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61744 Test Plan: Sandcastle Reviewed By: malfet Differential Revision: D29717320 fbshipit-source-id: 6eea4295ee2e5572ab337620be412376fcc2f3cc	2021-07-23 23:04:46 -07:00
Luca Wehrstedt	7bcd8f94a5	Avoid re-doing CUDA stream sync in OwnerRRef (#57355 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57355 We had started fixing OwnerRRef to make it CUDA-compatible, by properly synchronizing CUDA streams/events where appropriate. However, since we started using CUDAFuture (or, well, ivalue::Future nowadays, after they got merged) this is all done automatically for us, hence we can undo these "fixes" as they're now duplicated. ghstack-source-id: 130583771 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28118182 fbshipit-source-id: 4b1dd9fe88c23802b1df573941d1b73af48bb67b	2021-06-04 06:52:33 -07:00
Luca Wehrstedt	797dff55b5	Unify fetching RRefs (#57859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57859 Just like with assigning OwnerRRefs, we can also deduplicate the code paths for fetching their values. In fact this was duplicated three times, with different ways of post-processing the value (once for JIT, once for Python, once for autograd). Thanks to future, we can have that logic once, and then connect it to different follow-up steps. ghstack-source-id: 129567050 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28286172 fbshipit-source-id: e0742a99cf555755e848057ab6fee5285ff0df2a	2021-05-21 13:15:15 -07:00
Luca Wehrstedt	45012da298	Migrate from shared_ptr to intrusive_ptr for Future (#57636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636 The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge. At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC. This PR was produced mainly by find-and-replace, plus a couple of manual fixes. ghstack-source-id: 128296581 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D28187972 fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b	2021-05-07 03:59:20 -07:00
Luca Wehrstedt	7d4121d1d2	Make RRefContext get devices from RPC agent when creating OwnerRRef (#57443 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57443 Based on the comments in https://github.com/pytorch/pytorch/pull/57355, I started looking at the callsites of `getOrCreateOwnerRRef` and `createOwnerRRef`, and noticed that many of them didn't specify the `devices` argument, which was optional and thus defaulted to `{}`, which created a CPU-only Future inside the OwnerRRef. (Such callsites were, for example, in `processPythonRemoteCall` and `processBaseScriptRemoteCall`, or `PyRRef::unpickle`, ...). Some (or all?) of these callsites might still have worked thanks to the RRef's own handling of CUDA streams and events, however we intend to remove that in https://github.com/pytorch/pytorch/pull/57355. I think it would be a safer and more generic solution to always create OwnerRRefs with the full set of devices supported by the RPC agent, and this is in fact easy to do since the RRefContext has access to the RPC agent. This means that all OwnerRRefs, no matter how they're created, will support CUDA if the agent does. This also allows us to stop requiring to specify devices when creating a OwnerRRef by hand in Python. ghstack-source-id: 128184665 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28144365 fbshipit-source-id: 1f2d446873f31ee297415c46b94126b6502b12d3	2021-05-06 01:12:56 -07:00
Luca Wehrstedt	7ffadf6e46	Replace DeviceIndexes with Devices in RRefs (#57442 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57442 We did this for the RPC agents and for ivalue::Future, the last one (I think) is RRef. ghstack-source-id: 128184664 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28144368 fbshipit-source-id: eeacab6006f72118cbec542a02322f2e391c67a3	2021-05-06 01:12:54 -07:00
Nikita Shulga	eac02f85cf	Fix more clang-tidy errors (#57235 ) Summary: In my last PR I've missed CUDA and distributed folders, fixing this now This change is autogenerated by `python tool/clang_tidy.py -s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235 Reviewed By: janeyx99 Differential Revision: D28084444 Pulled By: malfet fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda	2021-04-28 23:29:10 -07:00
Shen Li	1ee54cc7b4	Add devices argument to RRef constructor (#57085 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085 PR #54932 fixed the CUDA RPC for RRef when RRef is created through RPC. But besides that use case, RRef can also be created locally by directly passing in a value, which would bypass the CUDA stream synchronization in #54932. This commit covers the above gap by adding a `devices` argument to RRef constructor. The RRef will then use this argument to choose between `CUDAFutre` and `ivalue::Future` to hold the value. When `devices` is specified and non-empty, `CUDAFuture` will be used, and the `devices` will be passed to that `CUDAFuture`. Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D28050001 Pulled By: mrshenli fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e	2021-04-28 19:11:10 -07:00
Pavel Belevich	9f89b53d7d	Synchronize RRef.to_here() CUDA Streams properly (#54932 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54932 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27684022 Pulled By: pbelevich fbshipit-source-id: 2bae51ab6649258d0219ca4e9dbbf45ac6a76c28	2021-04-13 23:24:38 -07:00
Rohan Varma	3b11822825	[RPC] Refactor rref_context to not use utils::Future (#51697 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51697 Refactors the rest of rref_context, specifically pendingOwners map and `getOwnerRRef` to use JitFuture. ghstack-source-id: 122037611 Test Plan: CI Reviewed By: wanchaol Differential Revision: D26243268 fbshipit-source-id: ab8874c8253274e8fe50dcd7291e0655a8f3f1df	2021-02-19 00:59:38 -08:00
Shen Li	008206decc	Replace FutureMessage with ivalue::Future in RRefContext (#49960 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49960 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25730530 Pulled By: mrshenli fbshipit-source-id: 5d54572c653592d79c40aed616266c87307a1ad8	2021-01-07 19:50:19 -08:00
Shen Li	25ef605132	Replace FutureMessage with ivalue::Future in distributed/autograd/utils.* (#49927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49927 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25724241 Pulled By: mrshenli fbshipit-source-id: d608e448f5224e41fbb0b5be6b9ac51a587f25b4	2021-01-07 19:50:16 -08:00
Pritam Damania	f1624b82b5	Preserve python backtrace in autograd engine errors. (#43684 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43684 This PR attempts to address #42560 by capturing the appropriate exception_ptr in the autograd engine and passing it over to the Future. As part of this change, there is a significant change the Future API where we now only accept an exception_ptr as part of setError. For the example in #42560, the exception trace would now look like: ``` > Traceback (most recent call last): > File "test_autograd.py", line 6914, in test_preserve_backtrace > Foo.apply(t).sum().backward() > File "torch/tensor.py", line 214, in backward > torch.autograd.backward(self, gradient, retain_graph, create_graph) > File "torch/autograd/__init__.py", line 127, in backward > allow_unreachable=True) # allow_unreachable flag > File "torch/autograd/function.py", line 87, in apply > return self._forward_cls.backward(self, *args) > File "test_autograd.py", line 6910, in backward > raise ValueError("something") > ValueError: something ``` ghstack-source-id: 111109637 Test Plan: waitforbuildbot Reviewed By: albanD Differential Revision: D23365408 fbshipit-source-id: 1470c4776ec8053ea92a6ee1663460a3bae6edc5	2020-09-01 01:28:47 -07:00
Shihao Xu	b803b4ce09	[torch.distributed.rpc] Add stringify WorkerInfo, better error message for py_rref (#39974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974 # Problem When this assertion happens, I don't know - which worker_id it is on, even with the worker_name "trainer:0". - which rref is throwing this exception. ```shell File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp> trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception raise result.exception_type(result.msg) RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0') Traceback (most recent call last): File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function result = python_udf.func(python_udf.args, python_udf.kwargs) File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke return getattr(rref.local_value(), func_name)(args, **kwargs) RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0 ``` Changes, - Add stringify WorkerInfo - Make localValue() assertion message clearer about the case. ghstack-source-id: 105840918 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork Reviewed By: mrshenli Differential Revision: D5690653 fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170	2020-06-13 12:57:05 -07:00
Rohan Varma	8b2bb02e09	Implement timeout support for RRefs (#38590 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 105232837 Test Plan: CI Differential Revision: D21588165 fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3	2020-06-04 02:14:42 -07:00
Shen Li	155a287aea	Enforce const on PyRRef functions (#38415 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38415 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D21554722 Pulled By: mrshenli fbshipit-source-id: 53c2abd8de43545873be486e1fb893bc329d65a1	2020-05-14 19:01:28 -07:00
Shihao Xu	3d0279862d	Consolidate builtin/python_udf RPC to return ivalue::Future like torchscript RPC does (#35154 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35154 This is for issue https://github.com/pytorch/pytorch/issues/34999. close https://github.com/pytorch/pytorch/issues/34999. https://github.com/pytorch/pytorch/issues/34997 need more work. This will make a few work items easier, like 1) Dist autograd profiler, 2) JIT annotation for Future. Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_rref_forward_chain --stress-runs 100 buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \ -r test_call_method_on_rref ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- 'test_rref_proxy_class $fb\.test_rpc_fork\.RpcTestWithFork$' --stress-runs 100 test_rref_proxy_reuse test_handle_send_exceptions ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \ -r test_script_call_python_return_future ``` Differential Revision: D7722184 fbshipit-source-id: bd92b855bfea4913d6672700590c57622fa86e0e	2020-05-08 21:28:56 -07:00
Shihao Xu	615235fc80	Migrate OwnerRRef value store to generic torch Future (#38143 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38143 It's a followup of https://github.com/pytorch/pytorch/pull/32556, where an error handling boilerplate code path was added to the FutureMessage callback. However, I noticed that the FutureMessage could never be set with an error, because the FutureMessage is a member in OwnerRRef, - OwnerRRef does not have a setError method yet. - The FutureMessage is only used for signaling - The value of the RRef is contained in the `value_` field. With the Future being generalized, it could contain more value types, not limited to Message. This PR migrates the OwnerRRef value from the `value_` field to the generic Future. In a later PR, it will be super easy to add a `setError` method for OwnerRRef, which calls `future_.setError(..)`. (I decide to do it later. I think it's better to migrate the call sites together with adding the new `setError` method.) Also, this fixes the issue pointed out by https://github.com/pytorch/pytorch/pull/31086/files#r422256916. This PR was submitted as https://github.com/pytorch/pytorch/pull/32608. ghstack-source-id: 103757743 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \ -r test_call_method_on_rref ``` Differential Revision: D5707692 fbshipit-source-id: 83ce0e5e5e97acb9ce8230fce5e4a3d806478b02	2020-05-08 15:10:32 -07:00
Shen Li	d5b38984c8	Let RPC return FutureIValue instead of FutureMessage (#37519 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37519 closes #37446 Currently FutureMessage is used in several places: 1. `rpc_async` returns a `FutureMessage` object and we expose it as `torch.distributed.rpc.Future`. From applications perspective, they are expecting a `py::object` instead of a `Message`, and we do the conversion in the `Future.wait()` pybind method. 2. RPC autograd profiler takes `FutureMessage` and installs callbacks to it. The profiler actually only need a `Future<T>` and does not care what `T` is. 3. `OwnerRRef` exposes a `getFuture()` API which returns a `FutureMessage`. This `FutureMessage` will be marked completed when the value referenced by the `OwnerRRef` is ready. `OwnerRRef` does not need it to be a Message type either, it actually creates an empty `Message` to mark the `Future`. The above places are using `FutureMessage`, but they don't really need a `Message`, and `Message` is a communication layer type that applications or profiler or the RRef shouldn't be aware of. Another motivation for making this change is that for async RPC UDF #36071, we are going to allow application to call `markCompleted` in Python. If we still use `FutureMessage`, then in the `markCompleted` pybind function, it needs to convert the provided `py::object` into a specific message type, which is leaking communication layer code to pybind functions. Even if this is doable, we will have two entities (RPC agent and pybind Python frontend) accessing the same request callback logic. This is too messy. This commit replaces all surface `FutureMessage` with `FutureIValue`, so that `FutureMessage` is no longer visible from Python land. Note that this does not cause BC issues, as the Python Future type name and its API stay intact. Internally, we still have `FutureMessage` in the communication layer. Test Plan: Imported from OSS Reviewed By: xush6528 Differential Revision: D21308887 Pulled By: mrshenli fbshipit-source-id: 4f574f38e83125081f142813cfdde56119522089	2020-04-29 19:10:29 -07:00
Rohan Varma	752d3c281a	[profiler] Allow record_function ctx manager to profile futures (#35055 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055 This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future. This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future. Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well. These code paths will be merged once the futures are merged. ghstack-source-id: 102478180 Test Plan: Added unit tests Differential Revision: D20452003 fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d	2020-04-20 12:37:54 -07:00
Omkar Salpekar	5927a6731c	[PyTorch Docs] Updated RRef docs to indicate RPC Retries (#36678 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36678 Updated the docs to explicitly indicate that RRef control messages are idempotent and retried upon failure. ghstack-source-id: 102225791 Test Plan: build bot Differential Revision: D20828041 fbshipit-source-id: ca4d71c65a453664c16c32134c47637a966b1a19	2020-04-15 17:33:20 -07:00
Jeremy Lilley	f182b43760	[rref] Handle exceptions returned via remote() calls (#35331 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35331 When the function called by remote() throws, it seems sensible to surface that exeption when rref.to_here() is called. Doing this only involves simple modifications: - we need the OwnerRRef to keep around an optional<string> for the error - add an OwnerRRef setError() method that's parallel to setValue(), and plumb through the logic We add rpc_tests to verify that the exception is propagated properly. ghstack-source-id: 101136900 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:rpc_spawn buck test mode/dev-nosan caffe2/test/distributed/rpc/jit:rpc_spawn Differential Revision: D20634078 fbshipit-source-id: b5b13fdb85cdf6a43f42347d82eabae1635368ec	2020-03-31 10:06:15 -07:00
Shihao Xu	b5edf329f8	[JIT] Make RPC RRef Owner WorkerInfo.name available to TorchScript (#34896 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34896 Make TorchScript support calling ref.owner() to get owner worker id and calling ref.owner_name() to get owner worker name. Differential Revision: D7652208 fbshipit-source-id: a60125bb316ac2cf19a993cbd2affc933c0af7c9	2020-03-17 20:28:18 -07:00
Shen Li	422e348619	Don't run user function until all UserRRefs in the args are confirmed (#34497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497 Use a thread_local table to intercept UserRRefs created during user function args deserialization, and then wait for confirmations of those UserRRefs before launching the given user function. Differential Revision: D20347464 Test Plan: Imported from OSS Pulled By: mrshenli fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07	2020-03-16 18:30:06 -07:00
Shen Li	ad4bc8c9b8	Best-effort Error Detection for Using Deleted UserRRefs (#34673 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34673 Test Plan: Imported from OSS Differential Revision: D20427839 Pulled By: mrshenli fbshipit-source-id: b1b12ca42a9ed5294806c53fa7d6f54e7dc8b188	2020-03-12 21:39:15 -07:00
Shihao Xu	4e07c35679	Delete all user forks tracked in RRefContext before graceful shutting down (#31893 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31893 In order to resolve the issue summarized in https://github.com/pytorch/pytorch/issues/31325. The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks. As the first step, we want to have a weak ref tracker to track all user rrefs. ghstack-source-id: 100023142 Test Plan: V22 is the version that make User to wait on delete UseerRRef message. # Unit tests ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_nested_rref_stress --stress-runs 100 buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_nested_rref_stress buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par - r test_rref_forward_chain buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_non_garbage_collected_user_rref_due_to_local_circular_dependency ``` Reviewed By: mrshenli Differential Revision: D19292254 fbshipit-source-id: 92c3e8d0b00f183c5e22f163bdca482cc25a1ce9	2020-03-12 10:23:08 -07:00
Shen Li	7da24b36b1	Apply clang-format to RPC files (#34139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34139 Test Plan: Imported from OSS Differential Revision: D20227342 Pulled By: mrshenli fbshipit-source-id: 01b478bde1f6a51f69eb5277fa90ba6ac2d4b5dc	2020-03-03 16:44:35 -08:00
Yanli Zhao	4d9b649261	jit pickling rref (#32959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32959 in rpc torch script call path, we need to pickle/unpickle rref, this diff is added to make jit pickler/unpickler be able to pickle/unpickle rref. It is similar to what is implemented for PyRef::pickle() and PyRef::unpickle(). The pickling/unpickling design assumes it is always coupled with RPC calls. It is not needed to checkpoint a model with rref, before checkpointing the model, user should call ref.to_here() to get value inside rref. The pickling process is: 1. push torch.distributed.rpc.rref global string 1. call rref.fork() and create rrefForkData, which is a few IDs and type str of the value held inside the rref, the IDs includes rref id, fork id, caller work id, callee work id, owner work id 2. push the rrefForkData The unpickling process is: 1. read torch.distributed.rpc.rref global string, and retrieve the cached global lamda function 2. the globa lamda function will get rrefForkData 3. if callee is also owner work id, then get owner rref based on Ids inside rrefFork data and return the ownerRRef 4. if callee is not owner work id, then create user rref using the rrefForkData and return the userRRef 5. meanwhile owner rref will be notified and do reference counting correctly During unpickling, a type_resolver is needed to parse type str. This type_resolver has python dependency, so we get it from rpc_agent, and pass it to unpickler during construction. So we added a type_resolver argumenmt to jit unpickler constructor in this diff. ghstack-source-id: 98814793 Test Plan: unit test Differential Revision: D19713293 fbshipit-source-id: 4fd776cdd4ce8f457c4034d79acdfb4cd095c52e	2020-02-24 11:16:35 -08:00
Wanchao Liang	9ae4d38a21	[rpc] Switch RRef to be managed by intrusive_ptr (#33189 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33189 Add RRefInterface to Aten/Core, which will later be used by IValue Switch all the rpc code base to use intrusive_ptr instead of shared_ptr, so that we could add it to IValue. Actual adding to IValue and JIT will be in next PR Test Plan: Imported from OSS Differential Revision: D19871241 Pulled By: wanchaol fbshipit-source-id: d7e1fd04b46320e0f26c18591b49c92ad30a4032	2020-02-13 20:15:31 -08:00
Shihao Xu	12bcfa7c77	Remove Python dependency (toPyTuple/fromPyTuple, jitCompilationUnit, deserialize) in rref_impl.h/cpp (#32753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32753 Functions to be bound as an Aten operator could not have Python dependency. This is to refactor and remove Python dependency. ghstack-source-id: 97485800 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_script_functions_not_supported buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_script_functions_not_supported ``` ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call ``` Differential Revision: D5741675 fbshipit-source-id: 31ee60955be8d815d0773f3699e3ff2f1f9d8849	2020-01-30 17:52:48 -08:00
Yanli Zhao	b474c351dd	[rpc] Remove template on RRef and add Type to RRef creation (#30630 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30630 This remove template and all the specializations it have in rpc, we universally use IValue as the inner value since we support making python object to be hold inside IValue. This will also ensure that we have the correct type information when creating the RRef, we use the return type from the schema when creating userRRef and OwnerRRef, it will enable IValue to always have the correct type if the IValue is the RRef object (next PR) Test Plan: Imported from OSS Differential Revision: D19502235 fbshipit-source-id: 0d5decae8a9767e0893f3b8b6456b231653be3c5	2020-01-23 21:15:46 -08:00
Shen Li	e8e47c0a1b	Split RRef class into abstract RRef and RRefBase (#28942 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28942 The new abstract RRef class contains only user-facing RRef APIs. It will be later moved to a common folder so that it can be shared by jit and distributed packages to provide TorchScript support. Test Plan: Imported from OSS Differential Revision: D18240590 Pulled By: mrshenli fbshipit-source-id: ac28cfc2c8039ab7131b537b2971ed4738710acb	2019-12-28 20:01:02 -08:00

39 Commits