pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
albanD	1a8af1503f	Upgrade Pybind submodule to 2.10.4 (#103989 ) This is not ready for review, this is to make sure asan is fixed. Not sure what is the most effective way to track down the bad dec_ref within deploy yet. The asan silencing is done to match this comment: `1c79003b3c/test/test_cpp_extensions_jit.py (L749-L752)` EDIT: since the final failing function is in libtorch_python.so, we would need to skip that whole lib (not ok). So now we're skipping based on the function name which should be restrictive enough to not hide any real bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103989 Approved by: https://github.com/malfet	2023-06-27 20:22:39 +00:00
Sergii Dymchenko	edec9698ab	Fix ScripModule typo (#84444 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/84444 Approved by: https://github.com/malfet	2022-09-01 23:55:25 +00:00
Pritam Damania	05e17e7ff6	Add API usage logging for several other RPC APIs. (#67722 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67722 ghstack-source-id: 142259452 Test Plan: waitforbuildbot Reviewed By: jaceyca, fduwjj Differential Revision: D32118872 fbshipit-source-id: 041ab5601221b1846c56ce4bb63364bec9ad28b0	2021-11-03 14:02:00 -07:00
Scott Wolchok	82f7f8d471	[PyTorch] Adopt IValue::toTupleRef() where obvious (#65505 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65505 Generated with `fastmod -m 'toTuple(\s)->' 'toTupleRef()${1}.'` , followed by `fastmod '(std::move$.)toTupleRef\($.' '${1}toTuple()->'` to unbreak 2 callsites. ghstack-source-id: 142065835 Test Plan: CI Reviewed By: gchanan Differential Revision: D31131025 fbshipit-source-id: 54457ae5bbeb38db9c7f196d469b98521c3d3f34	2021-11-02 10:22:18 -07:00
Scott Wolchok	e88d1c4f10	[PyTorch] Add tuple inline storage (#64066 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64066 I noticed a bunch of time being spent heap-allocating Tuples in the unpickler. 1-, 2-, and 3-element Tuples are apparently common enough that they get their own bytecode instructions, so I decided to try also giving them their own representation. We store up to 3 IValues inline in `Tuple` rather than doing a second heap allocation for a `std::vector<IValue>`. ghstack-source-id: 140695395 Test Plan: Added automated tests for TupleElements. Pixel 3 before: https://www.internalfb.com/intern/aibench/details/761596366576284 Pixel 3 after: https://www.internalfb.com/intern/aibench/details/591414145082422 We went from 347 ms to 302 ms. Reviewed By: dhruvbird Differential Revision: D30592622 fbshipit-source-id: 93625c54c9dca5f765ef6d5c191944179cb281a8	2021-10-15 12:16:51 -07:00
Scott Wolchok	2d885ab73d	[jit] Reduce refcounting of Types (#65345 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65345 FooType::get() can return a const reference. Inconveniently, converting shared_ptr<FooType> to shared_ptr<Type> requires a copy & refcount bump, so to properly take advantage of this in unshapedType() we need to take a const Type& in isSubtypeOf(), which is good practice anyway -- don't require a shared_ptr if you don't need to take ownership. ghstack-source-id: 140044165 Test Plan: CI perf says c10::unshapedType time decreased from 2.8% to 2.2% during static runtime startup, though I expect this to be generally beneficial. Reviewed By: hlu1 Differential Revision: D31027361 fbshipit-source-id: 676feb81db9f74ad7b8651d8774f4ecb4cfa6ab8	2021-10-08 09:03:04 -07:00
Rohan Varma	d433a55c94	Replace throw std::runtime_error with torch_check in torch/csrc/distributed (#59683 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59683 Replaces usages of throw std::runtime_error("foo") with the better torch_check(false, "foo") which allows C++ stacktraces to show up when TORCH_SHOW_CPP_STACKTRACES=1. This will hopefully provide much better debugging information when debugging crashes/flaky tests. ghstack-source-id: 131167210 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D28981327 fbshipit-source-id: 677f569e28600263cab18759eb1b282e0391aa7b	2021-06-11 11:15:49 -07:00
Luca Wehrstedt	797dff55b5	Unify fetching RRefs (#57859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57859 Just like with assigning OwnerRRefs, we can also deduplicate the code paths for fetching their values. In fact this was duplicated three times, with different ways of post-processing the value (once for JIT, once for Python, once for autograd). Thanks to future, we can have that logic once, and then connect it to different follow-up steps. ghstack-source-id: 129567050 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28286172 fbshipit-source-id: e0742a99cf555755e848057ab6fee5285ff0df2a	2021-05-21 13:15:15 -07:00
Luca Wehrstedt	7d4121d1d2	Make RRefContext get devices from RPC agent when creating OwnerRRef (#57443 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57443 Based on the comments in https://github.com/pytorch/pytorch/pull/57355, I started looking at the callsites of `getOrCreateOwnerRRef` and `createOwnerRRef`, and noticed that many of them didn't specify the `devices` argument, which was optional and thus defaulted to `{}`, which created a CPU-only Future inside the OwnerRRef. (Such callsites were, for example, in `processPythonRemoteCall` and `processBaseScriptRemoteCall`, or `PyRRef::unpickle`, ...). Some (or all?) of these callsites might still have worked thanks to the RRef's own handling of CUDA streams and events, however we intend to remove that in https://github.com/pytorch/pytorch/pull/57355. I think it would be a safer and more generic solution to always create OwnerRRefs with the full set of devices supported by the RPC agent, and this is in fact easy to do since the RRefContext has access to the RPC agent. This means that all OwnerRRefs, no matter how they're created, will support CUDA if the agent does. This also allows us to stop requiring to specify devices when creating a OwnerRRef by hand in Python. ghstack-source-id: 128184665 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28144365 fbshipit-source-id: 1f2d446873f31ee297415c46b94126b6502b12d3	2021-05-06 01:12:56 -07:00
Luca Wehrstedt	7ffadf6e46	Replace DeviceIndexes with Devices in RRefs (#57442 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57442 We did this for the RPC agents and for ivalue::Future, the last one (I think) is RRef. ghstack-source-id: 128184664 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28144368 fbshipit-source-id: eeacab6006f72118cbec542a02322f2e391c67a3	2021-05-06 01:12:54 -07:00
Shen Li	1ee54cc7b4	Add devices argument to RRef constructor (#57085 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085 PR #54932 fixed the CUDA RPC for RRef when RRef is created through RPC. But besides that use case, RRef can also be created locally by directly passing in a value, which would bypass the CUDA stream synchronization in #54932. This commit covers the above gap by adding a `devices` argument to RRef constructor. The RRef will then use this argument to choose between `CUDAFutre` and `ivalue::Future` to hold the value. When `devices` is specified and non-empty, `CUDAFuture` will be used, and the `devices` will be passed to that `CUDAFuture`. Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D28050001 Pulled By: mrshenli fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e	2021-04-28 19:11:10 -07:00
Meghan Lele	6866c033d5	[JIT] Add recursive scripting for class type module attributes (#55124 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55124 Summary This commit modifies type inference (used by the module scripting code) so that it tries to script the type of any class instances that it encounters. This enables recursive, automatic scripting of class type module attributes. Test Plan This commit adds a test case for this to `TestClassType`. Test Plan: Imported from OSS Reviewed By: gmagogsfm Differential Revision: D23971883 Pulled By: SplitInfinity fbshipit-source-id: 7a5a2e7c12ee68cbdeb0a07e6aaf98734a79cb06	2021-04-02 12:16:21 -07:00
Rohan Varma	c3f2f3294e	[RPC] Add option to make rref.get_type not block. (#50977 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50977 Adds a `blocking` flag that can be set to False to make this API return a `Future` to the type. This is to make this function non-blocking, mostly for a future change that will allow `rref.rpc_async()` to be completely non-blocking (it currently calls and waits for this function that issues an RPC in-line). ghstack-source-id: 121021433 Test Plan: Modified UT Reviewed By: mrshenli Differential Revision: D25944582 fbshipit-source-id: e3b48a52af2d4578551a30ba6838927b489b1c03	2021-02-04 20:18:50 -08:00
Rohan Varma	d64184ef4c	[RPC] Support timeout for RRef proxy functions (#50499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50499 Adds a timeout API to the following functions: ``` rref.rpc_sync() rref.rpc_async() rref.remote() ``` so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases: 1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised 2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change. Test Plan: Added UT Reviewed By: wanchaol Differential Revision: D25897495 fbshipit-source-id: f9ad5b8f75121f50537677056a5ab16cf262847e	2021-01-15 13:23:23 -08:00
Rohan Varma	ab1ba8f433	[RPC] Support timeout in rref._get_type() (#50498 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50498 This change is mostly needed for the next diff in this stack, where rref._get_type() is called in the rpc_async/rpc_sync RRef proxy function and can block indefinitely if there is no timeout. It will also be useful to have a timeout argument when we publicize this API to keep it consistent with other RPC APIs. ghstack-source-id: 119859767 Test Plan: Added UT Reviewed By: pritamdamania87 Differential Revision: D25897588 fbshipit-source-id: 2e84aaf7e4faecf80005c78ee2ac8710f387503e	2021-01-15 13:18:39 -08:00
Shen Li	f9f758e349	Apply clang-format to rpc cpp files (#50236 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50236 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25847892 Pulled By: mrshenli fbshipit-source-id: b4af1221acfcaba8903c629869943abbf877e04e	2021-01-08 11:47:43 -08:00
Shen Li	2d5f57cf3b	Completely remove FutureMessage from RRef Implementations (#50004 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50004 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25750602 Pulled By: mrshenli fbshipit-source-id: 06854a77f4fb5cc4c34a1ede843301157ebf7309	2021-01-07 19:50:27 -08:00
Shen Li	25ef605132	Replace FutureMessage with ivalue::Future in distributed/autograd/utils.* (#49927 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49927 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25724241 Pulled By: mrshenli fbshipit-source-id: d608e448f5224e41fbb0b5be6b9ac51a587f25b4	2021-01-07 19:50:16 -08:00
Shen Li	84e3237a53	Let RpcAgent::send() return JitFuture (#49906 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49906 This commit modifies RPC Message to inherit from `torch::CustomClassHolder`, and wraps a Message in an IValue in `RpcAgent::send()`. Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25719518 Pulled By: mrshenli fbshipit-source-id: 694e40021e49e396da1620a2f81226522341550b	2021-01-07 19:47:14 -08:00
Pritam Damania	781e0ed835	Support RRef.backward() for Owner RRefs. (#46641 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46641 Second part of https://github.com/pytorch/pytorch/pull/46568, allows RRef.backward() to work for owner RRefs. ghstack-source-id: 115440252 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24441300 fbshipit-source-id: 64af28e6b6ae47ea27e611a148f217bc344a4c5b	2020-11-07 21:25:32 -08:00
Pritam Damania	adafd3d4b2	Support RRef.backward() for local RRefs. (#46568 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46568 This PR adds support for an RRef.backward() API. This would be useful in applications like pipeline parallelism as described here: https://github.com/pytorch/pytorch/issues/44827 This PR only adds support for local RRefs, remote RRef support will be added in a follow up PR. ghstack-source-id: 115100729 Test Plan: 1) unit tests. 2) waitforbuildbot Reviewed By: mrshenli Differential Revision: D24406311 fbshipit-source-id: fb0b4e185d9721bf57f4dea9847e0aaa66b3e513	2020-10-26 17:31:17 -07:00
Shen Li	924717bf51	Add _get_type() API to RRef (#44663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44663 The new API returns the type of the data object referenced by this `RRef`. On the owner, this is same as `type(rref.local_value())`. On a user, this will trigger an RPC to fetch the `type` object from the owner. After this function is run once, the `type` object is cached by the `RRef`, and subsequent invocations no longer trigger RPC. closes #33210 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D23691990 Pulled By: mrshenli fbshipit-source-id: a2d87cd601a691dd75164b6bcd7315245e9cf6bd	2020-09-16 11:59:22 -07:00
Michael Suo	c93e96fbd9	[jit] move script-related implementation out of torch/jit/__init__.py (#40902 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40902 See the bottom of this stack for context. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D22360210 Pulled By: suo fbshipit-source-id: 4275127173a36982ce9ad357aa344435b98e1faf	2020-07-08 11:38:34 -07:00
Shihao Xu	b803b4ce09	[torch.distributed.rpc] Add stringify WorkerInfo, better error message for py_rref (#39974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974 # Problem When this assertion happens, I don't know - which worker_id it is on, even with the worker_name "trainer:0". - which rref is throwing this exception. ```shell File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp> trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception raise result.exception_type(result.msg) RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0') Traceback (most recent call last): File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function result = python_udf.func(python_udf.args, python_udf.kwargs) File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke return getattr(rref.local_value(), func_name)(args, **kwargs) RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0 ``` Changes, - Add stringify WorkerInfo - Make localValue() assertion message clearer about the case. ghstack-source-id: 105840918 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork Reviewed By: mrshenli Differential Revision: D5690653 fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170	2020-06-13 12:57:05 -07:00
Yanan Cao	c22bbb2124	[JIT] Add Type::repr_str to return human-readable str (#39544 ) Summary: Clearly expressing a type is inferred by PyTorch instead of explicitly annotated by user makes many error messages more user-friendly Currently Type has two string conversion methods. str() for IR printing and python_str() for serialization and error message generation. If we want to include more information in type printing while maintaining serialization/deserialization correctness, we need to split python_str() into annotation_str() and repr_str(). annotation_str is solely responsible for serialization, it strictly matches format of python type annotation. repr_str() is responsible for generating a human-readable error message that includes information like "this type is inferred, not explicitly annotated" Closes https://github.com/pytorch/pytorch/issues/39449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39544 Differential Revision: D21978759 Pulled By: gmagogsfm fbshipit-source-id: 733566f5a62e748b5ca4bb3c5943ebb6d5b664d0	2020-06-10 12:01:24 -07:00
Rohan Varma	8b2bb02e09	Implement timeout support for RRefs (#38590 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590 This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works: - Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call). - Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown) - A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for. - `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user. - Tests are added, primarily via delay injection. ghstack-source-id: 105232837 Test Plan: CI Differential Revision: D21588165 fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3	2020-06-04 02:14:42 -07:00
Shen Li	155a287aea	Enforce const on PyRRef functions (#38415 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38415 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D21554722 Pulled By: mrshenli fbshipit-source-id: 53c2abd8de43545873be486e1fb893bc329d65a1	2020-05-14 19:01:28 -07:00
Rohan Varma	4d4895a62a	Use Future's then() API to fix RPC profiling (#38352 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38352 Fixes the RPC profiling by using the `then()` API added in https://github.com/pytorch/pytorch/pull/37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value) To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests. Re-enabled profiling tests and stress tested them 1000 times to verify the fix ghstack-source-id: 104086114 Test Plan: Re-enabled profiling tests Differential Revision: D21506940 fbshipit-source-id: 35cde22f0551c825c9bc98ddc24cca412878a63a	2020-05-14 12:52:45 -07:00
Shen Li	f99a693cd9	Remove unnecessary py::object copy in PyRRef ctor (#38402 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38402 Test Plan: Imported from OSS Differential Revision: D21554724 Pulled By: mrshenli fbshipit-source-id: abab45010810ec53628ea2c7a9c76cdc50eb2f74	2020-05-13 22:00:13 -07:00
Shihao Xu	3d0279862d	Consolidate builtin/python_udf RPC to return ivalue::Future like torchscript RPC does (#35154 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35154 This is for issue https://github.com/pytorch/pytorch/issues/34999. close https://github.com/pytorch/pytorch/issues/34999. https://github.com/pytorch/pytorch/issues/34997 need more work. This will make a few work items easier, like 1) Dist autograd profiler, 2) JIT annotation for Future. Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_rref_forward_chain --stress-runs 100 buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \ -r test_call_method_on_rref ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- 'test_rref_proxy_class $fb\.test_rpc_fork\.RpcTestWithFork$' --stress-runs 100 test_rref_proxy_reuse test_handle_send_exceptions ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \ buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \ -r test_script_call_python_return_future ``` Differential Revision: D7722184 fbshipit-source-id: bd92b855bfea4913d6672700590c57622fa86e0e	2020-05-08 21:28:56 -07:00
Shen Li	322e564ee3	Minor format cleanup in py_rref.cpp (#37520 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37520 Test Plan: Imported from OSS Reviewed By: xush6528 Differential Revision: D21308889 Pulled By: mrshenli fbshipit-source-id: 36d5efc4d9c3e6cc0b2abec35675a338a2f81424	2020-04-29 19:12:40 -07:00
Shen Li	d5b38984c8	Let RPC return FutureIValue instead of FutureMessage (#37519 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37519 closes #37446 Currently FutureMessage is used in several places: 1. `rpc_async` returns a `FutureMessage` object and we expose it as `torch.distributed.rpc.Future`. From applications perspective, they are expecting a `py::object` instead of a `Message`, and we do the conversion in the `Future.wait()` pybind method. 2. RPC autograd profiler takes `FutureMessage` and installs callbacks to it. The profiler actually only need a `Future<T>` and does not care what `T` is. 3. `OwnerRRef` exposes a `getFuture()` API which returns a `FutureMessage`. This `FutureMessage` will be marked completed when the value referenced by the `OwnerRRef` is ready. `OwnerRRef` does not need it to be a Message type either, it actually creates an empty `Message` to mark the `Future`. The above places are using `FutureMessage`, but they don't really need a `Message`, and `Message` is a communication layer type that applications or profiler or the RRef shouldn't be aware of. Another motivation for making this change is that for async RPC UDF #36071, we are going to allow application to call `markCompleted` in Python. If we still use `FutureMessage`, then in the `markCompleted` pybind function, it needs to convert the provided `py::object` into a specific message type, which is leaking communication layer code to pybind functions. Even if this is doable, we will have two entities (RPC agent and pybind Python frontend) accessing the same request callback logic. This is too messy. This commit replaces all surface `FutureMessage` with `FutureIValue`, so that `FutureMessage` is no longer visible from Python land. Note that this does not cause BC issues, as the Python Future type name and its API stay intact. Internally, we still have `FutureMessage` in the communication layer. Test Plan: Imported from OSS Reviewed By: xush6528 Differential Revision: D21308887 Pulled By: mrshenli fbshipit-source-id: 4f574f38e83125081f142813cfdde56119522089	2020-04-29 19:10:29 -07:00
Shen Li	5c2b273089	Add RRef Python Helper to launch function on the referenced object (#36619 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36619 With this PR, applications no longer need to create dedicated helpers to run functions on the object referenced by an RRef. Instead, `rref.rpc_sync().some_func()` will use `rpc_sync` to run `some_func` on the owner of the RRef using the object referenced by the RRef. Similar helpers for `rref.rpc_async().some_func()` and `rref.remote().some_func()` are also added. An alternative design is to expose PyRRef as RRefBase and then implement everything in a new Python RRef class. However, the RRef class cannot directly inherit from PyRRef/RRefBase, otherwise we will need to let pyRemote* C++ functions to load RRef from Python and return an RRef instance. It is possible to let RRef hold a instance of PyRRef instead of inherit from it, but this does not look like a elegant design, as we will have RRef holding PyRRef and PyRRef holding the C++ RRef. Another alternative is to use dynamic method loading, by installing member methods to PyRRef instances. However, this would require different solutions to handle RRef(data) and rpc.remote(...). Base on the above thinking, we decided to go with the current implementation for simplicity and we can also keep all RRef-related APIs in one place. Test Plan: Imported from OSS Differential Revision: D21028333 Pulled By: mrshenli fbshipit-source-id: fe90f56ef7183d18874e357900093755e1601eb4	2020-04-21 19:29:54 -07:00
Rohan Varma	752d3c281a	[profiler] Allow record_function ctx manager to profile futures (#35055 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055 This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future. This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future. Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well. These code paths will be merged once the futures are merged. ghstack-source-id: 102478180 Test Plan: Added unit tests Differential Revision: D20452003 fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d	2020-04-20 12:37:54 -07:00
Shihao Xu	87582ae6c4	Make RRef type_hint mismatch exception message more actionable to users (#35943 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35943 This change will add message to tell why the concrete Module type is not a subtype of the Interface type, by telling the missing method name. For example, users may have forgot to tag that method with torch.jit.export. Test Plan: ` Differential Revision: D7993693 fbshipit-source-id: 1a5b1d9ef483e5e120ab53c2427586560fbb9bcd	2020-04-03 10:25:09 -07:00
Yanli Zhao	ec9f680973	Enforce rref python pickling to be in the scope of RPC call (#34755 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34755 This diff disallows to use python pickler to pickle RRef. RRef can only be pickled in the scope of RPC call using _InternalRPCPickler. ghstack-source-id: 100481337 Test Plan: unit tests Differential Revision: D20453806 fbshipit-source-id: ebd4115ee01457ba6958cde805afd0a87c686612	2020-03-19 23:43:45 -07:00
Shihao Xu	b5edf329f8	[JIT] Make RPC RRef Owner WorkerInfo.name available to TorchScript (#34896 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34896 Make TorchScript support calling ref.owner() to get owner worker id and calling ref.owner_name() to get owner worker name. Differential Revision: D7652208 fbshipit-source-id: a60125bb316ac2cf19a993cbd2affc933c0af7c9	2020-03-17 20:28:18 -07:00
Shen Li	422e348619	Don't run user function until all UserRRefs in the args are confirmed (#34497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497 Use a thread_local table to intercept UserRRefs created during user function args deserialization, and then wait for confirmations of those UserRRefs before launching the given user function. Differential Revision: D20347464 Test Plan: Imported from OSS Pulled By: mrshenli fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07	2020-03-16 18:30:06 -07:00
Shen Li	f9aa0c870f	Use c10::str in py_rref.cpp (#34681 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34681 Test Plan: Imported from OSS Differential Revision: D20428827 Pulled By: mrshenli fbshipit-source-id: 847486b3114f0e9a2ad5f80c5e44db82d977c6a2	2020-03-12 21:39:10 -07:00
Michael Suo	c235be42dd	[jit] kill script namespace (#34515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34515 Once upon a time we thought this was necessary. In reality it is not, so removing it. For backcompat, our public interface (defined in `api/`) still has typedefs to the old `script::` names. There was only one collision: `Pass` as a `Stmt` and `Pass` as a graph transform. I renamed one of them. Test Plan: Imported from OSS Differential Revision: D20353503 Pulled By: suo fbshipit-source-id: 48bb911ce75120a8c9e0c6fb65262ef775dfba93	2020-03-11 23:32:48 -07:00
Shen Li	18ef09f5ac	Remove _load_return_value from RPC internal.py (#34492 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34492 Differential Revision: D20347468 Test Plan: Imported from OSS Pulled By: mrshenli fbshipit-source-id: 92388d0d50a08fb895bacacf94c7b5495b4ae2b6	2020-03-09 20:40:50 -07:00
Shihao Xu	17ceb6941f	[RPC] Create local RRef<ModuleInterface> remotely in Python, use it remotely in TorchScript (#34183 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34183 https://github.com/pytorch/pytorch/pull/33263 enhanced the RRef Python constructor to infer most types, by `jit::tryToInferType(..)`. But this helper function can't infer `ScriptModule` type due to `ScriptModule`'s special per-Module type singleton logic, so it's still not possible for an Python-created RRef to know the JIT type of it's contained `ScriptModule`. Instead of inferring the specific type of a Module, which could leads to too many candidate types (due to Module's multiple inheritance possibility), it's more straightforward to set it's type as a user-specified `ModuleInterface` type. We added an optional argument `type_hint` for users to mark an `RRef` for what `ModuleInterface` type it's holds. ghstack-source-id: 99649379 (Note: this ignores all push blocking failures!) Test Plan: Aspects that need to be confirmed in the test cases https://fb.quip.com/aGxRAh2lCg05 ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_create_local_script_class_rref buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_create_local_script_module_rref buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_return_local_script_class_rref_in_py_and_use_in_script buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_return_local_script_module_rref_in_py_and_use_in_script buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \ && buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_torchscript_function_exception ``` Differential Revision: D7065050 fbshipit-source-id: e10210c0996622969e499e4a35b0659b36787c1c	2020-03-06 08:28:22 -08:00
Shen Li	7da24b36b1	Apply clang-format to RPC files (#34139 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34139 Test Plan: Imported from OSS Differential Revision: D20227342 Pulled By: mrshenli fbshipit-source-id: 01b478bde1f6a51f69eb5277fa90ba6ac2d4b5dc	2020-03-03 16:44:35 -08:00
Wanchao Liang	64aab3260a	[jit] allow RRef local creation with IValue objects (#33263 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33263 This PR allow PyRRef local creation to inspect the pyobject, if it founds that we could turn it to an IValue, turn to an IValue first, otherwise hold it as a PyObjectType Test Plan: Imported from OSS https://fb.quip.com/aGxRAh2lCg05 Differential Revision: D19871243 Pulled By: wanchaol fbshipit-source-id: ae5be3c52fb1e6db33c64e64ef64bc8b9ea63a9a	2020-02-27 22:49:53 -08:00
Michael Suo	dbe850af5b	[jit] do the code reorg (#33851 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33851 Rationale and context described in #33828. Script to reproduce the move: https://gist.github.com/suo/16cbefaaeb67ca5a7c6caffd49b7f6e9 ghstack-source-id: 99079645 Test Plan: Make sure CI passes Reviewed By: jamesr66a Differential Revision: D20133869 fbshipit-source-id: 390e9241a9c85366d9005c492ac31f10aa96488e	2020-02-27 13:02:51 -08:00
Yanli Zhao	4d9b649261	jit pickling rref (#32959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32959 in rpc torch script call path, we need to pickle/unpickle rref, this diff is added to make jit pickler/unpickler be able to pickle/unpickle rref. It is similar to what is implemented for PyRef::pickle() and PyRef::unpickle(). The pickling/unpickling design assumes it is always coupled with RPC calls. It is not needed to checkpoint a model with rref, before checkpointing the model, user should call ref.to_here() to get value inside rref. The pickling process is: 1. push torch.distributed.rpc.rref global string 1. call rref.fork() and create rrefForkData, which is a few IDs and type str of the value held inside the rref, the IDs includes rref id, fork id, caller work id, callee work id, owner work id 2. push the rrefForkData The unpickling process is: 1. read torch.distributed.rpc.rref global string, and retrieve the cached global lamda function 2. the globa lamda function will get rrefForkData 3. if callee is also owner work id, then get owner rref based on Ids inside rrefFork data and return the ownerRRef 4. if callee is not owner work id, then create user rref using the rrefForkData and return the userRRef 5. meanwhile owner rref will be notified and do reference counting correctly During unpickling, a type_resolver is needed to parse type str. This type_resolver has python dependency, so we get it from rpc_agent, and pass it to unpickler during construction. So we added a type_resolver argumenmt to jit unpickler constructor in this diff. ghstack-source-id: 98814793 Test Plan: unit test Differential Revision: D19713293 fbshipit-source-id: 4fd776cdd4ce8f457c4034d79acdfb4cd095c52e	2020-02-24 11:16:35 -08:00
Wanchao Liang	93179b1c1c	[jit] Initial use RRef in TorchScript (#33190 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33190 This enable the initial RRef type to be used inside TorchScript, user could pass a python RRef into a torchscript function and call to_here inside. Specifically, this PR: - Add RRef schema type parsing - Add python interop for RRef in Python and into JIT - register to_here op in register_distributed_ops More support for RRef in TorchScript will be added in future PRs Test Plan: Imported from OSS Differential Revision: D19871244 Pulled By: wanchaol fbshipit-source-id: 7eca6c491a84666b261c70806254b705603bd663	2020-02-13 20:17:25 -08:00
Wanchao Liang	9ae4d38a21	[rpc] Switch RRef to be managed by intrusive_ptr (#33189 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33189 Add RRefInterface to Aten/Core, which will later be used by IValue Switch all the rpc code base to use intrusive_ptr instead of shared_ptr, so that we could add it to IValue. Actual adding to IValue and JIT will be in next PR Test Plan: Imported from OSS Differential Revision: D19871241 Pulled By: wanchaol fbshipit-source-id: d7e1fd04b46320e0f26c18591b49c92ad30a4032	2020-02-13 20:15:31 -08:00
Shihao Xu	12bcfa7c77	Remove Python dependency (toPyTuple/fromPyTuple, jitCompilationUnit, deserialize) in rref_impl.h/cpp (#32753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32753 Functions to be bound as an Aten operator could not have Python dependency. This is to refactor and remove Python dependency. ghstack-source-id: 97485800 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_script_functions_not_supported buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_script_functions_not_supported ``` ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call ``` Differential Revision: D5741675 fbshipit-source-id: 31ee60955be8d815d0773f3699e3ff2f1f9d8849	2020-01-30 17:52:48 -08:00
Shen Li	a40a19ccab	Remove GIL from RRefContext (#32807 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32807 After this commit, RRefContext no longer depends on pybind. Test Plan: Imported from OSS Differential Revision: D19636316 Pulled By: mrshenli fbshipit-source-id: 88faa101c32e9019e979ae8e5da6706e49842726	2020-01-30 10:53:25 -08:00

1 2

66 Commits