pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
cyy	95dbbf713e	[Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109 ) Follows #125102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109 Approved by: https://github.com/ezyang	2024-07-16 04:23:42 +00:00
Luca Wehrstedt	03a5c6ea99	Remove LazyStreamContext (1 out of 2) (#59298 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59298 After recent changes, LazyStreamContext had in fact always become eager, and was in fact equivalent to a vector of streams. So it makes more sense now to remove this abstraction and use a more self-descriptive type. This PR migrates the RequestCallback internals. The next PR migrates the TensorPipe agent. ghstack-source-id: 130583774 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28789175 fbshipit-source-id: fa581a50f9a6a1e42c2ad8c808a9b099bea7433e	2021-06-04 06:53:46 -07:00
Luca Wehrstedt	7bcd8f94a5	Avoid re-doing CUDA stream sync in OwnerRRef (#57355 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57355 We had started fixing OwnerRRef to make it CUDA-compatible, by properly synchronizing CUDA streams/events where appropriate. However, since we started using CUDAFuture (or, well, ivalue::Future nowadays, after they got merged) this is all done automatically for us, hence we can undo these "fixes" as they're now duplicated. ghstack-source-id: 130583771 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28118182 fbshipit-source-id: 4b1dd9fe88c23802b1df573941d1b73af48bb67b	2021-06-04 06:52:33 -07:00
Luca Wehrstedt	a3392cafe0	[reland] Set streams when invoking UDFs (#59210 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59210 Reland of https://github.com/pytorch/pytorch/pull/58427 Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible. ghstack-source-id: 130202847 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28623889 fbshipit-source-id: ed38242f813dac075d162685d52ae89f408932f9	2021-06-02 05:46:02 -07:00
Xiaodong Wang	7a8336a5a7	Revert D28474983: Set streams when invoking UDFs Test Plan: revert-hammer Differential Revision: D28474983 (`ab1e958d20`) Original commit changeset: 358292764d0a fbshipit-source-id: b4d4c25fe551d83848a9d023c139a9f1acc4c23d	2021-05-21 19:24:14 -07:00
Luca Wehrstedt	ab1e958d20	Set streams when invoking UDFs (#58427 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58427 Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible. ghstack-source-id: 129567052 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28474983 fbshipit-source-id: 358292764d0a6832081c34bf6736f0961475ff3d	2021-05-21 13:15:32 -07:00
Luca Wehrstedt	797dff55b5	Unify fetching RRefs (#57859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57859 Just like with assigning OwnerRRefs, we can also deduplicate the code paths for fetching their values. In fact this was duplicated three times, with different ways of post-processing the value (once for JIT, once for Python, once for autograd). Thanks to future, we can have that logic once, and then connect it to different follow-up steps. ghstack-source-id: 129567050 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28286172 fbshipit-source-id: e0742a99cf555755e848057ab6fee5285ff0df2a	2021-05-21 13:15:15 -07:00
Luca Wehrstedt	cd9dbbd93a	Simplify process(Script\|Python)(Remote)?Call (#57857 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57857 There used to be a whole lot of methods: `processPythonCall`, `processScriptCall`, `processScriptRemoteCall`, `processPythonRemoteCall`, `processScriptCallOp`, `processBaseScriptRemoteCall` and `processScriptRemoteCallOp`. Thanks to the previous simplification, we can now drop all but the first four, which map nicely 1:1 to the four message types we need to handle. Also their signatures become much simpler: they take an RPC command and return a future. ghstack-source-id: 129567070 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28253848 fbshipit-source-id: e0e45345c414a96900f9d70ee555359d28908833	2021-05-21 13:15:12 -07:00
Luca Wehrstedt	e220a1bbcd	Make processPythonExecution return a future (#57855 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57855 We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this. ghstack-source-id: 129567054 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28253846 fbshipit-source-id: d854d4aa163798fb015cd6d46932f9ff1d18262e	2021-05-21 13:15:09 -07:00
Luca Wehrstedt	20d02cb7dd	Remove getScriptRemoteCallType (#57854 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57854 Because OwnerRRefs used to be created before their value was computed, we had to figure out their type ahead of time. After the previous diff, we inverted the order of operations, and we can now first compute the result and then create the OwnerRRef. Which means we can just inspect the value to get its type. Much simpler, and much less likely to get it wrong. ghstack-source-id: 129567060 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28253843 fbshipit-source-id: f13c9b294f477ae66fcbdbc85c642fdc69b2740f	2021-05-21 13:15:07 -07:00
Luca Wehrstedt	60fc37393e	Simplify OwnerRRef completion (#57853 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57853 A bunch of methods received an OwnerRRef to "fill in". I think it will be more flexible to do it the other way around, and have these methods return a value (wrapped in a Future), which can then be "connected" to an OwnerRRef, but which can also potentially be consumed in different ways. ghstack-source-id: 129567059 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28253844 fbshipit-source-id: 7e3772312dbacfc75a6ac0f62189fc9828001fc7	2021-05-21 13:15:05 -07:00
Luca Wehrstedt	ea2f5bbb4c	Unify async execution for JIT functions (#57852 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57852 Another great example of the benefits of Futures. Thanks to the "right abstraction" (i.e., the `thenAsync` method), adding support for async execution becomes trivial, and the code much simpler than what it used to be. ghstack-source-id: 129567063 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28253842 fbshipit-source-id: b660151ca300f3d6078db0f3e380c80a4d8f5190	2021-05-21 13:15:04 -07:00
Luca Wehrstedt	bfdc279134	Unify invoking JIT functions (#57851 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57851 The same as the previous PR, but for JIT functions. ghstack-source-id: 129567069 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28253841 fbshipit-source-id: 2b8affde16c106f5c76efa8be49af070213708bf	2021-05-21 13:15:02 -07:00
Luca Wehrstedt	4ac18f6710	Centralize setting messageId in RequestCallback (#57848 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57848 This PR looks large, but all it does is add a dozen lines and remove a lot of other ones. One first advantage of using Futures is that we can easily chain some "post-processing" to them. Until now we needed to pass the message ID around everywhere because it was set separately by each method. Instead, we could simply add a follow-up step to the final future which sets this ID, and remove all the former logic. ghstack-source-id: 129567065 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28224477 fbshipit-source-id: 7b6e21646262abe5bbbf268897e2d792e5accc27	2021-05-21 13:14:57 -07:00
Luca Wehrstedt	f6844eafce	Make RequestCallback collect Futures from methods, rather than providing them (#57847 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57847 This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions. With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain. In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us. If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing. I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order). In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA. --- Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones. In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future. ghstack-source-id: 129567062 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28224478 fbshipit-source-id: dbdc66b6458645a4a164c02f00d8618fa64da028	2021-05-21 13:14:55 -07:00
Luca Wehrstedt	45012da298	Migrate from shared_ptr to intrusive_ptr for Future (#57636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636 The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge. At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC. This PR was produced mainly by find-and-replace, plus a couple of manual fixes. ghstack-source-id: 128296581 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D28187972 fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b	2021-05-07 03:59:20 -07:00
Pavel Belevich	9f89b53d7d	Synchronize RRef.to_here() CUDA Streams properly (#54932 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54932 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D27684022 Pulled By: pbelevich fbshipit-source-id: 2bae51ab6649258d0219ca4e9dbbf45ac6a76c28	2021-04-13 23:24:38 -07:00
Shen Li	1f795e1a9b	Remove FutureMessage from RPC request callback logic (#50026 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50026 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25753588 Pulled By: mrshenli fbshipit-source-id: a6fcda7830901dd812fbf0489b001e6bd9673780	2021-01-07 19:50:47 -08:00
Howard Huang	55d5b27343	Refactor request_callback_no_python.cpp processRpc function (#47816 ) Summary: Addresses step 1 of https://github.com/pytorch/pytorch/issues/46564 Took processing logic for each case in request_callback_no_python.cpp and put it in a dedicated function. cc: izdeby Pull Request resolved: https://github.com/pytorch/pytorch/pull/47816 Reviewed By: izdeby Differential Revision: D25090207 Pulled By: H-Huang fbshipit-source-id: bfa38e38db02e077d859125739aaede90ba492e7	2020-11-20 07:29:51 -08:00
Pritam Damania	781e0ed835	Support RRef.backward() for Owner RRefs. (#46641 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46641 Second part of https://github.com/pytorch/pytorch/pull/46568, allows RRef.backward() to work for owner RRefs. ghstack-source-id: 115440252 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D24441300 fbshipit-source-id: 64af28e6b6ae47ea27e611a148f217bc344a4c5b	2020-11-07 21:25:32 -08:00
Shen Li	96d48178c8	Make pipeWrite and pipeRead noexcept (#45783 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783 After the previous device maps commits, `pipeWrite` might throw. In this case, if we increment active calls before `pipeWrite` on the caller, that active call won't be decremented properly when `pipeWrite` throws. As a result, `shutdown` can silently timeout. I noticed this as some tests take more than 60s to finish. This commit extract the tensor device checking logic out of pipeWrite, and make sure the error is thrown before the active call count is incremented. Differential Revision: D24094803 Test Plan: Imported from OSS Reviewed By: mruberry Pulled By: mrshenli fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b	2020-10-08 18:53:51 -07:00
Rohan Varma	19dda7c68a	Fallback to CPU when remote end does not have CUDA for profiling (#44967 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967 When enabling profiler on server, if it is a different machine it may not have CUDA while caller does. In this case, we would crash but now we fallback to CPU and log a warning. ghstack-source-id: 112977906 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D23790729 fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362	2020-09-26 13:12:55 -07:00
Pritam Damania	ff6e560301	Add C++ end to end test for RPC and distributed autograd. (#36893 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893 Adding an end to end test for running a simple training loop in C++ for the distributed RPC framework. The goal of this change is to enable LeakSanitizer and potentially catch memory leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we haven't found a solution for this. As a result, adding a C++ test that triggers most of the critical codepaths would be good for now. As an example, this unit test would've caught the memory leak fixed by: https://github.com/pytorch/pytorch/pull/31030 ghstack-source-id: 107781167 Test Plan: 1) Verify the test catches memory leaks. 2) waitforbuildbot Reviewed By: mrshenli Differential Revision: D21112208 fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d	2020-07-15 12:59:19 -07:00
Shen Li	422e348619	Don't run user function until all UserRRefs in the args are confirmed (#34497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497 Use a thread_local table to intercept UserRRefs created during user function args deserialization, and then wait for confirmations of those UserRRefs before launching the given user function. Differential Revision: D20347464 Test Plan: Imported from OSS Pulled By: mrshenli fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07	2020-03-16 18:30:06 -07:00
Jeremy Lilley	e7e6d56b77	Allow async work in rpc RequestCallback processing. (#30637 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30637 RequestCallback api currently forces work to be always synchronous, which, as we scale, means we're going to need to throw large number of (mostly blocked) threads at the rpc problem. For some activities like dependent autograd rpcs, there's not a necessary reason to block in these threads. In this change, the RequestCallback api is updated to return a shared_ptr<FutureMessage> rather than a Message: std::shared_ptr<FutureMessage> operator()(Message& request) const; With a futures-style api, RPC ops that wish to be async can then be async, while short-lived blocking functions (or Python UDFs) can just block. In this change, we keep all of the current ops as synchronous (i.e. we block and then return a completed FutureMessage). We also update the rpc_agents in a manner compatible with this sort of parallelism. Here, we only want to incur overhead when we use the async behavior. Some modest extra cost seems unavoidable here (e.g. the allocation for the std::make_shared<>), but we can trivially detect the synchronous/completed case in the rpc_agent and avoid the extra thread-switches/etc. in that case. ghstack-source-id: 95287026 Test Plan: - Basic: buck test mode/dev-nosan caffe2/test/... - Additional testcase in ThriftRpcAgentTest for deferred work. Differential Revision: D18774322 fbshipit-source-id: cf49922a71707cfb1726de16f93af23b160385d8	2019-12-10 16:11:05 -08:00
Yanli Zhao	56eb4f7daa	Add autograd hook for python rpc call (#28312 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/28312 1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in #27022) ghstack-source-id: 92240367 Test Plan: unit tests Differential Revision: D18017554 fbshipit-source-id: dbe79a5171063901a78a9b3322b9b31c159d098d	2019-10-19 07:38:14 -07:00
Yanli Zhao	af88537483	Back out "Add autograd hook for python rpc call" Summary: Original commit changeset: 070324c57312 Test Plan: revert Reviewed By: pritamdamania87 Differential Revision: D18011308 fbshipit-source-id: 4185e4c6f51c1d11b23b8ab44e6e958b09f27c53	2019-10-18 11:53:39 -07:00
Yanli Zhao	56c4215fcc	Add autograd hook for python rpc call (#27576 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27576 1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached. it still send rpc with autograd meta. This is not ideal. This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads 2. meanwhile create a utiliy to attach autograd info and functions as needed 3. add autograd send/recv functions for python rpc call 4. make changes to support nested python rpc calls 5. disallow nested dist autograd context (was landed in #27022) ghstack-source-id: 92154535 Test Plan: unit tests Differential Revision: D17819153 fbshipit-source-id: 37d8a85855bf591f2f2da48d475a06e870a30ea1	2019-10-18 10:11:45 -07:00
Pritam Damania	fe4170bda8	Add send and recv backward functions for builtin operators RPC. (#25527 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527 Master GH issue: https://github.com/pytorch/pytorch/issues/23110. This change builds upon https://github.com/pytorch/pytorch/pull/24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. ghstack-source-id: 91240466 Test Plan: unit tests. Differential Revision: D17148077 fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233	2019-10-03 01:18:46 -07:00

29 Commits